CloudWatch Metrics and Logs Insights for DevOps Observability

Q: Q1: When should I use CloudWatch metrics vs CloudWatch Logs vs both via EMF?

Use metrics for low-cardinality numeric series you alarm on continuously (CPU, latency, error rate). Use logs for high-volume narrative events you query on demand (request traces, debug output). Use EMF when you want both: a structured log line with embedded metric values and dimensions. EMF is the DOP-C02 sweet spot for application telemetry from Lambda and Fargate because it solves cardinality cost and keeps drill-down.

Q: Q3: How do I monitor a Lambda function without inflating my metric bill?

Use EMF via the AWS Lambda Powertools library. Emit metrics as part of the structured log event; CloudWatch automatically extracts the metric. Tag high-cardinality fields like customer_id as log fields, not dimensions. Use Contributor Insights for top-N customer drill-down.

Q: Q4: What is the difference between metric streams and subscription filters?

Metric streams ship metrics to Firehose; subscription filters ship log events to Lambda, Kinesis, Firehose, or OpenSearch. They are complementary — both can land data in S3 for archive but the data shape is different.

Q: Q6: How do I retroactively analyze logs that have already been ingested?

Use Logs Insights for the last 7 days of stored data with a one-time query, or export logs to S3 (via the manual CreateExportTask API or via Firehose subscription filter) and query with Athena.

Q: Q7: Why do my Lambda log groups never expire?

Lambda creates /aws/lambda/<function> log groups with never expire retention by default. Set LogRetentionInDays in your IaC, or use the AWS Config rule cloudwatch-log-group-retention-period-check to detect non-compliant groups and remediate via SSM Automation.

CloudWatch metrics and CloudWatch Logs Insights form the observability backbone that every DOP-C02 monitoring scenario assumes you already understand at the implementation level. Where the Associate-tier exam tests "can you create an alarm", the DevOps Engineer Professional exam tests whether you can pick the right ingestion path (agent vs PutMetricData API vs metric filter vs embedded metric format), pick the right query tool (Logs Insights vs Athena vs OpenSearch), pick the right retention and lifecycle plan (log group retention vs S3 archive vs Firehose to S3), and pick the right cardinality strategy (dimensions vs Contributor Insights vs structured fields). Get any of these four picks wrong and you either pay too much, miss the data you need at incident time, or paint yourself into a corner that you cannot refactor without re-instrumenting.

This guide assumes you already know basic CloudWatch concepts (metric, alarm, dashboard) and basic CloudWatch Logs concepts (log group, log stream). It focuses on the DOP-C02 implementation depth — namespace and dimension design for cardinality control, the unified CloudWatch agent on EC2 and on-premises, the difference between standard and high-resolution metrics, the embedded metric format (EMF) for emitting metrics from Lambda and containers without the PutMetricData call, metric filters that turn log events into metrics, metric streams that ship every metric to Firehose for cheap long-term analytics, Logs Insights query language at the level required by exam scenarios, subscription filters that fan out logs to Lambda or Kinesis, retention policies that cut storage cost, KMS encryption of log groups, cross-account observability monitoring accounts, and the trap-rich corners of vended logs versus custom logs versus container logs. Domain 4 (Monitoring and Logging, 15 percent) and Domain 5.3 (Troubleshoot system and application failures) both lean on this material.

Why CloudWatch Metrics and Logs Insights Matter on DOP-C02

DOP-C02 weights monitoring at 15 percent and incident response at 14 percent — together that is 29 percent of the exam, almost a third of your score. Both domains route through CloudWatch metrics and CloudWatch Logs Insights because nearly every "what alarm fires", "what query finds the bad request", "how do you build a dashboard for the on-call engineer" scenario lands here. The exam wants the DevOps engineer who can ingest at the right layer, query at the right latency, and retain at the right cost. CloudWatch metrics and Logs Insights are not just monitoring features; they are the shared vocabulary that EventBridge rules, alarms, runbooks, and rollback signals all consume.

The exam style is implementation-detail-rich. A typical stem reads: "An ECS Fargate workload emits 12,000 unique customer IDs per minute, each tagged on a metric. The team needs a per-customer P95 latency view but the CloudWatch bill is 8x budget. Which two changes reduce cost while preserving per-customer drill-down?" The answer is not "use CloudWatch differently" — it is the specific combination of stopping the per-customer dimension explosion and switching to Contributor Insights for top-N customer cardinality, plus moving raw events to CloudWatch Logs with EMF so you keep the data without paying per-metric cost. Knowing these tools and their cost models is the DevOps Engineer Professional skill.

Namespace: a logical container for metrics. AWS-vended namespaces start with AWS/ (e.g., AWS/EC2); custom namespaces are anything else. Namespaces partition metrics so dashboards and alarms can filter cleanly.
Dimension: a name-value pair that uniquely identifies a metric within a namespace. Each unique combination is a separate billable metric. Maximum 30 dimensions per metric, and high cardinality dimensions are the number-one CloudWatch cost driver.
Resolution: standard resolution is 1-minute granularity (default); high resolution is 1-second granularity (extra cost, only for sub-minute alarms).
Embedded Metric Format (EMF): a JSON log format that CloudWatch parses into metrics automatically. Write a JSON log line, get a metric — no PutMetricData call, no SDK, works from Lambda and containers natively.
Metric filter: a filter applied to a log group that emits a CloudWatch metric whenever a log event matches a pattern. Useful for converting unstructured logs (e.g., ERROR text) into countable metrics.
Metric stream: a continuous near-real-time stream of every CloudWatch metric to Firehose, delivered as JSON or OpenTelemetry. Used for cheap long-term archival to S3 or third-party analytics.
Logs Insights: a purpose-built query language for CloudWatch Logs with fields, filter, stats, parse, sort, and limit operators. Charged per GB of data scanned.
Subscription filter: a real-time delivery rule on a log group that ships matching events to Lambda, Kinesis Data Streams, Firehose, or OpenSearch.
Contributor Insights: a built-in CloudWatch tool that finds the top-N contributors to a metric (e.g., top IPs, top customers) without exploding dimension cardinality.
Cross-account observability: the monitoring-account model that lets one account view metrics, logs, and traces from many source accounts via a single sign-on link.
Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html

Plain-Language Explanation: CloudWatch Metrics and Logs Insights

The CloudWatch surface area is wide, and without analogies the namespace, dimension, resolution, EMF, and metric-stream concepts blur together. Three different analogies isolate different parts of the system.

Analogy 1: The Hospital Vital-Signs Monitor and Chart Room

Picture a busy hospital ward. Every patient (every EC2 instance, every Lambda invocation, every container task) is wired to a vital-signs monitor — heart rate, blood pressure, oxygen saturation, temperature. Those readings are CloudWatch metrics: continuous numeric samples at fixed intervals. The hospital does not store every individual reading on a clipboard; it stores them in a central monitor station indexed by ward, by bed, by patient (the namespace and dimensions). At the same time, the nurses write chart notes — narrative text about what happened ("patient was anxious at 03:14, gave reassurance"). Those are CloudWatch Logs: timestamped, mostly text, queryable with Logs Insights the way a doctor reviews charts to reconstruct a shift. When a doctor needs to know "how many patients had a fever this hour", that is a metric filter — pattern-matching across chart notes and emitting a metric. When the head of cardiology wants every reading from every monitor exported to the research database for long-term study, that is a metric stream to Firehose. The DOP-C02 exam wants you to choose the right artifact: never store narrative chart notes as numeric metrics (high cardinality, expensive), and never try to query numeric vital signs in the chart room (use the metric station). Picking the wrong tool is the DOP-C02 trap.

Analogy 2: The Restaurant Cash Register and Order Tickets

A restaurant runs two parallel record systems. The cash register keeps a running tally — total sales by hour, by station, by waiter. That is CloudWatch metrics: aggregated, dimensional, designed for fast lookup. The order tickets stack up in a spike on the wall — every individual order with its line items. That is CloudWatch Logs: high-volume, semi-structured, queried later when the manager wants to investigate a specific complaint. The chef does not look up "how many burgers were sold today" by reading every ticket — that would be slow and the ticket pile would fill the kitchen. The chef looks at the register tape (a CloudWatch metric). But when a customer disputes a charge, the chef pulls the specific ticket from the pile (a Logs Insights query with filter and parse). If the restaurant wants to predict next week's burger demand, they archive the tickets weekly to a research basement (the metric stream to S3) where data scientists run Athena queries on the archive. EMF is the kitchen's clever trick of having the order ticket itself contain a tearable receipt that goes into the register — one piece of paper, two record systems updated.

Analogy 3: The Astronomy Observatory with Light Sensors and Image Plates

An observatory has two instruments. Light sensors continuously sample brightness at one-second resolution and dump readings into a numeric database. That is high-resolution CloudWatch metrics — fast, expensive per-channel, used for short-term precision. Image plates are wide-field photographic exposures taken every 10 minutes, capturing the whole sky in raw form. That is CloudWatch Logs: bulk, low-frequency, archived for years, queried only when something interesting happens. When astronomers want to find "the brightest object in the Andromeda quadrant in the last hour", they run a Logs Insights query that scans plates filtered by region. When they want to track Polaris's brightness over a year, they look at the metric stream archive in cold storage. Contributor Insights is the smart card that ranks "the top 10 brightest objects across the entire sky last night" without storing a separate metric per object — a count-min sketch over the log stream. The observatory does not put every star as a separate dimension on its brightness metric; that would create millions of metrics and bankrupt the institute. It uses Contributor Insights to summarize the long tail.

For DOP-C02 stems centered on "metrics vs logs ingestion choice", reach for the hospital vital-signs vs chart notes analogy. For stems centered on "high-cardinality cost explosion and how to fix it", reach for the restaurant register vs ticket pile analogy. For stems centered on "long-term archive vs real-time query", reach for the observatory sensors vs plates analogy. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html

CloudWatch Metric Anatomy — Namespace, Dimension, Resolution

The first DOP-C02 implementation choice is how a metric is shaped. Get the shape wrong and your bill explodes or your alarms cannot find the data they need.

Namespaces partition metrics for clarity

Every metric lives in a namespace. AWS service metrics use AWS/<Service> (e.g., AWS/Lambda, AWS/ECS). Your custom metrics should use a clear application or team prefix (e.g., MyCompany/Checkout). Namespaces never collide across accounts. The exam tests that you know AWS/EC2 only contains hypervisor-visible metrics like CPUUtilization and NetworkIn; OS-level metrics like memory and disk usage require the CloudWatch agent and end up under CWAgent (the default custom namespace).

Dimensions identify a metric instance

A dimension is a name-value pair that, together with the metric name and namespace, uniquely identifies one billable time series. CPUUtilization with dimension {InstanceId=i-abc123} is one metric; the same CPUUtilization with {InstanceId=i-def456} is a separate metric. Every unique dimension combination is a new metric — and CloudWatch charges per metric. The DOP-C02 cardinality trap: tagging a metric with customer_id, request_id, or any high-cardinality field will multiply your metrics by the cardinality of that field. Never put unbounded fields in dimensions; use Contributor Insights or EMF logs instead.

Resolution determines granularity and cost

Standard-resolution metrics are sampled every 60 seconds. High-resolution custom metrics support 1, 5, 10, 30, and 60-second granularity. Only set high resolution when you genuinely need sub-minute alarms — e.g., hot-path API latency. Standard alarms evaluate per minute; high-resolution alarms evaluate every 10 or 30 seconds at higher cost. AWS-vended metrics from EC2 are 5-minute by default and 1-minute with detailed monitoring enabled (extra cost).

A custom metric exists from the moment any data is published with a unique namespace + name + dimension combination. Stopping publication does not delete the metric — it expires after 15 months of inactivity. If you accidentally publish a million unique-dimension metrics, you pay for them until expiry. Always validate dimension cardinality before turning on PutMetricData at scale, and prefer EMF log fields for high-cardinality data because logs are billed by ingest volume, not by unique field combinations. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html

The Unified CloudWatch Agent — One Agent for Metrics and Logs

The unified CloudWatch agent replaces both the legacy SSM CloudWatch plugin and the standalone CloudWatch Logs agent. It runs on EC2 (Linux and Windows), on-premises servers, and ECS-on-EC2 nodes. The agent reads a JSON configuration that defines:

Metrics to collect: CPU per-core, memory, disk usage, swap, network connections, process count — anything not visible to the hypervisor.
Logs to ship: file paths to tail, log group names, log stream names, retention.
StatsD and collectd ingestion: the agent listens on UDP for application-emitted metrics from any client.
Procstat plugin: per-process CPU and memory tracking for specific binaries.

You configure the agent locally, then store the config in SSM Parameter Store and use SSM Run Command (AWS-ConfigureAWSPackage / AmazonCloudWatch-ManageAgent) to push it to fleets. This pattern is heavily tested: do not write a custom Ansible role to deploy the agent on DOP-C02; the right answer is SSM Distributor + State Manager to keep the agent installed and running with the right config.

A common DOP-C02 stem reports a memory leak that the team only discovered after the OOM killer terminated processes. The trap distractor says "enable detailed monitoring on the EC2 instance". Detailed monitoring only changes resolution from 5-minute to 1-minute on AWS-vended metrics — it does not add memory or disk metrics. The right answer is to install the unified CloudWatch agent and ship mem_used_percent and disk_used_percent to the CWAgent namespace, then alarm on those custom metrics. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html

Embedded Metric Format (EMF) — Logs and Metrics in One Write

EMF is a JSON log format that CloudWatch automatically parses to extract metrics. Write a single structured log line; CloudWatch reads it and emits both the log event (queryable in Logs Insights) and the metric (alarmable, dashboardable). The format declares a _aws.CloudWatchMetrics envelope describing which fields are dimensions, which are metric values, and what unit. EMF is ideal for Lambda and ECS Fargate because:

No PutMetricData synchronous API call (which adds latency and can fail).
High-cardinality dimensions stay in the log payload, not as billable metrics.
One write produces both telemetry artifacts.

The AWS Lambda Powertools and the CloudWatch Embedded Metric Format Client Library generate the JSON for you in Python, Node.js, Java, and Go. EMF is heavily emphasized on DOP-C02 because it solves the cardinality problem without sacrificing visibility.

Metric Filters — Turning Log Events Into Metrics

A metric filter is a saved pattern on a log group that emits a CloudWatch metric whenever a matching event arrives. The classic use case is converting a free-text ERROR log into a count metric for alarming. Filter syntax supports JSON logs ({ $.level = "ERROR" }), space-delimited tokens ([ip, user, ..., status_code=5*, size]), and plain substring matches.

Metric filter values can come from log fields (e.g., $.duration to track request duration) or be a literal 1 to count events. You then alarm on the resulting metric. Metric filters only apply to events ingested after the filter was created — they do not retroactively scan old logs. That is a DOP-C02 trap distractor.

Logs Insights Query Language — The DevOps SQL of CloudWatch

Logs Insights is the right tool for on-demand log query at incident time. It uses a pipeline syntax with these operators:

fields @timestamp, @message, @logStream — select fields.
filter status >= 500 — narrow rows.
parse @message "user=* path=*" as user, path — extract from unstructured text.
stats count() by bin(5m), service — aggregate.
sort @timestamp desc — order.
limit 100 — cap.

Logs Insights is billed by GB of data scanned. It is not designed for continuous queries — for that, use a metric filter or subscription filter. It scans across multiple log groups (up to 50 in one query). Results are kept for 7 days. Common DOP-C02 query patterns:

Find the top 10 5xx-returning paths in the last hour.
Compute P95 request latency bucketed by endpoint.
Trace a user's session across multiple microservice log groups using a request ID.

The exam can show partial queries with one operator missing. Memorize the order: fields → filter → parse → stats → sort → limit. Aggregations require stats (count, sum, avg, min, max, count_distinct, percentiles via pct(field, 95)). Bucketing time uses bin(<duration>). Cross-log-group queries support up to 50 log groups simultaneously. Logs Insights does not support joins across log groups in the SQL sense — use shared field names like a request ID and stats by instead. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html

Subscription Filters — Real-Time Log Fan-Out

Subscription filters deliver matching log events in real time to one of four destinations:

AWS Lambda — for low-volume, low-throughput processing.
Kinesis Data Streams — for high-throughput, multi-consumer fan-out.
Amazon Data Firehose — for batched delivery to S3, Redshift, OpenSearch, or third parties.
OpenSearch Service — for full-text search and Kibana dashboards.

Each log group supports up to two subscription filters. The filter pattern syntax matches the metric filter syntax. For DOP-C02 scenarios that say "ship every error log to a security team's account in real time", the answer is a subscription filter to Kinesis Data Streams (which supports cross-account delivery), not "use S3 Event Notifications" or "schedule a Lambda".

Metric Streams — Cheap Long-Term Metric Archive

Metric streams continuously deliver every CloudWatch metric (or a filtered subset) to Amazon Data Firehose in near real time, in JSON or OpenTelemetry 0.7 / 1.0 format. The stream goes to Firehose, which can land it in S3, Splunk, Datadog, New Relic, or any HTTP endpoint. Use cases:

Cheap long-term retention beyond CloudWatch's 15-month cap.
Third-party APM integration without per-metric pull-API throttling.
Cost optimization: query archived metrics in Athena instead of paying CloudWatch metric data API costs.

A common exam stem: "compliance requires 7-year metric retention". The right answer is a metric stream to Firehose to S3 with a lifecycle policy, not "increase CloudWatch retention" (which only goes to 15 months).

Log Group Retention and KMS Encryption

By default, CloudWatch Logs retain data forever. This is the number-one CloudWatch cost driver in mature accounts. Set log group retention explicitly per use case (1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1827, 2192, 2557, 2922, 3288, 3653 days, or never expire). For long-term archive beyond compliance retention, ship logs to S3 via a subscription filter to Firehose and apply S3 lifecycle policies (Glacier Instant, Glacier Flexible, Deep Archive).

KMS encryption is configured per log group via the associate-kms-key API. The KMS key policy must allow logs.<region>.amazonaws.com to encrypt, and the policy condition should restrict to the specific log group ARN to prevent the confused-deputy problem. AWS-managed keys cannot be used for log groups — you must use a customer-managed key.

A common DOP-C02 cost-optimization scenario reports a steadily growing CloudWatch bill from log groups created by Lambda functions years ago. The trap distractor is "delete old log streams individually". The right answer is set log group retention (e.g., 30 days) so old streams expire automatically, and apply retention via a Config rule (cloudwatch-log-group-retention-period-check) plus an SSM Automation runbook to remediate non-compliant log groups. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Working-with-log-groups-and-streams.html

Contributor Insights — Top-N Without Cardinality Explosion

Contributor Insights is purpose-built for the question "who are the top contributors to this metric". Examples:

Top 10 source IPs hitting an ALB.
Top 10 customers consuming DynamoDB capacity.
Top 10 user agents triggering 500 errors.

You define a rule that points at a log group, identifies the contributor field (e.g., srcaddr), and Contributor Insights computes a sliding top-N without storing a per-contributor metric. The result is a graph showing the long-tail vs the heavy hitters. Use Contributor Insights instead of dimensions=customer_id whenever the cardinality is unbounded.

Cross-Account Observability

CloudWatch supports a monitoring account that can view metrics, logs, and X-Ray traces from up to 100,000 source accounts via a single sign-on link. The model uses an observability access manager (OAM) sink in the monitoring account and an OAM link in each source account. This is the right pattern for centralized DevOps visibility across an Organizations OU. The exam loves it because it is a 2022-and-newer feature that older study material misses.

High-Frequency Exam Traps

Below are the corners that DOP-C02 returns to most often. Memorize the trap shape and the right answer.

Trap 1: CloudWatch Agent vs CloudWatch Logs Agent

The legacy CloudWatch Logs agent (awslogs) is deprecated. The exam still occasionally lists it as a distractor. Always pick the unified CloudWatch agent for both metrics and logs.

Trap 2: Metric Filter Cannot Backfill

A metric filter only emits metrics from log events ingested after the filter was created. Distractors say "create a metric filter to count errors from last week's logs" — wrong; the right answer is a Logs Insights query for historical data.

Trap 3: Logs Insights Cannot Replace Real-Time Alarms

Logs Insights is on-demand, not continuous. To alarm on log patterns in real time, you need a metric filter → metric → alarm, not a scheduled Logs Insights query.

Trap 4: VPC Flow Logs and Vended Logs Charging

VPC Flow Logs to CloudWatch Logs are charged at the vended-logs ingestion rate (cheaper than custom logs). The exam will test that you know vended logs (VPC Flow Logs, Route 53 query logs, Global Accelerator, AWS Network Firewall) are billed differently from your application logs.

Trap 5: Embedded Metric Format Requires the JSON Envelope

EMF requires the _aws envelope in the JSON log line. Plain JSON without the envelope is just a log event — no metric is created. The trap distractor is "log JSON to CloudWatch Logs and CloudWatch will create metrics automatically". False — you must use the EMF envelope.

Trap 6: Lambda Function Logs Default to No Retention

Lambda creates /aws/lambda/<function> log groups with never expire retention. CDK and CloudFormation templates that create Lambdas should explicitly set LogRetentionInDays.

Trap 7: Subscription Filter Limit

Each log group supports two subscription filters. If a scenario says "ship logs to a SIEM and to a security analytics Lambda and to OpenSearch", you cannot do all three with subscription filters alone. The right answer is one filter to Kinesis Data Streams, then fan out from Kinesis to multiple consumers.

ECS containers ship stdout/stderr to CloudWatch Logs via the awslogs log driver in the task definition. This is a Docker driver, not the CloudWatch agent. The agent is for OS-level metrics on the EC2 host (in EC2 launch type) or for sidecar metric collection (in Fargate via the AWS Distro for OpenTelemetry). Confusing these two ingestion paths is a high-frequency DOP-C02 trap. Reference: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_awslogs.html

DOP-C02 Exam Patterns and Worked Scenarios

Scenario 1: Memory Alarm on EC2 Auto Scaling Group

Stem: "The team needs to alarm when memory usage on an Auto Scaling Group exceeds 85 percent for five minutes." Wrong: enable detailed monitoring. Right: install the unified CloudWatch agent via SSM Distributor + State Manager, ship mem_used_percent to CWAgent namespace with an AutoScalingGroupName dimension, then create a CloudWatch alarm at 85 percent for 5 datapoints over 5 minutes.

Scenario 2: Find the Slow Endpoint at Incident Time

Stem: "On-call engineer needs to identify which API path has P95 latency above 2 seconds in the last hour." Wrong: scan S3 with Athena (slow ingestion path). Right: CloudWatch Logs Insights query with parse, stats pct(duration, 95) by path, filter pct95 > 2000. Mention bin(5m) for trend.

Scenario 3: Cost-Cutting on a Per-Customer Metric

Stem: "12,000 customers tagged on every metric; cost is 8x budget." Wrong: shorter retention. Right: stop the per-customer dimension, switch to EMF logs with customer_id as a log field, use Contributor Insights for top-N drill-down. Aggregate metric remains low cardinality; per-customer drill-down stays available via logs.

Scenario 4: 7-Year Metric Retention for Compliance

Stem: "Compliance requires 7 years of metric data retention." Wrong: increase CloudWatch retention. Right: CloudWatch metric stream to Firehose to S3 with lifecycle to Glacier; query historical data via Athena.

Scenario 5: Multi-Account Centralized Logs

Stem: "100 source accounts; security team needs single-pane log search." Right: CloudWatch cross-account observability with a monitoring account OAM sink and per-source-account OAM links; for log archive, ship via Firehose to a central S3 bucket and query with Athena.

Cost Optimization Playbook

CloudWatch is one of the top three line items in mature accounts. The DOP-C02 cost-optimization questions cluster around these levers:

Set log group retention — never leave at default. 30-90 days is typical for application logs; 1-7 days for verbose debug logs; longer only with explicit compliance need.
Eliminate high-cardinality dimensions — move to EMF + Contributor Insights.
Use vended-logs paths for VPC Flow Logs, Route 53 query logs (cheaper than custom logs).
Use metric streams for long-term archive instead of CloudWatch retention extension.
Use S3 Storage Lens + lifecycle for archived logs.
Sample debug logs at the application layer; do not ship every event to CloudWatch.
Use AWS Distro for OpenTelemetry to send metrics to Managed Prometheus instead of CloudWatch when cardinality is genuinely high.

FAQ

Q1: When should I use CloudWatch metrics vs CloudWatch Logs vs both via EMF?

Use metrics for low-cardinality numeric series you alarm on continuously (CPU, latency, error rate). Use logs for high-volume narrative events you query on demand (request traces, debug output). Use EMF when you want both: a structured log line with embedded metric values and dimensions. EMF is the DOP-C02 sweet spot for application telemetry from Lambda and Fargate because it solves cardinality cost and keeps drill-down.

Q2: How does Logs Insights compare to Athena for log queries?

Logs Insights queries CloudWatch Logs directly with second-level latency and a 7-day result retention. Athena queries logs that have been exported to S3, with Parquet conversion making it 10-100x cheaper at scale. Use Logs Insights for incident-time investigation (last few hours to days). Use Athena for long-term analytics, compliance reporting, and cost-sensitive bulk queries.

Q3: How do I monitor a Lambda function without inflating my metric bill?

Use EMF via the AWS Lambda Powertools library. Emit metrics as part of the structured log event; CloudWatch automatically extracts the metric. Tag high-cardinality fields like customer_id as log fields, not dimensions. Use Contributor Insights for top-N customer drill-down.

Q4: What is the difference between metric streams and subscription filters?

Metric streams ship metrics to Firehose; subscription filters ship log events to Lambda, Kinesis, Firehose, or OpenSearch. They are complementary — both can land data in S3 for archive but the data shape is different.

Q5: Can I encrypt CloudWatch Logs with my own KMS key?

Yes, with a customer-managed key (CMK). The KMS key policy must allow logs.<region>.amazonaws.com and ideally restrict via a kms:EncryptionContext:aws:logs:arn condition to the specific log group ARN. AWS-managed keys cannot encrypt log groups.

Q6: How do I retroactively analyze logs that have already been ingested?

Use Logs Insights for the last 7 days of stored data with a one-time query, or export logs to S3 (via the manual CreateExportTask API or via Firehose subscription filter) and query with Athena.

Q7: Why do my Lambda log groups never expire?

Lambda creates /aws/lambda/<function> log groups with never expire retention by default. Set LogRetentionInDays in your IaC, or use the AWS Config rule cloudwatch-log-group-retention-period-check to detect non-compliant groups and remediate via SSM Automation.

Cross-References

CloudWatch alarms and EventBridge are covered in cloudwatch-alarms-eventbridge-integration for the alarming and remediation flow.
X-Ray distributed tracing is covered in xray-distributed-tracing for request-path observability.
CloudTrail and Config dashboards are covered in cloudtrail-config-audit-dashboards for governance auditing.
EventBridge auto-remediation runbooks are covered in eventbridge-auto-remediation-runbooks for closing the loop on alarms.
Deployment failure troubleshooting uses the same Logs Insights and metric patterns and is covered in deployment-failure-troubleshooting.

Why CloudWatch Metrics and Logs Insights Matter on DOP-C02

Plain-Language Explanation: CloudWatch Metrics and Logs Insights

Analogy 1: The Hospital Vital-Signs Monitor and Chart Room

Analogy 2: The Restaurant Cash Register and Order Tickets

Analogy 3: The Astronomy Observatory with Light Sensors and Image Plates

CloudWatch Metric Anatomy — Namespace, Dimension, Resolution

Namespaces partition metrics for clarity

Dimensions identify a metric instance

Resolution determines granularity and cost

The Unified CloudWatch Agent — One Agent for Metrics and Logs

Embedded Metric Format (EMF) — Logs and Metrics in One Write

Metric Filters — Turning Log Events Into Metrics

Logs Insights Query Language — The DevOps SQL of CloudWatch

Subscription Filters — Real-Time Log Fan-Out

Metric Streams — Cheap Long-Term Metric Archive

Log Group Retention and KMS Encryption

Contributor Insights — Top-N Without Cardinality Explosion

Cross-Account Observability

High-Frequency Exam Traps

Trap 1: CloudWatch Agent vs CloudWatch Logs Agent

Trap 2: Metric Filter Cannot Backfill

Trap 3: Logs Insights Cannot Replace Real-Time Alarms

Trap 4: VPC Flow Logs and Vended Logs Charging

Trap 5: Embedded Metric Format Requires the JSON Envelope

Trap 6: Lambda Function Logs Default to No Retention

Trap 7: Subscription Filter Limit

DOP-C02 Exam Patterns and Worked Scenarios

Scenario 1: Memory Alarm on EC2 Auto Scaling Group

Scenario 2: Find the Slow Endpoint at Incident Time

Scenario 3: Cost-Cutting on a Per-Customer Metric

Scenario 4: 7-Year Metric Retention for Compliance

Scenario 5: Multi-Account Centralized Logs

Cost Optimization Playbook

FAQ

Q1: When should I use CloudWatch metrics vs CloudWatch Logs vs both via EMF?

Q2: How does Logs Insights compare to Athena for log queries?

Q3: How do I monitor a Lambda function without inflating my metric bill?

Q4: What is the difference between metric streams and subscription filters?

Q5: Can I encrypt CloudWatch Logs with my own KMS key?

Q6: How do I retroactively analyze logs that have already been ingested?

Q7: Why do my Lambda log groups never expire?

Cross-References

Official sources

More DOP-C02 topics