examlab .net The most efficient path to the most valuable certifications.
In this note ≈ 25 min

AWS X-Ray — Distributed Tracing and Application Performance

5,000 words · ≈ 25 min read ·

DOP-C02 deep dive on AWS X-Ray service map, segments, subsegments, sampling rules, trace context propagation, daemon vs SDK vs ADOT integration with Lambda, ECS, EKS, API Gateway, and CloudWatch ServiceLens.

Do 20 practice questions → Free · No signup · DOP-C02

AWS X-Ray distributed tracing answers the DevOps engineer's most painful incident question: where in this 17-microservice request chain did the latency or error originate? Logs alone cannot answer this — they live in service-local silos and lack the cross-service correlation. Metrics summarize but don't trace. X-Ray (and its modern OpenTelemetry-based successor pattern via ADOT) reconstructs the full request path as a trace composed of segments (one per service) and subsegments (one per outbound call inside a service), letting you see the service map, drill into the slow span, and correlate to logs and metrics. DOP-C02 expects you to know how to instrument every compute platform (Lambda, ECS, EKS, EC2, on-prem), pick the right sampling rule to bound cost, propagate trace context across SQS / SNS / EventBridge, and integrate X-Ray with CloudWatch ServiceLens for unified observability.

This guide assumes you understand the basic premise of distributed tracing (trace ID, parent-child span relationships) and have seen the X-Ray console at least once. It focuses on the DOP-C02 implementation depth: segments vs subsegments vs annotations vs metadata, sampling rules at the daemon and SDK levels, the X-Ray daemon model on EC2/ECS-on-EC2 vs the AWS-managed ingestion on Lambda and Fargate, the X-Ray SDK in Java, Node.js, Python, .NET, Go, and Ruby, the AWS Distro for OpenTelemetry (ADOT) as the strategic successor for new instrumentation, trace context headers (X-Amzn-Trace-Id) and propagation across asynchronous boundaries, X-Ray Insights for anomaly detection in trace patterns, X-Ray Groups for filtering, KMS encryption of trace data, and ServiceLens which fuses X-Ray traces with CloudWatch metrics and logs into one console. Domain 4.2 (Audit, monitor, and analyze logs and metrics to detect issues) and Domain 5.3 (Troubleshoot system and application failures) cover this material.

Why X-Ray Distributed Tracing Matters on DOP-C02

Distributed tracing is the only observability tool that natively spans services. CloudWatch metrics see one service at a time. CloudWatch Logs see one log group at a time. Without traces, "the order API is slow" becomes a multi-day investigation across 12 services. With X-Ray, the service map shows the slow node in one screen and the trace timeline shows which downstream call ate the budget. DOP-C02 weights this skill in Domain 4.2 (analyze logs and metrics to detect issues) and Domain 5.3 (troubleshoot failures). Multiple community study reports list "X-Ray service map" and "trace context propagation" as high-frequency exam topics.

The exam style here is integration-rich. A typical stem reads: "An API Gateway → Lambda → DynamoDB → SQS → Lambda → DynamoDB chain shows P99 latency of 4 seconds with no obvious culprit. Which two changes would help the team isolate the slow link?" The wrong answers cluster around "increase log verbosity" or "enable detailed CloudWatch monitoring". The right pair is enabling X-Ray on every Lambda and API Gateway plus propagating the trace context across the SQS message attribute. The exam tests whether you know which compute platforms need a daemon, which auto-instrument, and which require manual context propagation across queues and event buses.

  • Trace: the full record of a single request across all services, identified by a globally unique 35-character trace ID with embedded timestamp.
  • Segment: one service's slice of the trace; created by an SDK or by AWS automatically (API Gateway, Lambda, ALB).
  • Subsegment: a unit of work inside a segment — typically one outbound call (DynamoDB, S3, HTTP, custom) or a logical block.
  • Annotation: a key-value pair on a segment that is indexed for filtering and search; max 50 per segment.
  • Metadata: arbitrary key-value pairs not indexed; up to 64 KB per segment for context the engineer might want during investigation.
  • Sampling rule: a per-service or per-resource rule that decides what fraction of requests to trace; X-Ray bills per recorded segment, so sampling controls cost.
  • X-Ray daemon: a sidecar UDP listener on EC2/ECS-on-EC2/on-prem that buffers and ships segment data to the X-Ray API; not needed on Lambda or Fargate (handled by AWS).
  • Trace context header: X-Amzn-Trace-Id: Root=...;Parent=...;Sampled=... carried in HTTP, message attributes, and SDK calls so child segments link back to the parent trace.
  • ADOT (AWS Distro for OpenTelemetry): AWS's supported OpenTelemetry distribution that exports traces to X-Ray and metrics to CloudWatch / Managed Prometheus; the strategic forward path.
  • CloudWatch ServiceLens: the console that correlates X-Ray traces with CloudWatch metrics and logs in a single view.
  • X-Ray Insights: ML-based anomaly detection over trace patterns that surfaces unusual error/latency clusters automatically.
  • X-Ray Group: a saved filter expression that scopes the service map and traces to a subset (e.g., only the checkout service in production).
  • Reference: https://docs.aws.amazon.com/xray/latest/devguide/xray-concepts.html

Plain-Language Explanation: X-Ray Distributed Tracing

Distributed tracing is conceptually different from logging or metrics, and engineers who have not worked with it before sometimes mistake it for "fancy logs". Three analogies make the model click.

Analogy 1: The Hospital Patient Journey Map

A patient enters the ER with a complex symptom. They are seen by triage, sent to imaging, then to a specialist consult, then to a procedure room, then to recovery, then discharged. Each department writes its own notes (the CloudWatch Logs). Each department also samples vitals (the CloudWatch metrics). But what nobody has — until you build it — is the end-to-end patient journey timeline showing exactly when the patient arrived at each station, how long they waited, how long the work took, and which step delayed the discharge. That timeline is the X-Ray trace. Each department's slice is a segment. Inside the imaging department's slice, the time spent waiting for the MRI machine to free up is a subsegment. The patient's wristband is the trace context header — every department reads the wristband, links its slice to the journey, and writes its notes. Without the wristband, the radiologist cannot connect "patient X's scan" with "patient X in the ER" — they are isolated silos. The exam loves stems where the team has logs but cannot connect them: the answer is propagate the trace context across services and queues.

Analogy 2: The Restaurant Order Through the Kitchen

A complex restaurant has a hot line, a cold line, a pastry station, a wine cellar, and a delivery runner. An order ticket gets stamped at each station with arrival and departure times — that is the X-Ray trace timeline. The total customer wait is the trace duration. The service map is the kitchen flow chart showing which station feeds which: the customer table connects to the host stand, which connects to the kitchen prep, which connects to hot line and cold line and pastry, which all converge at the pass before going to the runner. When orders are slow, the manager looks at the service map to see which station has a red box (high error or latency) and pulls a few traces from that station to see exactly which sub-step (the deep fryer, the salad assembly, the saucier) caused the wait. Annotations on the order ticket are the searchable tags — table number, party size, VIP status, dietary requirements. Metadata is the chef's free-form notes that are not searchable but useful when investigating. Sampling is the policy of "we don't track every ticket in detail, just every 10th one and every ticket flagged complaint" — bounding the paperwork cost.

Analogy 3: The Postal Service Tracking a Package

A package travels from sender to receiver across multiple legs: pickup truck, regional sort facility, long-haul air, destination sort, last-mile van. Each leg scans the tracking number (the trace ID) into a database. The customer's tracking page shows the timeline — pickup at 09:14, arrived at sort facility at 11:02, departed at 14:30, etc. That is exactly an X-Ray trace. Service map is the route diagram. Sampling is "we record every package's high-level scans but only deep-track 1% to keep the database manageable, plus we deep-track all packages with priority flag". Insights are the postal service's anomaly detection that says "30% more delays in the Chicago hub today than usual" without anyone running queries. ServiceLens is the operations console that overlays package metrics (volume, delays) on the same map. The X-Ray daemon is the local scan terminal that batches scans and ships them up — on Lambda and Fargate this is built into the platform; on EC2 you run it as a sidecar.

For DOP-C02 stems centered on "find the slow service in a chain", reach for the hospital patient journey analogy. For stems centered on "service map and dependency visualization", reach for the restaurant kitchen flow analogy. For stems centered on "sampling, annotations, and async propagation", reach for the postal package tracking analogy. Reference: https://docs.aws.amazon.com/xray/latest/devguide/xray-concepts.html

Trace Anatomy — Segments, Subsegments, Annotations, Metadata

A trace is a tree of segments and subsegments. The root segment is created when the request first enters an instrumented service. Downstream services inherit the trace ID and create child segments. Inside each segment, the SDK creates subsegments for outbound work — calls to DynamoDB, S3, HTTP services, or arbitrary code blocks the developer wraps with subsegment.begin()/end().

Annotations are searchable, metadata is not

Annotations are indexed key-value pairs (string, number, boolean) used in the X-Ray console filter expression: annotation.environment = "prod" AND annotation.user_id = "abc123". Limit: 50 per segment. Use annotations for drill-down dimensions at investigation time.

Metadata is arbitrary JSON, up to 64 KB per segment, not searchable. Use metadata for additional context an engineer might want to read once they pulled up the trace — request bodies, response sizes, internal state.

Errors, faults, and throttles

Each segment carries error/fault/throttle flags. Error = client error (4xx HTTP, exception caught and re-thrown). Fault = server error (5xx HTTP, unhandled exception). Throttle = downstream throttling (429, ProvisionedThroughputExceededException). The service map colors nodes by predominant flag — red for fault, orange for error, yellow for throttle.

Sampling Rules — Cost Control with Visibility

X-Ray bills per recorded segment. Sampling defines what fraction of requests get recorded. Two layers:

Default sampling (no rule configured)

The first request per second is recorded; for additional requests, 5 percent are recorded. This default is fine for dev but inappropriate for high-traffic prod (5 percent of 10,000 RPS = 500 RPS billed = expensive).

Define rules in the X-Ray console or via API. Each rule has:

  • Reservoir size: a fixed number of requests per second always sampled (e.g., 1).
  • Fixed rate: percentage of requests beyond the reservoir to sample (e.g., 0.05 = 5 percent).
  • Match criteria: service name, service type, host, HTTP method, URL path, resource ARN — the rule applies only to matching requests.
  • Priority: lower number wins when multiple rules match.

Use case: keep /healthcheck at 0 percent reservoir + 0 percent rate (don't trace at all), keep /checkout at 5 reservoir + 100 percent rate (trace everything), keep all other paths at 1 reservoir + 5 percent rate.

When the upstream service samples a request, the trace context header carries Sampled=1. Every downstream service in the chain records its segment without making its own sampling decision. This means the upstream sampling rule controls the entire chain's trace volume, not just its own. Set sampling at the entry point (API Gateway, ALB, frontend service); downstream services don't need their own custom rules unless they receive traffic from multiple entry points. Reference: https://docs.aws.amazon.com/xray/latest/devguide/xray-console-sampling.html

X-Ray on Each Compute Platform

The instrumentation pattern differs by platform.

Lambda

Enable Active tracing on the function (CloudFormation: TracingConfig: { Mode: Active }). AWS automatically creates the segment for the invocation. The Lambda runtime has the X-Ray daemon embedded — no sidecar needed. Adding the X-Ray SDK in code (or AWS Lambda Powertools) creates subsegments for downstream AWS SDK calls and HTTP calls automatically.

Fargate (ECS and EKS)

Run the X-Ray daemon as a sidecar container in the task definition, listening on UDP 2000 inside the task. The application container's SDK ships segments to the sidecar, which forwards to the X-Ray API. ADOT collector is the modern alternative.

ECS on EC2

Run the daemon as a DaemonSet on each EC2 host (one daemon per instance, all containers on that instance share it via the host network) or as a sidecar in each task. DaemonSet is more efficient at scale.

EKS

Same as ECS — use a daemon DaemonSet. The newer pattern is the ADOT collector as a Kubernetes Deployment or DaemonSet, supporting both X-Ray and OpenTelemetry-compatible backends.

EC2 / on-premises

Install the X-Ray daemon on each host (yum/apt package or binary). Configure the SDK to send to UDP 127.0.0.1:2000.

API Gateway

Toggle X-Ray tracing on the stage. API Gateway creates an upstream segment automatically; if the integration target is also instrumented, the trace continues.

Application Load Balancer

ALB injects the X-Amzn-Trace-Id header into incoming requests. ALB itself does not create a segment, but downstream services receiving the header can extract and propagate the trace context.

Trace Context Propagation — The Hardest Part

Synchronous HTTP and AWS SDK calls propagate context automatically when the SDK is instrumented. Asynchronous boundaries are where engineers lose context most often.

SQS

The producer adds the trace context as a message attribute (AWSTraceHeader). The consumer SDK reads the attribute and continues the trace. If you put the trace ID in the message body, AWS does not auto-link.

SNS

Same as SQS — AWSTraceHeader message attribute.

EventBridge

Add the trace context to the event detail at PutEvents time; receiving Lambda extracts it manually unless using the integrated AWS SDK propagation.

Step Functions

Step Functions natively integrates with X-Ray; enable on the state machine and traces include each state transition.

Kinesis

Encode the trace context in the record data or partition key — no native message attributes. Most teams use ADOT-OTel propagation here for native support.

A common DOP-C02 stem: a request goes through API Gateway → Lambda → SQS → Lambda → DynamoDB but the X-Ray service map shows the chain ending at the first Lambda. The trap distractor is "enable Active tracing on the second Lambda". That alone does not fix it — the second Lambda will create a new root trace because it never received the context. The fix is propagate AWSTraceHeader as an SQS message attribute at PutMessage time and read it at consumer time. Reference: https://docs.aws.amazon.com/xray/latest/devguide/xray-services-sqs.html

ADOT — AWS Distro for OpenTelemetry

The strategic direction for new instrumentation is OpenTelemetry, with the AWS Distro (ADOT) as the supported pipeline that exports to X-Ray for traces and to CloudWatch / Managed Prometheus for metrics. Why ADOT matters on DOP-C02:

  • Vendor neutrality: same instrumentation works for X-Ray, Datadog, Honeycomb, New Relic.
  • Unified pipeline: traces and metrics through the same collector.
  • OpenTelemetry semantic conventions: standardized attribute names across the industry.
  • Future-proof: AWS is investing in ADOT, and X-Ray-only SDKs are in maintenance mode.

ADOT runs as a sidecar (Fargate), DaemonSet (EKS), or Lambda layer. The exam tests that you know ADOT is the right pick for new greenfield work and for cross-vendor portability, while X-Ray SDK remains acceptable for AWS-only stacks.

CloudWatch ServiceLens — The Unified Console

ServiceLens fuses X-Ray service map, CloudWatch metrics, and CloudWatch Logs into one view. From a node on the service map, you can:

  • Drill into the segment timeline.
  • Jump to the related CloudWatch metrics for that service (latency, errors, requests).
  • Run a CloudWatch Logs Insights query scoped to the service's log group, filtered by trace ID.
  • See alarms in OK / ALARM state overlaid on the map.

ServiceLens is the on-call engineer's investigation console. The exam expects you to know it exists and is the answer when stems mention "single console for traces, metrics, and logs".

X-Ray Insights — Anomaly Detection on Trace Patterns

X-Ray Insights uses ML to find clusters of unusual error or latency patterns over time, surfacing them as insight events with a root-cause hypothesis (e.g., "DynamoDB throttling on Orders table caused 23 percent of OrderService errors in the last 30 minutes"). Insights are emitted as EventBridge events on the aws.xray source, so they can route to SNS, Lambda, or Incident Manager response plans. Enable Insights per X-Ray Group.

X-Ray Groups — Scoped Service Maps

A group is a saved filter expression. Examples:

  • service("checkout-api") AND annotation.environment = "prod".
  • responsetime > 5 OR fault = true.

Groups have their own service map, retention, and Insights configuration. Use groups to scope per-team, per-service, or per-environment views.

Encryption and Retention

X-Ray data is encrypted at rest by default with an AWS-managed key. You can configure a customer-managed KMS key at the account level (single key applies to all traces in the Region). X-Ray retains traces for 30 days; long-term retention requires exporting to S3 via a Lambda subscriber on the X-Ray events.

  • Trace retention: 30 days by default; cannot be extended.
  • Default sampling: 1 req/sec reservoir + 5% rate, applies if no custom rule.
  • Sampling decisions made at trace start propagate to all downstream segments.
  • Annotations: indexed, max 50 per segment, used for filter expressions.
  • Metadata: not indexed, up to 64 KB per segment.
  • Daemon: required on EC2 / ECS-on-EC2 / on-prem; not required on Lambda or Fargate (built-in).
  • Trace context header: X-Amzn-Trace-Id: Root=1-XXXXXXXX-XXXXXXXX;Parent=YYYYYYYY;Sampled=1. Reference: https://docs.aws.amazon.com/xray/latest/devguide/xray-concepts.html

High-Frequency Exam Traps

Trap 1: Lambda Active Tracing vs Passive Tracing

Active tracing on Lambda creates a segment for every invocation that has a sampling decision = 1. Passive tracing (legacy; rarely tested) only created segments when an upstream service had already started the trace. The exam will distractor with "passive tracing is sufficient" — pick active tracing.

Trap 2: X-Ray Daemon on Fargate Is Not Required

Fargate has the daemon built into the task agent. Distractor: "deploy daemon as sidecar on Fargate" — wrong (it's optional but not necessary; built-in is preferred).

Trap 3: Trace Header Propagation Across SQS

Use the AWSTraceHeader message attribute, not the message body. Distractor: "embed trace ID in message body JSON" — does not auto-link in X-Ray.

Trap 4: Sampling Decision at the Entry Point

Setting custom sampling on a downstream service has no effect once the upstream sampled Sampled=1. The entry point controls the entire chain.

Trap 5: Annotations Are Capped at 50

If you need more than 50 indexed fields, you cannot use annotations alone. Use metadata for the rest, or split into multiple subsegments each with their own annotations.

Trap 6: X-Ray vs CloudWatch Logs Insights for Traces

X-Ray is for request-path latency and error analysis across services. Logs Insights is for log content queries within a log group. Do not pick Logs Insights when the stem asks for service-map and downstream-call timing.

Trap 7: 30-Day Retention Cannot Be Extended

For long-term trace archive, export via Lambda subscriber on X-Ray events to S3. Distractor: "configure X-Ray retention to 1 year" — not configurable.

A DOP-C02 distractor offers Amazon QuickSight for "unified observability dashboards". QuickSight is for business intelligence over data warehouses, not for operational trace-and-metric correlation. The right answer for a DevOps single-pane view is CloudWatch ServiceLens (X-Ray + CloudWatch fused) plus optionally Amazon Managed Grafana for cross-source dashboards. Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ServiceLens.html

DOP-C02 Exam Patterns and Worked Scenarios

Scenario 1: Find the Slow Microservice

Stem: "API P99 is 4 seconds across a 6-service chain; team cannot tell which service is slow." Right: enable X-Ray on every service via Active tracing (Lambda), daemon (EC2/ECS), or ADOT (EKS); review the service map; pick the node with the highest latency and drill into traces.

Scenario 2: Trace a Request Across SQS Boundary

Stem: "Trace appears to end at the first Lambda; SQS consumer creates a new root trace." Right: at PutMessage time, populate the AWSTraceHeader message attribute with the current trace context; consumer SDK auto-extracts.

Scenario 3: Cost Control on a High-Traffic API

Stem: "X-Ray bill is too high at 1000 RPS." Right: define a custom sampling rule with a small reservoir (1) and low fixed rate (1-5%); exclude /healthcheck entirely with a 0/0 rule.

Scenario 4: Detect Latency Anomalies Automatically

Stem: "Want automatic detection of unusual latency clusters without manual dashboard checking." Right: enable X-Ray Insights on the relevant Group; route insight events via EventBridge to SNS or Incident Manager.

Scenario 5: Cross-Vendor Telemetry

Stem: "Team uses Datadog for metrics and X-Ray for traces; wants single instrumentation library." Right: AWS Distro for OpenTelemetry (ADOT) with the OTLP exporter shipping traces to X-Ray and metrics to Datadog.

FAQ

Q1: When should I use X-Ray vs CloudWatch Logs Insights?

X-Ray for request-path tracing across services — when the question is "which service or downstream call caused this latency". Logs Insights for log content queries within a service — when the question is "what was the error message" or "how many users hit this endpoint". They are complementary; ServiceLens fuses them.

Q2: How do I keep X-Ray cost down at high traffic?

Use a custom sampling rule with a small reservoir (1 req/sec) and low fixed rate (1-5%). Exclude /healthcheck and similar endpoints entirely. Sampling decisions propagate down the chain, so configuring the entry point covers the whole trace.

Q3: Should I use X-Ray SDK or ADOT for new code?

For new code, ADOT — it is the strategic direction, gives vendor neutrality, and supports both traces and metrics in one pipeline. For existing code with the X-Ray SDK working, no migration needed unless you need OTel features.

Q4: How do I propagate trace context across SQS?

Producer sets the AWSTraceHeader message attribute. Consumer's SDK extracts it. Without this attribute, the SQS consumer creates a new root trace, breaking the chain.

Q5: Is the X-Ray daemon required for Lambda?

No. Lambda's runtime includes the daemon. You only enable Active tracing on the function and (optionally) add the X-Ray SDK or AWS Lambda Powertools for subsegments.

Q6: How long does X-Ray retain traces?

30 days, not configurable. For longer retention, export via a Lambda subscribed to the X-Ray API and write to S3 with lifecycle policies.

Q7: Can X-Ray trace traffic between accounts?

Yes — the trace context header propagates naturally across HTTP and AWS SDK calls regardless of account. Each account stores its own segments. To view a unified service map across accounts, use CloudWatch cross-account observability to share X-Ray traces with a monitoring account.

Cross-References

  • CloudWatch metrics and Logs Insights are the metric and log layers ServiceLens fuses with X-Ray; see cloudwatch-metrics-logs-insights.
  • CloudWatch alarms and EventBridge consume X-Ray Insight events for automated escalation; see cloudwatch-alarms-eventbridge-integration.
  • Deployment failure troubleshooting uses X-Ray to correlate failed deployment requests with downstream slowness; see deployment-failure-troubleshooting.
  • CloudTrail and Config dashboards complement X-Ray for governance and audit visibility; see cloudtrail-config-audit-dashboards.
  • Incident Manager and Health consume X-Ray Insight events as response triggers; see systems-manager-incident-manager-health.

Official sources

More DOP-C02 topics