Cloud Monitoring and Cloud Trace — GCP PCD Study Notes

Q: How do I avoid blowing my Cloud Trace budget with too many spans?

Configure the OpenTelemetry sampler to ParentBased(TraceIdRatioBased(0.01)) for 1 percent sampling. Increase the rate temporarily during incident debugging using a feature flag or a Cloud Run environment variable, then drop back to 1 percent. AlwaysOn sampling on a 1000 RPS service costs over $500 per month in span ingestion.

Introduction to Cloud Monitoring and Cloud Trace

Cloud Monitoring and Cloud Trace are the two observability pillars the Professional Cloud Developer exam expects you to wire into every workload before you mark it production-ready. Cloud Monitoring collects time-series metrics, runs alerting policies, hosts dashboards, and tracks SLOs and error budgets. Cloud Trace ingests distributed spans, computes latency percentiles per RPC, and surfaces the slow request that is dragging your tail latency upward. Together with Cloud Profiler and the OpenTelemetry SDK, they form the developer-facing layer of Google Cloud's operations suite (formerly Stackdriver).

This study note walks through metric types, the MQL query language, dashboard structure, every alerting policy variant, uptime checks, SLO and error budget math, the Trace sampling model with its 30-day retention, the three Cloud Profiler profile types, OpenTelemetry wiring, custom metrics, the metrics scope replacement for Workspaces, and notification channel routing. Every section ties to a concrete GCP API, gcloud command, or numeric default that shows up on the PCD exam.

白話文解釋（Plain English Explanation）

Before diving into Cloud Monitoring and Cloud Trace internals, three concrete pictures make the moving parts click.

Think of Cloud Monitoring Like the Cockpit of a Commercial Airliner

A 787 cockpit has hundreds of gauges, but the pilot only stares at six during cruise: altitude, airspeed, attitude, heading, vertical speed, and fuel. The rest are wired into automated alerts that page the pilot only when something exceeds a threshold. Cloud Monitoring works the same way. You collect thousands of metrics from Compute Engine, Cloud Run, GKE, and your application code, but a healthy dashboard surfaces six to ten golden signals (request rate, latency p95, error rate, saturation, queue depth, business KPI). The rest live behind alerting policies that wake you only when a real threshold trips. Pile every metric onto a single dashboard and you have a cockpit with no pilot in it, because the human visual system stops absorbing detail past about a dozen elements.

Think of Cloud Trace Like a Receipt from a Multi-Stage Tax Filing

When a tax form passes through preparer, reviewer, manager, partner, then back to the client portal, each handoff stamps a timestamp on the cover sheet. Six weeks later when a client asks "why did this take so long," the partner reads the stamps and points at the reviewer who held the file for nine days. Cloud Trace stamps the same kind of receipt on every HTTP request. Each microservice that touches the request adds a span with start time, end time, and metadata. When tail latency creeps up, you open the Trace waterfall, see that the Spanner call took 1.8 seconds instead of 80 milliseconds, and you know exactly where to look. Without Trace, you are reading a tax return with no signatures and guessing whose fault the delay was.

Think of an SLO and Error Budget Like a Monthly Data Plan

Your phone plan promises 50 GB per month. If you stream 49 GB on day 28, you slow down on day 29. If you stream 30 GB by day 15, you have room to binge-watch on the flight. An SLO works identically. You promise the user "99.9 percent of requests succeed in 30 days," which budgets 43.2 minutes of allowed failure that month. Every 5xx response burns budget. When the burn rate hits the alerting policy threshold (typically 14.4x for fast burn, 6x for slow burn), an incident fires and the team stops shipping risky changes until the budget recovers. Without an SLO, every reliability conversation devolves into opinion. With one, it becomes arithmetic.

Core Concepts of Cloud Monitoring and Cloud Trace

The PCD exam expects fluency with the following objects and their precise semantics.

Metric Descriptor

A metric descriptor identifies a time series by its type name (for example compute.googleapis.com/instance/cpu/utilization), its value type (BOOL, INT64, DOUBLE, STRING, DISTRIBUTION), its kind (GAUGE, DELTA, CUMULATIVE), and its monitored resource. Cloud Monitoring ships thousands of built-in descriptors and lets you register custom descriptors under the custom.googleapis.com/ or workload.googleapis.com/ prefix.

Monitored Resource

A monitored resource is the labelled object the metric is recorded against, such as gce_instance, k8s_container, cloud_run_revision, or generic_task. Each resource type carries its own label set (project_id, instance_id, zone, etc.), which is how Monitoring scopes a single time series in a sea of millions.

Alerting Policy

An alerting policy ties one or more conditions to a notification channel set and a documentation block. Conditions fall into six families: metric threshold, metric absence, forecast, MQL, PromQL, and log-based. The policy opens an incident when conditions evaluate true and auto-closes when they recover.

Uptime Check

An uptime check is a synthetic HTTP, HTTPS, TCP, or ICMP probe issued from six fixed global locations (USA-OR, USA-VA, USA-IA, Europe, South America, Asia-Pacific). The default cadence is once per minute and the timeout is five seconds. Each check exposes a metric you can alert on.

Span and Trace

A span is a timed operation with a start timestamp, end timestamp, name, attributes, and a parent span ID. A trace is a directed tree of spans sharing a 16-byte trace ID. Cloud Trace retains spans for 30 days by default and supports up to 1000 spans per trace.

SLI, SLO, and Error Budget

An SLI is a measurable proxy for user happiness, typically request success rate or request latency. An SLO is the target value for that SLI over a rolling or calendar window (commonly 28 days). The error budget is the inverse of the SLO: a 99.9 percent SLO over 28 days yields 40 minutes and 19 seconds of allowed failure.

Metrics Scope (Formerly Workspace)

A metrics scope is the project-bound visibility boundary for Cloud Monitoring. Each Google Cloud project starts with a metrics scope that contains itself. You add up to 375 monitored projects to a single scoping project, which lets one operations team observe many application projects without switching context.

Metric Types and the MQL Query Language

Knowing the metric kind and value type is half the battle when writing an alert that does not flap.

Metric Kinds: GAUGE, DELTA, CUMULATIVE

A GAUGE reports the current value at the sample time (CPU percent, queue depth). A DELTA reports the change since the previous sample over a fixed alignment window (HTTP requests per minute). A CUMULATIVE reports a monotonically increasing total since process start (bytes_sent_count). When you build a chart or alert, Monitoring aligns CUMULATIVE metrics into rates automatically; forgetting to align a CUMULATIVE counter is the most common reason a dashboard renders a flat line at zero.

Value Types and the DISTRIBUTION Type

Most metrics are INT64 or DOUBLE, but latency metrics are typically DISTRIBUTION. A DISTRIBUTION carries a histogram of bucketed samples, which is how Monitoring can compute p50, p95, and p99 percentiles without storing every raw sample. The percentile() MQL function expects a DISTRIBUTION input; running it against a DOUBLE returns an error.

MQL Basics

Monitoring Query Language is the SQL-of-metrics. A minimal query looks like:

fetch cloud_run_revision::run.googleapis.com/request_count
| filter resource.service_name == 'checkout-api'
| align rate(1m)
| every 1m
| group_by [resource.location], sum(value.request_count)

The pipeline reads as fetch resource and metric, filter rows, align to a time window, sample every cadence, group by labels with an aggregation. MQL supports joins across metrics, which lets you compute ratios like error rate equals 5xx count over total count in a single query.

When to Pick PromQL Instead

Cloud Monitoring also accepts PromQL queries for managed Prometheus and for general alerting. Teams migrating from self-hosted Prometheus usually keep PromQL for muscle memory; greenfield GCP-native teams pick MQL for tighter integration with built-in metrics and label hierarchies.

Cloud Monitoring metrics ingestion is free for Google-built metrics. Custom metrics and Logs-based metrics over the free tier ($0.2580 per MB ingested in 2025) are billed, and chart sample queries past 1 million per project per month also incur cost. Plan dashboards to fetch what you display, not every metric in the project.

Dashboards and Visualisation

A dashboard turns a sea of metrics into a story your team can scan in 30 seconds.

Built-in Versus Custom Dashboards

Cloud Monitoring ships pre-built dashboards for every major Google service (Cloud Run, GKE, Cloud SQL, Pub/Sub). These appear automatically and require zero setup. Custom dashboards live in the dashboards API and can be created via UI, JSON, Terraform (google_monitoring_dashboard), or gcloud monitoring dashboards create --config-from-file=dashboard.json.

A dashboard widget is either a XY chart (line, stacked, bar), a scorecard (single big number with sparkline), a gauge, a heatmap (for DISTRIBUTION metrics), an alert chart (overlays incident history), a text widget for Markdown notes, or an embedded log panel. The mix-and-match matters: a scorecard for current error rate plus an XY chart of the trailing seven days conveys far more than two line charts.

Mosaic Layout and Section Headers

Dashboards now support mosaic layout, which packs widgets without the rigid 12-column grid of the legacy view, and section headers that group widgets into collapsible bands. Teams use sections to mirror the four golden signals (traffic, latency, errors, saturation) on every service dashboard.

Dashboards live inside a project and are visible to anyone with roles/monitoring.viewer. Store the JSON in Git so dashboard drift is reviewable, and use Terraform for promotion across dev and prod projects. A dashboard ID never changes, which lets you embed a stable URL in runbooks.

Alerting Policies in Depth

The PCD exam loves to ask which condition type fits which scenario.

Metric Threshold Condition

The standard variant: fire when metric.value > X for at least duration minutes. You configure the aligner (mean, max, sum, percentile), the reducer (mean across all instances, percentile across all instances), the threshold, and the duration window. A 5-minute duration prevents single-sample spikes from paging the team.

Metric Absence Condition

Fires when an expected metric stops reporting for the configured duration. Use this to detect an agent that crashed, a Cloud Run revision that scaled to zero unexpectedly, or a cron job that did not emit its heartbeat metric. Without absence conditions, a failed agent looks like "the metric is below threshold" forever.

Forecast Condition

Cloud Monitoring's forecast condition uses a built-in linear regression model to predict whether a metric will breach a threshold within a future window (typically 1 to 24 hours). The classic use case is disk fill-up: a forecast that says "disk will be 100 percent full in 6 hours" pages the team while there is still time to expand the disk.

MQL and PromQL Conditions

For anything more complex than a single-metric threshold, an MQL or PromQL condition lets you express arbitrary math across multiple time series. The exam favourite is the error rate ratio: 5xx_count / total_count > 0.01 for 5 minutes. A single MQL condition replaces what would otherwise require multi-condition policies and complex AND/OR logic.

Log-Based Alert

A log-based alert fires on the count of log entries matching a Cloud Logging query, not on a metric value. Use it for one-off signals like "we got a NullPointerException with this stack frame" that do not warrant a permanent log-based metric.

Always populate the alerting policy documentation field with a Markdown runbook link, the MQL query to diagnose, and the rollback command. An alert that wakes someone at 3 AM without a runbook is worse than no alert at all, because the on-call engineer wastes 20 minutes hunting for context that should have been one click away.

Uptime Checks and Synthetic Monitoring

Real users on six continents do not care if your service is healthy from the data centre next door.

HTTP, HTTPS, TCP, and ICMP Checks

Cloud Monitoring uptime checks ship as HTTP (GET or POST), HTTPS with optional cert validity check, TCP (probes a port for connection success), or ICMP (ping). HTTPS checks can optionally validate the cert chain and fire an alert if expiry is under 15 days, which catches the renewal cron job that silently failed.

Six Global Probe Locations

Every uptime check fires from six fixed regions: USA-OR (Oregon), USA-VA (Virginia), USA-IA (Iowa), Europe (Belgium), South America (São Paulo), and Asia-Pacific (Singapore). The default success threshold is "at least 2 of 6 locations report success," which tolerates regional ISP weirdness without false alarms.

Synthetic Monitors for Multi-Step Flows

When a single GET against /health is not enough, Cloud Monitoring Synthetic Monitors run a Node.js or Python script (typically using Puppeteer or Playwright) that walks through login, add-to-cart, and checkout. Synthetic monitors execute every 1 to 60 minutes and emit a custom metric you can alert on.

Uptime Check Pricing and Quotas

Uptime checks are free up to 100 checks per project and $0.30 per check per month above that. Synthetic monitors invoke a Cloud Functions execution under the hood and bill accordingly. The PCD exam rarely tests the price but does test that uptime checks alone are HTTP/TCP only and cannot perform login flows; that is synthetic monitors' job.

SLOs and Error Budgets

Service monitoring is the Cloud Monitoring sub-product that materialises SLI/SLO concepts as first-class objects.

Defining an SLI

You define an SLI on a Service object (Cloud Run service, GKE workload, Istio service, or a custom BasicService you register). The SLI is one of: request-based availability, request-based latency, windows-based availability, or custom MQL. A latency SLI says "good requests are those that complete under 300 ms"; an availability SLI says "good requests are those without a 5xx response."

Compliance Period and Burn Rate

An SLO specifies the goal (99.9 percent), the compliance period (rolling 28 days is standard), and the calendar versus rolling option. The burn rate is the multiple of the budget you are currently consuming. A burn rate of 1.0x means the budget will deplete exactly at period end; 14.4x means it depletes in 2 hours.

Multi-Window Multi-Burn-Rate Alerting

The SRE-recommended pattern fires a P1 alert on a 14.4x burn rate sustained for 1 hour (a 5 percent budget burn in 1 hour) and a P2 alert on a 6x burn rate sustained for 6 hours. This catches both fast incidents (database down) and slow regressions (a deploy that quietly added 5 ms of latency) while filtering out single-minute spikes.

Error Budget Policy

The error budget policy is the team-level agreement that says what happens when budget runs out: feature freezes, mandatory postmortem, paging escalation. Cloud Monitoring tracks the budget but does not enforce the policy; that is on the team's engineering culture.

The ratio at which an SLO's error budget is being consumed relative to the rate that would exhaust it exactly at the end of the compliance window. A 1x burn rate exhausts the budget on schedule; a 14.4x burn rate exhausts a 28-day budget in 2 hours, which is the SRE-recommended threshold for a P1 page.

Cloud Trace Internals

Trace is the latency-debugging companion to Monitoring.

Trace Context Propagation

A trace is identified by a 16-byte trace ID generated at the entry point and propagated through every downstream call via the W3C traceparent HTTP header (older code may still use X-Cloud-Trace-Context). Cloud Run, App Engine, and GKE Ingress inject the header automatically; your application must forward it on outbound calls or the trace breaks into disconnected fragments.

Sampling

Cloud Trace's default sampling rate on App Engine standard is 0.1 requests per second per instance, which keeps overhead negligible. OpenTelemetry SDKs let you configure sampling with ParentBased(TraceIdRatioBased(0.01)) to sample 1 percent of requests while honouring upstream decisions. Always-on sampling is fine for low-traffic services but multiplies trace ingestion cost on busy ones.

30-Day Retention

Trace spans are retained for 30 days. There is no extended retention; if you need historical latency analysis beyond 30 days, export traces to BigQuery via the Trace export API or use the Trace ListTraces API in a scheduled Cloud Function. The exam often presents a distractor that claims 90 days or 365 days; the correct answer is 30.

Trace Analysis Reports

Cloud Trace ships built-in latency reports comparing two time windows (today vs yesterday, before vs after deploy). A latency regression report highlights endpoints whose p95 increased by more than a configurable threshold, which is the fastest way to confirm a deploy caused a slowdown.

Trace Pricing

Trace ingestion is free for the first 2.5 million spans per month and $0.20 per million above that. A naive OpenTelemetry deployment with sampling at 1.0 can blow through the free tier in a single afternoon on a busy service.

Cloud Profiler for Code-Level Performance

Where Trace shows which RPC is slow, Profiler shows which function inside that RPC burned the CPU.

CPU, Heap, and Contention Profiles

Cloud Profiler collects three profile types continuously: CPU time (where wall-clock time is spent), heap (which allocations dominate memory), and contention (which mutex locks are blocking goroutines or threads). Contention profiling is only available for Go and Java; Python ships CPU and heap only.

Agent Setup

You attach Profiler by linking the language agent: cloud.google.com/go/profiler for Go, google-cloud-profiler pip package for Python, the Java agent JAR for the JVM, or @google-cloud/profiler for Node.js. The agent samples roughly 10 seconds of every minute and ships profiles asynchronously, keeping overhead under 5 percent.

Flame Graph Interpretation

The Profiler UI renders flame graphs where the x-axis is total time consumed and the y-axis is the call stack. Wide bars mean expensive functions; tall stacks mean deep recursion. The "diff" view compares two time windows or two versions of a service, which makes a post-deploy regression jump off the screen.

Profiler Quotas

Profiler is free for all Google Cloud projects. The only practical limit is the 50 deployments per project default (a deployment is the tuple of service name, version, zone). Past that, file a quota increase or consolidate naming.

Three observability tools, three time horizons. Cloud Monitoring metrics retain for 6 weeks (custom) to 24 months (Google-built). Cloud Trace spans retain for 30 days. Cloud Profiler profiles retain for 30 days. If the question asks about historical analysis past these windows, the answer is BigQuery export.

OpenTelemetry and Custom Metrics

The CNCF-standard SDK is the modern, vendor-neutral way to wire applications into Cloud Monitoring and Cloud Trace.

OpenTelemetry SDK Wiring

OpenTelemetry replaces the older OpenCensus libraries (which are deprecated as of 2024). You install the language-specific SDK plus the GCP exporter: opentelemetry-exporter-gcp-monitoring and opentelemetry-exporter-gcp-trace for Python, @google-cloud/opentelemetry-cloud-monitoring-exporter for Node.js. The exporter pushes spans to Cloud Trace and metrics to Cloud Monitoring via the same authenticated credentials your service already uses.

Auto-Instrumentation

Each language ships an auto-instrumentation agent that hooks common libraries (Flask, Django, requests, gRPC, JDBC, Express) without code changes. For Python, opentelemetry-instrument python app.py produces a fully traced service with zero edits. The exam favours auto-instrumentation as the right answer for "minimum developer effort" scenarios.

Custom Metrics via OpenTelemetry

Beyond auto-collected metrics, you can emit custom application metrics: orders per second, cache hit rate, queue depth. Define a Counter, UpDownCounter, Histogram, or ObservableGauge on a Meter, increment it in your code, and the exporter flushes it as workload.googleapis.com/your_metric_name in Cloud Monitoring.

Direct API Versus OpenTelemetry

You can also write custom metrics via the monitoring.googleapis.com/v3/projects/PROJECT/timeSeries:create REST API, but this is verbose and tightly coupled. OpenTelemetry is the recommended path because the same instrumentation works across GCP, AWS, on-premises, and any future backend.

OpenTelemetry is now the official replacement for OpenCensus across Google Cloud. New code should never start with OpenCensus libraries even though they still function; the OpenCensus repo went read-only and security patches stopped in 2023. The PCD exam released in 2024 onwards uses OpenTelemetry as the canonical answer for instrumentation questions.

Metrics Scopes and Multi-Project Visibility

The Workspaces rebrand is the most-asked exam topic from this lesson.

From Workspaces to Metrics Scopes

Google renamed Workspaces to Metrics Scopes in 2022. The underlying model is identical: a single project (the scoping project) aggregates metrics from many monitored projects, giving an operations team a single pane of glass. Old API and SDK references to "Workspace" still work but are now deprecated terminology.

Scoping Project and Monitored Projects

The scoping project is the project where the Monitoring console is rooted. Monitored projects are the projects whose metrics, traces, and uptime checks become visible inside that scope. A monitored project can belong to at most one scope at a time; if you need it visible in two scopes, you must rotate or duplicate it.

Limits and Hierarchy

A scoping project supports up to 375 monitored projects. Adding a project requires roles/monitoring.editor on the scoping project and roles/monitoring.viewer on the monitored project. Folder and organisation hierarchies do not auto-populate a scope; you add each project explicitly via the API or gcloud monitoring metrics-scopes.

Cross-Project Alerts and Dashboards

Once projects are in the same scope, you can build dashboards and alerts that span them with a single label filter. This is how a platform team monitors 200 application projects from one Monitoring UI without granting blanket IAM at the org level.

Notification Channels and Incident Routing

A great alert that pages no one is a dead alert.

Supported Channel Types

Cloud Monitoring supports Email, SMS (US numbers only), Slack, PagerDuty, Webhook (HTTP POST to any URL), Pub/Sub, Google Chat, Microsoft Teams via webhook, and a handful of third-party integrations (Splunk On-Call, BigPanda). Each channel is a project-scoped resource verified via a confirmation email or OAuth grant.

Webhook and Pub/Sub Channels

Webhook channels POST a JSON incident payload to any URL, which is how teams plug Cloud Monitoring into custom incident-management or auto-remediation systems. Pub/Sub channels publish the same payload to a topic, letting a Cloud Function or Cloud Run service consume the event and take programmatic action (rollback, scale-up, ticket creation).

Severity-Tiered Routing

The most common pattern attaches different channel sets per policy severity: P1 to PagerDuty, P2 to PagerDuty and Slack, P3 to Slack only, P4 to an email digest. A single policy can list multiple channels; ordering does not matter because all channels receive the notification in parallel.

Snoozes and Maintenance Windows

A snooze is a time-bounded suppression that prevents specific policies (filtered by name pattern or labels) from creating new incidents during a window. Snooze before a planned deploy, a database migration, or any change you know will trip alerts. Snoozes do not delete history; they mute the channel layer for the configured period.

A common exam distractor pairs "uptime check" with "synthetic monitor" and asks which one tests a multi-step login flow. Uptime checks are stateless probes (HTTP/HTTPS/TCP/ICMP) and cannot fill forms or click buttons. Synthetic monitors run a Puppeteer or Playwright script and can. If the question mentions "user journey," "shopping cart," or "post-login API," the answer is synthetic monitor, not uptime check.

Common Pitfalls and Trade-offs

A CUMULATIVE metric viewed without rate() or delta() renders as either a constantly climbing line (raw counter) or a flat zero (after auto-align mishandles the kind). The MQL align rate(1m) or the chart's built-in "rate" aligner turns the counter into per-second values. Forgetting this is the most reported "my dashboard is broken" support ticket.

Sampling Too High in Cloud Trace

Setting the OpenTelemetry sampler to AlwaysOn on a service handling 1000 requests per second exports 2.6 billion spans per month. At $0.20 per million past the 2.5M free tier that is over $500 per month for trace storage alone. Default to TraceIdRatioBased(0.01) (1 percent) and increase only when debugging an active incident.

Alerting on Single-Sample Spikes

A condition like "CPU above 90 percent right now" fires on every transient spike during garbage collection or a burst of legitimate traffic. Always pair the threshold with a duration window of 5 minutes or more, which forces the spike to be sustained before the policy opens an incident.

Missing Trace Context Propagation

If your code uses a custom HTTP client that does not forward the traceparent header, traces split into multiple disconnected fragments and the waterfall view becomes useless. Verify propagation by inspecting outbound requests in Trace; if the second service shows up as a separate root trace, propagation is broken.

Confusing Cloud Profiler with Cloud Trace

The PCD exam often pairs these. Trace shows you which RPC is slow (the gRPC call to Spanner took 1.8 seconds). Profiler shows you which function inside that RPC is slow (the JSON serialiser burned 1.6 seconds of CPU on a 500 KB payload). You need both for full-stack performance debugging.

Best Practices

A short list distilled from Google Cloud Architecture Center and real PCD exam questions.

Instrument with OpenTelemetry from Day One

Adding instrumentation after the fact requires touching every service. Wire OpenTelemetry auto-instrumentation into the base container image, set the GCP exporter, and every new service gets metrics and traces for free.

One Service, One Dashboard, Four Golden Signals

Each service owns one dashboard rendering traffic (requests per second), errors (error rate percent), latency (p50, p95, p99), and saturation (CPU, memory, queue depth). Extra dashboards are fine for deep-dive context but the golden signals dashboard is the on-call landing page.

SLOs Before Alerts

Define the SLO first, then derive the alerting policy from the burn rate. Reverse order (alerts first, SLOs later) produces inconsistent thresholds and alert fatigue.

Document Every Alert

The documentation Markdown field is non-optional in any mature operation. Include the symptom, the impact, the diagnostic MQL query, the rollback command, and the escalation path.

Once a quarter, fire a test alert through the production channel chain and confirm PagerDuty escalation works, the runbook link resolves, and the on-call engineer can reach the rollback command in under 5 minutes. Untested incident response is a vendor lock-in disguised as confidence.

Exam Tips

The PCD exam reuses a small set of question patterns for Cloud Monitoring and Trace.

Pattern: Distributed Tracing in Microservices

If the question mentions "latency in a microservices architecture" or "bottleneck across services," the answer is Cloud Trace with OpenTelemetry instrumentation, not Cloud Logging and not Cloud Profiler.

Pattern: Code-Level CPU Hot Spot

If the question mentions "which function consumes the most CPU" or "inefficient code path," the answer is Cloud Profiler, not Trace.

Pattern: Forecasting Slow-Burn Failures

If the question says "disk will run out in hours" or "alert before the system breaks," the answer is a forecast alerting condition, not a static threshold.

Pattern: Multi-Project Visibility

If the question asks how to view metrics from 50 projects in one dashboard, the answer is a metrics scope with the 50 projects added as monitored projects, not federated dashboards and not exporting to BigQuery.

Pattern: Custom Application Metrics

If the question asks about emitting business KPIs (orders, revenue, queue depth) into Cloud Monitoring, the answer is OpenTelemetry custom metrics, not Cloud Logging structured logs, not BigQuery streaming inserts.

For any "minimum overhead" question about Trace, the answer involves sampling. For any "complete visibility" question, the answer involves OpenTelemetry auto-instrumentation. Match the question's adjective to the mechanism and you will pick the right answer even when distractors look superficially correct.

Real-World Use Case: E-Commerce Checkout Latency

Consider a Cloud Run checkout service exposing /checkout that calls product inventory (Cloud Run), payment authorisation (third-party API), and order persistence (Spanner). After a deploy on Tuesday, p95 latency rose from 280 ms to 720 ms but no error rate alert fired.

The on-call engineer opens the Cloud Trace latency report, selects "since deploy" versus "previous 24 hours," and sees that the inventory RPC p95 jumped from 40 ms to 460 ms. Drilling into a single slow trace shows a Spanner query inside inventory now reads 12 partitions instead of 1. Cloud Profiler's diff view between the two deploys confirms a new LIKE '%term%' query bypassing the composite index. The fix is a one-line query rewrite, deployed within 30 minutes of the initial Trace investigation.

Without Trace, the team would have spent hours guessing which microservice regressed. Without Profiler, they would have known the inventory service was slow but not why. Without an MQL latency SLO, they would not have known until customers complained.

Frequently Asked Questions (FAQ)

What is the retention period for Cloud Trace spans?

Cloud Trace retains spans for 30 days. There is no extended retention option in the product itself. For longer historical analysis, export traces to BigQuery via the Trace API and a scheduled Cloud Function, or rely on the latency report's pre-aggregated metrics which Cloud Monitoring keeps for 24 months.

What is the difference between Cloud Monitoring and Cloud Logging?

Cloud Monitoring stores numeric time-series metrics and runs alerts on them. Cloud Logging stores text or JSON log entries from applications and infrastructure. Logging supports log-based metrics that extract numbers from log fields, which then flow into Monitoring. Choose Monitoring for "how many requests, how slow"; choose Logging for "what exactly happened in this request."

How do I send a custom application metric to Cloud Monitoring?

The recommended path is the OpenTelemetry SDK with the GCP exporter. Define a Counter or Histogram instrument on a Meter, increment it from your code, and the exporter flushes it as workload.googleapis.com/your_metric_name. The alternative is calling the monitoring.googleapis.com/v3/projects/PROJECT/timeSeries:create REST API directly, but that is verbose and not portable to other backends.

Can a single alerting policy notify multiple notification channels?

Yes. An alerting policy can list any number of notification channels. All channels receive the notification in parallel when an incident opens. The common pattern is severity-tiered routing: a P1 policy notifies PagerDuty plus a Slack incident channel, a P2 policy notifies Slack only.

What is the difference between an uptime check and a synthetic monitor?

Uptime checks are stateless HTTP, HTTPS, TCP, or ICMP probes that fire every 1 to 15 minutes from six fixed global locations. They cannot perform multi-step workflows. Synthetic monitors run a Node.js or Python script (typically Puppeteer or Playwright) that walks through login, form fills, and assertion steps, suitable for testing critical user journeys end-to-end.

How does Cloud Profiler differ from Cloud Trace?

Cloud Trace measures latency between services and shows you which RPC is slow. Cloud Profiler measures CPU, heap, and contention inside a single process and shows you which function is slow. Trace answers "which microservice regressed"; Profiler answers "which line of code in that service regressed."

What replaced Workspaces in Cloud Monitoring?

Metrics scopes. The concept is identical: a scoping project aggregates metrics from up to 375 monitored projects, giving an operations team a single observability surface. The rename happened in 2022 and old "Workspace" API references still function but are deprecated.

How do I avoid blowing my Cloud Trace budget with too many spans?

Configure the OpenTelemetry sampler to ParentBased(TraceIdRatioBased(0.01)) for 1 percent sampling. Increase the rate temporarily during incident debugging using a feature flag or a Cloud Run environment variable, then drop back to 1 percent. AlwaysOn sampling on a 1000 RPS service costs over $500 per month in span ingestion.

Cloud Logging and Error Reporting — pairs metrics-based observability with log and exception aggregation.
Cloud-Native Design Patterns — context for why golden signals and SLOs matter in microservice architectures.
Testing in the Cloud Environment — synthetic monitors as a continuous testing extension.

Introduction to Cloud Monitoring and Cloud Trace

白話文解釋（Plain English Explanation）

Think of Cloud Monitoring Like the Cockpit of a Commercial Airliner

Think of Cloud Trace Like a Receipt from a Multi-Stage Tax Filing

Think of an SLO and Error Budget Like a Monthly Data Plan

Core Concepts of Cloud Monitoring and Cloud Trace

Metric Descriptor

Monitored Resource

Alerting Policy

Uptime Check

Span and Trace

SLI, SLO, and Error Budget

Metrics Scope (Formerly Workspace)

Metric Types and the MQL Query Language

Metric Kinds: GAUGE, DELTA, CUMULATIVE

Value Types and the DISTRIBUTION Type

MQL Basics

When to Pick PromQL Instead

Dashboards and Visualisation

Built-in Versus Custom Dashboards

Widget Types

Mosaic Layout and Section Headers

Sharing and Versioning

Alerting Policies in Depth

Metric Threshold Condition

Metric Absence Condition

Forecast Condition

MQL and PromQL Conditions

Log-Based Alert

Uptime Checks and Synthetic Monitoring

HTTP, HTTPS, TCP, and ICMP Checks

Six Global Probe Locations

Synthetic Monitors for Multi-Step Flows

Uptime Check Pricing and Quotas

SLOs and Error Budgets

Defining an SLI

Compliance Period and Burn Rate

Multi-Window Multi-Burn-Rate Alerting

Error Budget Policy

Cloud Trace Internals

Trace Context Propagation

Sampling

30-Day Retention

Trace Analysis Reports

Trace Pricing

Cloud Profiler for Code-Level Performance

CPU, Heap, and Contention Profiles

Agent Setup

Flame Graph Interpretation

Profiler Quotas

OpenTelemetry and Custom Metrics

OpenTelemetry SDK Wiring

Auto-Instrumentation

Custom Metrics via OpenTelemetry

Direct API Versus OpenTelemetry

Metrics Scopes and Multi-Project Visibility

From Workspaces to Metrics Scopes

Scoping Project and Monitored Projects

Limits and Hierarchy

Cross-Project Alerts and Dashboards

Notification Channels and Incident Routing

Supported Channel Types

Webhook and Pub/Sub Channels

Severity-Tiered Routing

Snoozes and Maintenance Windows

Common Pitfalls and Trade-offs

Sampling Too High in Cloud Trace

Alerting on Single-Sample Spikes

Missing Trace Context Propagation

Confusing Cloud Profiler with Cloud Trace

Best Practices

Instrument with OpenTelemetry from Day One

One Service, One Dashboard, Four Golden Signals

SLOs Before Alerts

Document Every Alert

Test the Pager Quarterly

Exam Tips

Pattern: Distributed Tracing in Microservices

Pattern: Code-Level CPU Hot Spot

Pattern: Forecasting Slow-Burn Failures