examlab .net The most efficient path to the most valuable certifications.
In this note ≈ 18 min

Cloud Monitoring and Observability

3,500 words · ≈ 18 min read ·

Master Google Cloud Monitoring to gain visibility into your infrastructure and applications using metrics, dashboards, and MQL.

Do 20 practice questions → Free · No signup · PCA

Introduction to Observability on GCP

Observability is more than just "monitoring." It is the ability to understand the internal state of a system based on its external outputs. In Google Cloud, Cloud Monitoring provides the foundation for observability by collecting metrics, events, and metadata from your Google Cloud resources and applications. A Professional Cloud Architect uses these tools to ensure reliability, performance, and cost-efficiency.

Plain-Language Explanation: Cloud Monitoring

Analogy 1 — The Car Dashboard

Monitoring is like the dashboard in your car. The speedometer (Metric) tells you how fast you are going. The fuel gauge tells you how much resource you have left. The "Check Engine" light (Alert) tells you something is wrong right now, even if you don't know exactly what it is yet.

Analogy 2 — The Health Monitor (Smartwatch)

Observability is like wearing a smartwatch that tracks your heart rate, sleep patterns, and oxygen levels. If your heart rate spikes, the watch doesn't just say "Danger"; it helps the doctor (Architect) look at the patterns over time to see if it happened during exercise or while resting, helping diagnose the root cause.

Analogy 3 — The Air Traffic Control Tower

Cloud Monitoring is the Air Traffic Control (ATC) tower. The controllers see every plane (Service) on their radar. They can see the altitude (Latency), speed (Throughput), and fuel levels (Resource Usage) of every aircraft. Without ATC, the entire airport (Cloud Infrastructure) would descend into chaos.

A numerical measurement of a system's attribute over time, such as CPU utilization, request count, or memory usage.


Core Components of Cloud Monitoring

  1. Metrics: Collected automatically for Google Cloud services (GCE, GKE, Cloud SQL). You can also send Custom Metrics from your code.
  2. Dashboards: Visual representations of your metrics. Use Custom Dashboards to group related metrics (e.g., all metrics for the "Payment Service").
  3. Uptime Checks: Periodically check if your application is reachable from global locations.
  4. MQL (Monitoring Query Language): A powerful, expressive language for querying and transforming metric data. It's similar to SQL but optimized for time-series data.

Google Cloud Managed Service for Prometheus (GMP)

For teams already using Prometheus for Kubernetes monitoring, GMP provides a fully managed, globally scalable Prometheus-compatible monitoring solution.

  • No Ops: You don't have to manage Prometheus servers or storage.
  • Hybrid: Monitor GKE, on-premises, and other clouds using the same Prometheus tools.

Observability Best Practices

  • Monitor the "Golden Signals": Latency, Traffic, Errors, and Saturation.
  • Log-based Metrics: Turn your logs into metrics (e.g., "Count the number of 404 errors in the logs") to bridge the gap between logging and monitoring.
  • Distributed Tracing: Use Cloud Trace to see how a single request travels through multiple microservices to find the exact source of latency.
::promoted

Architect's Insight: For the PCA exam, if a scenario involves monitoring a "multi-cloud" or "hybrid" environment, Cloud Monitoring with the Ops Agent installed on external VMs is the primary solution. ::


Metrics Explorer and Monitoring Query Language (MQL)

Metrics Explorer is the entry point for ad-hoc metric investigation in Cloud Monitoring. It exposes a builder UI where you choose a resource type (e.g., gce_instance, k8s_container, https_lb_rule), a metric type (e.g., compute.googleapis.com/instance/cpu/utilization), filter labels, and an aligner (e.g., ALIGN_RATE, ALIGN_MEAN) plus a reducer (e.g., REDUCE_SUM by zone). For 80% of operational debugging, the builder is enough. The other 20% requires MQL.

MQL Cheat Sheet for PCA Scenarios

fetch gce_instance::compute.googleapis.com/instance/cpu/utilization
| filter resource.zone =~ 'us-central1-.*'
| align mean_aligner(1m)
| every 1m
| group_by [resource.instance_id], mean(val())

Key MQL operations the exam likes to probe:

  • fetch — Selects the resource and metric type. Always the first verb.
  • align — Normalises raw samples to a fixed period (rate, delta, mean_aligner). Required before aggregation.
  • group_by — Equivalent to SQL GROUP BY plus aggregator (sum, mean, percentile).
  • join — Combines two time series to compute ratios (e.g., HTTP error rate = 5xx / total). This is the killer feature that the builder UI cannot replicate.
  • outer_join 0 — Fills missing data points with zero, critical for SLO denominators that drop to zero during quiet periods.

When to Reach for MQL

Pick MQL when you need: cross-metric ratios, percentile-on-percentile math, dynamic thresholds (e.g., compare today vs. last week), or anomaly detection logic embedded in an alert condition. The builder UI compiles down to MQL internally — you can hit "Show as MQL" to learn the equivalent expression for any panel you already built.

Cloud Monitoring metric retention is 6 weeks for standard 1-minute resolution data. If the PCA scenario requires multi-quarter trend analysis or compliance retention, you must export to BigQuery via scheduled exports or stream via Pub/Sub to long-term storage. Do not assume metrics live forever in Monitoring.


Dashboards: Cloud Monitoring vs. Grafana on GCP

Cloud Monitoring ships two native dashboard families plus first-class Grafana integration. Choosing between them is a recurring PCA scenario.

Native Cloud Monitoring Dashboards

  • Auto-created dashboards — Provisioned per service (Cloud SQL, GKE, Load Balancing). Zero-config baseline view.
  • Custom dashboards — Defined in the console or as JSON / Terraform (google_monitoring_dashboard). Supports XY charts, scorecards, heatmaps, log panels, and Markdown widgets.
  • Dashboard filters — Variables like ${project} or ${environment} let one dashboard serve dev/stage/prod.

Managed Service for Prometheus + Grafana

When the team already runs Grafana, point it at the Prometheus-compatible query endpoint that GMP exposes (https://monitoring.googleapis.com/v1/projects/PROJECT/location/global/prometheus/). Authenticate via the Prometheus data source sync sidecar or workload identity. Benefits:

  • Reuse existing Grafana dashboards (Kubernetes mixins, Node Exporter, etc.) without rewriting in MQL.
  • PromQL semantics (counters, histograms with histogram_quantile) are native.
  • Cloud Monitoring Grafana plugin can also query Cloud Monitoring directly using MQL or the builder.

Decision Heuristic

  • Pure-GCP shop, mixed metrics + logs in one pane → Cloud Monitoring custom dashboards.
  • Kubernetes-heavy, multi-cloud, existing Prometheus exporters → GMP + Grafana.
  • Need both PromQL and Cloud Monitoring metrics on the same dashboard → Grafana with both data sources.

Uptime Checks Deep Dive

Uptime Checks synthetically probe a public endpoint (HTTP, HTTPS, TCP) from six global locations by default: USA-Oregon, USA-Iowa, USA-Virginia, Europe, South America, Asia-Pacific. A check fails an individual location if the response is non-2xx, body doesn't match the optional content matcher, or the request times out.

Configuration Knobs

  • Frequency — 1, 5, 10, or 15 minutes. 1-minute interval costs more but catches transient outages faster.
  • Timeout — Up to 60 seconds.
  • Content matcher — String match, regex, JSON path, or "contains/does not contain". Catches "200 OK but the body says ERROR" bugs.
  • Authentication — Basic auth, OIDC token (for private services behind IAP), custom headers.
  • TLS validation — Disable for self-signed dev environments; keep on for production.

Internal Uptime Checks via Private Checkers

Public probers cannot reach private VPC services. Private uptime checks run from inside a service-producer VPC and probe internal load balancers or VMs. They require enabling the Service Directory and a managed checker resource. Use this for internal microservices that should never face the internet.

Alert Policy Wiring

An uptime check is just a metric (monitoring.googleapis.com/uptime_check/check_passed). Wire an alert policy with the rule "if at least 2 of 6 locations fail for 5 minutes" — this filters false positives from single-region network blips. For critical services tighten to "if any 1 location fails".

Uptime check false positives frequently come from rate-limiting your own prober traffic. The 6 prober source IP ranges are published — whitelist them in Cloud Armor, WAF, and any upstream CDN. A common exam distractor is "uptime check fails but the service is fine" → the answer is usually firewall or rate-limit misconfiguration, not service downtime.


SLO and Error Budget Engineering

Cloud Monitoring has first-class Service Level Objective (SLO) support under the Services section. The workflow: define a Service, pick a Service Level Indicator (SLI), set the objective, and Monitoring auto-generates the Error Budget burn-rate metrics.

SLI Types Supported

  • Request-based SLIgood_request_count / total_request_count. Sources: Cloud Load Balancing, Istio/ASM, App Engine, Cloud Run, or any custom metric pair.
  • Windows-based SLI — Each 1-minute window is "good" or "bad" by a threshold (e.g., p99 latency < 200ms). Better for latency objectives.
  • Basic SLI — Pre-built for fully managed products (GAE, Cloud Run) where Google publishes availability and latency metrics automatically.

Burn Rate Alerts

The error budget for a 30-day window with a 99.9% SLO is 43.2 minutes. A burn-rate alert fires when the budget is being consumed faster than sustainable. Google recommends a multi-window, multi-burn-rate policy:

  • Fast burn — 14.4× burn rate over a 1-hour lookback (page on-call immediately, eats 2% of budget).
  • Slow burn — 1× burn rate over a 6-hour lookback (ticket, not a page).

This pattern surfaces both acute incidents and chronic degradation without paging on every 1-minute blip.

Linking SLOs to Releases

Use the SLO API in CI/CD: query the current error budget remaining before promoting a release. If budget is below threshold, freeze deploys — a pattern from Google's own SRE playbook.


Custom Metrics and OpenTelemetry

When built-in metrics are insufficient (e.g., "queue depth in my custom worker"), publish custom metrics via the Cloud Monitoring API (projects.timeSeries.create) or, preferred for new code, the OpenTelemetry SDK with the GCP exporter.

Custom Metric Anatomy

  • Metric typecustom.googleapis.com/myapp/queue_depth (custom prefix is mandatory).
  • Metric kindGAUGE (current value), DELTA (change over interval), or CUMULATIVE (monotonic counter).
  • Value typeINT64, DOUBLE, DISTRIBUTION (for histograms / latency percentiles).
  • Labels — Up to 10 user-defined labels (e.g., queue_name, region).

OpenTelemetry → Cloud Monitoring

from opentelemetry import metrics
from opentelemetry.exporter.cloud_monitoring import CloudMonitoringMetricsExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader

reader = PeriodicExportingMetricReader(CloudMonitoringMetricsExporter())
metrics.set_meter_provider(MeterProvider(metric_readers=[reader]))
meter = metrics.get_meter("checkout-service")
queue_depth = meter.create_gauge("queue_depth")
queue_depth.set(42, {"queue_name": "orders"})

The exporter batches by default every 60 seconds. Cost is $0.2580 per million samples ingested beyond the free tier — design your label cardinality carefully because each unique label combination creates a separate time series.

Cardinality Budget

Cloud Monitoring enforces a soft limit of 200,000 active time series per metric descriptor in a Metrics Scope. Avoid high-cardinality labels like user_id or request_id — those belong in Cloud Logging or Cloud Trace, not metrics.


Log-Based Metrics

Log-based metrics convert log entries into Cloud Monitoring time series. Two flavours:

Counter Metrics

Count log entries matching a filter. Example: count Cloud SQL error: deadlock entries to alert on contention spikes.

resource.type="cloudsql_database"
severity="ERROR"
textPayload:"deadlock"

Distribution Metrics

Extract a numeric value from each log entry (via regex or JSON path) and bucket it into a histogram. Example: extract response latency from an unstructured access log when you can't add Cloud Monitoring SDK code to the app.

Cost and Caveats

  • System log-based metrics (provided by Google) are free.
  • User-defined log-based metrics are billed per the number of log entries scanned and per metric data written.
  • Log-based metrics inherit log retention, not metric retention — if logs are excluded by a sink, the metric stops receiving data.
  • Labels are extracted from log fields; same cardinality rules apply.

Cost-saving pattern: Combine a logging sink exclusion with a log-based metric. Exclude verbose-but-cheap logs from _Default (saves storage cost) but keep the log-based metric definition so you still get the count time series. The metric reads from the ingestion stream, not from stored logs, so exclusions do not break it.


Alerting Policy Conditions in Depth

Alert policies are the action layer of observability. A policy = one or more conditions + notification channels + documentation + optional auto-close duration.

Condition Types

  • Metric threshold — Most common. "CPU > 80% for 5 minutes across any GCE instance."
  • Metric absence — Fires when a metric stops reporting. Critical for "is my exporter alive?" checks.
  • Forecast — Predicts future breach based on linear regression of the past 1h–24h. Useful for disk-fill alerts ("project disk will be full in 4 hours").
  • Log-match — Fires immediately on a specific log entry (e.g., a security audit setIamPolicy event). Faster than going through a log-based metric.
  • MQL condition — Arbitrary MQL expression returning a boolean time series. Use for cross-metric ratios.
  • Uptime check failed — Specialised condition tied to an uptime check resource.

Notification Channels

Cloud Monitoring supports Email, SMS, Slack, PagerDuty, Pub/Sub, Webhook, Mobile App. For incident-management integration use Pub/Sub → Cloud Function → custom routing logic (e.g., escalate via business hours rules).

Anti-Flapping Knobs

  • duration — How long the condition must be true before firing. 5 minutes is the default sweet spot.
  • auto_close — How long to wait after recovery before closing the incident (default 7 days; tune down to 30 minutes for transient issues).
  • combinerAND, OR, AND_WITH_MATCHING_RESOURCE when a policy has multiple conditions.

PCA exam recall list — what fires fastest:

  1. Log-match condition (sub-minute, fires on first matching log entry).
  2. Uptime check (1-minute minimum frequency, but global alert latency adds ~1 min).
  3. Metric threshold with duration=0s (still ~1 minute due to alignment period).
  4. Forecast condition (deliberately slow, predicts hours ahead).

If the scenario says "alert within seconds of a specific event in logs," the answer is log-match condition, not log-based metric + threshold.


Cloud Trace and Cloud Profiler Integration

Metrics and logs answer "what" and "when." Cloud Trace and Cloud Profiler answer "where" and "why."

Cloud Trace

  • Distributed tracing service that ingests spans via the OpenTelemetry protocol (formerly Stackdriver Trace SDK / Zipkin).
  • Auto-instrumentation is available for App Engine, Cloud Run, Cloud Functions, GKE with the OpenTelemetry Collector sidecar, and the Cloud Load Balancer (sampled).
  • The Trace List UI plots latency over time and lets you drill into a single trace to see the waterfall of spans across services.
  • Trace ID is the joining key with logs — Cloud Logging auto-correlates entries that contain a trace field, so clicking a trace shows the matching logs inline.

Cloud Profiler

  • Continuous profiling of CPU, heap, contention (Go), and wall-clock time, sampled at 5% overhead.
  • Supports Go, Java, Node.js, Python, PHP, .NET (preview). Enable by adding the agent library + a service-account permission roles/cloudprofiler.agent.
  • The flame graph compares two time ranges or two service versions — the killer feature for finding regressions introduced by a deploy.

The Four-Pillar Pattern

A PCA-grade observability stack ties everything together:

  1. Cloud Monitoring → metrics + dashboards + alerts.
  2. Cloud Logging → structured logs with trace correlation.
  3. Cloud Trace → request-level latency breakdown.
  4. Cloud Profiler → code-level CPU/memory hotspots.

All four share the same resource labels and trace IDs, so a single click in an alert can take an SRE from "p99 latency breached SLO" to "this Go function on this version is eating CPU".

Cloud Trace sampling rate defaults to a small fraction (e.g., 0.1% on App Engine Standard) to control cost. If a PCA scenario complains "we cannot find the slow request in Trace," the answer is increase the sampling rate via the OpenTelemetry SDK TraceIdRatioBased sampler, or use always-on sampling for the specific service that has low traffic. Cloud Trace pricing is per span ingested, so 100% sampling on a high-QPS service can be expensive.


Metrics Scopes and Multi-Project Consolidation

The old "Workspace" concept has been renamed to Metrics Scope. A Metrics Scope is owned by a scoping project and lists up to 375 monitored projects whose metrics, dashboards, and alerts are visible from the scoping project.

Architectural Patterns

  • Per-environment scope — One scoping project per environment (dev/stage/prod), each scope contains the workload projects for that env.
  • Per-team scope — Platform/SRE team gets a scope listing all production projects across business units.
  • Org-wide rollup — A central observability project scopes all production projects; team scopes still exist for autonomy.

Permissions

  • roles/monitoring.viewer on the scoping project lets a user see all metrics across the listed projects — even without IAM on the workload projects.
  • To add a project to a scope you need monitoring.metricsScopes.link on the scoping project AND monitoring.metricsScopes.linkSource on the target project.

Limitations

  • A monitored project can belong to multiple scopes (good for shared services).
  • Alert policies are still scoped per-project — Metrics Scope only consolidates viewing, not alerting. For org-level alerts you must replicate policies via Terraform or Config Sync.
  • Logs are not part of Metrics Scope. For multi-project log consolidation use aggregated log sinks at the folder or org level pointing to a central log bucket or BigQuery dataset.

FAQ — Cloud Monitoring

Q1. What is the "Ops Agent"?

The Ops Agent is a single agent for Google Cloud's operations suite. It collects both metrics and logs from your Compute Engine instances. It replaces the older Monitoring and Logging agents.

Q2. How long is monitoring data retained?

Standard metrics are typically retained for 6 weeks. Some high-resolution metrics may have shorter retention periods. For long-term analysis, export metrics to BigQuery.

Q3. Can I monitor resources in other projects?

Yes. You can use Metrics Scopes to monitor multiple projects from a single "Scoping Project." This is essential for centralizing observability in a multi-project organization.

Q4. What is an "Uptime Check" vs. a "Health Check"?

An Uptime Check is external (pinging from the internet); a Health Check is internal (used by Load Balancers or Managed Instance Groups to decide if a VM should be restarted).

Q5. When should I use MQL instead of the basic query builder?

Use MQL when you need to perform complex operations, such as calculating the ratio of two different metrics (e.g., Error Rate = Errors / Total Requests) or performing cross-service joins.

Official sources

More PCA topics