Google Cloud Operations Suite

Q: Q: What is the difference between Stackdriver and Google Cloud Operations Suite?

A: They are the same product family under different names. "Stackdriver" was the original brand (from the acquisition in 2014). It was renamed "Cloud Operations Suite," and is now formally branded "Google Cloud Observability." All three names refer to Cloud Logging , Cloud Monitoring , Cloud Trace , Cloud Profiler , and Error Reporting collectively. CDL exam questions may use any of the three names interchangeably.

Q: Q: How are alerts routed to my on-call team?

A: Create an alerting policy in Cloud Monitoring that defines the condition (e.g., "error rate above 1% for 5 minutes"). Attach one or more notification channels to the policy — supported channels include Email, SMS, Slack, PagerDuty, Webhook, Pub/Sub, Google Chat, and the Google Cloud mobile app. When the condition fires, Cloud Monitoring opens an incident and notifies all configured channels. Once the condition clears, the incident auto-resolves and a "resolved" notification is sent.

Q: Q: How do I keep log retention costs under control?

A: Use three levers. First, log exclusion filters in Cloud Logging drop low-value entries at ingestion (e.g., debug-level logs in production, health-check probes). Second, log sinks route compliance-only logs to Cloud Storage (Coldline or Archive class) for cheap multi-year retention instead of paying Cloud Logging's per-GiB retention rate. Third, configure shorter bucket retention for chatty workloads so noisy logs age out after 7 or 14 days while audit logs are sunk to long-term storage.

Q: Q: What replaced Cloud Debugger?

A: Cloud Debugger was shut down in May 2023 . The replacement strategy combines Cloud Logging (with rich structured payloads that capture runtime context), Error Reporting (for grouped exception tracking), Cloud Trace (for distributed latency analysis), and Cloud Profiler (for CPU and memory profiling). For step-through breakpoint debugging, use local IDE debuggers against staging or test environments.

Q: Q: Can the Google Cloud Operations Suite ingest telemetry from AWS or on-prem workloads?

A: Yes. Deploy the OpenTelemetry Collector or BindPlane in the source environment and configure it to forward logs, metrics, and traces to Google Cloud. This is how multi-cloud and hybrid enterprises operate a single observability plane across heterogeneous estates. The Google Cloud Operations Suite's metric and log models are OpenTelemetry-compatible.

What Is the Google Cloud Operations Suite?

The Google Cloud Operations Suite — now formally branded as Google Cloud Observability — is the unified telemetry, monitoring, and troubleshooting platform that ships with every Google Cloud project. It started life as Stackdriver, an independent monitoring company that Google acquired in 2014. After the acquisition it was renamed Stackdriver, then Cloud Operations Suite, and finally Google Cloud Observability. For the Cloud Digital Leader (CDL) exam, all three names refer to the same family of products: Cloud Logging, Cloud Monitoring, Cloud Trace, Cloud Profiler, and Error Reporting.

The purpose of the Google Cloud Operations Suite is to answer one fundamental question that every business leader, SRE, and developer asks dozens of times per day: "Is my service healthy, and if not, why?" Without observability, a cloud workload is a black box. You see whether the bill went up, but you have no idea whether the latency spike yesterday cost customers, whether a memory leak is brewing in production, or whether your error rate is climbing toward an SLO breach.

The Google Cloud Operations Suite is designed for the business-leader perspective as much as the engineer perspective. CDL exam scenarios will ask things like "Which Google Cloud product provides centralized log storage with optional long-term retention in BigQuery?" or "An on-call team needs to be paged when the error rate exceeds 1% — which product configures alerting policies?" This study note maps every common CDL scenario to the right Google Cloud Observability product, explains the three pillars of observability (logs, metrics, traces), and grounds the concepts in real-world analogies that anyone in Taiwan can picture.

By the end of this chapter you will understand how the Google Cloud Operations Suite collects telemetry across compute services like Compute Engine, GKE, and Cloud Run, how alerting policies route incidents to on-call channels, how log sinks feed compliance archives, and how SRE teams use the suite to drive service-level objectives. You will also understand the cost levers of the Google Cloud Operations Suite — because Cloud Logging at scale is one of the most under-budgeted Google Cloud line items in practice.

The Three Pillars of Observability

Modern observability rests on three pillars of telemetry: logs, metrics, and traces. Each pillar answers a different question, and each one maps to a different product inside the Google Cloud Operations Suite.

Pillar 1 — Logs (What Happened?)

A log is a timestamped, structured (or unstructured) record of a discrete event. A log line says: "At 14:32:11 UTC, user [email protected] failed login attempt 3 with HTTP 401 from IP 203.0.113.42." Logs are the forensic record — when something goes wrong, you read the logs to reconstruct exactly what happened. On the Google Cloud Operations Suite, logs live in Cloud Logging.

Pillar 2 — Metrics (How Often, How Much?)

A metric is a numeric measurement aggregated over time. A metric says: "Over the past minute, this service received 1,247 requests, with a 95th-percentile latency of 320 ms and 12 HTTP 500 responses." Metrics are the dashboards and alarms layer — they tell you whether the system is healthy in aggregate, and they trigger alerts when thresholds break. On the Google Cloud Operations Suite, metrics live in Cloud Monitoring.

Pillar 3 — Traces (Where Did the Request Go?)

A trace follows a single request as it travels through a distributed system. A trace says: "Request abc123 entered the API gateway, called the auth service (15 ms), then the catalog service (40 ms), then three database queries (210 ms total), and returned to the user in 295 ms." Traces are the distributed timeline — they tell you which downstream call is the bottleneck. On the Google Cloud Operations Suite, traces live in Cloud Trace.

The three pillars of observability are logs (what happened), metrics (how often / how much), and traces (request flow across services). On the Google Cloud Operations Suite they map respectively to Cloud Logging, Cloud Monitoring, and Cloud Trace. Memorize this triad — CDL exam questions about "which product handles X" almost always pivot on which pillar X belongs to. See https://cloud.google.com/products/operations for the canonical product family overview.

白話文解釋（Plain English Explanation）

Observability sounds abstract until you map it to something physical and familiar. The Google Cloud Operations Suite is not really a "tool" — it is a command and control room for your services, with dashboards, alarms, recorders, and runbooks. The three analogies below capture the same idea from different angles, and each one highlights a different facet of why observability matters.

Analogy 1 — The Hospital Patient Monitor

Imagine a patient lying in an ICU bed. Around the bed there are several devices that never stop recording: the heart rate monitor beeps every second, the blood pressure cuff inflates every fifteen minutes, the oxygen saturation clip glows red on the patient's finger, and the chart at the foot of the bed records every medication dose, every doctor visit, and every test result. None of these devices are optional — without them, the staff is blind, and a deteriorating patient could die before anyone notices.

Google Cloud Observability plays exactly the same role for your services. Cloud Monitoring is the heart rate monitor and the blood pressure cuff — it watches CPU, latency, request count, and error rate around the clock, and it beeps (sends an alert) the moment any vital sign crosses a danger line. Cloud Logging is the chart at the foot of the bed — it records every event, every error, every login attempt, so that when the doctor arrives, there is a full history to consult. Cloud Trace is the imaging scan that follows the contrast dye through the patient's bloodstream — it shows exactly where the blockage is in the distributed request flow.

Just as a hospital cannot operate safely without monitors, a Google Cloud workload cannot operate safely without the Google Cloud Operations Suite. And just as a hospital cannot afford to disable monitors to save battery cost, you cannot afford to turn off logs to save Google Cloud Operations Suite ingestion fees — you can only filter intelligently to keep what matters and drop what doesn't.

Analogy 2 — The Factory Control Room

A modern manufacturing plant in Taichung might run hundreds of machines simultaneously: injection moulders, packaging lines, conveyor belts, paint booths, and quality-inspection stations. The central control room has a wall of screens. One screen shows live throughput numbers per line. Another screen shows a heat map of which machines are running, idle, or in fault. A red rotating light spins above the door whenever a critical alarm fires. Operators sit at consoles and watch.

The Google Cloud Operations Suite is your cloud control room. Cloud Monitoring dashboards are the wall of screens, showing service-level metrics for every microservice in real time. Alerting policies are the red rotating lights — when error rate exceeds 1%, when latency p95 crosses 500 ms, when memory consumption climbs past 90%, an alarm fires and routes a notification to Email, Slack, PagerDuty, SMS, or any other configured notification channel. Cloud Logging is the printed log book that operators can leaf through to investigate any specific alarm.

The control room metaphor also captures uptime checks — Google Cloud Operations Suite can periodically ping your public endpoint from multiple global locations, just like a factory quality inspector who walks the floor every hour to physically verify that every machine is producing output. If three out of five ping locations report failure, the factory has a real problem, not a localised network blip.

Analogy 3 — The Aircraft Black Box and Radar Tower

Every commercial aircraft carries two black boxes: the cockpit voice recorder and the flight data recorder. They capture everything — every dial reading, every radio call, every control input. After a crash, investigators reconstruct the final minutes of the flight from black-box data. Meanwhile, the radar tower tracks every aircraft in the airspace in real time, vectoring planes around bad weather and warning of collision risk.

Your Google Cloud workloads need both. Cloud Logging is the black box — it persistently records every event so that after an incident you can run a post-mortem and reconstruct exactly what happened. Cloud Monitoring is the radar tower — it watches the live position and health of every aircraft (service) in the fleet, and it routes warnings before the crash, not after. Cloud Trace adds a third dimension: the flight path of a single passenger through a multi-leg journey — from the boarding gate to the connecting flight to the baggage carousel. When one passenger arrives late, the trace tells you whether the delay was at the security checkpoint, the runway, or the customs queue.

The Google Cloud Operations Suite is what turns a fleet of opaque services into an instrumented, observable airline. And like an airline, the cost of telemetry is real — recording every conversation in the cockpit for ten years costs storage money — but the cost of not recording is catastrophic when an incident demands a forensic answer.

Cloud Logging — Centralized Log Aggregation

Cloud Logging is the log management service inside the Google Cloud Operations Suite. It ingests log data from every Google Cloud service automatically — Compute Engine, GKE, Cloud Run, App Engine, Cloud Functions, Cloud SQL, Load Balancers, Cloud Storage access logs, audit logs — and stores them in a unified, queryable, retention-managed repository.

How Logs Get Into Cloud Logging

Cloud Run, Cloud Functions, App Engine: Logs are captured automatically. Any output your app writes to stdout or stderr becomes a Cloud Logging entry, with no agent installation required.
GKE: The cluster runs the GKE Logging agent by default. Pod logs, system logs, and Kubernetes events all stream into Cloud Logging with cluster, namespace, and pod-name labels.
Compute Engine: Install the Ops Agent (the unified successor to the legacy Logging Agent and Monitoring Agent) on each VM. The Ops Agent collects both logs and metrics over the same channel.
External and hybrid sources: Use BindPlane or the OpenTelemetry Collector to forward logs from on-prem servers, AWS workloads, or third-party SaaS into Cloud Logging.
Custom application logs: Use the Cloud Logging client libraries (Java, Python, Node.js, Go, etc.) to write structured log entries with severity, labels, and JSON payloads.

Log-Based Metrics

Cloud Logging can derive metrics from log content. For example, if your access logs contain "status":500, you can create a log-based metric that counts every 500-status line per minute, and then build a Cloud Monitoring alert on that metric. This is how teams turn unstructured operational data into structured alarms.

Log Sinks — Routing Logs to Storage

A log sink is a Cloud Logging routing rule that copies matching log entries to a destination outside Cloud Logging:

BigQuery: For SQL analytics, compliance reporting, and long-term searchable retention.
Cloud Storage: For low-cost archival (Coldline or Archive class) and regulatory retention of seven or ten years.
Pub/Sub: For real-time streaming to a downstream SIEM (Chronicle, Splunk, Datadog) or to a custom data pipeline.

Log sinks are how enterprises balance operational searchability (keep recent logs in Cloud Logging for fast queries) with cost-effective compliance (route everything to Cloud Storage for cheap multi-year retention).

Cloud Logging is NOT free at scale. The first 50 GiB per project per month is free, but every additional GiB costs roughly USD 0.50 in ingestion fees, and high-volume workloads (chatty Java applications, verbose Kubernetes debug logs) can drive thousands of dollars per month. Use log exclusion filters to drop low-value entries before ingestion, and route compliance-only logs directly to Cloud Storage via a sink instead of paying to keep them queryable in Cloud Logging. See https://cloud.google.com/logging/pricing for the current pricing schedule.

Cloud Monitoring — Metrics, Dashboards, Alerts, SLOs

Cloud Monitoring is the metrics-and-alerting service inside the Google Cloud Operations Suite. It collects time-series numeric data from every Google Cloud product, lets you build custom dashboards, and routes alerts to on-call channels when conditions breach.

Built-In Metrics

Every Google Cloud product exposes a set of system metrics out of the box: Compute Engine VM CPU and disk, GKE pod CPU and memory, Cloud Run request count and latency, Cloud SQL connection count, Pub/Sub backlog size, and so on. There is nothing to install or configure for these — they appear in Cloud Monitoring automatically.

Custom Metrics

For application-specific signals (e.g., "carts abandoned per minute"), you can publish custom metrics using the Cloud Monitoring API or by deriving them from logs (log-based metrics).

Dashboards

Cloud Monitoring dashboards are user-configurable wall displays. You assemble widgets — line charts, heat maps, scorecards, gauges — and arrange them to give a service team or a leadership team the at-a-glance view they need. Dashboards can be shared across projects, embedded in incident response runbooks, or used in regular operational review meetings.

Alerting Policies and Notification Channels

The alerting model in Google Cloud Operations Suite has three building blocks:

Alerting policy: Defines a condition (e.g., "Cloud Run error rate above 1% for 5 minutes") plus the affected resources and the severity.
Notification channel: Specifies where the alert is delivered. Supported channels include Email, SMS, Slack, PagerDuty, Webhook, Pub/Sub, Google Chat, and Cloud Mobile App.
Incident: When the policy condition fires, Cloud Monitoring opens an incident record, notifies the channels, and tracks the incident until the condition resolves.

Uptime Checks

Uptime checks are scheduled HTTPS, HTTP, or TCP probes against your public endpoints, executed from multiple Google Cloud regions worldwide. They tell you whether real users from Asia, Europe, and the Americas can actually reach your service. Uptime checks integrate with alerting policies so that you get paged when too many global locations fail.

SLO Monitoring

Cloud Monitoring includes service-level objective (SLO) monitoring for SRE teams. You define a service, pick service-level indicators (SLIs) — for example, "fraction of requests with latency below 300 ms" — and set an SLO target (e.g., 99.9% over a rolling 28-day window). Cloud Monitoring tracks error budget burn rate in real time and can alert when burn rate threatens the SLO. For the broader SLO and SRE practices, see /en/certs/gcp/cdl/topics/devops-and-sre-principles.

The alerting policy → notification channel → incident model is one of the most exam-relevant patterns in the Google Cloud Operations Suite. CDL scenarios about "how do we page the on-call engineer when error rate spikes?" map to this three-step model: create an alerting policy in Cloud Monitoring, attach a notification channel (typically PagerDuty or Slack for on-call workflows), and Cloud Monitoring then manages the incident lifecycle automatically. See https://cloud.google.com/monitoring/alerts for the official alerting documentation.

Cloud Trace — Distributed Tracing for Latency Analysis

Cloud Trace is the distributed tracing service inside the Google Cloud Operations Suite. In a microservices architecture, a single user request might traverse a dozen backend services. When the request is slow, the question "where did the time go?" becomes hard to answer with logs alone. Cloud Trace solves this by tagging every request with a trace ID, propagating that ID across service boundaries (via HTTP headers), and reconstructing the full span tree in the Google Cloud Operations Suite UI.

How Tracing Works

Each service in the request path emits a span — a record of how long that service spent handling its portion of the request, plus any tags (database query, downstream service name, status code). Spans are joined by the shared trace ID. The end result is a flame chart showing exactly which span was the bottleneck.

Auto-Instrumentation

App Engine Standard: Automatic tracing for every request.
Cloud Run, Cloud Functions: Automatic trace ID propagation; install the OpenTelemetry SDK to emit application spans.
GKE and Compute Engine: Use OpenTelemetry instrumentation libraries to emit spans to Cloud Trace.

When to Use Cloud Trace

Diagnosing latency regressions after a deploy.
Identifying which downstream call is the bottleneck.
Optimising end-to-end user experience by removing redundant or slow service hops.

Cloud Profiler — Continuous CPU and Memory Profiling

Cloud Profiler is the always-on production profiler inside the Google Cloud Operations Suite. Traditional profilers (gperftools, pprof, async-profiler) require a developer to attach to a running process and capture a snapshot, which is unsafe in production. Cloud Profiler is different: it runs continuously, with negligible overhead (~1% CPU), and aggregates flame graphs over time.

What Cloud Profiler Measures

CPU time: Which functions are burning the most CPU?
Heap allocation: Which functions allocate the most memory?
Contention: Which locks cause goroutines / threads to wait?
Wall-clock time: Where do requests actually spend their time end-to-end?

Cloud Profiler supports Go, Java, Node.js, and Python. Integration is a one-line agent install. For SRE teams running large fleets, Cloud Profiler is the secret weapon for rightsizing — discovering that 60% of CPU is burned in a single inefficient JSON parser, fixing it, and cutting infrastructure cost in half.

Error Reporting — Automatic Exception Aggregation

Error Reporting is the exception-aggregation service inside the Google Cloud Operations Suite. When your application throws an exception (Java stack trace, Python traceback, Go panic, Node.js Error), Error Reporting automatically:

Parses the stack trace from Cloud Logging.
Groups identical errors together (same stack signature) so you see one entry per error, not one per occurrence.
Counts occurrences over time and tracks first-seen and last-seen timestamps.
Notifies the team when a new error type appears in production.

This means engineers see "20 distinct error types this week, ranked by frequency" instead of "200,000 individual stack traces" — turning a flood of noise into a triaged action list.

What Happened to Cloud Debugger?

Cloud Debugger was a Google Cloud Operations Suite product that let developers set breakpoints in live production code and capture variable state without stopping the process. Google shut down Cloud Debugger in May 2023. The replacement strategy is:

Use Error Reporting to track production exceptions.
Use Cloud Logging with rich structured payloads to capture the runtime context that a debugger would have shown.
Use Cloud Trace and Cloud Profiler for performance investigation.
Use local debugging for actual breakpoint workflows.

CDL exam questions may still mention Cloud Debugger as a deprecated reference; recognise the name, note that its responsibilities migrated to Cloud Logging plus Error Reporting, and move on.

Telemetry Across Compute Options

Different Google Cloud compute services emit telemetry into the Google Cloud Operations Suite by default in subtly different ways. Knowing the defaults is a CDL-favourite question.

Cloud Run

Every stdout / stderr line becomes a Cloud Logging entry automatically. Request count, request latency, container instance count, and container CPU appear in Cloud Monitoring with zero configuration.

GKE

Pod logs are captured by the GKE Logging agent. Pod CPU and memory are captured by the GKE Monitoring agent. Both ship by default in new clusters.

Compute Engine

VMs do not automatically ship application logs or detailed system metrics. You must install the Ops Agent on each VM. Without the Ops Agent, you only see hypervisor-level CPU, disk, and network metrics.

App Engine, Cloud Functions

Fully managed — logs, metrics, and traces are emitted automatically with no agent.

Multi-Cloud and Hybrid

For workloads outside Google Cloud (on-prem, AWS, Azure), deploy the OpenTelemetry Collector or BindPlane to forward telemetry to the Google Cloud Operations Suite. This is how enterprises operate a single observability pane across hybrid estates. For the underlying compute options, see /en/certs/gcp/cdl/topics/google-cloud-compute-options.

The default observability footprint of a service is itself a selection criterion when choosing compute. Cloud Run and App Engine give you logs, metrics, and traces with zero agent installation; Compute Engine requires you to install and maintain the Ops Agent on every VM. If "minimising operational burden of observability" is a stated requirement in a CDL scenario, the right answer biases toward serverless compute options that integrate with Google Cloud Operations Suite out of the box. See https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent for Ops Agent installation details.

Pricing and Cost Optimization for Observability

The Google Cloud Operations Suite has a per-service pricing model with generous free tiers — but those free tiers run out fast at scale, and CDL exam scenarios on cost optimisation often hinge on which lever to pull.

Log sink — a Cloud Logging destination that routes matching log entries (selected by an inclusion or exclusion filter) to a separate storage system. Common sink targets are Cloud Storage (cheap long-term archival, typically Coldline or Archive class), BigQuery (analytical querying with SQL), Pub/Sub (streaming to a third-party SIEM such as Splunk or Chronicle SIEM), or another Cloud Logging bucket. Sinks are the primary cost-control lever: by routing high-volume, low-value logs to Cloud Storage instead of letting them ingest into Cloud Logging at full price, organizations routinely cut their Cloud Logging bill by 50–80% without losing compliance retention.

The single most-tested cost optimisation pattern in the Google Cloud Operations Suite is "use log sinks and log exclusion filters to control Cloud Logging spend." Beyond the 50 GiB free tier per project per month, Cloud Logging is roughly USD 0.50 per additional GiB ingested, which adds up fast on chatty Kubernetes workloads or audit-heavy environments. The right answer to "our Cloud Logging bill is too high" is rarely "reduce logging" — it is route high-volume application logs to Cloud Storage via a sink and keep only security-relevant logs in Cloud Logging for short-term querying. Pair this with Cloud Storage lifecycle rules to move archives to Archive class after 90 days for the cheapest long-term retention.

Cloud Logging Pricing

Free tier: First 50 GiB ingested per project per month.
Beyond free tier: Roughly USD 0.50 per additional GiB ingested.
Retention: First 30 days included; longer retention costs extra unless you sink to Cloud Storage.

Cloud Monitoring Pricing

Free tier: All Google Cloud system metrics are free.
Custom metrics and log-based metrics: Free up to 150 MiB per month, then a small per-MiB charge.
API reads: Free up to 1 million calls per month.

Cloud Trace, Cloud Profiler, Error Reporting

Cloud Trace: First 2.5 million spans ingested per month free.
Cloud Profiler: Free.
Error Reporting: Free (uses Cloud Logging as the underlying storage).

Cost Levers

Log exclusion filters: Drop debug-level logs at ingestion time.
Bucket retention tiers: Use shorter retention for noisy logs.
Sinks to Cloud Storage: Archive logs cheaply for compliance instead of paying Cloud Logging retention.
Sampling for Cloud Trace: Sample 1 in 1000 requests in high-traffic services.

For the broader Google Cloud cost-management toolkit, see /en/certs/gcp/cdl/topics/cost-management-tools.

How Google Cloud Operations Suite Supports SRE Practices

The Google Cloud Operations Suite is the practical foundation for Site Reliability Engineering (SRE) practices popularised by Google. The mapping is direct:

SLI / SLO / error budget: Defined and tracked in Cloud Monitoring SLO.
Alerting on symptoms not causes: Built with Cloud Monitoring alerting policies tied to user-facing SLIs.
Postmortems and blameless retrospectives: Driven from Cloud Logging forensic data and Cloud Trace request timelines.
Continuous improvement: Driven from Cloud Profiler flame graphs that identify hot paths in production code.
Toil reduction: Driven by automating runbook responses to incidents produced by Cloud Monitoring alerting policies.

For a deeper treatment of DevOps and SRE principles in the CDL context, see /en/certs/gcp/cdl/topics/devops-and-sre-principles.

Security and Audit Logging

Google Cloud automatically generates Cloud Audit Logs in every project — these are the immutable record of "who did what, where, and when." There are four types:

Admin Activity logs: Always on, no charge, capture configuration changes (e.g., VM created, IAM policy changed).
Data Access logs: Opt-in, capture data reads and writes (e.g., who queried which BigQuery table).
System Event logs: Always on, no charge, capture Google-initiated events (e.g., live migration of a VM).
Policy Denied logs: Capture access attempts blocked by IAM or VPC Service Controls.

Cloud Audit Logs are stored in Cloud Logging and routed via log sinks into BigQuery or Cloud Storage for compliance retention. CDL scenarios about "regulator demands a five-year record of who accessed customer data" map to Data Access logs sunk into Cloud Storage Archive class.

Integration with the Broader Google Cloud Ecosystem

The Google Cloud Operations Suite does not stand alone — it integrates with virtually every other Google Cloud product:

IAM: Controls who can read logs, configure alerts, and acknowledge incidents.
BigQuery: Receives log sinks for SQL analytics over operational data.
Pub/Sub: Streams logs to downstream systems in real time.
Cloud Storage: Stores long-term log archives at cheap Coldline / Archive tier.
Chronicle and Security Command Center: Consume Cloud Logging audit data for security operations.
Looker: Visualises log-based analytics for executive dashboards.
VPC Service Controls: Prevents log exfiltration by enforcing perimeter controls around Cloud Logging.

This tight integration is the reason that enterprises rarely replace Google Cloud Operations Suite with a third-party SIEM for Google-native workloads — the friction of duplicating data is higher than the marginal benefit of a separate tool.

Frequently Asked Questions

Q: What is the difference between Stackdriver and Google Cloud Operations Suite?

A: They are the same product family under different names. "Stackdriver" was the original brand (from the acquisition in 2014). It was renamed "Cloud Operations Suite," and is now formally branded "Google Cloud Observability." All three names refer to Cloud Logging, Cloud Monitoring, Cloud Trace, Cloud Profiler, and Error Reporting collectively. CDL exam questions may use any of the three names interchangeably.

Q: Do I have to install agents to get observability on Cloud Run or App Engine?

A: No. Cloud Run, Cloud Functions, App Engine Standard, and App Engine Flexible automatically ship logs and system metrics to the Google Cloud Operations Suite. You only need agents on Compute Engine VMs — install the Ops Agent to capture application logs and detailed system metrics. GKE installs its logging and monitoring agents automatically on new clusters.

Q: How are alerts routed to my on-call team?

A: Create an alerting policy in Cloud Monitoring that defines the condition (e.g., "error rate above 1% for 5 minutes"). Attach one or more notification channels to the policy — supported channels include Email, SMS, Slack, PagerDuty, Webhook, Pub/Sub, Google Chat, and the Google Cloud mobile app. When the condition fires, Cloud Monitoring opens an incident and notifies all configured channels. Once the condition clears, the incident auto-resolves and a "resolved" notification is sent.

Q: How do I keep log retention costs under control?

A: Use three levers. First, log exclusion filters in Cloud Logging drop low-value entries at ingestion (e.g., debug-level logs in production, health-check probes). Second, log sinks route compliance-only logs to Cloud Storage (Coldline or Archive class) for cheap multi-year retention instead of paying Cloud Logging's per-GiB retention rate. Third, configure shorter bucket retention for chatty workloads so noisy logs age out after 7 or 14 days while audit logs are sunk to long-term storage.

Q: What replaced Cloud Debugger?

A: Cloud Debugger was shut down in May 2023. The replacement strategy combines Cloud Logging (with rich structured payloads that capture runtime context), Error Reporting (for grouped exception tracking), Cloud Trace (for distributed latency analysis), and Cloud Profiler (for CPU and memory profiling). For step-through breakpoint debugging, use local IDE debuggers against staging or test environments.

Q: Can the Google Cloud Operations Suite ingest telemetry from AWS or on-prem workloads?

A: Yes. Deploy the OpenTelemetry Collector or BindPlane in the source environment and configure it to forward logs, metrics, and traces to Google Cloud. This is how multi-cloud and hybrid enterprises operate a single observability plane across heterogeneous estates. The Google Cloud Operations Suite's metric and log models are OpenTelemetry-compatible.

Summary

The Google Cloud Operations Suite — branded as Google Cloud Observability and historically known as Stackdriver — is the unified telemetry, monitoring, and troubleshooting platform built into Google Cloud:

Cloud Logging — centralized log aggregation, log-based metrics, and log sinks to BigQuery, Pub/Sub, or Cloud Storage for analytics and compliance retention.
Cloud Monitoring — system and custom metrics, dashboards, alerting policies, notification channels, uptime checks, and SLO tracking.
Cloud Trace — distributed tracing for latency analysis across microservices, with OpenTelemetry-based auto-instrumentation.
Cloud Profiler — continuous, low-overhead CPU and memory profiling in production for Go, Java, Python, and Node.js.
Error Reporting — automatic exception aggregation and grouping from Cloud Logging payloads.

The three pillars of observability — logs, metrics, traces — map directly onto Cloud Logging, Cloud Monitoring, and Cloud Trace respectively. Cloud Run, App Engine, and Cloud Functions ship telemetry into the suite with zero agent installation; Compute Engine needs the Ops Agent; GKE installs agents automatically. The alerting model is policy → notification channel → incident, with PagerDuty, Slack, Email, and SMS as the typical on-call destinations.

As a Cloud Digital Leader, your job is to recognise which product solves which observability problem, articulate the cost levers that keep Cloud Logging from blowing up the budget, and explain how the suite supports modern SRE practices like SLOs, error budgets, and blameless postmortems. Master this framework and you will confidently answer any observability-related question on the CDL exam.

What Is the Google Cloud Operations Suite?

The Three Pillars of Observability

Pillar 1 — Logs (What Happened?)

Pillar 2 — Metrics (How Often, How Much?)

Pillar 3 — Traces (Where Did the Request Go?)

白話文解釋（Plain English Explanation）

Analogy 1 — The Hospital Patient Monitor

Analogy 2 — The Factory Control Room

Analogy 3 — The Aircraft Black Box and Radar Tower

Cloud Logging — Centralized Log Aggregation

How Logs Get Into Cloud Logging

Log-Based Metrics

Log Sinks — Routing Logs to Storage

Cloud Monitoring — Metrics, Dashboards, Alerts, SLOs

Built-In Metrics

Custom Metrics

Dashboards

Alerting Policies and Notification Channels

Uptime Checks

SLO Monitoring

Cloud Trace — Distributed Tracing for Latency Analysis

How Tracing Works

Auto-Instrumentation

When to Use Cloud Trace

Cloud Profiler — Continuous CPU and Memory Profiling

What Cloud Profiler Measures

Error Reporting — Automatic Exception Aggregation

What Happened to Cloud Debugger?

Telemetry Across Compute Options

Cloud Run

GKE

Compute Engine

App Engine, Cloud Functions

Multi-Cloud and Hybrid

Pricing and Cost Optimization for Observability

Cloud Logging Pricing

Cloud Monitoring Pricing

Cloud Trace, Cloud Profiler, Error Reporting

Cost Levers

How Google Cloud Operations Suite Supports SRE Practices

Security and Audit Logging

Integration with the Broader Google Cloud Ecosystem

Frequently Asked Questions

Q: What is the difference between Stackdriver and Google Cloud Operations Suite?

Q: Do I have to install agents to get observability on Cloud Run or App Engine?

Q: How are alerts routed to my on-call team?

Q: How do I keep log retention costs under control?

Q: What replaced Cloud Debugger?

Q: Can the Google Cloud Operations Suite ingest telemetry from AWS or on-prem workloads?

Summary

Official sources

More CDL topics