Introduction to Cloud Logging and Error Reporting
Observability is a core pillar of the Google Cloud Professional Cloud Developer (PCD) exam. Cloud Logging is a fully managed service that ingests, stores, searches, analyzes, and exports log entries from Google Cloud, Anthos, AWS, on-premises, and arbitrary application code. Error Reporting sits on top of Cloud Logging and automatically aggregates and groups crashes and exceptions in your running applications, sending notifications when a new error signature is detected.
The PCD exam expects you to know not just what these services do, but the exact knobs you need to turn for production-grade deployments: which jsonPayload keys are special, which severity strings the API accepts, how _Default and _Required buckets differ, when to route to BigQuery versus Pub/Sub, how Error Reporting groups stack traces, and how to keep cost under control with exclusion filters and short-retention buckets.
Cloud Logging is billed on ingestion (USD per GiB written) and storage beyond the included retention, so every architecture decision in this topic is simultaneously a cost decision. Error Reporting, in contrast, is free for the analysis layer — but only if the underlying log entries it reads from _Default are still within their retention window.
Structured Logging with jsonPayload
Plain-text logs (the textPayload field) are searchable but hostile to filtering. Structured logging sends a JSON object that Cloud Logging parses into the jsonPayload field, which is then queryable with the Logging Query Language (LQL).
Special jsonPayload keys
Cloud Logging promotes a handful of well-known JSON keys to top-level LogEntry fields when the agent or runtime detects them. Memorize these — they appear in scenario questions:
severity→ promoted toLogEntry.severitymessage→ used as the headline in Logs ExplorerhttpRequest→ promoted toLogEntry.httpRequest(status, latency, userAgent)logging.googleapis.com/trace→ links the entry to a Cloud Trace spanlogging.googleapis.com/spanId→ ties the entry to a specific spanlogging.googleapis.com/trace_sampled→ boolean for sampling decisionslogging.googleapis.com/labels→ user labels on the entrylogging.googleapis.com/insertId→ enforces idempotency on retries
Writing structured logs from app code
From Cloud Run, App Engine, Cloud Functions, and GKE with the Ops Agent or Logging Agent, anything you print to stdout/stderr as a single-line JSON object is auto-parsed. There is no SDK requirement — print(json.dumps({...})) is sufficient. If you need batching, retries, and explicit resource attribution, use the google-cloud-logging client library, which writes via the entries.write REST method.
jsonPayload — The LogEntry field that holds parsed JSON content. Mutually exclusive with textPayload and protoPayload. Searchable with jsonPayload.field_name="value" in LQL and indexable by log-based metrics.
Severity Levels and Filtering
Cloud Logging accepts exactly nine severity values, defined by google.logging.type.LogSeverity: DEFAULT, DEBUG, INFO, NOTICE, WARNING, ERROR, CRITICAL, ALERT, EMERGENCY. Anything else gets coerced to DEFAULT.
Mapping language conventions to GCP severity
- Python
logging.INFO→INFO;logging.ERROR→ERROR;logging.CRITICAL→CRITICAL - Java
java.util.logging.Level.WARNING→WARNING;SEVERE→ERROR - Node.js
console.error→ERROR;console.warn→WARNING;console.log→INFO - Go
log.Fatalwrites a singleCRITICALentry before exit
LQL filters and the severity ladder
LQL severity filters are inclusive of higher levels. severity >= WARNING returns WARNING, ERROR, CRITICAL, ALERT, and EMERGENCY — useful for alerting policies that should fire on anything more severe than the threshold. Combine with resource filters:
resource.type="cloud_run_revision"
resource.labels.service_name="checkout-api"
severity >= ERROR
Cloud Run, Cloud Functions, and App Engine auto-promote stderr to ERROR and stdout to INFO only when the line is not valid JSON. If your JSON includes "severity": "ERROR" but you write it to stdout, Cloud Logging still classifies it as ERROR. Prefer explicit severity keys to avoid confusion when porting between runtimes.
Trace Correlation with logging.googleapis.com/trace
The single most useful observability technique on Google Cloud is correlating a log line with the full distributed trace that produced it. Cloud Run, App Engine, and Cloud Functions automatically inject the trace context as the X-Cloud-Trace-Context HTTP header.
The trace key format
Set logging.googleapis.com/trace to the full resource name: projects/PROJECT_ID/traces/TRACE_ID. Just the bare TRACE_ID does not link in the Logs Explorer side panel. The agent and most client libraries assemble the resource path for you when you provide a trace_id.
Pairing with spanId
Adding logging.googleapis.com/spanId (a 16-character hex string) anchors the entry to a specific span inside the trace, so the Trace UI can render "logs for this span" without showing entries from sibling RPCs in the same request. For OpenTelemetry-instrumented apps, the OTel exporter for Cloud Logging populates both fields automatically.
Propagation rules
- Inbound: read
X-Cloud-Trace-Context(TRACE_ID/SPAN_ID;o=TRACE_TRUE) - Outbound: forward the same header when calling downstream services
- gRPC: use the
grpc-trace-binbinary metadata or the W3Ctraceparentheader
On Cloud Run, the platform writes its own request log entry per HTTP request with the trace context populated. If your application also writes its own structured log without logging.googleapis.com/trace, you will see two unrelated entries in Logs Explorer for the same request. Always propagate the trace ID into your jsonPayload.
Log Buckets: _Default, _Required, and Custom
Every Google Cloud project has two system buckets that cannot be deleted: _Required and _Default. You can create up to 100 additional log buckets per project for fine-grained retention and CMEK policies.
_Required bucket
Holds Admin Activity audit logs, System Event audit logs, and Access Transparency logs. Retention is fixed at 400 days and cannot be configured, exported away, or disabled. Storage in _Required is free.
_Default bucket
Holds everything else that does not match a routing sink with disabled=false. Default retention is 30 days, configurable from 1 day to 3650 days (10 years). Data Access audit logs land here unless you explicitly enable them and route elsewhere.
Custom buckets
Created with gcloud logging buckets create LOG_BUCKET_NAME --location=LOCATION --retention-days=DAYS. Custom buckets let you:
- Apply CMEK with
--cmek-kms-key-name - Choose a specific region (data residency)
- Set retention from 1 day to 3650 days independently per bucket
- Enable Log Analytics (BigQuery-backed querying) with
--enable-analytics
_Required = 400 days, free, admin/audit only, immutable. _Default = 30 days default, configurable 1–3650, charged after the included retention. Custom buckets = up to 100 per project, support CMEK and Log Analytics.
Log Sinks: Routing to BigQuery, GCS, Pub/Sub, and Splunk
A sink is a routing rule defined at the project, folder, billing-account, or organization level. Every matched log entry is delivered to the sink's destination in addition to (or, with an exclusion, instead of) the originating bucket.
Sink destinations
- Cloud Logging bucket (same or different project): the canonical "send security logs to a central project" pattern
- BigQuery dataset: ad-hoc SQL analysis; tables are auto-created with
_YYYYMMDDsharding or partitioned tables (recommended) - Cloud Storage bucket: long-term archival; hourly file batches in
LOG_NAME/YYYY/MM/DD/HH00/prefix - Pub/Sub topic: real-time streaming to Splunk via Dataflow's
Pub/Sub to Splunktemplate, third-party SIEMs, or custom consumers
Writer identity
Every sink creates a writer identity service account named [email protected] (for org-level sinks, a unique p-PROJECT_NUMBER-LOCATION@... identity). You must grant this identity write permission on the destination — roles/bigquery.dataEditor, roles/storage.objectCreator, or roles/pubsub.publisher — before logs flow.
Aggregated sinks
At the folder or organization level, an aggregated sink with --include-children pulls logs from every project below it, enabling a single centralized security-logs project for the whole org.
The Splunk integration is not a native Cloud Logging sink. The correct pattern is: Cloud Logging sink → Pub/Sub topic → Dataflow Pub/Sub to Splunk template → Splunk HEC endpoint. Expect this multi-hop architecture to appear in PCD scenarios about hybrid SIEM ingestion.
Log-based Metrics: Counter and Distribution
Log-based metrics convert log content into time-series data that Cloud Monitoring can chart and alert on. Two flavors exist.
Counter metrics
Increment by one for every log entry matching a filter. Free for system metrics, charged per chargeable label for user-defined ones. Useful for:
- "How many 500 responses per minute" → filter
httpRequest.status >= 500 - "Number of failed logins" → filter
jsonPayload.event="login_failed"
Created with gcloud logging metrics create LOGIN_FAILURES --description="..." --log-filter='resource.type="cloud_run_revision" AND jsonPayload.event="login_failed"'.
Distribution metrics
Extract a numeric value from each matching entry and build a histogram. Required when you care about percentiles, not just counts. Useful for:
- p95 request latency from
httpRequest.latency - Distribution of payload sizes from
jsonPayload.bytes_processed
You specify --value-extractor (e.g. EXTRACT(jsonPayload.latency_ms)) and bucket boundaries (linear, exponential, or explicit). Labels can be extracted with --label-extractor to slice the histogram by tenant, region, or endpoint.
Quotas and gotchas
- Maximum of 500 log-based metrics per project
- Maximum of 10 user-defined labels per metric, 30 chargeable label values per metric per minute before throttling
- Backfilled: metrics start counting from creation time onward, not retroactively
Error Reporting: Auto-Grouping by Stack Trace Signature
Error Reporting reads from _Default and any custom bucket where you have left ingestion enabled, then groups stack traces into error groups that share a common fingerprint.
How grouping works
The grouper extracts the exception type and the top frames of the stack trace, normalizing away dynamic content (memory addresses, line numbers in vendored code, anonymous lambda names) to produce a signature. All entries with the same signature collapse into one ErrorGroup resource, surfaced in the console as a single card with first-seen, last-seen, total count, and affected service versions.
Supported runtimes
Auto-detection works for Java, Python, Node.js, Go, .NET, Ruby, and PHP when the stack trace is written in the platform's idiomatic format. For other languages, format the trace yourself and include it in jsonPayload.message along with @type = type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent.
Notifications
Each project can configure email notifications for new error groups (the first time a signature is seen) and resolved-then-reopened groups. Mobile push notifications via the Google Cloud Console app are also supported. Integration with PagerDuty or Slack is achieved indirectly: create an alerting policy on the log-based metric logging.googleapis.com/byte_count or on a custom metric, then route via Notification Channels.
A common exam trap is suggesting "create an alerting policy in Error Reporting" — Error Reporting itself does not expose Cloud Monitoring alerting policies. You alert on log-based metrics derived from the same logs, or on Error Reporting's email/mobile notifications. Picking the "Error Reporting alerting policy" distractor is wrong.
Cloud Logging API for Application Instrumentation
For workloads outside auto-instrumented runtimes (on-prem servers, batch jobs, mobile telemetry), use the Cloud Logging API directly.
entries.write
POST https://logging.googleapis.com/v2/entries:write accepts up to 1,000 entries or 10 MB per request, whichever is smaller. Each entry must include a logName (projects/PROJECT_ID/logs/LOG_ID), a resource (with type and labels), and exactly one of textPayload, jsonPayload, or protoPayload.
Client library batching
The official client libraries (google-cloud-logging for Python, @google-cloud/logging for Node.js, cloud.google.com/go/logging in Go) wrap entries.write with:
- In-process batching with configurable
delay_thresholdandbatch_size - Background flushing on a worker goroutine/thread
- Resource auto-detection on Cloud Run, GKE, GCE, App Engine
- Built-in retry with exponential backoff on
UNAVAILABLEandDEADLINE_EXCEEDED
Authentication
Service accounts need roles/logging.logWriter. On Cloud Run, App Engine, Cloud Functions, GKE Workload Identity, and Compute Engine, the platform default service account already has this role unless you removed it. For external machines, use a service-account key or Workload Identity Federation (preferred) — never embed long-lived keys.
Retention Configuration and Cost Control
Cloud Logging cost has two components: ingestion (first 50 GiB/month free per project, then USD per GiB) and storage beyond the included retention window (_Default includes 30 days free).
Configuring retention per bucket
gcloud logging buckets update _Default \
--location=global \
--retention-days=7
Reducing _Default retention from 30 to 7 days has no effect on ingestion charges — it only affects storage charges that kick in after the first 30 days, so dropping below 30 saves nothing. To cut cost, exclude logs before they are ingested.
Locked retention for compliance
Add --locked to a bucket update to make retention immutable: nobody, including project owners, can shorten retention or delete the bucket until every entry has aged past its retention. This satisfies WORM compliance for HIPAA, PCI-DSS, and SEC 17a-4.
Locked buckets cannot be deleted while they contain entries. Plan capacity accordingly — at 100 buckets per project, you can quickly box yourself in. Use a dedicated long-retention archive bucket per regulated workload rather than locking every bucket "just in case".
CMEK on Log Buckets
By default, Cloud Logging encrypts data at rest with Google-managed keys. For regulated workloads, use Customer-Managed Encryption Keys (CMEK) stored in Cloud KMS.
Two scopes for CMEK
- Project default key — applied at the project or folder level via
gcloud logging cmek-settings update --kms-key-name=.... Any new log bucket created in the project inherits this key. - Per-bucket key — set at bucket creation with
--cmek-kms-key-name. Cannot be added to an existing bucket; you must create a new bucket and route logs there.
IAM requirements
Grant the Cloud Logging service account roles/cloudkms.cryptoKeyEncrypterDecrypter on the KMS key. The service account name is [email protected]. Without this grant, log ingestion to the CMEK-protected bucket fails silently and entries are dropped.
Regional alignment
The KMS keyring must be in the same region as the log bucket, or in a multi-region that contains it. A us-central1 bucket cannot use a europe-west1 key.
Exclusion Filters
Exclusion filters live on the _Default sink (and any custom sink) and drop matching entries before ingestion is billed. This is the single biggest lever for Cloud Logging cost control.
Common exclusions
- Health-check probes:
httpRequest.requestUrl=~"/healthz$" AND httpRequest.status=200 - Verbose Kubernetes audit at the
RequestReceivedstage:protoPayload.@type="type.googleapis.com/google.cloud.audit.AuditLog" AND protoPayload.metadata.stage="RequestReceived" - Successful Cloud SQL slow-query logs below a latency threshold
- Repetitive load-balancer entries for static-asset URLs
Setting exclusions
gcloud logging sinks update _Default \
--add-exclusion=name=skip-healthz,filter='httpRequest.requestUrl=~"/healthz$"'
You can attach up to 50 exclusion filters per sink. Each exclusion may have a disabled flag for temporary debugging — flip it on, reproduce an incident with full logs, then re-disable.
Exclusions are evaluated before log-based metrics. If you create a metric that counts health-check 200s, then add an exclusion that drops them, the metric stops counting. Order exclusions and metrics deliberately: exclude the truly worthless, but keep at least a 1% sample if you need a denominator for SLOs.
Log Analytics and BigQuery Integration
Log Analytics is a Cloud Logging feature that lets you query log buckets with standard SQL without first exporting them to BigQuery. Enable it per bucket at creation time with --enable-analytics; existing buckets can be upgraded but the operation is irreversible.
Querying log buckets with SQL
Once analytics is enabled, the bucket is exposed as a linked BigQuery dataset named LOG_BUCKET_NAME._AllLogs. You can run queries like:
SELECT
timestamp,
resource.labels.service_name AS service,
JSON_VALUE(json_payload, '$.user_id') AS user_id,
severity
FROM `PROJECT_ID.region-LOCATION._AllLogs`
WHERE severity = 'ERROR'
AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
ORDER BY timestamp DESC
LIMIT 100;
This sidesteps the need for a BigQuery sink for ad-hoc investigations, but does not replace a sink when you need to join logs with non-log tables. For long-running joins against fact tables, a BigQuery sink with partitioned tables remains cheaper and faster.
When to choose each path
- Ad-hoc one-off SQL on the last 30 days of logs → Log Analytics
- Joining logs with order, customer, or product tables → BigQuery sink
- Reports retained beyond bucket retention → BigQuery sink with partition expiration aligned to compliance window
- Cross-project consolidated dashboards → aggregated sink at org level into one BigQuery dataset
Audit Logging Categories
The PCD exam expects fluency in the four Google Cloud audit log categories, all delivered via Cloud Logging.
Admin Activity
Always-on, free, written to _Required for 400 days. Captures every IAM change, resource create/update/delete, and configuration mutation. Cannot be disabled. Use these logs to answer "who changed the firewall rule at 03:14?".
Data Access
Off by default for most services (always-on for BigQuery DATA_READ). Three sub-types: ADMIN_READ, DATA_READ, DATA_WRITE. Enable selectively per service via the project's IAM & Admin > Audit Logs page, because DATA_READ on chatty services like Cloud Storage will quickly dominate your ingestion bill.
System Event
Always-on, free, captures GCP-initiated actions like live migration or auto-renewal of TLS certificates. Useful for post-incident reconstruction.
Policy Denied
Records when an IAM policy or VPC Service Controls rule denies a request. Crucial for diagnosing "why does this service account get 403 when I think I gave it permission?" — the denied log entry tells you which condition or perimeter blocked the call.
Enabling DATA_READ on production Cloud Storage buckets is the single most common surprise on Cloud Logging bills. A bucket serving public assets can generate millions of read entries per day. Enable it only on sensitive buckets, route to a dedicated long-retention custom bucket, and exclude unauthenticated reads if not required for compliance.
白話文解釋(Plain English Explanation)
Analogy 1: The Black Box Flight Recorder
Cloud Logging is like the black box on a passenger jet. Every sensor reading, every cockpit announcement, every status flag is written to a tamper-resistant recorder. When the investigators arrive, they replay the recording millisecond by millisecond. Log buckets are the different black boxes you might install — _Required is the bolted-down regulatory recorder you legally cannot remove, _Default is your everyday voice recorder, and custom buckets are extra recorders you install in specific cabins (services) with their own retention rules.
Analogy 2: The Hospital Triage Nurse
Error Reporting is the triage nurse standing at the emergency-room door. Cloud Logging is the paperwork archive that holds every patient's full file forever, including coughs and headaches. The triage nurse ignores routine paperwork and only flags the genuinely serious cases — broken bones, chest pains. She also recognizes repeat visitors: if the same patient with the same symptom comes back, she does not start a new file, she pulls the existing one and adds another visit. That fingerprint-matching is exactly how Error Reporting groups stack traces, and the "fast track" lane the nurse sends serious cases to is the email/mobile notification that pages the on-call developer.
Analogy 3: The Postal Sorting Office
Log sinks behave like a postal sorting office. Every letter (log entry) arrives at one central conveyor (_Default). At the sorting station, rules decide where to forward each letter: BigQuery is the research department that wants every letter cataloged for SQL queries, Cloud Storage is the long-term archive vault that gets weekly batches, Pub/Sub is the high-speed pneumatic tube that fires letters to Splunk in real time. The exclusion filter is the office manager throwing the junk mail straight into the recycling bin so nobody pays to process it — this is exactly why you exclude health-check logs before they are ingested.
Frequently Asked Questions
Q1: Can Cloud Logging handle logs from on-premises servers?
Yes. Install the Ops Agent (the unified successor to the legacy Logging Agent and Monitoring Agent) on physical or virtual servers, configure it with a service-account key or Workload Identity Federation, and it streams entries to Cloud Logging just like a GCE VM. For custom apps without an agent, call the entries.write REST endpoint directly with google-cloud-logging client libraries.
Q2: Is Error Reporting real-time?
Yes, with a small ingestion delay. Once a stack trace lands in _Default, Error Reporting typically detects and groups it within seconds to a minute. New-error notifications are sent via email immediately after the first matching entry is grouped, and the Google Cloud Console mobile app can push notifications to your phone.
Q3: How do I keep audit logs for seven years for SOX compliance?
Two options. (1) Increase _Required is not possible — it is fixed at 400 days. Instead, create an aggregated sink at the organization level filtering logName:"cloudaudit.googleapis.com", send to a Cloud Storage bucket with a 7-year retention policy and bucket lock, or to a Cloud Logging bucket with --retention-days=2555 --locked. (2) Route to BigQuery for ad-hoc audit queries; BigQuery datasets have their own retention controls.
Q4: My Cloud Logging bill is dominated by Kubernetes container logs. What now?
Apply exclusion filters on _Default to drop noisy namespaces (resource.labels.namespace_name=~"^(kube-system|istio-system)$"), drop verbose info logs from sidecars, and downgrade the audit-log stage from RequestReceived to only ResponseComplete. If you still need the data, send a sampled (e.g. 10%) stream to a low-cost Cloud Storage sink and exclude the rest entirely.
Q5: How do I get logs and traces side by side?
Inject logging.googleapis.com/trace = projects/PROJECT_ID/traces/TRACE_ID into every log entry. On Cloud Run, App Engine, and Cloud Functions, the platform extracts TRACE_ID from the X-Cloud-Trace-Context request header — propagate it into your structured log payload, then open any log entry in Logs Explorer and click "Show in context" or jump straight to the Trace UI. Add logging.googleapis.com/spanId for span-level correlation.
Q6: Are log-based metrics retroactive?
No. A log-based metric begins recording at creation time and only counts entries written after that moment. If you need historical analysis of an event that has not been measured yet, route the logs to BigQuery via a sink and run a COUNT(*) query against the historical partitions.