Cloud Logging and Error Reporting

Q: Q3: How do I keep audit logs for seven years for SOX compliance?

Two options. (1) Increase _Required is not possible — it is fixed at 400 days. Instead, create an aggregated sink at the organization level filtering logName:"cloudaudit.googleapis.com" , send to a Cloud Storage bucket with a 7-year retention policy and bucket lock, or to a Cloud Logging bucket with --retention-days=2555 --locked . (2) Route to BigQuery for ad-hoc audit queries; BigQuery datasets have their own retention controls.

Q: Q4: My Cloud Logging bill is dominated by Kubernetes container logs. What now?

Apply exclusion filters on _Default to drop noisy namespaces ( resource.labels.namespace_name=~"^(kube-system|istio-system)$" ), drop verbose info logs from sidecars, and downgrade the audit-log stage from RequestReceived to only ResponseComplete . If you still need the data, send a sampled (e.g. 10%) stream to a low-cost Cloud Storage sink and exclude the rest entirely.

Q: Q5: How do I get logs and traces side by side?

Inject logging.googleapis.com/trace = projects/PROJECT_ID/traces/TRACE_ID into every log entry. On Cloud Run, App Engine, and Cloud Functions, the platform extracts TRACE_ID from the X-Cloud-Trace-Context request header — propagate it into your structured log payload, then open any log entry in Logs Explorer and click "Show in context" or jump straight to the Trace UI. Add logging.googleapis.com/spanId for span-level correlation.

Q: Q6: Are log-based metrics retroactive?

No. A log-based metric begins recording at creation time and only counts entries written after that moment. If you need historical analysis of an event that has not been measured yet, route the logs to BigQuery via a sink and run a COUNT(*) query against the historical partitions.

Introduction to Cloud Logging and Error Reporting

Observability is a core pillar of the Google Cloud Professional Cloud Developer (PCD) exam. Cloud Logging is a fully managed service that ingests, stores, searches, analyzes, and exports log entries from Google Cloud, Anthos, AWS, on-premises, and arbitrary application code. Error Reporting sits on top of Cloud Logging and automatically aggregates and groups crashes and exceptions in your running applications, sending notifications when a new error signature is detected.

The PCD exam expects you to know not just what these services do, but the exact knobs you need to turn for production-grade deployments: which jsonPayload keys are special, which severity strings the API accepts, how _Default and _Required buckets differ, when to route to BigQuery versus Pub/Sub, how Error Reporting groups stack traces, and how to keep cost under control with exclusion filters and short-retention buckets.

Cloud Logging is billed on ingestion (USD per GiB written) and storage beyond the included retention, so every architecture decision in this topic is simultaneously a cost decision. Error Reporting, in contrast, is free for the analysis layer — but only if the underlying log entries it reads from _Default are still within their retention window.

Structured Logging with jsonPayload

Plain-text logs (the textPayload field) are searchable but hostile to filtering. Structured logging sends a JSON object that Cloud Logging parses into the jsonPayload field, which is then queryable with the Logging Query Language (LQL).

Special jsonPayload keys

Cloud Logging promotes a handful of well-known JSON keys to top-level LogEntry fields when the agent or runtime detects them. Memorize these — they appear in scenario questions:

severity → promoted to LogEntry.severity
message → used as the headline in Logs Explorer
httpRequest → promoted to LogEntry.httpRequest (status, latency, userAgent)
logging.googleapis.com/trace → links the entry to a Cloud Trace span
logging.googleapis.com/spanId → ties the entry to a specific span
logging.googleapis.com/trace_sampled → boolean for sampling decisions
logging.googleapis.com/labels → user labels on the entry
logging.googleapis.com/insertId → enforces idempotency on retries

Writing structured logs from app code

From Cloud Run, App Engine, Cloud Functions, and GKE with the Ops Agent or Logging Agent, anything you print to stdout/stderr as a single-line JSON object is auto-parsed. There is no SDK requirement — print(json.dumps({...})) is sufficient. If you need batching, retries, and explicit resource attribution, use the google-cloud-logging client library, which writes via the entries.write REST method.

jsonPayload — The LogEntry field that holds parsed JSON content. Mutually exclusive with textPayload and protoPayload. Searchable with jsonPayload.field_name="value" in LQL and indexable by log-based metrics.

Severity Levels and Filtering

Cloud Logging accepts exactly nine severity values, defined by google.logging.type.LogSeverity: DEFAULT, DEBUG, INFO, NOTICE, WARNING, ERROR, CRITICAL, ALERT, EMERGENCY. Anything else gets coerced to DEFAULT.

Mapping language conventions to GCP severity

Python logging.INFO → INFO; logging.ERROR → ERROR; logging.CRITICAL → CRITICAL
Java java.util.logging.Level.WARNING → WARNING; SEVERE → ERROR
Node.js console.error → ERROR; console.warn → WARNING; console.log → INFO
Go log.Fatal writes a single CRITICAL entry before exit

LQL filters and the severity ladder

LQL severity filters are inclusive of higher levels. severity >= WARNING returns WARNING, ERROR, CRITICAL, ALERT, and EMERGENCY — useful for alerting policies that should fire on anything more severe than the threshold. Combine with resource filters:

resource.type="cloud_run_revision"
resource.labels.service_name="checkout-api"
severity >= ERROR

Cloud Run, Cloud Functions, and App Engine auto-promote stderr to ERROR and stdout to INFO only when the line is not valid JSON. If your JSON includes "severity": "ERROR" but you write it to stdout, Cloud Logging still classifies it as ERROR. Prefer explicit severity keys to avoid confusion when porting between runtimes.

Trace Correlation with logging.googleapis.com/trace

The single most useful observability technique on Google Cloud is correlating a log line with the full distributed trace that produced it. Cloud Run, App Engine, and Cloud Functions automatically inject the trace context as the X-Cloud-Trace-Context HTTP header.

The trace key format

Set logging.googleapis.com/trace to the full resource name: projects/PROJECT_ID/traces/TRACE_ID. Just the bare TRACE_ID does not link in the Logs Explorer side panel. The agent and most client libraries assemble the resource path for you when you provide a trace_id.

Pairing with spanId

Adding logging.googleapis.com/spanId (a 16-character hex string) anchors the entry to a specific span inside the trace, so the Trace UI can render "logs for this span" without showing entries from sibling RPCs in the same request. For OpenTelemetry-instrumented apps, the OTel exporter for Cloud Logging populates both fields automatically.

Propagation rules

Inbound: read X-Cloud-Trace-Context (TRACE_ID/SPAN_ID;o=TRACE_TRUE)
Outbound: forward the same header when calling downstream services
gRPC: use the grpc-trace-bin binary metadata or the W3C traceparent header

On Cloud Run, the platform writes its own request log entry per HTTP request with the trace context populated. If your application also writes its own structured log without logging.googleapis.com/trace, you will see two unrelated entries in Logs Explorer for the same request. Always propagate the trace ID into your jsonPayload.

Log Buckets: _Default, _Required, and Custom

Every Google Cloud project has two system buckets that cannot be deleted: _Required and _Default. You can create up to 100 additional log buckets per project for fine-grained retention and CMEK policies.

_Required bucket

Holds Admin Activity audit logs, System Event audit logs, and Access Transparency logs. Retention is fixed at 400 days and cannot be configured, exported away, or disabled. Storage in _Required is free.

_Default bucket

Holds everything else that does not match a routing sink with disabled=false. Default retention is 30 days, configurable from 1 day to 3650 days (10 years). Data Access audit logs land here unless you explicitly enable them and route elsewhere.

Custom buckets

Created with gcloud logging buckets create LOG_BUCKET_NAME --location=LOCATION --retention-days=DAYS. Custom buckets let you:

Apply CMEK with --cmek-kms-key-name
Choose a specific region (data residency)
Set retention from 1 day to 3650 days independently per bucket
Enable Log Analytics (BigQuery-backed querying) with --enable-analytics

_Required = 400 days, free, admin/audit only, immutable. _Default = 30 days default, configurable 1–3650, charged after the included retention. Custom buckets = up to 100 per project, support CMEK and Log Analytics.

Log Sinks: Routing to BigQuery, GCS, Pub/Sub, and Splunk

A sink is a routing rule defined at the project, folder, billing-account, or organization level. Every matched log entry is delivered to the sink's destination in addition to (or, with an exclusion, instead of) the originating bucket.

Sink destinations

Cloud Logging bucket (same or different project): the canonical "send security logs to a central project" pattern
BigQuery dataset: ad-hoc SQL analysis; tables are auto-created with _YYYYMMDD sharding or partitioned tables (recommended)
Cloud Storage bucket: long-term archival; hourly file batches in LOG_NAME/YYYY/MM/DD/HH00/ prefix
Pub/Sub topic: real-time streaming to Splunk via Dataflow's Pub/Sub to Splunk template, third-party SIEMs, or custom consumers

Writer identity

Every sink creates a writer identity service account named [email protected] (for org-level sinks, a unique p-PROJECT_NUMBER-LOCATION@... identity). You must grant this identity write permission on the destination — roles/bigquery.dataEditor, roles/storage.objectCreator, or roles/pubsub.publisher — before logs flow.

Aggregated sinks

At the folder or organization level, an aggregated sink with --include-children pulls logs from every project below it, enabling a single centralized security-logs project for the whole org.

The Splunk integration is not a native Cloud Logging sink. The correct pattern is: Cloud Logging sink → Pub/Sub topic → Dataflow Pub/Sub to Splunk template → Splunk HEC endpoint. Expect this multi-hop architecture to appear in PCD scenarios about hybrid SIEM ingestion.

Log-based Metrics: Counter and Distribution

Log-based metrics convert log content into time-series data that Cloud Monitoring can chart and alert on. Two flavors exist.

Counter metrics

Increment by one for every log entry matching a filter. Free for system metrics, charged per chargeable label for user-defined ones. Useful for:

"How many 500 responses per minute" → filter httpRequest.status >= 500
"Number of failed logins" → filter jsonPayload.event="login_failed"

Created with gcloud logging metrics create LOGIN_FAILURES --description="..." --log-filter='resource.type="cloud_run_revision" AND jsonPayload.event="login_failed"'.

Distribution metrics

Extract a numeric value from each matching entry and build a histogram. Required when you care about percentiles, not just counts. Useful for:

p95 request latency from httpRequest.latency
Distribution of payload sizes from jsonPayload.bytes_processed

You specify --value-extractor (e.g. EXTRACT(jsonPayload.latency_ms)) and bucket boundaries (linear, exponential, or explicit). Labels can be extracted with --label-extractor to slice the histogram by tenant, region, or endpoint.

Quotas and gotchas

Maximum of 500 log-based metrics per project
Maximum of 10 user-defined labels per metric, 30 chargeable label values per metric per minute before throttling
Backfilled: metrics start counting from creation time onward, not retroactively

Error Reporting: Auto-Grouping by Stack Trace Signature

Error Reporting reads from _Default and any custom bucket where you have left ingestion enabled, then groups stack traces into error groups that share a common fingerprint.

How grouping works

The grouper extracts the exception type and the top frames of the stack trace, normalizing away dynamic content (memory addresses, line numbers in vendored code, anonymous lambda names) to produce a signature. All entries with the same signature collapse into one ErrorGroup resource, surfaced in the console as a single card with first-seen, last-seen, total count, and affected service versions.

Supported runtimes

Auto-detection works for Java, Python, Node.js, Go, .NET, Ruby, and PHP when the stack trace is written in the platform's idiomatic format. For other languages, format the trace yourself and include it in jsonPayload.message along with @type = type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent.

Notifications

Each project can configure email notifications for new error groups (the first time a signature is seen) and resolved-then-reopened groups. Mobile push notifications via the Google Cloud Console app are also supported. Integration with PagerDuty or Slack is achieved indirectly: create an alerting policy on the log-based metric logging.googleapis.com/byte_count or on a custom metric, then route via Notification Channels.

A common exam trap is suggesting "create an alerting policy in Error Reporting" — Error Reporting itself does not expose Cloud Monitoring alerting policies. You alert on log-based metrics derived from the same logs, or on Error Reporting's email/mobile notifications. Picking the "Error Reporting alerting policy" distractor is wrong.

Cloud Logging API for Application Instrumentation

For workloads outside auto-instrumented runtimes (on-prem servers, batch jobs, mobile telemetry), use the Cloud Logging API directly.

entries.write

POST https://logging.googleapis.com/v2/entries:write accepts up to 1,000 entries or 10 MB per request, whichever is smaller. Each entry must include a logName (projects/PROJECT_ID/logs/LOG_ID), a resource (with type and labels), and exactly one of textPayload, jsonPayload, or protoPayload.

Client library batching

The official client libraries (google-cloud-logging for Python, @google-cloud/logging for Node.js, cloud.google.com/go/logging in Go) wrap entries.write with:

In-process batching with configurable delay_threshold and batch_size
Background flushing on a worker goroutine/thread
Resource auto-detection on Cloud Run, GKE, GCE, App Engine
Built-in retry with exponential backoff on UNAVAILABLE and DEADLINE_EXCEEDED

Authentication

Service accounts need roles/logging.logWriter. On Cloud Run, App Engine, Cloud Functions, GKE Workload Identity, and Compute Engine, the platform default service account already has this role unless you removed it. For external machines, use a service-account key or Workload Identity Federation (preferred) — never embed long-lived keys.

Retention Configuration and Cost Control

Cloud Logging cost has two components: ingestion (first 50 GiB/month free per project, then USD per GiB) and storage beyond the included retention window (_Default includes 30 days free).

Configuring retention per bucket

gcloud logging buckets update _Default \
  --location=global \
  --retention-days=7

Reducing _Default retention from 30 to 7 days has no effect on ingestion charges — it only affects storage charges that kick in after the first 30 days, so dropping below 30 saves nothing. To cut cost, exclude logs before they are ingested.

Locked retention for compliance

Add --locked to a bucket update to make retention immutable: nobody, including project owners, can shorten retention or delete the bucket until every entry has aged past its retention. This satisfies WORM compliance for HIPAA, PCI-DSS, and SEC 17a-4.

Locked buckets cannot be deleted while they contain entries. Plan capacity accordingly — at 100 buckets per project, you can quickly box yourself in. Use a dedicated long-retention archive bucket per regulated workload rather than locking every bucket "just in case".

CMEK on Log Buckets

By default, Cloud Logging encrypts data at rest with Google-managed keys. For regulated workloads, use Customer-Managed Encryption Keys (CMEK) stored in Cloud KMS.

Two scopes for CMEK

Project default key — applied at the project or folder level via gcloud logging cmek-settings update --kms-key-name=.... Any new log bucket created in the project inherits this key.
Per-bucket key — set at bucket creation with --cmek-kms-key-name. Cannot be added to an existing bucket; you must create a new bucket and route logs there.

IAM requirements

Grant the Cloud Logging service account roles/cloudkms.cryptoKeyEncrypterDecrypter on the KMS key. The service account name is [email protected]. Without this grant, log ingestion to the CMEK-protected bucket fails silently and entries are dropped.

Regional alignment

The KMS keyring must be in the same region as the log bucket, or in a multi-region that contains it. A us-central1 bucket cannot use a europe-west1 key.

Exclusion Filters

Exclusion filters live on the _Default sink (and any custom sink) and drop matching entries before ingestion is billed. This is the single biggest lever for Cloud Logging cost control.

Common exclusions

Health-check probes: httpRequest.requestUrl=~"/healthz$" AND httpRequest.status=200
Verbose Kubernetes audit at the RequestReceived stage: protoPayload.@type="type.googleapis.com/google.cloud.audit.AuditLog" AND protoPayload.metadata.stage="RequestReceived"
Successful Cloud SQL slow-query logs below a latency threshold
Repetitive load-balancer entries for static-asset URLs

Setting exclusions

gcloud logging sinks update _Default \
  --add-exclusion=name=skip-healthz,filter='httpRequest.requestUrl=~"/healthz$"'

You can attach up to 50 exclusion filters per sink. Each exclusion may have a disabled flag for temporary debugging — flip it on, reproduce an incident with full logs, then re-disable.

Exclusions are evaluated before log-based metrics. If you create a metric that counts health-check 200s, then add an exclusion that drops them, the metric stops counting. Order exclusions and metrics deliberately: exclude the truly worthless, but keep at least a 1% sample if you need a denominator for SLOs.

Log Analytics and BigQuery Integration

Log Analytics is a Cloud Logging feature that lets you query log buckets with standard SQL without first exporting them to BigQuery. Enable it per bucket at creation time with --enable-analytics; existing buckets can be upgraded but the operation is irreversible.

Querying log buckets with SQL

Once analytics is enabled, the bucket is exposed as a linked BigQuery dataset named LOG_BUCKET_NAME._AllLogs. You can run queries like:

SELECT
  timestamp,
  resource.labels.service_name AS service,
  JSON_VALUE(json_payload, '$.user_id') AS user_id,
  severity
FROM `PROJECT_ID.region-LOCATION._AllLogs`
WHERE severity = 'ERROR'
  AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
ORDER BY timestamp DESC
LIMIT 100;

This sidesteps the need for a BigQuery sink for ad-hoc investigations, but does not replace a sink when you need to join logs with non-log tables. For long-running joins against fact tables, a BigQuery sink with partitioned tables remains cheaper and faster.

When to choose each path

Ad-hoc one-off SQL on the last 30 days of logs → Log Analytics
Joining logs with order, customer, or product tables → BigQuery sink
Reports retained beyond bucket retention → BigQuery sink with partition expiration aligned to compliance window
Cross-project consolidated dashboards → aggregated sink at org level into one BigQuery dataset

Audit Logging Categories

The PCD exam expects fluency in the four Google Cloud audit log categories, all delivered via Cloud Logging.

Admin Activity

Always-on, free, written to _Required for 400 days. Captures every IAM change, resource create/update/delete, and configuration mutation. Cannot be disabled. Use these logs to answer "who changed the firewall rule at 03:14?".

Data Access

Off by default for most services (always-on for BigQuery DATA_READ). Three sub-types: ADMIN_READ, DATA_READ, DATA_WRITE. Enable selectively per service via the project's IAM & Admin > Audit Logs page, because DATA_READ on chatty services like Cloud Storage will quickly dominate your ingestion bill.

System Event

Always-on, free, captures GCP-initiated actions like live migration or auto-renewal of TLS certificates. Useful for post-incident reconstruction.

Policy Denied

Records when an IAM policy or VPC Service Controls rule denies a request. Crucial for diagnosing "why does this service account get 403 when I think I gave it permission?" — the denied log entry tells you which condition or perimeter blocked the call.

Enabling DATA_READ on production Cloud Storage buckets is the single most common surprise on Cloud Logging bills. A bucket serving public assets can generate millions of read entries per day. Enable it only on sensitive buckets, route to a dedicated long-retention custom bucket, and exclude unauthenticated reads if not required for compliance.

白話文解釋（Plain English Explanation）

Analogy 1: The Black Box Flight Recorder

Cloud Logging is like the black box on a passenger jet. Every sensor reading, every cockpit announcement, every status flag is written to a tamper-resistant recorder. When the investigators arrive, they replay the recording millisecond by millisecond. Log buckets are the different black boxes you might install — _Required is the bolted-down regulatory recorder you legally cannot remove, _Default is your everyday voice recorder, and custom buckets are extra recorders you install in specific cabins (services) with their own retention rules.

Analogy 2: The Hospital Triage Nurse

Error Reporting is the triage nurse standing at the emergency-room door. Cloud Logging is the paperwork archive that holds every patient's full file forever, including coughs and headaches. The triage nurse ignores routine paperwork and only flags the genuinely serious cases — broken bones, chest pains. She also recognizes repeat visitors: if the same patient with the same symptom comes back, she does not start a new file, she pulls the existing one and adds another visit. That fingerprint-matching is exactly how Error Reporting groups stack traces, and the "fast track" lane the nurse sends serious cases to is the email/mobile notification that pages the on-call developer.

Analogy 3: The Postal Sorting Office

Log sinks behave like a postal sorting office. Every letter (log entry) arrives at one central conveyor (_Default). At the sorting station, rules decide where to forward each letter: BigQuery is the research department that wants every letter cataloged for SQL queries, Cloud Storage is the long-term archive vault that gets weekly batches, Pub/Sub is the high-speed pneumatic tube that fires letters to Splunk in real time. The exclusion filter is the office manager throwing the junk mail straight into the recycling bin so nobody pays to process it — this is exactly why you exclude health-check logs before they are ingested.

Frequently Asked Questions

Q1: Can Cloud Logging handle logs from on-premises servers?

Yes. Install the Ops Agent (the unified successor to the legacy Logging Agent and Monitoring Agent) on physical or virtual servers, configure it with a service-account key or Workload Identity Federation, and it streams entries to Cloud Logging just like a GCE VM. For custom apps without an agent, call the entries.write REST endpoint directly with google-cloud-logging client libraries.

Q2: Is Error Reporting real-time?

Yes, with a small ingestion delay. Once a stack trace lands in _Default, Error Reporting typically detects and groups it within seconds to a minute. New-error notifications are sent via email immediately after the first matching entry is grouped, and the Google Cloud Console mobile app can push notifications to your phone.

Q3: How do I keep audit logs for seven years for SOX compliance?

Two options. (1) Increase _Required is not possible — it is fixed at 400 days. Instead, create an aggregated sink at the organization level filtering logName:"cloudaudit.googleapis.com", send to a Cloud Storage bucket with a 7-year retention policy and bucket lock, or to a Cloud Logging bucket with --retention-days=2555 --locked. (2) Route to BigQuery for ad-hoc audit queries; BigQuery datasets have their own retention controls.

Q4: My Cloud Logging bill is dominated by Kubernetes container logs. What now?

Apply exclusion filters on _Default to drop noisy namespaces (resource.labels.namespace_name=~"^(kube-system|istio-system)$"), drop verbose info logs from sidecars, and downgrade the audit-log stage from RequestReceived to only ResponseComplete. If you still need the data, send a sampled (e.g. 10%) stream to a low-cost Cloud Storage sink and exclude the rest entirely.

Q5: How do I get logs and traces side by side?

Inject logging.googleapis.com/trace = projects/PROJECT_ID/traces/TRACE_ID into every log entry. On Cloud Run, App Engine, and Cloud Functions, the platform extracts TRACE_ID from the X-Cloud-Trace-Context request header — propagate it into your structured log payload, then open any log entry in Logs Explorer and click "Show in context" or jump straight to the Trace UI. Add logging.googleapis.com/spanId for span-level correlation.

Q6: Are log-based metrics retroactive?

No. A log-based metric begins recording at creation time and only counts entries written after that moment. If you need historical analysis of an event that has not been measured yet, route the logs to BigQuery via a sink and run a COUNT(*) query against the historical partitions.

Introduction to Cloud Logging and Error Reporting

Structured Logging with jsonPayload

Special jsonPayload keys

Writing structured logs from app code

Severity Levels and Filtering

Mapping language conventions to GCP severity

LQL filters and the severity ladder

Trace Correlation with logging.googleapis.com/trace

The trace key format

Pairing with spanId

Propagation rules

Log Buckets: _Default, _Required, and Custom

_Required bucket

_Default bucket

Custom buckets

Log Sinks: Routing to BigQuery, GCS, Pub/Sub, and Splunk

Sink destinations

Writer identity

Aggregated sinks

Log-based Metrics: Counter and Distribution

Counter metrics

Distribution metrics

Quotas and gotchas

Error Reporting: Auto-Grouping by Stack Trace Signature

How grouping works

Supported runtimes

Notifications

Cloud Logging API for Application Instrumentation

entries.write

Client library batching

Authentication

Retention Configuration and Cost Control

Configuring retention per bucket

Locked retention for compliance

CMEK on Log Buckets

Two scopes for CMEK

IAM requirements

Regional alignment

Exclusion Filters

Common exclusions

Setting exclusions

Log Analytics and BigQuery Integration

Querying log buckets with SQL

When to choose each path

Audit Logging Categories

Admin Activity

Data Access

System Event

Policy Denied

白話文解釋（Plain English Explanation）

Analogy 1: The Black Box Flight Recorder

Analogy 2: The Hospital Triage Nurse

Analogy 3: The Postal Sorting Office

Frequently Asked Questions

Q1: Can Cloud Logging handle logs from on-premises servers?

Q2: Is Error Reporting real-time?

Q3: How do I keep audit logs for seven years for SOX compliance?

Q4: My Cloud Logging bill is dominated by Kubernetes container logs. What now?

Q5: How do I get logs and traces side by side?

Q6: Are log-based metrics retroactive?

Official sources

More PCD topics