Introduction to Cloud Logging
In any distributed system, logs are the fundamental record of "what happened." Cloud Logging is a fully managed service that allows you to store, search, analyze, monitor, and alert on log data and events from Google Cloud and other sources. Coupled with Error Reporting, which automatically groups crashes and exceptions, it provides a powerful toolkit for real-time troubleshooting and long-term auditing.
Plain-Language Explanation: Logging and Error Reporting
Analogy 1 — The Black Box Flight Recorder
Cloud Logging is the Black Box on an airplane. It records every button pressed, every engine temperature change, and every communication. If the plane lands safely, you might never look at it. If something goes wrong, it is the first place investigators look to understand the sequence of events.
Analogy 2 — The Medical Record
Logging is like a patient's medical record, documenting every visit, every symptom, and every medication. Error Reporting is like an Emergency Room (ER) triage system. It identifies that 50 people have arrived with the same symptoms (Grouped Exception) and alerts the staff that there might be a food poisoning outbreak (App Bug) rather than 50 unrelated accidents.
Analogy 3 — The Library Archive
Logging is the Library Archive. Thousands of newspapers (Logs) are delivered every day. The Log Router is the librarian who decides which newspapers are kept in the main room (Log Bucket), which are sent to the basement for long-term storage (Coldline GCS), and which are sent to the research lab for analysis (BigQuery).
A configuration that instructs Cloud Logging to export certain logs to destinations like BigQuery, Cloud Storage, or Pub/Sub.
Log Routing and Storage
Cloud Logging uses a "Router" to handle incoming logs:
- Ingestion: Logs arrive from GCE, GKE, or via API.
- Exclusion Filters: Discard "noisy" logs (like successful health checks) to save costs.
- Sinks: Export logs to:
- Cloud Storage: Long-term, low-cost compliance (Retention).
- BigQuery: Complex SQL analysis and security forensics.
- Pub/Sub: Real-time streaming to external SIEMs or automated remediation scripts.
- Log Buckets: The default storage for search within the Logs Explorer.
Log Router Sinks Deep Dive — BigQuery, GCS, Pub/Sub, Splunk
The Log Router is the fork in the road for every log entry: every line gets evaluated against active sinks, and zero, one, or many sinks may match. Understanding each sink target is critical for PCA scenarios involving compliance, analytics, and SIEM integration.
BigQuery Sink
Choose when you need SQL analytics, joins, or BI dashboards on logs.
- Partitioned tables (recommended): one partitioned table per log stream, partitioned by
timestamp. Use--use-partitioned-tableswhen creating the sink. - Schema evolution: Cloud Logging auto-adds new columns when
jsonPayloadkeys change; existing rows keep their old schema. - Latency: typically 60-120 seconds end-to-end.
Cloud Storage Sink
Choose for archival, WORM compliance, or 7-year audit retention at the lowest cost.
- Logs land as hourly JSON files under
gs://<bucket>/<log-id>/YYYY/MM/DD/HH/. - Pair with Bucket Lock + Retention Policy to satisfy SEC/HIPAA immutability requirements.
- Use lifecycle rules to transition to Nearline → Coldline → Archive.
Pub/Sub Sink
Choose for real-time fan-out to Dataflow, Splunk, Chronicle, or third-party SIEM.
- Each log entry becomes one Pub/Sub message; downstream consumers must handle at-least-once delivery.
- Common pattern:
Logs Router → Pub/Sub → Dataflow → BigQuery + Chronicle.
Splunk via Pub/Sub + Dataflow Template
GCP provides a Google-maintained Pub/Sub-to-Splunk Dataflow template (HEC) that streams logs into Splunk HTTP Event Collector. It handles batching, retries, and dead-letter queues; this is the documented path for centralizing GCP logs into an existing on-prem Splunk SIEM without writing custom code.
Sink IAM gotcha: Every Log Router sink runs as a writer service account that Cloud Logging auto-creates. After creating the sink, you must grant that service account roles/bigquery.dataEditor, roles/storage.objectCreator, or roles/pubsub.publisher on the destination — otherwise logs silently drop with no visible error. Use gcloud logging sinks describe SINK_NAME to read the writer identity.
Cloud Error Reporting
Error Reporting automatically analyzes logs for common exception patterns (e.g., Stack Traces in Java, Python, or Go).
- Auto-Grouping: Similar errors are grouped together so you see "50 occurrences" instead of 50 individual log lines.
- Alerting: Get notified immediately when a new type of error appears in production.
- Link to Code: If your source code is in Cloud Source Repositories, Error Reporting can link the stack trace directly to the line of code.
Logging Best Practices
- Structured Logging: Always log in JSON format. This allows you to filter by specific fields (e.g.,
jsonPayload.user_id) rather than doing slow string searches. - Retention Policies: Set different retention periods for different logs. Audit logs might need 7 years, while debug logs only need 30 days.
- Data Masking: Use the Logs Router to redact PII (Personal Identifiable Information) before it is stored.
Architect's Insight: For the PCA exam, if you need to perform "real-time analysis of log data for security threats," the correct architecture is Logs Router -> Pub/Sub -> Dataflow -> BigQuery. ::
Log-Based Metrics — Counter vs Distribution
Log-based metrics turn log content into time-series metrics that Cloud Monitoring can chart and alert on. There are two flavors with very different costs and use cases.
Counter Metrics
- Count how many log entries match a filter.
- Example filter:
severity=ERROR AND resource.type="cloud_run_revision"→ produces a 1-D integer metric. - Use for SLO error budgets, "spike in 5xx" alerts, or "count of failed login attempts."
Distribution Metrics
- Extract a numeric value from each log (e.g.,
jsonPayload.latency_ms) and produce a histogram with percentiles (p50/p95/p99). - Use for latency tracking from logs when you cannot emit OpenTelemetry metrics directly.
System vs User-Defined
- System log-based metrics (e.g.,
logging.googleapis.com/byte_count) are free and pre-built. - User-defined log-based metrics are billed per metric ingestion; keep cardinality low — every label combination (especially high-cardinality labels like
user_id) multiplies cost.
Labels and Extraction
You can promote jsonPayload fields into metric labels using regex or field extractors. Be careful: a label like request_id will create one time series per request and explode your bill.
Distribution metric trap: PCA scenarios often ask "how do I track p99 latency when my legacy app only writes text logs?" The answer is a distribution log-based metric with a regex extractor, NOT a counter metric. Counter only tells you how many; distribution gives you the value distribution and percentiles.
Retention Buckets — Custom vs Default
Cloud Logging stores log entries in log buckets at the project or folder level. Two buckets are auto-created in every project:
| Bucket | Default Retention | Purpose |
|---|---|---|
_Default |
30 days | All non-audit logs |
_Required |
400 days (immutable) | Admin Activity + System Event audit logs |
Custom Buckets
Create your own log bucket when you need:
- Custom retention from 1 day to 3650 days (10 years).
- CMEK (Customer-Managed Encryption Keys) — only configurable on user-created buckets.
- Regional pinning — store EU logs in
europe-west1for GDPR.
Locked Retention
Apply a retention lock to a bucket and the configured retention becomes immutable — no human, including project owners, can shorten it. This is the GCP equivalent of WORM compliance for log data and is required for SOX/PCI scenarios.
Sink-to-Bucket Pattern
Use Log Router sinks to route specific log streams into custom buckets, e.g., security logs → 7-year locked bucket, application debug logs → 7-day cheap bucket. This is the cheapest way to satisfy mixed retention requirements without exporting to GCS.
Exclusion Filters and Cost Optimization
Cloud Logging ingestion is billed at $0.50 per GiB after the 50 GiB free tier per project per month. For chatty workloads (GKE, load balancers, Cloud SQL slow logs), this dominates the observability bill. Exclusion filters are the primary lever.
Where to Apply
Exclusion filters live on the sink level (including the implicit _Default sink). Excluded logs are dropped before billing — they never hit a bucket and are unrecoverable.
Typical Patterns
- Drop successful health checks:
resource.type="http_load_balancer" AND httpRequest.status<400 AND httpRequest.requestUrl=~"/healthz" - Drop verbose GKE control-plane logs:
resource.type="k8s_container" AND severity<WARNING AND resource.labels.container_name="istio-proxy" - Sample 10% of INFO logs:
severity=INFO AND sample(insertId, 0.10)(use thesample()function to keep statistical fidelity).
Sampling vs Exclusion
- Exclusion drops 100% of matched logs.
- Sampling (via
sample()inside the filter) keeps a percentage — useful when you still need spot-checks.
Cost vs Visibility Trade-off
Aggressive exclusion can hide incidents. Always pair exclusion with log-based metrics computed before exclusion (metrics are computed on all ingested logs regardless of bucket routing), so you still see counters even when you drop the raw entries.
Quick win: Run the Logs Storage page in the Console to see top log producers by volume. In most GCP projects, 60-80% of ingestion comes from 2-3 noisy log streams (LB access logs, GKE system, VPC Flow). Excluding just those three usually cuts the bill in half.
Error Reporting Grouping and Deduplication
Error Reporting's value is grouping — turning 50,000 stack traces into 12 actionable error groups. Understanding the grouping algorithm helps you write loggable errors that group correctly.
Grouping Signature
For each error, Error Reporting computes a signature from:
- Exception type (e.g.,
NullPointerException,ValueError). - Top frames of the stack trace (typically 3-5 frames closest to the throw).
- Service + version (errors in
v1.2.3ofcheckout-svcgroup separately fromv1.2.4).
Auto-Detected Languages
Error Reporting parses stack traces natively for Java, Python, Go, Node.js, Ruby, PHP, .NET, and C#. For other languages, log a structured entry with @type: "type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent" and Error Reporting will ingest it.
Manual Reporting via API
Languages without native support can call the Error Reporting API directly, or simply write a log entry with the ReportedErrorEvent proto in jsonPayload. The latter is the recommended pattern because it works with any agent.
Resolution Workflow
Each group has a state: Open → Acknowledged → Resolved → Muted. Resolving an error tells Error Reporting to re-alert if it reappears (regression detection), so resolution is meaningful — don't bulk-resolve.
Why Groups Sometimes Split Unexpectedly
If a redeploy changes the line numbers in your stack trace (e.g., refactor), the same logical bug may create a new group. Use error_message filters and the resolution timeline view to merge them mentally; there is no manual merge UI.
Cloud Logging API and Structured Logging with jsonPayload
The Cloud Logging API accepts log entries via entries.write. The single most important field for production-grade logging is jsonPayload vs textPayload.
textPayload vs jsonPayload
textPayload: a single string. Searches are full-text scans — slow and expensive.jsonPayload: a structured object. Each key becomes a queryable field:jsonPayload.user_id="u123"is indexed and fast.
Reserved Top-Level Fields
Inside jsonPayload, certain keys are promoted to top-level LogEntry fields:
severity→LogEntry.severity(controls coloring + alerts).message→ shown as the summary line in Logs Explorer.trace→LogEntry.trace(links the log to a Cloud Trace span).spanId→ enables per-span log filtering.httpRequest→ populates the HTTP request panel.
Trace Correlation Example
{
"severity": "ERROR",
"message": "Payment timeout",
"logging.googleapis.com/trace": "projects/my-proj/traces/abc123",
"logging.googleapis.com/spanId": "span-456",
"jsonPayload": {"order_id": "o-789", "vendor": "stripe", "timeout_ms": 30000}
}
This entry will appear inline in the Cloud Trace waterfall for trace abc123 — invaluable for debugging distributed systems.
Ops Agent Auto-Parsing
The Ops Agent on GCE auto-detects JSON in stdout/stderr and parses it into jsonPayload. On GKE, the Logging operator does the same for container stdout. Just write JSON to stdout — no SDK needed for the common case.
Client Libraries
For complex use cases (custom monitored resources, large payloads), use the official Logging client libraries (google-cloud-logging in Python, @google-cloud/logging in Node.js). They batch writes and handle retries automatically.
Audit Logs — Admin Activity vs Data Access vs System Event
Cloud Audit Logs are the non-negotiable forensic record of who did what in your GCP environment. There are four distinct audit log types, each with different defaults and cost implications.
1. Admin Activity
- What: API calls that modify metadata or configuration (create VM, change IAM, delete bucket).
- Default: Always on, cannot be disabled, free.
- Retention: 400 days in
_Requiredbucket.
2. Data Access
- What: API calls that read or write user data (
storage.objects.get,bigquery.jobs.createreading a table). - Default: Disabled for most services (except BigQuery, which has Data Access on for reads).
- Cost: Billed at normal Logging ingestion rates; can balloon costs by 10-100x.
- Granularity: Three sub-types —
ADMIN_READ,DATA_READ,DATA_WRITE. Enable selectively per service.
3. System Event
- What: GCP-initiated actions (live migration, host maintenance, automatic key rotation).
- Default: Always on, free.
- Use case: Correlate workload disruptions with platform events.
4. Policy Denied
- What: Records when a request is denied by VPC Service Controls, Org Policy, or IAM Conditions.
- Default: Always on, free.
- Use case: Debug "why was this request blocked?" — invaluable during VPC-SC perimeter rollout.
Enabling Data Access Logs
Enable per service via IAM Audit Config in gcloud organizations set-iam-policy or the Console. PCA scenarios often hint at this: "audit who reads sensitive BigQuery tables" → enable DATA_READ on BigQuery, route to a locked bucket.
PCA scenario heuristic: If the question asks about immutable, tamper-proof, 7-year retention of who-did-what for compliance (SOX/HIPAA/PCI), the answer is Admin Activity + Data Access audit logs → sink to a locked log bucket (or GCS bucket with Bucket Lock) with the configured retention. Never store audit logs only in the default bucket — its 400-day cap is below most compliance requirements.
Log Analytics on BigQuery
Log Analytics is a feature that runs BigQuery SQL directly against Cloud Logging buckets — without exporting. Released GA in 2023, it removes the need for a separate BigQuery sink for many analytics use cases.
How It Works
- Upgrade a log bucket to be Log Analytics-enabled (one-way operation).
- The bucket's logs become queryable via a synthetic BigQuery dataset (
{project}.global._Default._AllLogs). - You can also link the bucket as a BigQuery dataset and join logs against your own BigQuery tables.
Capabilities
- Standard SQL with full BigQuery functions (window functions, regex, geo).
- Schema: each log entry is a row with
timestamp,severity,resource,json_payload,proto_payload. - Queries run against the bucket's underlying storage — no storage duplication, no extra ingestion fees.
Cost Model
- No charge for the analytics feature itself; you pay only standard BigQuery query bytes-scanned if you link to a BigQuery dataset.
- For queries inside the Log Analytics UI, scans against the bucket are included.
When to Choose Log Analytics vs BigQuery Sink
| Need | Choice |
|---|---|
| One-off SQL exploration of logs | Log Analytics |
| Join logs with billing / inventory / external tables | BigQuery Sink (or Log Analytics with linked dataset) |
| Long-term archive + dashboards | BigQuery Sink with partitioning |
| Quick SQL during incidents without setup | Log Analytics |
Example Query
SELECT
TIMESTAMP_TRUNC(timestamp, MINUTE) AS minute,
COUNT(*) AS errors
FROM `my-proj.global._Default._AllLogs`
WHERE severity = 'ERROR'
AND resource.type = 'cloud_run_revision'
AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
GROUP BY minute
ORDER BY minute DESC;
Sensitive Data Redaction in Logs
Application logs are the #1 source of PII leakage in GCP environments — developers log request bodies, headers, and DB rows without thinking. Three defenses, layered.
1. Pre-Ingestion Scrubbing (Best)
Scrub at the source in your application code. Use a logging interceptor that strips fields like password, ssn, credit_card before calling the Logging API. This is the only way to guarantee secrets never enter GCP at all.
2. Log Router with Cloud DLP
For data already flowing through Logs Router, pipe via Pub/Sub → Dataflow with the DLP Inspect template → BigQuery/GCS. Dataflow calls Cloud DLP with infoType detectors (EMAIL_ADDRESS, US_SOCIAL_SECURITY_NUMBER, CREDIT_CARD_NUMBER) and either redacts (replace with [REDACTED]), masks (***-**-1234), or tokenizes (format-preserving encryption).
3. Field-Level Access Control on Log Buckets
Even when raw PII makes it into a bucket, use Logging field-level IAM + bucket views to expose a redacted view to most users and keep the full bucket gated behind a small security group with roles/logging.privateLogViewer.
Data Access Logs Themselves Are Sensitive
Ironically, Data Access logs include the resource name being accessed — sometimes that name itself is sensitive (e.g., storage.objects.get on gs://my-bucket/patient-12345/scan.dcm). Plan retention and access control for the audit log bucket with the same rigor as the data it audits.
Anti-Patterns to Avoid
- Logging entire HTTP request bodies (use header-only logging in prod).
- Including JWTs or session tokens in
Authorizationlog lines. - Dumping SQL query parameters with
WHERE ssn='123-45-6789'— parameterize before logging.
Memorize the 4 audit log types: Admin Activity (always on, free, modifies config), Data Access (off by default except BigQuery, billed, reads/writes data), System Event (always on, free, GCP-initiated), Policy Denied (always on, free, blocked-by-policy). Only Data Access costs money and requires explicit enablement.
FAQ — Cloud Logging
Q1. What is the difference between "Admin Activity" and "Data Access" logs?
Admin Activity logs record API calls that modify resources (e.g., creating a VM) and are always free. Data Access logs record API calls that read or write data (e.g., reading a GCS object) and can be very voluminous and expensive; they are disabled by default.
Q2. How do I reduce Logging costs?
Use Exclusion Filters to drop logs you don't need, and only enable Data Access logs for critical resources. Also, consider routing logs to Cloud Storage if you only need them for compliance and not for active searching.
Q3. Can Cloud Logging handle logs from on-premises servers?
Yes, using the Ops Agent or the Logging API. You can centralize all your logs in GCP for a "single pane of glass" view.
Q4. What is a "Log-based Alert"?
It is an alert that triggers when a specific pattern appears in your logs (e.g., the word "CRITICAL" or a specific error code). Use these for events that don't have a numerical metric.
Q5. How does Error Reporting know a new error is "new"?
It maintains a signature of the stack trace. If a stack trace arrives that doesn't match any existing signature, it is flagged as a new error.