Cloud Logging and Error Reporting — GCP PCA Study Notes

Q: Q1. What is the difference between "Admin Activity" and "Data Access" logs?

Admin Activity logs record API calls that modify resources (e.g., creating a VM) and are always free. Data Access logs record API calls that read or write data (e.g., reading a GCS object) and can be very voluminous and expensive; they are disabled by default.

Q: Q2. How do I reduce Logging costs?

Use Exclusion Filters to drop logs you don't need, and only enable Data Access logs for critical resources. Also, consider routing logs to Cloud Storage if you only need them for compliance and not for active searching.

Q: Q3. Can Cloud Logging handle logs from on-premises servers?

Yes, using the Ops Agent or the Logging API. You can centralize all your logs in GCP for a "single pane of glass" view.

Q: Q4. What is a "Log-based Alert"?

It is an alert that triggers when a specific pattern appears in your logs (e.g., the word "CRITICAL" or a specific error code). Use these for events that don't have a numerical metric.

Q: Q5. How does Error Reporting know a new error is "new"?

It maintains a signature of the stack trace. If a stack trace arrives that doesn't match any existing signature, it is flagged as a new error.

Introduction to Cloud Logging

In any distributed system, logs are the fundamental record of "what happened." Cloud Logging is a fully managed service that allows you to store, search, analyze, monitor, and alert on log data and events from Google Cloud and other sources. Coupled with Error Reporting, which automatically groups crashes and exceptions, it provides a powerful toolkit for real-time troubleshooting and long-term auditing.

Plain-Language Explanation: Logging and Error Reporting

Analogy 1 — The Black Box Flight Recorder

Cloud Logging is the Black Box on an airplane. It records every button pressed, every engine temperature change, and every communication. If the plane lands safely, you might never look at it. If something goes wrong, it is the first place investigators look to understand the sequence of events.

Analogy 2 — The Medical Record

Logging is like a patient's medical record, documenting every visit, every symptom, and every medication. Error Reporting is like an Emergency Room (ER) triage system. It identifies that 50 people have arrived with the same symptoms (Grouped Exception) and alerts the staff that there might be a food poisoning outbreak (App Bug) rather than 50 unrelated accidents.

Analogy 3 — The Library Archive

Logging is the Library Archive. Thousands of newspapers (Logs) are delivered every day. The Log Router is the librarian who decides which newspapers are kept in the main room (Log Bucket), which are sent to the basement for long-term storage (Coldline GCS), and which are sent to the research lab for analysis (BigQuery).

A configuration that instructs Cloud Logging to export certain logs to destinations like BigQuery, Cloud Storage, or Pub/Sub.

Log Routing and Storage

Cloud Logging uses a "Router" to handle incoming logs:

Ingestion: Logs arrive from GCE, GKE, or via API.
Exclusion Filters: Discard "noisy" logs (like successful health checks) to save costs.
Sinks: Export logs to:
- Cloud Storage: Long-term, low-cost compliance (Retention).
- BigQuery: Complex SQL analysis and security forensics.
- Pub/Sub: Real-time streaming to external SIEMs or automated remediation scripts.
- Log Buckets: The default storage for search within the Logs Explorer.

Log Router Sinks Deep Dive — BigQuery, GCS, Pub/Sub, Splunk

The Log Router is the fork in the road for every log entry: every line gets evaluated against active sinks, and zero, one, or many sinks may match. Understanding each sink target is critical for PCA scenarios involving compliance, analytics, and SIEM integration.

BigQuery Sink

Choose when you need SQL analytics, joins, or BI dashboards on logs.

Partitioned tables (recommended): one partitioned table per log stream, partitioned by timestamp. Use --use-partitioned-tables when creating the sink.
Schema evolution: Cloud Logging auto-adds new columns when jsonPayload keys change; existing rows keep their old schema.
Latency: typically 60-120 seconds end-to-end.

Cloud Storage Sink

Choose for archival, WORM compliance, or 7-year audit retention at the lowest cost.

Logs land as hourly JSON files under gs://<bucket>/<log-id>/YYYY/MM/DD/HH/.
Pair with Bucket Lock + Retention Policy to satisfy SEC/HIPAA immutability requirements.
Use lifecycle rules to transition to Nearline → Coldline → Archive.

Pub/Sub Sink

Choose for real-time fan-out to Dataflow, Splunk, Chronicle, or third-party SIEM.

Each log entry becomes one Pub/Sub message; downstream consumers must handle at-least-once delivery.
Common pattern: Logs Router → Pub/Sub → Dataflow → BigQuery + Chronicle.

Splunk via Pub/Sub + Dataflow Template

GCP provides a Google-maintained Pub/Sub-to-Splunk Dataflow template (HEC) that streams logs into Splunk HTTP Event Collector. It handles batching, retries, and dead-letter queues; this is the documented path for centralizing GCP logs into an existing on-prem Splunk SIEM without writing custom code.

Sink IAM gotcha: Every Log Router sink runs as a writer service account that Cloud Logging auto-creates. After creating the sink, you must grant that service account roles/bigquery.dataEditor, roles/storage.objectCreator, or roles/pubsub.publisher on the destination — otherwise logs silently drop with no visible error. Use gcloud logging sinks describe SINK_NAME to read the writer identity.

Cloud Error Reporting

Error Reporting automatically analyzes logs for common exception patterns (e.g., Stack Traces in Java, Python, or Go).

Auto-Grouping: Similar errors are grouped together so you see "50 occurrences" instead of 50 individual log lines.
Alerting: Get notified immediately when a new type of error appears in production.
Link to Code: If your source code is in Cloud Source Repositories, Error Reporting can link the stack trace directly to the line of code.

Logging Best Practices

Structured Logging: Always log in JSON format. This allows you to filter by specific fields (e.g., jsonPayload.user_id) rather than doing slow string searches.
Retention Policies: Set different retention periods for different logs. Audit logs might need 7 years, while debug logs only need 30 days.
Data Masking: Use the Logs Router to redact PII (Personal Identifiable Information) before it is stored.

::promoted

Architect's Insight: For the PCA exam, if you need to perform "real-time analysis of log data for security threats," the correct architecture is Logs Router -> Pub/Sub -> Dataflow -> BigQuery. ::

Log-Based Metrics — Counter vs Distribution

Log-based metrics turn log content into time-series metrics that Cloud Monitoring can chart and alert on. There are two flavors with very different costs and use cases.

Counter Metrics

Count how many log entries match a filter.
Example filter: severity=ERROR AND resource.type="cloud_run_revision" → produces a 1-D integer metric.
Use for SLO error budgets, "spike in 5xx" alerts, or "count of failed login attempts."

Distribution Metrics

Extract a numeric value from each log (e.g., jsonPayload.latency_ms) and produce a histogram with percentiles (p50/p95/p99).
Use for latency tracking from logs when you cannot emit OpenTelemetry metrics directly.

System vs User-Defined

System log-based metrics (e.g., logging.googleapis.com/byte_count) are free and pre-built.
User-defined log-based metrics are billed per metric ingestion; keep cardinality low — every label combination (especially high-cardinality labels like user_id) multiplies cost.

Labels and Extraction

You can promote jsonPayload fields into metric labels using regex or field extractors. Be careful: a label like request_id will create one time series per request and explode your bill.

Distribution metric trap: PCA scenarios often ask "how do I track p99 latency when my legacy app only writes text logs?" The answer is a distribution log-based metric with a regex extractor, NOT a counter metric. Counter only tells you how many; distribution gives you the value distribution and percentiles.

Retention Buckets — Custom vs Default

Cloud Logging stores log entries in log buckets at the project or folder level. Two buckets are auto-created in every project:

Bucket	Default Retention	Purpose
`_Default`	30 days	All non-audit logs
`_Required`	400 days (immutable)	Admin Activity + System Event audit logs

Custom Buckets

Create your own log bucket when you need:

Custom retention from 1 day to 3650 days (10 years).
CMEK (Customer-Managed Encryption Keys) — only configurable on user-created buckets.
Regional pinning — store EU logs in europe-west1 for GDPR.

Locked Retention

Apply a retention lock to a bucket and the configured retention becomes immutable — no human, including project owners, can shorten it. This is the GCP equivalent of WORM compliance for log data and is required for SOX/PCI scenarios.

Sink-to-Bucket Pattern

Use Log Router sinks to route specific log streams into custom buckets, e.g., security logs → 7-year locked bucket, application debug logs → 7-day cheap bucket. This is the cheapest way to satisfy mixed retention requirements without exporting to GCS.

Exclusion Filters and Cost Optimization

Cloud Logging ingestion is billed at $0.50 per GiB after the 50 GiB free tier per project per month. For chatty workloads (GKE, load balancers, Cloud SQL slow logs), this dominates the observability bill. Exclusion filters are the primary lever.

Where to Apply

Exclusion filters live on the sink level (including the implicit _Default sink). Excluded logs are dropped before billing — they never hit a bucket and are unrecoverable.

Typical Patterns

Drop successful health checks: resource.type="http_load_balancer" AND httpRequest.status<400 AND httpRequest.requestUrl=~"/healthz"
Drop verbose GKE control-plane logs: resource.type="k8s_container" AND severity<WARNING AND resource.labels.container_name="istio-proxy"
Sample 10% of INFO logs: severity=INFO AND sample(insertId, 0.10) (use the sample() function to keep statistical fidelity).

Sampling vs Exclusion

Exclusion drops 100% of matched logs.
Sampling (via sample() inside the filter) keeps a percentage — useful when you still need spot-checks.

Cost vs Visibility Trade-off

Aggressive exclusion can hide incidents. Always pair exclusion with log-based metrics computed before exclusion (metrics are computed on all ingested logs regardless of bucket routing), so you still see counters even when you drop the raw entries.

Quick win: Run the Logs Storage page in the Console to see top log producers by volume. In most GCP projects, 60-80% of ingestion comes from 2-3 noisy log streams (LB access logs, GKE system, VPC Flow). Excluding just those three usually cuts the bill in half.

Error Reporting Grouping and Deduplication

Error Reporting's value is grouping — turning 50,000 stack traces into 12 actionable error groups. Understanding the grouping algorithm helps you write loggable errors that group correctly.

Grouping Signature

For each error, Error Reporting computes a signature from:

Exception type (e.g., NullPointerException, ValueError).
Top frames of the stack trace (typically 3-5 frames closest to the throw).
Service + version (errors in v1.2.3 of checkout-svc group separately from v1.2.4).

Auto-Detected Languages

Error Reporting parses stack traces natively for Java, Python, Go, Node.js, Ruby, PHP, .NET, and C#. For other languages, log a structured entry with @type: "type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent" and Error Reporting will ingest it.

Manual Reporting via API

Languages without native support can call the Error Reporting API directly, or simply write a log entry with the ReportedErrorEvent proto in jsonPayload. The latter is the recommended pattern because it works with any agent.

Resolution Workflow

Each group has a state: Open → Acknowledged → Resolved → Muted. Resolving an error tells Error Reporting to re-alert if it reappears (regression detection), so resolution is meaningful — don't bulk-resolve.

Why Groups Sometimes Split Unexpectedly

If a redeploy changes the line numbers in your stack trace (e.g., refactor), the same logical bug may create a new group. Use error_message filters and the resolution timeline view to merge them mentally; there is no manual merge UI.

Cloud Logging API and Structured Logging with jsonPayload

The Cloud Logging API accepts log entries via entries.write. The single most important field for production-grade logging is jsonPayload vs textPayload.

textPayload vs jsonPayload

textPayload: a single string. Searches are full-text scans — slow and expensive.
jsonPayload: a structured object. Each key becomes a queryable field: jsonPayload.user_id="u123" is indexed and fast.

Reserved Top-Level Fields

Inside jsonPayload, certain keys are promoted to top-level LogEntry fields:

severity → LogEntry.severity (controls coloring + alerts).
message → shown as the summary line in Logs Explorer.
trace → LogEntry.trace (links the log to a Cloud Trace span).
spanId → enables per-span log filtering.
httpRequest → populates the HTTP request panel.

Trace Correlation Example

{
  "severity": "ERROR",
  "message": "Payment timeout",
  "logging.googleapis.com/trace": "projects/my-proj/traces/abc123",
  "logging.googleapis.com/spanId": "span-456",
  "jsonPayload": {"order_id": "o-789", "vendor": "stripe", "timeout_ms": 30000}
}

This entry will appear inline in the Cloud Trace waterfall for trace abc123 — invaluable for debugging distributed systems.

Ops Agent Auto-Parsing

The Ops Agent on GCE auto-detects JSON in stdout/stderr and parses it into jsonPayload. On GKE, the Logging operator does the same for container stdout. Just write JSON to stdout — no SDK needed for the common case.

Client Libraries

For complex use cases (custom monitored resources, large payloads), use the official Logging client libraries (google-cloud-logging in Python, @google-cloud/logging in Node.js). They batch writes and handle retries automatically.

Audit Logs — Admin Activity vs Data Access vs System Event

Cloud Audit Logs are the non-negotiable forensic record of who did what in your GCP environment. There are four distinct audit log types, each with different defaults and cost implications.

1. Admin Activity

What: API calls that modify metadata or configuration (create VM, change IAM, delete bucket).
Default: Always on, cannot be disabled, free.
Retention: 400 days in _Required bucket.

2. Data Access

What: API calls that read or write user data (storage.objects.get, bigquery.jobs.create reading a table).
Default: Disabled for most services (except BigQuery, which has Data Access on for reads).
Cost: Billed at normal Logging ingestion rates; can balloon costs by 10-100x.
Granularity: Three sub-types — ADMIN_READ, DATA_READ, DATA_WRITE. Enable selectively per service.

3. System Event

What: GCP-initiated actions (live migration, host maintenance, automatic key rotation).
Default: Always on, free.
Use case: Correlate workload disruptions with platform events.

4. Policy Denied

What: Records when a request is denied by VPC Service Controls, Org Policy, or IAM Conditions.
Default: Always on, free.
Use case: Debug "why was this request blocked?" — invaluable during VPC-SC perimeter rollout.

Enabling Data Access Logs

Enable per service via IAM Audit Config in gcloud organizations set-iam-policy or the Console. PCA scenarios often hint at this: "audit who reads sensitive BigQuery tables" → enable DATA_READ on BigQuery, route to a locked bucket.

PCA scenario heuristic: If the question asks about immutable, tamper-proof, 7-year retention of who-did-what for compliance (SOX/HIPAA/PCI), the answer is Admin Activity + Data Access audit logs → sink to a locked log bucket (or GCS bucket with Bucket Lock) with the configured retention. Never store audit logs only in the default bucket — its 400-day cap is below most compliance requirements.

Log Analytics on BigQuery

Log Analytics is a feature that runs BigQuery SQL directly against Cloud Logging buckets — without exporting. Released GA in 2023, it removes the need for a separate BigQuery sink for many analytics use cases.

How It Works

Upgrade a log bucket to be Log Analytics-enabled (one-way operation).
The bucket's logs become queryable via a synthetic BigQuery dataset ({project}.global._Default._AllLogs).
You can also link the bucket as a BigQuery dataset and join logs against your own BigQuery tables.

Capabilities

Standard SQL with full BigQuery functions (window functions, regex, geo).
Schema: each log entry is a row with timestamp, severity, resource, json_payload, proto_payload.
Queries run against the bucket's underlying storage — no storage duplication, no extra ingestion fees.

Cost Model

No charge for the analytics feature itself; you pay only standard BigQuery query bytes-scanned if you link to a BigQuery dataset.
For queries inside the Log Analytics UI, scans against the bucket are included.

When to Choose Log Analytics vs BigQuery Sink

Need	Choice
One-off SQL exploration of logs	Log Analytics
Join logs with billing / inventory / external tables	BigQuery Sink (or Log Analytics with linked dataset)
Long-term archive + dashboards	BigQuery Sink with partitioning
Quick SQL during incidents without setup	Log Analytics

Example Query

SELECT
  TIMESTAMP_TRUNC(timestamp, MINUTE) AS minute,
  COUNT(*) AS errors
FROM `my-proj.global._Default._AllLogs`
WHERE severity = 'ERROR'
  AND resource.type = 'cloud_run_revision'
  AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
GROUP BY minute
ORDER BY minute DESC;

Sensitive Data Redaction in Logs

Application logs are the #1 source of PII leakage in GCP environments — developers log request bodies, headers, and DB rows without thinking. Three defenses, layered.

1. Pre-Ingestion Scrubbing (Best)

Scrub at the source in your application code. Use a logging interceptor that strips fields like password, ssn, credit_card before calling the Logging API. This is the only way to guarantee secrets never enter GCP at all.

2. Log Router with Cloud DLP

For data already flowing through Logs Router, pipe via Pub/Sub → Dataflow with the DLP Inspect template → BigQuery/GCS. Dataflow calls Cloud DLP with infoType detectors (EMAIL_ADDRESS, US_SOCIAL_SECURITY_NUMBER, CREDIT_CARD_NUMBER) and either redacts (replace with [REDACTED]), masks (***-**-1234), or tokenizes (format-preserving encryption).

3. Field-Level Access Control on Log Buckets

Even when raw PII makes it into a bucket, use Logging field-level IAM + bucket views to expose a redacted view to most users and keep the full bucket gated behind a small security group with roles/logging.privateLogViewer.

Data Access Logs Themselves Are Sensitive

Ironically, Data Access logs include the resource name being accessed — sometimes that name itself is sensitive (e.g., storage.objects.get on gs://my-bucket/patient-12345/scan.dcm). Plan retention and access control for the audit log bucket with the same rigor as the data it audits.

Anti-Patterns to Avoid

Logging entire HTTP request bodies (use header-only logging in prod).
Including JWTs or session tokens in Authorization log lines.
Dumping SQL query parameters with WHERE ssn='123-45-6789' — parameterize before logging.

Memorize the 4 audit log types: Admin Activity (always on, free, modifies config), Data Access (off by default except BigQuery, billed, reads/writes data), System Event (always on, free, GCP-initiated), Policy Denied (always on, free, blocked-by-policy). Only Data Access costs money and requires explicit enablement.

FAQ — Cloud Logging

Q1. What is the difference between "Admin Activity" and "Data Access" logs?

Admin Activity logs record API calls that modify resources (e.g., creating a VM) and are always free. Data Access logs record API calls that read or write data (e.g., reading a GCS object) and can be very voluminous and expensive; they are disabled by default.

Q2. How do I reduce Logging costs?

Use Exclusion Filters to drop logs you don't need, and only enable Data Access logs for critical resources. Also, consider routing logs to Cloud Storage if you only need them for compliance and not for active searching.

Q3. Can Cloud Logging handle logs from on-premises servers?

Yes, using the Ops Agent or the Logging API. You can centralize all your logs in GCP for a "single pane of glass" view.

Q4. What is a "Log-based Alert"?

It is an alert that triggers when a specific pattern appears in your logs (e.g., the word "CRITICAL" or a specific error code). Use these for events that don't have a numerical metric.

Q5. How does Error Reporting know a new error is "new"?

It maintains a signature of the stack trace. If a stack trace arrives that doesn't match any existing signature, it is flagged as a new error.

Introduction to Cloud Logging

Plain-Language Explanation: Logging and Error Reporting

Analogy 1 — The Black Box Flight Recorder

Analogy 2 — The Medical Record

Analogy 3 — The Library Archive

Log Routing and Storage

Log Router Sinks Deep Dive — BigQuery, GCS, Pub/Sub, Splunk

BigQuery Sink

Cloud Storage Sink

Pub/Sub Sink

Splunk via Pub/Sub + Dataflow Template

Cloud Error Reporting

Logging Best Practices

Log-Based Metrics — Counter vs Distribution

Counter Metrics

Distribution Metrics

System vs User-Defined

Labels and Extraction

Retention Buckets — Custom vs Default

Custom Buckets

Locked Retention

Sink-to-Bucket Pattern

Exclusion Filters and Cost Optimization

Where to Apply

Typical Patterns

Sampling vs Exclusion

Cost vs Visibility Trade-off

Error Reporting Grouping and Deduplication

Grouping Signature

Auto-Detected Languages

Manual Reporting via API

Resolution Workflow

Why Groups Sometimes Split Unexpectedly

Cloud Logging API and Structured Logging with jsonPayload

textPayload vs jsonPayload

Reserved Top-Level Fields

Trace Correlation Example

Ops Agent Auto-Parsing

Client Libraries

Audit Logs — Admin Activity vs Data Access vs System Event

1. Admin Activity

2. Data Access

3. System Event

4. Policy Denied

Enabling Data Access Logs

Log Analytics on BigQuery

How It Works

Capabilities

Cost Model

When to Choose Log Analytics vs BigQuery Sink

Example Query

Sensitive Data Redaction in Logs

1. Pre-Ingestion Scrubbing (Best)

2. Log Router with Cloud DLP

3. Field-Level Access Control on Log Buckets

Data Access Logs Themselves Are Sensitive

Anti-Patterns to Avoid

FAQ — Cloud Logging

Q1. What is the difference between "Admin Activity" and "Data Access" logs?

Q2. How do I reduce Logging costs?

Q3. Can Cloud Logging handle logs from on-premises servers?

Q4. What is a "Log-based Alert"?

Q5. How does Error Reporting know a new error is "new"?

Official sources

More PCA topics