Introduction to Troubleshooting in the Cloud
For a Professional Cloud Architect, troubleshooting is not about "guessing." It is a systematic process of deduction using the data provided by the cloud platform's observability suite. In a distributed, microservices-based environment, "where" the problem is can be just as difficult to find as "what" the problem is.
This guide outlines a structured approach to identifying, isolating, and resolving issues across the GCP stack.
The Systematic Troubleshooting Workflow
- Define the Problem: What is the expected behavior vs. the actual behavior? Who is affected? When did it start?
- Gather Evidence: Use Cloud Logging and Cloud Monitoring to see if there are spikes in errors, latency, or resource exhaustion.
- Isolate the Root Cause: Is it the Network (VPC)? The Identity (IAM)? The Compute (VM/Pod)? Or the Application Code?
- Formulate and Test a Hypothesis: "If I change the IAM role, will the 403 error disappear?"
- Implement the Fix and Verify: Apply the change and monitor the logs to ensure the error rate drops to zero.
- Post-Mortem: Document why it happened and how to prevent it (e.g., adding an alert or changing an architecture).
白話文解釋(Plain English Explanation)
Analogy 1 — Emergency Room Triage (Severity & Resource Routing)
Walk into an ER and a triage nurse sorts patients by acuity: chest pain goes ahead of a sprained ankle. GCP troubleshooting is identical. Error Reporting acts as the triage nurse, grouping incoming exceptions by root cause and surfacing the highest-volume, newest, or fastest-growing issues first. Cloud Monitoring SLO burn-rate alerts are the "code blue" pages that pull on-call engineers in immediately, while log-based metrics with severity>=ERROR filters are the chart at the foot of the bed. Without this triage layer, every paper cut and every heart attack would land in the same queue, and the on-call SRE would burn out chasing the loudest, not the most damaging, signal.
Analogy 2 — The Car Dashboard (Metrics vs Logs vs Traces)
Your car's dashboard shows speed, RPM, fuel, engine temperature at a glance — that is Cloud Monitoring: numerical, real-time, threshold-based. When a warning light flashes ("low oil pressure"), you pull over and consult the owner's manual + service history — that is Cloud Logging: the textual narrative of what the engine did. If the car has an intermittent stutter, you plug in an OBD-II scanner that records every sensor reading during a test drive — that is Cloud Trace + Cloud Profiler: per-request and per-function granularity. The architect's mistake is using only one instrument: staring at the dashboard while ignoring the manual, or reading 10,000 log lines without checking whether CPU is pinned at 100%.
Analogy 3 — The Detective Case File (Root Cause Investigation)
A detective doesn't solve a case by reading the entire city's CCTV archive. They start with the crime scene (the failing endpoint), interview witnesses (kubectl describe pod, gcloud logging read), check forensics (VPC Flow Logs, audit logs), and build a timeline (Trace spans ordered by start time). The suspect lineup is the recent change set — a deploy 12 minutes before the incident, a firewall rule edit, an IAM grant. The detective's golden rule is "follow the evidence, not the hunch"; the architect's equivalent is "follow the trace ID, not the loudest dashboard." Every escalation to L3 support should arrive with a complete case file: timestamps, request IDs, expected vs actual behaviour, and the hypotheses already ruled out.
Three-instrument rule for every PCA scenario: Monitoring answers "what is broken right now," Logging answers "why did it break," Trace answers "where in the call chain did it break." Map every exam keyword to one of these three before choosing a tool.
Plain-Language Analogies for Troubleshooting
Analogy 1 — The Doctor and the Medical Lab (Observability Tools)
When you feel sick, you go to a Doctor (The Architect). The doctor doesn't just guess what's wrong. They look at your Vitals (Cloud Monitoring) to see if you have a fever. They look at your Medical History (Cloud Logging) to see what you ate yesterday. If they need to see exactly how blood is flowing through your veins, they order an X-ray or MRI (Cloud Trace).
Analogy 2 — The Missing Parcel (Tracing and Logging)
If you order a package and it doesn't arrive, you don't just call the post office and say "It's missing." You use a Tracking Number (Trace ID). You see it left the warehouse (Service A), arrived at the sorting center (Service B), but never left the delivery truck (Service C). Cloud Trace tells you exactly which "truck" in your microservices chain lost the package.
Analogy 3 — The Silent Library (Error Reporting)
Imagine a library where people are constantly whispering, but suddenly someone screams. Error Reporting is like a librarian who ignores the whispers but immediately writes down exactly what the person screamed, which book they were holding, and what page they were on. It groups similar "screams" together so you know which shelf (code) needs fixing first.
The GCP Observability Suite (Operations Suite)
1. Cloud Logging (The Historian)
The central repository for all logs.
- Log Explorer: Use power-queries to filter by
resource.type,severity, or specifictextPayload. - Log-Based Metrics: Turn a recurring text string (e.g., "Out of Memory") into a numerical chart you can alert on.
2. Cloud Monitoring (The Sentry)
Focuses on metrics (CPU, Memory, Latency, Throughput).
- Dashboards: Visual representation of system health.
- Alerting Policies: Notify SREs via PagerDuty, Email, or Slack when a threshold is breached (e.g., CPU > 80% for 5 minutes).
3. Cloud Trace (The Detective)
Follows a single request as it travels through multiple services.
- Best for finding Latency Bottlenecks. If a page takes 10 seconds to load, Trace shows that Service D took 9 of those seconds.
4. Cloud Profiler (The Performance Tuner)
Analyzes CPU and Memory consumption at the Code Level. It tells you exactly which function or line of code is hogging resources.
Common GCP Issues and Where to Look
| Issue Type | Symptom | Tools to Use |
|---|---|---|
| Permission Denied | 403 Errors, "Operation not permitted." | Cloud Logging (Audit Logs), Policy Simulator. |
| Network Latency | Slow response times, timeouts. | Cloud Trace, VPC Flow Logs, Connectivity Tests. |
| Resource Exhaustion | 503 Errors, VM crashes, OOM. | Cloud Monitoring, Cloud Profiler. |
| Intermittent Connectivity | Flaky connections between services. | VPC Flow Logs, Firewall Insights. |
| Data Inconsistency | Stale data, wrong results. | Cloud Logging (Data Access Logs), Database-specific logs. |
Architect's Tool: Use the Network Intelligence Center (Connectivity Tests) to troubleshoot VPC issues. It can tell you if a packet is being dropped by a Firewall rule or a missing Route without you having to send actual traffic. ::
Cloud Logging Filter Syntax for Rapid Triage
The Logs Explorer's Logging Query Language (LQL) is the architect's first scalpel. Memorising five filter patterns cuts mean-time-to-detect (MTTD) dramatically:
Core triage filters
- Single resource by name:
resource.type="gce_instance" AND resource.labels.instance_id="1234567890"— pin to a specific VM. - Severity floor:
severity>=ERROR— drop INFO/DEBUG noise; combine withtimestamp>="2026-05-12T00:00:00Z"to bound the window. - HTTP error class:
httpRequest.status>=500 AND httpRequest.requestUrl=~"/checkout/.*"— regex match on path for a failing endpoint. - Audit log forensics:
protoPayload.methodName="storage.objects.delete" AND protoPayload.authenticationInfo.principalEmail="*@external.com"— find deletions by external identities. - Correlation by trace:
trace="projects/my-proj/traces/abc123"— pull every log line for one request across all services.
Operational patterns
- Pin a query in Logs Explorer so the on-call team opens directly into the right view; saved queries beat tribal Slack snippets.
- Promote a recurring filter to a log-based metric (counter or distribution) and graph it in Cloud Monitoring. Spike on
OOMKilledtext? Now it is an alertable signal, not a needle in the haystack. - Use
logName="projects/PROJECT/logs/cloudaudit.googleapis.com%2Factivity"to scope to Admin Activity audit logs only — these are free and always-on; Data Access logs cost money and must be enabled per service. - For ingest cost control, write an exclusion filter in the
_Defaultsink forresource.type="k8s_container" AND severity<WARNING— but never exclude thecloudaudit.googleapis.comlog stream; you will lose forensic evidence.
Candidates often forget that Data Access audit logs for services like BigQuery and Cloud Storage are disabled by default (except for BigQuery DATA_READ which is on). If a post-mortem asks "who read this PII bucket?" and the logs were never enabled, the answer is "we cannot know." Enable Data Access logs at the org level via IAM > Audit Logs before an incident, not after.
Cloud Trace Span Analysis for Latency Forensics
When a request takes 8 seconds instead of 200ms, Cloud Trace reconstructs the per-service waterfall. Each request is a trace; each unit of work (HTTP call, RPC, DB query) is a span. The architect reads three signals from a trace:
What to read in a span
- Span duration vs parent duration — if a child span is 7.9s inside an 8s parent, the bottleneck is downstream of the parent. If many small children sum to 8s, you have N+1 query syndrome.
- Span attributes —
http.status_code,http.method,db.statement,peer.service. The presence ofdb.statement="SELECT ..."repeated 500 times in one trace is the ORM lazy-load smell. - Time gaps between spans — gaps with no instrumented work usually mean unsampled middleware, GC pause, or cold-start cold-Lambda equivalent (Cloud Run instance cold start).
Practical sampling and propagation
- Trace's default sampling rate is 0.1 requests/sec per instrumented service to keep cost down. Override with
OPENTELEMETRY_TRACES_SAMPLER_ARG=0.05(5%) for high-QPS services. - Context propagation uses W3C
traceparentheader (or legacyX-Cloud-Trace-Context). If a service is missing from the waterfall, it almost always failed to forward the header — a one-line bug in the HTTP client. - Run Trace Insights (auto-generated latency analysis) on a service: it flags p99 regressions vs baseline without you setting thresholds manually.
When Trace is the wrong tool
If the request never reaches your service (DNS failure, GFE blocked, firewall drop), no span exists. Pivot to VPC Flow Logs + Connectivity Tests. Trace is for in-process and inter-service latency, not for "the packet never arrived."
Cloud Profiler for CPU and Memory Hotspots
Cloud Profiler is statistical, continuous, low-overhead (~1% CPU). Where Trace tells you which service is slow, Profiler tells you which line of code burns the cycles. It supports Go, Java, Node.js, Python, .NET — both on GCE, GKE, Cloud Run, and App Engine.
Profile types
- CPU time — wall-clock vs on-CPU. A function with 2s CPU time but 8s wall time is blocking on I/O, not crunching numbers.
- Heap / Heap allocation — find memory leaks and excessive short-lived allocations (GC pressure).
- Contention (Java/Go) — lock contention hotspots; classic culprit is a
synchronizedmap that becomes a global bottleneck. - Threads (Java) — thread-pool starvation.
Reading a flame graph
- Width = sample count = time/allocations spent. A wide leaf at the top is your hotspot.
- Use the diff view to compare two time ranges (before/after a deploy). A new wide stack appearing post-deploy is a regression candidate.
- Filter by deployment label (
version=v123) to compare canary vs stable in the same chart.
On the PCA exam, the trigger phrase "identify which function or line of code is slow" maps to Cloud Profiler. The trigger "identify which service in the call chain is slow" maps to Cloud Trace. They are complementary, not interchangeable — Trace narrows from N services to 1; Profiler narrows from 1 service to 1 function.
Network Intelligence Center: Connectivity Tests and Performance Dashboard
Network Intelligence Center (NIC) is the architect's networking war room. Four sub-modules matter for troubleshooting:
Connectivity Tests (the killer feature)
Connectivity Tests perform static configuration analysis — they walk the VPC graph (routes, firewall rules, peering, Cloud NAT, hybrid VPN/Interconnect) and tell you whether a packet from source=VM-A to destination=10.20.30.40:443 would be delivered. No traffic is sent. The output is a step-by-step trace: "matched route X, evaluated firewall rule Y (allow), exited via Cloud NAT Z." When a packet would be dropped, NIC names the exact rule or missing route.
Use Connectivity Tests:
- After every firewall change (run as a Terraform post-deploy check).
- When debugging on-prem-to-GCP via Cloud VPN / Interconnect — NIC understands BGP routes and Cloud Router exchanges.
- For Private Google Access / Private Service Connect debugging — it knows which subnets have PGA enabled.
Performance Dashboard
Shows packet loss and median RTT between GCP zones/regions and from your project to the internet — bypasses the need to deploy iperf VMs.
Firewall Insights
Flags shadowed rules (a higher-priority rule masks a lower one), overly permissive rules (0.0.0.0/0 on port 22), and unused rules based on the last 60 days of hit counts.
Network Topology
Visual graph of inter-VM traffic — useful to spot a service that is unexpectedly chatting with an external IP (data exfiltration hint).
Error Reporting: Root-Cause-First Grouping
Error Reporting automatically deduplicates exceptions by stack trace fingerprint (language-aware: it ignores stack frame line numbers but groups by function chain). One real customer scenario produces one Error Reporting group even if 50,000 users hit it.
Triage workflow
- Sort by "Occurrences (last hour)" to find currently raging fires.
- Sort by "First seen" to find regressions introduced by today's deploy.
- Sort by "Affected users" to prioritise customer impact over noisy bots.
- Click into a group → see the resolved status (open / acknowledged / resolved / muted), assigned engineer, and linked Cloud Logging entries.
Integration with on-call
- Create a Cloud Monitoring alert on the log-based metric
error_reporting.googleapis.com/error_countfiltered by service. New error group spike → PagerDuty page. - The resolved-then-recurring transition is automatic: if you mark a group resolved and it occurs again, Error Reporting reopens it and can re-page on-call. This catches incomplete fixes.
- Use mute (not resolve) for known third-party noise (e.g., scraper bots triggering 404s).
Error Reporting requires either automatic instrumentation (App Engine, Cloud Functions, Cloud Run all wire it up by default) or explicit logging in the required format. For GKE / GCE workloads, the easiest path is to ensure exceptions are written to stderr with full stack traces and shipped via the Ops Agent — Error Reporting then parses them automatically.
GKE Pod Debugging: The kubectl Playbook
When a Pod is failing, run these commands in order. Each answers a specific question:
Step 1 — Is the Pod healthy at all?
kubectl get pod <pod> -n <ns> -o wide
# READY 0/1, STATUS CrashLoopBackOff, RESTARTS 17 → application crash loop
# STATUS Pending → unschedulable; check events
# STATUS ImagePullBackOff → registry/auth issue
Step 2 — Why is it in that state?
kubectl describe pod <pod> -n <ns>
Read the Events section at the bottom. Look for:
FailedScheduling: 0/3 nodes are available: insufficient cpu→ resize node pool, lower requests, or enable autoscaling.Liveness probe failed: HTTP probe failed with statuscode: 503→ probe path is wrong or app slow to warm.OOMKilled(exit code 137) → bump memory limit or fix the leak (use Cloud Profiler heap mode).
Step 3 — What did the app actually log?
kubectl logs <pod> -c <container> --previous # logs from the crashed container
kubectl logs -f <pod> --tail=200 # stream current logs
For multi-container Pods, always pass -c. The --previous flag is essential for CrashLoopBackOff — without it you see the boot of the next attempt, not the failure of the last.
Step 4 — Get inside a running but misbehaving container
kubectl exec -it <pod> -- /bin/sh
If the image is distroless (no shell), use ephemeral debug containers:
kubectl debug -it <pod> --image=busybox:1.36 --target=<container>
This attaches a sidecar with full networking/PID namespace access without rebuilding the image. GKE 1.23+ supports it natively.
Step 5 — Pivot to Cloud Logging
Logs Explorer query: resource.type="k8s_container" AND resource.labels.pod_name="<pod>" AND severity>=WARNING. Logs persist after the Pod is gone — kubectl logs does not.
VPC Flow Logs Analysis for Network Forensics
VPC Flow Logs record sampled metadata (5-tuple + bytes + RTT + action) for traffic on a subnet. They are the single most underrated tool for security and connectivity post-mortems.
Enabling intelligently
- Enable per-subnet, not per-VPC; flow logs cost money proportional to traffic.
- Tune aggregation interval (5s default) and sample rate (0.5 default = 50%). For low-traffic subnets, raise sample to 1.0; for chatty subnets, drop to 0.1.
- Enable metadata fields for
src_instance,dest_instance,geo_countryto get human-readable analysis without IP-to-resource lookups.
Forensic queries (BigQuery sink)
Export flow logs to BigQuery via Log Router for SQL analysis:
-- Top external talkers in the last hour
SELECT jsonPayload.connection.dest_ip, SUM(CAST(jsonPayload.bytes_sent AS INT64)) AS bytes
FROM `proj.flow_logs.compute_googleapis_com_vpc_flows_*`
WHERE jsonPayload.reporter = "SRC"
AND jsonPayload.dest_location.country NOT IN ("us", "tw")
AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
GROUP BY 1 ORDER BY bytes DESC LIMIT 20;
This surfaces data exfiltration candidates in minutes. Pair with Firewall Rules Logging (separate feature) to confirm which rule allowed/denied each flow.
What Flow Logs cannot do
They sample. A single dropped packet may not appear. For deterministic answers, use Connectivity Tests (config analysis) or Packet Mirroring (full PCAP to an analysis VM).
Set up a scheduled BigQuery query on flow logs that emails a daily top-talker report. After two weeks you will have a behavioural baseline; any new top-10 entrant is a candidate for investigation without you watching a dashboard. Combine with Cloud Monitoring alerts on compute.googleapis.com/firewall/dropped_packets_count for active blocking signals.
Audit Log Forensics: Admin Activity, Data Access, System Events
GCP Cloud Audit Logs answer the "who did what, when, from where" question. Four streams:
Cloud Audit Log — an immutable record emitted by Google Cloud services capturing administrative and data-plane API calls. Each entry contains protoPayload (the API call detail), authenticationInfo (who), requestMetadata (from where), and authorizationInfo (what permission was checked). Separate from application logs written by your own code.
- Admin Activity — always on, free, 400-day retention. Captures
setIamPolicy,create,delete,updateon resources. First place to look for "who deleted the bucket." - Data Access — off by default (except BigQuery DATA_READ). Records
get,list,readoperations. Enable for PII-bearing services; expect significant log volume. - System Event — Google-initiated actions (live migration, automatic instance recovery). Useful when explaining "the VM rebooted itself."
- Policy Denied — only logged when an org policy denies an API call. Diagnoses "why did Terraform fail to create this resource."
Investigation pattern
protoPayload.authenticationInfo.principalEmail answers who. protoPayload.methodName answers what. protoPayload.requestMetadata.callerIp answers from where. protoPayload.authorizationInfo[].granted answers was the action permitted.
Service account key compromises are routinely missed because investigators look at the human user emails first. Always include principalEmail=~".*@.*\\.iam\\.gserviceaccount\\.com" in your initial sweep — a leaked service account key calling from an unexpected callerIp is the canary for credential exfiltration.
Export critical audit logs to a separate, locked-down project with a Logging sink the moment a security event is suspected — this prevents an attacker with project-level admin from deleting evidence.
Recommender API: Surfacing Hidden Issues Proactively
The Recommender API is the cloud equivalent of a quarterly health check — it analyses usage patterns and emits insights + recommendations across security, cost, performance, manageability, and reliability dimensions. Architects who only consult it during incidents miss its real value.
High-signal recommenders
google.iam.policy.Recommender— IAM role-rightsizing: replaces overly broad roles with least-privilege custom roles based on actual API usage in the past 90 days.google.compute.instance.IdleResourceRecommender— flags VMs running at <3% CPU for weeks; classic cost recovery.google.compute.disk.IdleResourceRecommender— unattached persistent disks billed silently.google.compute.firewall.Recommender— shadowed and overly permissive firewall rules (overlaps with Firewall Insights).google.cloudsql.instance.OutOfDiskRecommender— predicts Cloud SQL disk exhaustion before the database hangs.google.gke.cluster.Recommender— GKE upgrades, version-skew warnings, deprecated APIs in the cluster.
Operational integration
- Pull recommendations via
gcloud recommender recommendations list --recommender=google.iam.policy.Recommender --location=globaland feed into a weekly engineering review. - Set up Active Assist alerts to email or PubSub-publish new high-priority recommendations.
- Treat the Security Health Analytics findings in Security Command Center as a parallel stream — between SCC and Recommender, most hidden misconfigurations surface without manual audits.
Support Case Escalation: Choosing the Right Tier
GCP support tiers (Basic, Standard, Enhanced, Premium) differ on response SLO, channels, and Technical Account Manager (TAM) access. Architects must know when and how to escalate a case so the right tier is engaged.
Case priority and response targets (Premium)
- P1 — Critical impact, service unusable: 15-minute response, 24/7. Reserve for production outages with no workaround.
- P2 — High impact, degraded: 1-hour response, 24/7.
- P3 — Medium impact: 4 business hours.
- P4 — Low impact / question: 8 business hours.
Filing a great case
Include in the case body, not as follow-up comments:
- Project ID, region/zone, resource name(s).
- Exact timestamps in UTC with timezone explicit.
- Request IDs / Trace IDs / Operation IDs — Google support can pull internal logs only with these IDs.
- What you have already ruled out — IAM, quota, recent changes. Saves L1 from re-running your work.
- Business impact — "Checkout is down for 100% of EU users" justifies P1 escalation.
Escalation paths
- Re-escalate within the case: ask explicitly for a P-level bump if impact has grown.
- TAM (Premium only): page your TAM for outages — they fast-track to the product team and own internal coordination.
- Customer Care Manager for billing/contract issues.
- Google Cloud status dashboard (
status.cloud.google.com) — confirm a regional incident isn't already disclosed before you spend hours debugging your own code.
For PCA scenarios, the right answer is rarely "open a support case first." The exam expects you to demonstrate observability-driven diagnosis (Logging + Trace + Monitoring + Connectivity Tests) before involving Google support. Support escalation is the answer when the issue is clearly on Google's side (regional outage, suspected platform bug, Quota increase request that requires human approval).
FAQ — Troubleshooting Strategies
Q1. What is the difference between a Log and a Metric?
A Log is a record of a specific event ("User X logged in"). A Metric is a numerical measurement over time ("There are 500 users logged in right now"). Logs are for "Why," Metrics are for "What."
Q2. How do I troubleshoot a GKE Pod that keeps restarting (CrashLoopBackOff)?
- Check
kubectl describe podto see the exit code. - Check Cloud Logging with a filter for that specific pod name to see the application's stderr/stdout logs before it crashed.
- Check Error Reporting to see if a specific exception is being thrown.
Q3. Why can't I see any logs in the Log Explorer?
Check three things:
- Does the service have the
logging.logWriterrole? - Are you looking at the correct time range?
- Is there an Exclusion Filter in the Log Router that is dropping the logs to save money?
Q4. When should I use Cloud Trace instead of Cloud Logging?
Use Cloud Trace when you have a performance issue across multiple microservices. Use Cloud Logging when you have a specific error or crash within a single service.
Q5. What are "VPC Flow Logs"?
They record a sample of all network traffic (IPs, Ports, Bytes) moving through your VPC. They are essential for troubleshooting "Who is talking to whom" and identifying unauthorized network attempts.
Final Architect Tip
On the PCA exam, if a question is about "Latency," the answer is usually Cloud Trace. If it's about "Resource usage/Alerting," look for Cloud Monitoring. For "Audit/Security/Error details," use Cloud Logging. Always prefer the most Surgical Tool (e.g., using Connectivity Tests for Networking) rather than just "Looking at all logs."