examlab .net The most efficient path to the most valuable certifications.
In this note ≈ 18 min

Microservices Architecture on Google Cloud

3,586 words · ≈ 18 min read ·

Design and operate cloud-native microservices on GCP with DDD bounded contexts, OpenAPI/gRPC contracts, Anthos Service Mesh, Pub/Sub events, Saga patterns, and polyglot persistence on Spanner, Firestore, and Bigtable for the PCD exam.

Do 20 practice questions → Free · No signup · PCD

Introduction to Microservices Architecture

Microservices decompose a monolith into small, independently deployable services that communicate over network APIs. Each service owns its data, ships on its own release train, and can be written in a different language. Google Cloud's runtime fabric (GKE, Cloud Run, Cloud Run for Anthos), data layer (Spanner, Firestore, Bigtable, BigQuery), and integration plane (Pub/Sub, Eventarc, API Gateway, Anthos Service Mesh, Service Directory) make it the canonical platform for this style.

For the Professional Cloud Developer (PCD) exam, you need to reason about how the pieces fit: how to slice services correctly, how to keep contracts honest, how identity flows across hops, how transactions span databases, and how observability survives the fan-out. This study note expands each axis with concrete services, API field names, and gcloud invocations you will see on the exam.

A microservice is an independently deployable unit of business capability that owns its data store, exposes a versioned contract (REST/OpenAPI or gRPC/Protobuf), authenticates callers via short-lived credentials (OIDC ID tokens or mTLS), and emits events to an asynchronous bus such as Pub/Sub.

Service Boundaries via Domain-Driven Design

Slicing a system into services is the single most consequential decision. Too coarse and you rebuild a monolith. Too fine and every business operation becomes a distributed transaction.

Bounded Contexts and the Ubiquitous Language

A bounded context in Domain-Driven Design (DDD) is the explicit boundary within which a particular domain model is defined and applicable. In an e-commerce platform the word Order in the Checkout context (price, line items, tax) is a different concept from Order in the Fulfillment context (warehouse picks, courier ID). Mapping each bounded context onto one (and only one) microservice keeps the model coherent and prevents "shared kernel" coupling.

Aggregates and Service Ownership

Inside a context, identify aggregates: clusters of entities mutated together with strong consistency boundaries (e.g., Order + OrderLine). The aggregate root is the only entry point for writes. A service should own one aggregate; if two services routinely need to write to the same aggregate, they are probably one service.

Anti-Corruption Layer (ACL)

When a new microservice has to talk to a legacy monolith or a third-party SaaS, build an anti-corruption layer: a small translation service that exposes a clean modern contract internally while shielding the rest of the platform from the legacy schema. On GCP this is often deployed as a Cloud Run service in front of an Apigee proxy.

Run an EventStorming workshop before drawing service boxes. Mapping domain events (OrderPlaced, PaymentAuthorized, ShipmentDispatched) reveals the natural seams. Cutting on events tends to align with bounded contexts; cutting on nouns (User, Product) creates anemic CRUD services that distribute the monolith.

API Contracts: OpenAPI and gRPC

Contracts are how services agree to talk without reading each other's source code.

OpenAPI for REST

External-facing and partner APIs are typically REST/JSON described by OpenAPI 3.0. GCP's API Gateway consumes an OpenAPI document with the x-google-backend extension to wire each path to a Cloud Run, Cloud Functions, or App Engine backend. Versioning lives in the URL (/v1/orders) or a header (Accept: application/vnd.example.v2+json).

gRPC and Protocol Buffers

For high-volume internal traffic, gRPC over HTTP/2 with Protobuf is preferred: binary encoding cuts payloads 60-80% versus JSON, and streaming RPCs are first-class. Cloud Run, GKE, and Anthos Service Mesh all support gRPC natively, including bidirectional streaming. Use proto3 syntax, reserve field numbers 1-15 for hot fields (single-byte tag), and never re-use a deprecated tag.

Contract Evolution and Backward Compatibility

Both OpenAPI and Protobuf allow additive evolution: new optional fields, new enum values, new RPC methods. Forbidden moves include changing a field's type, renaming a tag number, or making an optional field required. Run Buf or prism in CI to detect breaking changes before deploy.

Returning HTTP 200 OK with {"error": "..."} in the body defeats every gateway, retry library, and SLO tool you will ever use. Map domain failures to proper status codes (400, 404, 409, 422) or, for gRPC, to canonical google.rpc.Code values like INVALID_ARGUMENT and FAILED_PRECONDITION.

Service-to-Service Authentication

In a microservices mesh every internal hop is an authenticated call. GCP gives you two complementary primitives.

OIDC ID Tokens via the Metadata Server

When service A (running on Cloud Run, GKE with Workload Identity, GCE, or Cloud Functions) calls service B, A requests an OIDC ID token from the local metadata server at http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/identity?audience=https://service-b-xyz.a.run.app with the header Metadata-Flavor: Google. The returned JWT is signed by Google, carries A's service-account email in the sub claim, and is bound to the audience URL of B. B verifies the signature against Google's public JWKS and authorizes via IAM role roles/run.invoker.

mTLS Inside Anthos Service Mesh

Within an Anthos Service Mesh (ASM) cluster, sidecars automatically wrap every connection in mutual TLS using SPIFFE identities of the form spiffe://PROJECT.svc.id.goog/ns/NAMESPACE/sa/SERVICE_ACCOUNT. Authorization policies (AuthorizationPolicy CRD) can then state things like "only the checkout SA may call /payments".

Workload Identity Federation

For services running outside GCP (on-prem, AWS, GitHub Actions) that need to call GCP APIs, configure Workload Identity Federation so they trade their native OIDC token for a short-lived Google access token via sts.googleapis.com. No long-lived service-account keys ever leave Google.

The audience claim on an OIDC ID token must exactly match the callee's URL (Cloud Run service URL, or your custom audience for a load-balanced backend). A wildcard or mismatched audience returns HTTP 401 Unauthorized with WWW-Authenticate: Bearer error="invalid_token".

Service Mesh: Anthos Service Mesh and Istio

A service mesh moves cross-cutting concerns — mTLS, retries, traffic shifting, telemetry — out of every microservice and into a sidecar proxy.

Data Plane and Control Plane

ASM is Google's managed distribution of Istio. The data plane is an Envoy sidecar injected into every Pod; the control plane (istiod, managed by Google in ASM) ships configuration via xDS. Every L7 request enters and leaves a Pod through Envoy, which is where mTLS termination, request logging, and traffic policy live.

Traffic Management Primitives

  • VirtualService defines routing rules: weighted splits (90/10 canary), header-based routing, fault injection.
  • DestinationRule defines subsets (v1, v2) and connection-pool limits.
  • Gateway exposes the mesh externally, typically backed by a Google Cloud HTTP(S) Load Balancer.

Resilience Defaults

ASM lets you declare timeouts (spec.http.timeout: 2s), retries (spec.http.retries.attempts: 3 with retryOn: 5xx,reset,connect-failure), and circuit breakers (outlierDetection.consecutive5xxErrors: 5). Pair with PeerAuthentication set to mode: STRICT to require mTLS mesh-wide.

Roll out ASM canaries with a VirtualService weighted split rather than a Kubernetes Service selector swap. Weighted splits give you per-request granularity (e.g., 1% → 5% → 25% → 100%) and let you bake in automated rollback on a Cloud Monitoring SLO burn.

Saga Pattern for Distributed Transactions

ACID across services is impractical. The Saga pattern replaces a single distributed transaction with a sequence of local transactions, each paired with a compensating action that semantically undoes it if a later step fails.

Choreography Sagas with Pub/Sub

In a choreography saga each service reacts to events and publishes new ones. Place an order: Orders publishes OrderPlacedPayments charges and publishes PaymentAuthorizedInventory reserves stock and publishes StockReservedShipping schedules pickup. If Inventory fails it publishes StockReservationFailed; Payments subscribes and issues a refund (compensation). No central coordinator; each service knows only its inputs and outputs.

Orchestration Sagas with Workflows

When the flow is complex or needs auditability, use Cloud Workflows as an explicit orchestrator. A YAML workflow makes synchronous HTTP calls to each service, catches errors, and explicitly invokes compensation steps. The whole saga becomes a single resource you can inspect, rerun, and trace.

Idempotency Keys

Sagas retry. Without idempotency, a compensating refund could fire twice. Every mutating endpoint must accept an Idempotency-Key header (UUID v4) and deduplicate at the data layer — typically a Firestore document or Spanner row keyed by the idempotency key with a 24-72h TTL.

A compensating action is not a rollback — it is a new business transaction that semantically reverses the prior one. A shipment cannot be un-shipped; it must be returned. Model compensations explicitly in your domain (PaymentRefunded, StockReleased, OrderCancelled) instead of pretending the original event never happened.

Event-Driven Architecture with Pub/Sub

Pub/Sub is the asynchronous spine of GCP microservices. It decouples producers from consumers, absorbs traffic spikes, and survives consumer outages.

Topics, Subscriptions, and Delivery Semantics

Producers publish to a topic; consumers attach subscriptions (pull or push). Each subscription is an independent fan-out: 10 subscribers each see 100% of messages. Delivery is at-least-once with order-keys giving FIFO per key. Exactly-once delivery is opt-in (enable_exactly_once_delivery: true) and trades higher latency for stronger semantics.

Dead-Letter Topics and Retries

Configure a dead-letter topic (deadLetterPolicy.deadLetterTopic) plus maxDeliveryAttempts: 5. Failed messages land in a separate topic for inspection rather than poisoning the main subscription. Exponential backoff is set via retryPolicy.minimumBackoff: 10s and maximumBackoff: 600s.

Schema Validation

Attach a Pub/Sub schema (Avro or Protobuf) to a topic so the broker rejects malformed publishes at the edge. This prevents one bad producer from corrupting every downstream subscriber.

"Fire and forget" publishing without acknowledgement deadlines tuned to your workload guarantees duplicate processing storms. Default ack deadline is 10 seconds; if your consumer routinely takes 30 seconds you must extend with modify_ack_deadline or set ackDeadlineSeconds: 60 on the subscription. Otherwise Pub/Sub redelivers while you are still processing.

Distributed Tracing and Context Propagation

A single user click can fan out to 20 services. Without correlated traces you cannot debug latency or errors.

Cloud Trace and OpenTelemetry

Instrument with the OpenTelemetry SDK (Java, Go, Python, Node) and export to Cloud Trace. Each span carries a trace_id (16 bytes), span_id (8 bytes), parent span_id, service name, and arbitrary attributes. The Trace UI stitches the spans into a flame graph keyed by trace_id.

W3C Trace Context Headers

Propagation across HTTP and gRPC hops uses the W3C standard header traceparent: 00-<trace_id>-<span_id>-<flags>, optionally tracestate for vendor-specific data. Every microservice must (a) extract traceparent on ingress, (b) attach it to outgoing requests, (c) re-inject it on Pub/Sub publishes via attributes googclient_traceparent. ASM's Envoy injects these automatically; manual instrumentation is required outside the mesh.

Trace Sampling

100% sampling is expensive and noisy. Use tail-based sampling via the OpenTelemetry Collector to keep all traces with errors plus 1% of healthy traces, or probabilistic sampling at the SDK with 0.05 (5%).

The W3C traceparent header has exactly four hyphen-separated fields: version, trace-id (32 hex chars), parent-id (16 hex chars), flags (2 hex chars). Drop or rewrite any of those and the trace breaks at that hop.

API Gateway Patterns

External clients should never call individual microservices directly. An API gateway terminates TLS, authenticates users, enforces quotas, and routes to backends.

API Gateway vs Apigee vs ESPv2

GCP offers three layered options:

  • Google Cloud API Gateway — managed, OpenAPI-driven, low cost, ideal for serverless backends (Cloud Run, Cloud Functions). One config (gcloud api-gateway api-configs create) deploys an immutable revision; switch traffic by promoting the gateway.
  • Cloud Endpoints with ESPv2 — Envoy-based proxy you deploy yourself (sidecar on GKE or as a Cloud Run service). More control, more ops burden.
  • Apigee X — enterprise full-lifecycle API management: developer portal, monetization, complex policies (OAuth, JWT, transforms, quotas). The exam reaches for Apigee whenever the question mentions partner onboarding, monetization, or analytics.

Edge Concerns Belong at the Gateway

Authentication (Firebase Auth JWT, API keys, OAuth2), rate limiting (x-google-quota), CORS, request/response transformation, and request logging should live at the gateway, not duplicated in every backend service.

Backend-for-Frontend (BFF)

For mobile/web clients, deploy a thin BFF service per channel that aggregates calls to several internal microservices and shapes the payload for that client. Keeps mobile from making 12 round-trips.

Pin gateway clients to a specific OpenAPI revision via the x-google-api-name and config-id. When you roll a new config, traffic shifts only after you update the gateway, giving you a clean rollback point if a new contract breaks consumers.

Service Discovery with Service Directory

Hard-coding hostnames is brittle. Service Directory is GCP's managed service registry.

Namespaces, Services, and Endpoints

A namespace (per-region) contains services; each service has endpoints (address, port, metadata annotations). Clients resolve via gRPC's xds:/// URI scheme, the REST API (projects/*/locations/*/namespaces/*/services/*), or a private DNS zone that Service Directory auto-populates.

Integration with GKE and Anthos

ASM can synchronize Kubernetes Service objects into Service Directory, exposing in-cluster workloads to clients outside the cluster (Cloud Run, GCE VMs, on-prem via Cloud Interconnect). Conversely, external endpoints registered in Service Directory appear inside the mesh as ServiceEntry resources.

IAM and Network Reach

Service Directory enforces IAM at lookup time (roles/servicedirectory.viewer) so unauthenticated clients cannot enumerate your topology. Combined with VPC Service Controls, your registry stays inside the perimeter.

Service Directory is a fully managed service registry that gives a single source of truth for service location, health, and metadata across GCP, on-prem, and other clouds. Unlike DNS it stores structured metadata and is gRPC-aware.

Polyglot Persistence per Bounded Context

A core microservices tenet: each service owns its data. The corollary is polyglot persistence — different services pick different databases based on access patterns.

Spanner for Transactional Cores

Order management, payments, and inventory need strong consistency, multi-row transactions, and global reach. Cloud Spanner provides external consistency (TrueTime), SQL, and horizontal scale (up to petabytes). Use interleaved tables to colocate OrderLine under Order for join-free reads.

Firestore for User-Facing Read Models

A UserProfile service or a real-time chat read model fits Firestore in Native mode: document model, mobile/web SDKs with offline sync, server-sent listeners for live updates, and per-document ACLs via Firebase Auth.

Bigtable for Time-Series and IoT

A DeviceTelemetry or ClickStream service ingesting 1M writes/sec belongs on Bigtable: single-digit-ms latency, row-key-sorted storage, and seamless export to BigQuery for analytics. Design row keys as reverse_timestamp#device_id to spread hot partitions.

Memorystore for Hot Caches

Session state, idempotency keys with sub-second TTLs, and rate-limit counters belong in Memorystore for Redis (or Valkey). Sub-millisecond latency, but no durability guarantees — treat as cache, not source of truth.

BigQuery as the Analytical Sink

Operational stores stream to BigQuery via Datastream (Spanner change streams, Firestore native change feed) or via Pub/Sub → Dataflow. Analysts query a federated dataset; microservices never see analytical load.

Never share a database across microservices, even when "it would be easy". Shared databases reintroduce schema coupling, blocking migrations, and cross-team release deadlocks — the very pathologies microservices exist to escape. If two services need the same data, one owns it and the other reads via an API or an event-sourced read model.

Resilience Patterns

Network partitions, slow dependencies, and partial failures are normal. Build for them.

Timeouts, Retries, and Jitter

Default to a 1-2 second timeout on internal hops. Retry only idempotent operations, cap at 3 attempts, and use exponential backoff with jitter (base * 2^attempt + rand(0, base)) to prevent retry storms. ASM retries.perTryTimeout keeps per-attempt budgets honest.

Circuit Breakers and Bulkheads

A circuit breaker (Envoy outlierDetection) trips after consecutive failures, stops calling the broken dependency, and gives it room to recover. Bulkheads isolate resources: separate connection pools per dependency so a slow downstream cannot exhaust the entire thread pool.

Graceful Degradation

When a recommendation service is down, the product page should still render with a generic "popular items" list, not a 500. Design every read dependency with a fallback (@Fallback in Java, circuitbreaker middleware in Go).

Set the gateway timeout (e.g., API Gateway default 60s) lower than the load balancer's idle timeout (default 10 min). Mismatched timeouts cause clients to hang on already-cancelled requests and corrupt your retry budget.

CI/CD and Independent Deployability

Microservices only deliver speed if each service can deploy independently.

Per-Service Pipelines

Each repo (or each path in a monorepo) has its own Cloud Build trigger, its own image in Artifact Registry, and its own Cloud Deploy delivery pipeline. A change to Payments must never force a redeploy of Inventory.

Progressive Delivery

Use Cloud Deploy with ASM traffic-split rendering or use Cloud Run revisions (gcloud run services update-traffic --to-revisions=v2=10,v1=90) to shift traffic 10% → 50% → 100% gated on Cloud Monitoring SLO checks.

Schema Migrations Without Downtime

Two-step migrations: deploy a service that writes to both old and new columns, backfill, then deploy a service that reads from the new column, finally remove old column. Never make a contract change and a schema change in the same release.

Branching a "shared library" repo that every microservice imports recreates monolith coupling at build time. Either accept short-term duplication or publish the library as a versioned package in Artifact Registry so services upgrade on their own schedule.

白話文解釋(Plain English Explanation)

Analogy 1: Specialty Shops vs. One Mega-Supermarket

A monolith is a single mega-supermarket — bakery, butcher, pharmacy, electronics all under one roof and one power circuit. A blown fuse closes the entire store. Microservices are a high street of specialty shops. The bakery shutting for renovation does not stop the florist from selling roses. Each shopkeeper picks their own equipment (database), opening hours (release cadence), and currency (programming language). The downside? Customers now have to walk between shops, which is exactly what network calls in a distributed system feel like.

Analogy 2: A Restaurant Kitchen Brigade

A great kitchen is a microservices system. The garde manger handles cold dishes, the saucier owns sauces, the pâtissier owns desserts. They speak through clear tickets ("API contracts"), pass items via the pass ("Pub/Sub topic"), and a head chef ("API gateway") coordinates outgoing plates with the dining room. If the pastry chef calls in sick, mains still flow. A monolithic kitchen with one person doing everything collapses the moment that person sneezes — and that is also the maximum throughput.

Analogy 3: LEGO City vs. a Plastic Model Kit

A plastic model kit (monolith) is glued together: one wrong cut and the whole thing is ruined. A LEGO city (microservices) is composed of swappable blocks. You can replace a fire station (one service) with a hospital, as long as the studs (API contracts) match. The price is more bricks to keep track of, dedicated baseplates (clusters), and you must label your bins (Service Directory) so children (developers) can find the right pieces fast.

Frequently Asked Questions (FAQs)

Q1: When should I prefer Cloud Run over GKE for microservices?

A1: Cloud Run wins for stateless HTTP/gRPC services where per-request scaling, scale-to-zero, and zero ops overhead matter most. Pick GKE (or Cloud Run for Anthos) when you need DaemonSets, custom sidecars beyond the mesh, GPUs, persistent volumes, or batch workloads. Many teams run both: Cloud Run for the edge BFFs and Cloud Run jobs, GKE for the stateful core.

Q2: How do I implement a Saga without a central orchestrator?

A2: Use a choreography saga with Pub/Sub. Each step is a service that subscribes to the prior event and publishes its outcome. Compensation is just another subscription on a *Failed event. Add idempotency keys so retries are safe, and ship a separate "Saga Monitor" service that ingests all events into BigQuery so you can audit end-to-end flow when something goes wrong.

Q3: What is the difference between API Gateway and a service mesh?

A3: API Gateway is the north-south door — external clients enter the platform through it (auth, rate limiting, API keys). A service mesh handles east-west traffic — service-to-service inside the platform (mTLS, retries, traffic shifting, observability). They are complementary; a typical architecture has API Gateway or Apigee at the edge and Anthos Service Mesh inside.

Q4: How do I propagate user identity across many internal services?

A4: Two layers. (1) The gateway validates the end-user's Firebase Auth / OAuth2 token and forwards a verified X-End-User-Id header plus a JWT in X-Forwarded-Authorization. (2) Service-to-service calls add their own OIDC ID token from the metadata server. The callee thus has both who the human is (from header) and who the calling service is (from Authorization: Bearer), enabling fine-grained authorization.

Q5: How small should a microservice be?

A5: Small enough that one team (≤ 8 people) can fully own it, large enough that one business capability lives inside a single bounded context. "Two-pizza team" is a heuristic, not a rule. If a single user-facing feature requires touching 5 services, those services are too small. If a single team owns 20 services with similar release schedules, you have over-decomposed.

Q6: When does Pub/Sub make sense versus a synchronous gRPC call?

A6: Use Pub/Sub when the producer does not need the consumer's result immediately, when multiple consumers must see the same event, or when load spikes need absorbing. Use gRPC when the caller blocks on a result (login, payment authorization) and the SLA requires < 200 ms response. A common hybrid: synchronous "command" RPC up front, asynchronous "domain event" published on success for everything else to react to.

Official sources

More PCD topics