Introduction to Environment Management
Effective environment management is a cornerstone of operational excellence for a Professional Cloud Architect. On Google Cloud, environments (Dev, Test, Staging, Prod) must be architected to ensure isolation, reproducibility, and security. The goal is to minimize "works on my machine" syndrome by maintaining high environment parity while controlling costs and securing sensitive production data.
Plain-Language Explanation: Environment Management
Analogy 1 — The Michelin Star Kitchen
Think of Development as the Chef's home kitchen, where they experiment with new flavors. Testing is the private tasting room, where they perfect the dish. Staging is the soft opening, where everything is exactly like the real restaurant but for invited guests. Production is the Main Dining Room on a Saturday night. You never try a brand-new recipe for the first time in the main dining room!
Analogy 2 — The Movie Set
Development is the rehearsal room. Testing is the CGI and sound editing studio. Staging is the private screening for the producers. Production is the worldwide theatrical release. If you find a boom mic in the shot during the rehearsal, it's fine; if it's in the theatrical release, it's a disaster.
Analogy 3 — The Pilot Plant
In chemical engineering, you don't build a massive factory immediately. You build a lab scale (Dev), then a pilot plant (Staging) to see if the process scales up. Only when the pilot plant works do you build the full-scale refinery (Production).
The practice of keeping development, staging, and production environments as similar as possible to ensure that code behaves consistently across the entire delivery pipeline.
Designing the Resource Hierarchy for Environments
The most robust way to isolate environments on GCP is at the Project level within a Folder structure.
- Project Isolation: Each environment should have its own project (e.g.,
app-dev,app-staging,app-prod). This provides the strongest security boundary. - Shared VPC: Use a Shared VPC to manage networking centrally while allowing individual environment projects to use subnets dedicated to them.
- Folder-Level Policies: Apply Organization Policies at the folder level (e.g., a "Development" folder) to enforce different rules (like allowing external IPs in Dev but not in Prod).
Environment Promotion Workflows
Moving code and infrastructure from one environment to the next should be automated via CI/CD.
- Artifact Promotion: Never rebuild your container or binary for different environments. Build once in Dev/Build phase, store in Artifact Registry, and deploy the same image to Staging and then Prod.
- Configuration as Code: Use environment-specific configuration files or secret managers to inject environment-specific values (like DB connection strings) into the immutable artifact.
Architect's Insight: For the PCA exam, always advocate for Project-level isolation over simple labeling or shared projects for different environments. This ensures that a misconfiguration in Dev (like a broad firewall rule) cannot affect Prod. ::
Managing Test Data and Sandboxes
- Data Masking: When using production data for testing, always use Sensitive Data Protection (DLP) to mask PII.
- Synthetic Data: Prefer generating synthetic data that mimics production volume and variety without the security risk.
- Sandboxes: Provide "Playground" projects for developers that have strict budget alerts and auto-cleanup scripts to encourage innovation without runaway costs.
Dev/Staging/Prod Project Separation Patterns
The PCA exam consistently rewards designs that map each environment to a dedicated GCP project under a folder structure rooted at the Organization node. The canonical hierarchy is Organization → Environments folder → {dev, nonprod, prod} sub-folders → workload projects. Each environment folder gets its own billing sub-account, its own break-glass groups in Cloud Identity, and its own set of Organization Policy bindings.
Why project-level (not label-level) separation
- Quotas are per-project. A runaway Cloud Run revision in
devcannot exhaust theprodCompute Engine CPU quota when each lives in its own project. - IAM blast radius shrinks. Granting
roles/owneron a dev project never leaks into prod data in BigQuery or Cloud SQL. - Logs and audit trails stay clean. Cloud Audit Logs are aggregated by project, so security review on
app-prodis not polluted by developer experiments.
Folder-level guardrails to apply
| Policy | dev folder | prod folder |
|---|---|---|
compute.vmExternalIpAccess |
Allow (subset) | Deny |
iam.allowedPolicyMemberDomains |
Allow contractors | Restrict to corp domain |
storage.uniformBucketLevelAccess |
Enforce | Enforce |
compute.requireOsLogin |
Enforce | Enforce |
sql.restrictPublicIp |
Allow | Enforce |
Use Terraform with a google_folder resource and google_org_policy_policy bindings so the guardrails are version-controlled. Pair this with Folder-scoped sinks in Cloud Logging that route prod audit events to a separate BigQuery sink for compliance retention (often 7 years), while dev logs can expire in 30 days to control cost.
For PCA scenarios that mention "a developer accidentally deleted a database" or "non-prod IAM granted to prod", the correct answer is almost always dedicated projects under environment folders with Organization Policy guardrails — never tags, labels, or namespaces inside a shared project.
Shared VPC for Environment Network Isolation
A common architecture is one host project per environment (e.g., net-host-prod, net-host-nonprod) with workload projects attached as service projects. This centralizes network admin in the network team while letting application teams retain IAM on compute and data services.
Reference layout
net-host-prod(host project) owns thevpc-prodVPC with subnetssubnet-prod-asia-east1,subnet-prod-us-central1.app-prod-*service projects consume subnets viaroles/compute.networkUsergranted on specific subnets only — never on the whole VPC.net-host-nonprodhosts thevpc-nonprodVPC; VPC Peering between prod and nonprod is deliberately not configured, so a misconfigured nonprod workload cannot reach the prod database tier.
Firewall policy hierarchy
- Hierarchical firewall policies at the folder level deny all traffic from
0.0.0.0/0except for ingress ontcp:443behind the global Load Balancer. - Network firewall policies at the VPC level allow east-west traffic only between tagged service accounts (e.g.,
[email protected]to[email protected]). - Cross-environment traffic must traverse a documented egress path — typically through Cloud NAT plus Private Service Connect endpoints, never via direct VPC peering between dev and prod.
Why not a single Shared VPC for all environments
A single VPC means a single CIDR plan, single firewall surface, and a single set of routes. One misconfigured route in dev can blackhole prod traffic. Separate Shared VPCs per environment add 5 minutes of Terraform but remove an entire class of cross-env incidents.
IAM Scoping Per Environment
Identity sprawl is the most common source of environment leakage. The pattern that scores on PCA: groups bound at the lowest necessary scope, service accounts isolated per environment, no human gets standing prod access.
Group + scope matrix
gcp-app-developers@corp→ bound toroles/editoron the dev folder only.gcp-app-sre@corp→ bound toroles/run.developer+roles/logging.vieweron the prod folder, with break-glass access via Privileged Access Manager (PAM) forroles/run.admin.gcp-app-deployer@corp(CI/CD only, no humans) → bound to deploy-specific roles on each environment.
Service account hygiene
- Each environment has its own service account namespace:
[email protected]is distinct from[email protected]. - Disable service account key creation via the
iam.disableServiceAccountKeyCreationOrganization Policy on the prod folder. - Use Workload Identity Federation for CI/CD instead of long-lived JSON keys — GitHub Actions or GitLab CI assume the deployer SA via OIDC.
- Cross-project SA impersonation must be explicit: a dev SA cannot impersonate a prod SA because
roles/iam.serviceAccountTokenCreatoris never granted across the env boundary.
Audit checklist for PCA
- Run
gcloud asset search-all-iam-policies --scope=folders/PROD_FOLDER_ID --query="policy:user:"quarterly to find any direct user bindings that should be group bindings. - Enable Policy Analyzer in IAM Recommender to surface unused permissions, and accept the recommendations on dev first to validate the pattern before applying to prod.
Granting roles/iam.serviceAccountUser on a prod service account to the dev-developers group is a silent prod breach — the developer can launch a Compute Engine VM as the prod SA and read prod data. Always scope roles/iam.serviceAccountUser to the specific project where the SA is allowed to be attached.
Sandbox Folders for Experimentation
Sandboxes are the safety valve that prevents shadow IT. The PCA-recommended pattern is an Organization → Sandbox folder, isolated from the main Environments folder, with aggressive guardrails.
Sandbox folder configuration
- Auto-cleanup: A scheduled Cloud Function triggered by Cloud Scheduler runs
gcloud projects list --filter="parent.id=SANDBOX_FOLDER AND createTime<-P30D"and deletes any project older than 30 days unless taggedkeep=true. - Budget cap: Each sandbox project is created from a Project Factory Terraform module that auto-attaches a Cloud Billing budget of $50/month with a Pub/Sub trigger that disables billing at 100%.
- No prod data: An Organization Policy
gcp.restrictServiceUsagedeniesbigquerydatatransfer.googleapis.comanddatastream.googleapis.comin the Sandbox folder, preventing accidental pipelines from prod. - Public IP allowed, but logged: Unlike prod, sandboxes can have external IPs for quick demos, but VPC Flow Logs stream to a security SIEM so security can spot exfil patterns.
Self-service workflow
Developers request a sandbox via a Backstage or internal portal that calls a Cloud Build trigger. The build runs the project-factory Terraform, assigns the requester as roles/owner on that one project (never on the folder), and posts the project ID to Slack. This shifts experimentation out of the dev environment (where it pollutes shared state) into an ephemeral, capped space.
Terraform Workspaces vs Directory-per-Environment
Terraform offers two patterns for multi-environment IaC and the PCA exam tests when each is appropriate.
Workspaces (CLI workspaces or Terraform Cloud workspaces)
# main.tf - shared
resource "google_compute_network" "vpc" {
name = "vpc-${terraform.workspace}"
project = var.project_ids[terraform.workspace]
}
- Single codebase, state file per workspace.
- Pro: Minimal duplication when environments are truly identical in shape.
- Con: Easy to apply to the wrong workspace —
terraform workspace select prodis one typo away from disaster.
Directory-per-environment (recommended for PCA)
infra/
modules/
network/
gke-cluster/
envs/
dev/main.tf (calls modules with dev vars)
staging/main.tf (calls modules with staging vars)
prod/main.tf (calls modules with prod vars, separate backend)
- Each env has its own backend bucket:
gs://tf-state-prod,gs://tf-state-dev, with bucket-level IAM that prevents the dev pipeline SA from touching prod state. - Promotion = PR that bumps a module version pin (
source = "git::...//modules/gke-cluster?ref=v1.4.2") inenvs/staging/main.tf, then a follow-up PR inenvs/prod/main.tf. - This pattern aligns with Cloud Build triggers filtered by changed paths: only changes under
envs/prod/trigger the prod plan/apply pipeline, gated on manual approval.
For mixed teams, Terraform Cloud workspaces backed by VCS combine the best of both: directory-per-env in Git, with workspace-level run gates and audit logs.
Environment-Specific Secret Manager Layout
Secrets must never be shared across environments — a leaked dev secret should not unlock prod. Secret Manager in GCP supports this through per-project secrets plus IAM at the secret level.
Naming and project layout
- Secret name stays stable across environments:
db-password,stripe-api-key,oauth-client-secret. - Each environment has its own project hosting those secrets:
secrets-dev,secrets-staging,secrets-prod. - The application reads
projects/secrets-${ENV}/secrets/db-password/versions/latest— the only thing that changes is the project prefix injected at deploy time.
Access controls per environment
[email protected]getsroles/secretmanager.secretAccessoronprojects/secrets-prodonly.roles/secretmanager.adminis never granted to humans on the prod secrets project — rotation happens via a Cloud Scheduler + Cloud Function that calls Secret Manager and the upstream provider API.- CMEK encryption is enabled on prod secrets using a Cloud KMS key in a separate
kms-prodproject, so even a Secret Manager admin withoutroles/cloudkms.cryptoKeyDecryptercannot exfiltrate plaintext.
Rotation and versioning
- Automatic rotation is configured via the
rotationfield on the secret resource, triggering a Pub/Sub topic that a Cloud Function listens to. - Old versions are disabled (not destroyed) for 30 days to support rollback.
- For PCA: when a question describes "a secret was leaked in dev logs and we need to ensure prod is unaffected", the answer leverages the per-environment project separation — no rotation needed in prod because the dev secret never had prod access.
Inject the env-specific secret project ID at deploy time via Cloud Build substitution variables (_SECRET_PROJECT=$_ENV-secrets). Your application code stays environment-agnostic and reads from projects/${SECRET_PROJECT}/secrets/... — one less branch to maintain.
Cloud Build Substitutions for Per-Environment Pipelines
A single cloudbuild.yaml parameterized with substitutions lets you maintain one pipeline definition across dev/staging/prod, with the per-env values supplied by the trigger.
Substitution-driven pipeline
# cloudbuild.yaml
substitutions:
_ENV: dev
_PROJECT_ID: app-dev
_REGION: us-central1
_MIN_INSTANCES: '0'
_CMEK_KEY: ''
steps:
- id: build
name: gcr.io/cloud-builders/docker
args: ['build', '-t', 'us-central1-docker.pkg.dev/$_PROJECT_ID/app/web:$SHORT_SHA', '.']
- id: deploy
name: gcr.io/google.com/cloudsdktool/cloud-sdk
entrypoint: gcloud
args:
- run
- deploy
- web-$_ENV
- --image=us-central1-docker.pkg.dev/$_PROJECT_ID/app/web:$SHORT_SHA
- --region=$_REGION
- --min-instances=$_MIN_INSTANCES
- --project=$_PROJECT_ID
Trigger configuration per environment
- dev trigger: Fires on push to
main. Substitutions:_ENV=dev, _PROJECT_ID=app-dev, _MIN_INSTANCES=0. - staging trigger: Fires on tag matching
staging-*. Substitutions:_ENV=staging, _PROJECT_ID=app-staging, _MIN_INSTANCES=1. - prod trigger: Fires on tag matching
v[0-9]+.[0-9]+.[0-9]+. Substitutions:_ENV=prod, _PROJECT_ID=app-prod, _MIN_INSTANCES=2, _CMEK_KEY=projects/kms-prod/.... Requires manual approval (approval_config.approval_required: true).
Each Cloud Build trigger runs as its own service account. The prod trigger SA gets roles/run.admin only on app-prod; the dev trigger SA gets the same role only on app-dev. Never reuse one builder SA across environments — that single account becomes the highest-value target in the entire org.
Build-time vs runtime substitution
Only image tags, region, and resource names should differ at build time. Application config (feature flags, log levels) should come from runtime sources (Secret Manager, Firestore, App Config equivalent via Cloud Storage) so a config change does not require a redeploy. The image SHA promoted from dev to prod must be byte-identical.
GKE Config Sync for Per-Environment Policies
For organizations running GKE across environments, Config Sync (part of Anthos Config Management) provides GitOps-driven policy and config delivery per cluster.
Repository structure
config-sync/
base/ (common Kustomize bases — namespaces, network policies)
overlays/
dev/
patches/
deny-internet.yaml (less restrictive)
resource-quotas.yaml (smaller quotas)
staging/
prod/
patches/
deny-internet.yaml (deny all egress except allowlist)
resource-quotas.yaml (production-sized)
pod-security.yaml (restricted PSS)
Each GKE cluster (gke-dev, gke-staging, gke-prod) has its RootSync resource pointing to a different overlay path in the same Git repo, ensuring all clusters track a single source of truth while applying environment-appropriate constraints.
Policy Controller (Gatekeeper) per env
Policy Controller enforces constraints via OPA Gatekeeper templates:
- dev:
warnmode on most constraints, so developers see violations but deploys aren't blocked. - staging:
denymode on critical constraints (no privileged containers, nohostNetwork) but warn-only on image registry restrictions. - prod:
denymode on all constraints, includingK8sAllowedReposthat restricts images to*-docker.pkg.dev/PROD_PROJECT/*.
Drift detection and remediation
Config Sync continuously reconciles cluster state against Git. If a developer manually edits a prod NetworkPolicy via kubectl edit, Config Sync reverts it within seconds and emits a metric to Cloud Monitoring that triggers a PagerDuty alert. This makes prod effectively immutable outside the Git workflow — the same property you want from your IaC layer, applied at the Kubernetes API layer.
Ephemeral Preview Environments via Cloud Run
Per-PR preview environments dramatically improve developer feedback loops. Cloud Run is the GCP service of choice because of its sub-second cold starts, scale-to-zero billing, and per-revision URLs.
Implementation pattern
- A GitHub Actions workflow triggered on
pull_requestbuilds the container, tags itpr-${PR_NUMBER}-${SHA}, and pushes to Artifact Registry. - The workflow runs
gcloud run deploy preview-pr-${PR_NUMBER} --image=... --region=us-central1 --no-traffic, creating a new Cloud Run service per PR. - Cloud Run returns a deterministic URL
https://preview-pr-123-xyz-uc.a.run.app, which the workflow posts as a PR comment. - When the PR closes (merged or abandoned), a follow-up workflow runs
gcloud run services delete preview-pr-${PR_NUMBER}.
Cost and security guardrails
- All preview services live in a dedicated
app-previewproject (not dev, not staging). --min-instances=0and--max-instances=3keep idle cost at zero and active cost capped.- The preview SA has read-only access to a sanitized copy of staging data in a separate BigQuery dataset — never to production.
- Cloud Run authentication is enabled (
--no-allow-unauthenticated) and developers authenticate viagcloud auth print-identity-tokenor via Identity-Aware Proxy bound to the corp domain.
Why not GKE for previews
A GKE namespace per PR works but adds scheduling latency (pulling images, scheduling pods) and namespace cleanup overhead. Cloud Run gives you a fresh URL in under 30 seconds, scales to zero between developer interactions, and removes the need to manage cluster resources. Reserve GKE-based previews for workloads with sticky state (databases, stateful sets) that Cloud Run cannot host.
Data Anonymization for Staging Environments
Staging must mirror production realistically without becoming a compliance liability. The combination of Sensitive Data Protection (DLP), Dataflow, and Cloud KMS provides a repeatable anonymization pipeline.
Pipeline architecture
- Scheduled BigQuery export snapshots prod tables nightly to a staging-bound Cloud Storage bucket using BigQuery Data Transfer Service with a service account that has read-only on prod.
- A Dataflow template (the
Cloud_DLP_GCS_Text_to_BigQueryGoogle-provided template) reads the snapshot, calls the DLP API with aDeidentifyConfig, and writes to the staging BigQuery dataset. - The DLP config uses:
- Format-preserving encryption (FPE) with FFX on email and phone fields so referential integrity is preserved (the same prod email always maps to the same staging email).
- Cryptographic hashing on SSN/national-ID fields, with the key in Cloud KMS in a separate
kms-deidproject. - Bucketing on numeric fields like salary (
< 50000,50000-100000,> 100000) to retain statistical properties without exact values. - Date shift on timestamps with a per-record consistent offset so time-series analysis still works.
Re-identification controls
- The KMS key used for FPE is wrapped under a KMS key ring that only the deidentification SA can use. Re-identification requires explicit IAM grants that trigger a Cloud Audit Log alert.
- Staging access is gated by a separate group (
gcp-app-staging-readers@corp) that is not granted to product engineers by default — even though the data is anonymized, the principle of least privilege still applies.
Validation before promotion
Before staging data goes live, a Cloud Run job runs k-anonymity and l-diversity checks (using DLP's RiskAnalysisJob) to confirm the anonymized dataset meets the org's privacy thresholds (e.g., k≥5). If the job fails, the staging refresh aborts and an alert fires — preventing accidentally weak anonymization from reaching engineering eyes.
For PCA scenarios on test data: DLP + Dataflow is the GCP-native answer for anonymization at scale, with format-preserving encryption for fields that need referential integrity, cryptographic hashing for one-way pseudonymization, and bucketing/date shift for statistical fields. Keep the deidentification KMS key in a separate project to enforce separation of duties.
FAQ — Environment Management
Q1. How do we ensure parity between Dev and Prod?
Use Infrastructure as Code (Terraform). Define your infrastructure once and use variables to deploy it to multiple projects. This ensures the network, firewall, and service configurations are identical.
Q2. Should Dev and Prod be in the same VPC?
No. For high security, keep them in separate VPCs (possibly in a Shared VPC architecture where they occupy different subnets and are isolated by firewall rules).
Q3. How do we control costs in Dev/Test?
Use Cloud Scheduler to shut down non-production VMs during off-hours, use Preemptible/Spot VMs, and set aggressive Cloud Billing Budgets.
Q4. What is the role of the Staging environment?
Staging is for Final Validation. It should be a 1:1 replica of Production, including data volume (masked) and traffic patterns, to catch performance issues and integration bugs.
Q5. Can we use the same Service Accounts across environments?
Absolutely not. Each environment must have its own unique Service Accounts with permissions scoped only to that environment's resources.