Introduction to Governance and Change Management
For a Professional Cloud Architect, governance is the "set of rules" that ensures the cloud environment remains secure, compliant, and cost-effective as it grows. Change management is the "process" by which those rules are updated and new infrastructure is deployed without causing outages.
In the cloud, governance is enforced through code and automated policies, rather than just PDF documents and manual approvals. Effective governance balances agility (the ability to move fast) with control (the ability to minimize risk).
白話文解釋(Plain English Explanation)
Analogy 1 — Hospital Surgical Consent Form (Change Advisory Board)
A hospital never lets a surgeon cut a patient based on a verbal "trust me." Before any operation, a consent form is signed, the procedure is logged, and a second physician co-signs high-risk cases. In GCP terms, this is the Change Advisory Board (CAB) plus a Pull Request: the terraform plan output is the "X-ray," peer review is the "second physician," and the merge commit is the signed consent form. If something goes wrong, the audit log (Cloud Audit Logs) tells you exactly who approved what, when.
Analogy 2 — Civil Aviation Authority Takeoff Clearance (Change Windows)
No airliner takes off whenever the captain feels like it. The CAA tower assigns specific takeoff slots, weather windows, and runway permissions. Production deployments work the same way: you define change windows (e.g., Tue–Thu 10:00–16:00, never on Friday afternoons), block freezes during peak shopping season, and require a "tower clearance" from on-call before a Cloud Deploy rollout proceeds. Org Policy constraints like constraints/compute.requireOsLogin enforce the equivalent of "no aircraft without a transponder."
Analogy 3 — Bank Vault Two-Person Key Rule (Separation of Duties)
A bank vault requires two different keys held by two different people — the manager alone cannot open it, and the teller alone cannot open it. In GCP, this maps to separation of duties via IAM: the person who writes the Terraform code (roles/editor on a sandbox) is NOT the same person who has roles/owner on the production project. IAM Deny policies plus Access Approval ensure that even a compromised admin account cannot unilaterally read customer data — Google SRE has to co-sign the access request. Two keys, two humans, one vault.
The Pillar of Governance: Resource Hierarchy
The GCP Resource Hierarchy is the foundation of governance. It allows for the inheritance of policies and permissions.
- Organization: The root node (e.g.,
company.com). Essential for centralized billing and Org Policies. - Folders: Used to group projects by department (e.g., "Finance"), environment (e.g., "Prod"), or business unit. Folders can be nested up to 10 levels deep.
- Projects: The base unit for enabling services, managing APIs, and billing. Resources MUST belong to a project.
- Resources: The actual VMs, Buckets, and Databases.
Architect's Insight: Policy inheritance is additive. If a user is granted Storage Admin at the Folder level, they have it for all projects within that folder. Deny policies (via IAM Deny or Org Policies) generally override Allow policies.
::
Plain-Language Analogies for Governance
Analogy 1 — The Tree and its Fruit (Resource Hierarchy)
Think of your GCP environment as a Tree. The Organization is the Trunk. Folders are the Branches. Projects are the Twigs, and Resources (VMs, Data) are the Fruit. If you spray "Pesticide" (An Organization Policy) on the Trunk, it flows out to every branch, twig, and fruit. You don't have to spray every piece of fruit individually. This is Inheritance.
Analogy 2 — The Standardized Blueprint (Organization Policies)
Governance is like a City Building Code. The code says "All buildings must have fire sprinklers" and "No building can be taller than 5 stories." Organization Policies are these building codes. They prevent a project owner (a builder) from doing something dangerous, like opening a database to the entire public internet, even if they have the technical permission to do so.
Analogy 3 — The Library's Checkout System (Change Management)
Traditional change management is a line of people waiting for a Librarian (The Change Advisory Board) to stamp their book. Cloud Change Management is like a Self-Service Kiosk with an Automated Scanner. You scan your book (Your Code), and if it's "Safe" and you are "Authorized," the gate opens automatically. If you try to take a restricted book, the scanner beeps and blocks you immediately.
Analogy 4 — The Guardrails on a Mountain Road
Governance shouldn't be a brick wall that stops traffic; it should be the Guardrails on a winding mountain road. The guardrails allow drivers (Developers) to drive at high speeds with confidence, knowing that even if they make a small mistake, they won't go over the cliff.
Advanced Governance: Tags and Conditional Policies
GCP Tags (different from labels) are a powerful governance tool. Tags are managed at the Organization or Project level and can be used to conditionally apply policies.
- Conditional IAM: Grant "Instance Admin" ONLY if the resource has the tag
env: dev. - Conditional Org Policies: Disable external IPs ONLY for projects tagged
compliance: pci. - VPC Service Controls: Use tags to define which resources can communicate across boundaries.
Change Advisory Board (CAB): In a cloud-native context, the CAB is no longer a weekly meeting that approves individual gcloud commands. It is a governing body that approves the automation rules — the Org Policy constraints, the required peer-review count on GitHub, and the terraform plan gates in Cloud Build — that the CI/CD pipeline enforces on every commit.
Governance in Networking: Shared VPC
Governance of network resources is critical for security.
- Host Project: Contains the VPC, Subnets, and Cloud NAT. Managed by the "Network Admin."
- Service Projects: Contain the application resources (VMs, GKE). Managed by "Project Admins."
- Benefit: Centralizes control over IP space, firewalls, and hybrid connectivity while allowing developers to manage their own compute resources.
FinOps and Cost Governance
Governance isn't just about security; it's about money.
- Labels: Use labels (key-value pairs) for cost allocation (e.g.,
cost-center: marketing). - Budgets and Alerts: Set at the Billing Account or Project level to notify stakeholders when spending hits 50%, 90%, or 100% of the threshold.
- Quotas: Prevent "runaway costs" by limiting the number of expensive resources (like TPUs or A100 GPUs) a project can create.
Audit log retention is a compliance gate, not a default. Cloud Audit Logs Admin Activity logs are retained for 400 days at no charge, but Data Access logs default to 30 days and are disabled by default for most services. For SOX / HIPAA / PCI DSS workloads you MUST explicitly enable Data Access logs via Org Policy and route them to a Cloud Logging bucket with a 7-year retention lock, or sink them to a Cloud Storage bucket with Bucket Lock for immutability.
Make terraform plan the contract, not terraform apply. In your Cloud Build pipeline, run terraform plan -out=tfplan on the PR branch and post the diff as a GitHub comment. Reviewers approve the plan output, not the source code alone — this catches drift, surprise destroys (-/+ destroy and re-create), and IAM grants that the code reader missed. Then terraform apply tfplan runs only after merge, using the exact same plan that was reviewed.
Change Management: The Shift to GitOps
In a modern cloud environment, Infrastructure as Code (IaC) and GitOps are the primary drivers of change management.
- Version Control (Git): Every change is a "Pull Request" (PR).
- Peer Review: Another architect must approve the code, ensuring architectural alignment.
- Automated Testing (Linting/Validation): Tools like
terraform planandcheckovverify the change before it's applied. - Blue/Green or Canary Deployments: Change management for applications to minimize user impact.
- Rollback: The ability to "Revert" a commit to restore the last known good state.
A rollback runbook must exist BEFORE the deploy, not after the incident. Every production change should ship with: (a) the exact git revert <sha> or terraform apply command to undo it, (b) the maximum acceptable rollback window (e.g., "must revert within 30 min or schema migration becomes irreversible"), and (c) a Cloud Deploy rollback target or a re-tagged GAR image. For data changes, capture a bq snapshot or Spanner backup before the change — once the new schema is live, "git revert" alone won't bring your data back.
Change windows are NOT a substitute for canary deploys. A common PCA-exam wrong answer is "schedule all production changes for the Sunday 02:00 window." On GCP, the right answer is almost always gradual rollout via Cloud Deploy / GKE / Cloud Run revision traffic splitting — 1% → 10% → 50% → 100% — with automated rollback on SLO burn. The Sunday window only limits blast-radius timing; canaries limit blast-radius size. Use both, but never rely on the window alone.
Architect's Rule: Never make manual changes in the Google Cloud Console for production environments. This is known as "Click-Ops" and it leads to Configuration Drift, which is the enemy of governance. Use Terraform or Config Connector to manage state. ::
Compliance as Code
Maintain continuous compliance rather than doing periodic audits.
- Config Connector: Manage GCP resources using Kubernetes CRDs, allowing
kubectlto enforce state. - Forseti Security (Open Source): Inventory, Scan, and Enforce policies across your GCP footprint.
- GCP Security Command Center (SCC): Real-time monitoring of misconfigurations and vulnerabilities.
FAQ — Change Management and Governance
Q1. What is "Configuration Drift"?
Drift occurs when the actual state of your cloud resources no longer matches the state defined in your Terraform code. This usually happens because someone made a manual "emergency" change in the console and forgot to update the code. Drift detection tools are essential for maintaining governance.
Q2. How do I handle "Emergency Changes"?
Even in an emergency, use IaC if possible. If you MUST use the console, document the change immediately and use terraform plan to identify the drift. Update the code to match the manual change (or vice versa) as soon as the crisis is over.
Q3. Can a Project Owner override an Organization Policy?
No. Organization Policies are set at a higher level in the hierarchy and cannot be overridden by someone with lower-level permissions. This provides "Immutable Guardrails" that ensure security even if a project-level account is compromised.
Q4. What is the "Change Advisory Board" (CAB) in a DevOps world?
The CAB evolves from a "Gatekeeper" (approving every ticket) to a "Governing Body" (approving the Process and the Automation). They define the rules that the CI/CD pipeline enforces automatically.
Q5. How do I govern "Cloud Sprawl"?
Use Quotas to limit project creation. Implement a "Project Request" workflow where developers must provide a cost center and a sunset date before a project is provisioned via automation.
Q6. Difference between Labels and Tags?
Labels are metadata for filtering and billing (e.g., team: web). Tags are strongly typed resources that can be used in IAM and Org Policy conditions to enforce security (e.g., environment: prod).
Final Architect Tip
On the PCA exam, look for questions about "Standardization" or "Enforcing constraints across many projects." The answer is almost always Organization Policies. If the question is about "Managing many environments safely," the answer involves IaC (Terraform) and CI/CD. Always design for Inheritance—set policies at the highest level possible to reduce administrative overhead and ensure no "policy gaps" exist in your organization.
PCA governance memory hooks: (1) Org Policies = guardrails (deny external IPs, deny public buckets, require OS Login); (2) IAM = identity-based access; (3) Tags = conditional policy targeting; (4) Labels = billing/filtering only; (5) Audit Logs = who did what when (Admin Activity = free 400d, Data Access = opt-in); (6) Cloud Deploy + Binary Authorization = signed, gradual rollout; (7) Bucket Lock + Retention Policy = immutable audit trail. If the question says "prevent" or "enforce," think Org Policy. If it says "investigate" or "prove," think Cloud Audit Logs + BigQuery sink.