examlab .net The most efficient path to the most valuable certifications.
In this note ≈ 21 min

Cost Optimization for Data Processing

4,200 words · ≈ 21 min read ·

Practical GCP PDE study guide on cost optimization architectures: BigQuery slots vs on-demand, partitioning, materialized views, BI Engine, Dataflow, Dataproc, storage tiers, egress, CUDs, and FinOps.

Do 20 practice questions → Free · No signup · PDE

Introduction to Cost Optimization Architectures

Cost optimization architectures decide whether a data platform stays cheap to grow or becomes a quarterly finance review headline. On GCP the lever is rarely a single setting. It is the combination of how you store data, how you schedule compute, and how you bill for time you do not use. This note walks through the choices a Professional Data Engineer makes from raw bytes to dashboards, with every dollar accounted for.

Cost optimization architectures live and die at the design phase. Retrofitting partitioning, slot reservations, or storage classes onto a running pipeline is harder than picking them on day one. Treat cost as a first-class non-functional requirement, alongside latency and durability. See https://cloud.google.com/architecture/framework/cost-optimization

白話文解釋(Plain English Explanation)

Before diving into pricing pages, it helps to anchor cost optimization architectures to things you already understand. Three analogies cover the mental model: a utility bill at home, a gym membership, and the cold-storage room at the back of a restaurant.

The Electricity Bill Analogy

Imagine your house has two ways to pay for electricity. Plan A charges you per kilowatt-hour, with no monthly fee. Plan B charges a flat fee for unlimited usage up to a cap, with overage at a discount. If you only run a laptop and a fridge, Plan A wins. If you mine crypto in the basement, Plan B wins by a mile.

BigQuery on-demand pricing is Plan A. Slot reservations are Plan B. The cost optimization architectures decision is not which plan is better in the abstract. It is which plan matches your actual workload pattern. Most teams misjudge this because they look at a single peak day instead of the steady-state average.

The Gym Membership Analogy

A gym sells day passes for fifteen dollars and annual memberships for four hundred dollars. If you go more than twenty-seven days a year, the membership wins. If you go to the gym in January and quit in February, day passes were the right call.

This is exactly how committed use discounts work in cost optimization architectures. You pre-commit to one or three years of compute spend in exchange for a discount of twenty-five to fifty-seven percent. The trap is the same as the gym: people overestimate how much they will use the resources next year and end up paying for slots they never burn.

The Restaurant Cold-Storage Analogy

A restaurant has three places to keep ingredients. The line cooler at the prep station is reachable in seconds but small and expensive. The walk-in fridge in the back holds a week of supplies and costs a bit more per pound. The freezer in the basement holds bulk meat for months and costs almost nothing per pound, but takes thirty minutes to thaw.

Cloud Storage Standard, Nearline, Coldline, and Archive map to those four locations one for one. The cost optimization architectures question is not which class is cheapest. It is which class matches the access pattern, because thawing frozen data has a real retrieval fee that can erase storage savings if you guess wrong.

Core Concepts of Cost Optimization Architectures

Cost optimization architectures rest on five pillars that show up on every PDE exam scenario. Understand each one in isolation and then in combination.

The first pillar is decoupled storage and compute. BigQuery, Dataproc on GCS, and Dataflow on Pub/Sub all separate where data sits from what processes it. This lets you scale each axis independently, which is the single largest cost lever in cost optimization architectures.

The second pillar is right-sizing. Picking n2-standard-4 when n2-standard-2 would do, or running ten Dataflow workers when five suffice, doubles your bill for no business value. Right-sizing requires actual usage telemetry, not guesses.

The third pillar is temporal pricing. Spot VMs, preemptible workers, scheduled queries off-peak, and storage class transitions all exploit the fact that GCP charges different rates at different times or for different SLAs. Cost optimization architectures use these aggressively.

The fourth pillar is commitment. Committed use discounts, slot reservations, and BigQuery editions trade flexibility for price. Get this right and you save thirty to seventy percent. Get it wrong and you pay for capacity you never touch.

The fifth pillar is observability. Without billing exports to BigQuery, budget alerts, and anomaly detection, you only learn about cost problems on the first of the month. Cost optimization architectures embed cost feedback loops into engineering work.

A BigQuery slot is a unit of computational capacity equal to roughly half a CPU core plus associated memory and shuffle resources. Queries consume slots while running. On-demand pricing meters bytes scanned. Capacity pricing meters slot-hours. https://cloud.google.com/bigquery/docs/slots

Architecture and Design Patterns

A few recurring patterns show up in well-designed cost optimization architectures on GCP.

The tiered storage pattern moves data through hot, warm, cold, and archival tiers based on access age. Raw clickstream lands in Standard, ages to Nearline at thirty days, Coldline at ninety days, and Archive at one year. Lifecycle policies handle the transitions automatically. The pattern works because access frequency typically follows a power-law decay, where ninety percent of reads target the most recent ten percent of data.

The hybrid pricing pattern mixes on-demand BigQuery for ad-hoc analyst queries with a slot reservation for predictable production ETL. The reservation absorbs the steady workload. On-demand soaks up the spikes. This is the most common BigQuery cost optimization architectures pattern in mid-size organizations because it gets the best of both billing models.

The preemptible-heavy compute pattern runs Dataproc clusters with two regular workers and twenty preemptible workers. Spark and Hadoop are designed to handle node loss, so preemption is recoverable. The cluster costs roughly one-fifth of an all-regular configuration. Cost optimization architectures lean on this pattern for batch ETL where job latency tolerates a few restarts.

The materialized acceleration pattern layers materialized views and BI Engine on top of BigQuery base tables. Dashboard queries hit the cache. ETL refreshes the materialization once per hour. The pattern dramatically reduces bytes scanned at the cost of a small storage footprint and refresh compute.

The streaming-with-shuffle-offload pattern uses Dataflow Streaming Engine and Shuffle Service to move state and shuffle data off the worker VMs. This shrinks worker disk and memory needs, which directly cuts the per-hour worker cost. Cost optimization architectures for streaming pipelines almost always enable Streaming Engine by default.

When sketching a cost optimization architectures diagram, draw the cost meter next to each arrow. BigQuery scan = bytes. Dataflow worker = vCPU-hours. GCS retrieval = operations plus egress. Making the meters visible on the diagram prevents the design review from missing a hidden cost line. https://cloud.google.com/architecture/framework/cost-optimization

GCP Service Deep Dive

Each managed service has its own cost knobs. The PDE exam expects you to recall them under pressure.

BigQuery: On-Demand vs Slot Reservations

On-demand pricing charges per terabyte scanned, currently around 6.25 USD per TiB in most regions. There is no upfront cost and no commitment. The model rewards small, well-pruned queries against partitioned and clustered tables.

Slot reservations under BigQuery editions (Standard, Enterprise, Enterprise Plus) charge per slot-hour. You buy a baseline of slots and optionally enable autoscaling to a max ceiling. The reservation runs all queries assigned to the project, regardless of bytes scanned. Heavy ETL workloads and analytics teams that scan many petabytes per month break even at roughly four to five hundred slots of constant usage.

The decision flow for cost optimization architectures: if monthly on-demand spend exceeds the cost of a 500-slot Enterprise reservation, switch. If usage is bursty and unpredictable, stay on-demand or use the autoscale feature. If you need predictable monthly bills for finance, reserve.

Slot reservations also unlock features like cross-region replication and column-level security depending on the edition tier. Cost optimization architectures sometimes pick reservations for the features rather than purely the price.

Partitioning and Clustering for Cost Reduction

Partitioning splits a table into segments by date, integer range, or ingestion time. Queries with a WHERE clause on the partition column scan only the relevant partitions. A query against a 10 TB table with a single-day filter on a daily-partitioned table scans only 10 GB if the data is uniform, cutting scan cost by 99.9 percent.

Clustering sorts data within each partition by up to four columns. Queries filtering or aggregating on cluster columns benefit from block-level pruning. Clustering does not save as dramatically as partitioning but compounds with it. Cost optimization architectures almost always combine both.

Common pitfalls: partitioning by a high-cardinality column creates thousands of tiny partitions, which actually increases metadata overhead and can hurt performance. Clustering only helps if the workload filters on cluster columns. Picking cluster keys without query log analysis is guesswork.

Materialized Views and BI Engine

Materialized views precompute and store query results. BigQuery automatically uses them when an incoming query matches the view definition, even partially. Refresh happens incrementally and is billed as scan cost. Cost optimization architectures use materialized views for repeated aggregations like daily revenue rollups, where the dashboard hits the view rather than the multi-terabyte fact table.

BI Engine is an in-memory accelerator for BigQuery queries that come from Looker, Looker Studio, or Tableau. You allocate a memory reservation in GiB. Queries that fit in the reservation return in sub-second latency and do not consume on-demand scan bytes. The reservation has a flat monthly cost.

The combined pattern: BI Engine for sub-second dashboards, materialized views for slightly older but still common queries, raw partitioned tables for ad-hoc deep dives. This three-tier acceleration is a hallmark of mature cost optimization architectures.

Dataflow: Streaming Engine and Autoscaling

Dataflow charges per worker vCPU-hour, memory-hour, persistent disk, and shuffle data. Streaming Engine moves state out of worker VMs into a managed backend. Workers become smaller and cheaper. Autoscaling adjusts worker count based on backlog and CPU. Together they typically cut streaming pipeline cost by twenty to forty percent compared to legacy non-Streaming-Engine runs.

For batch jobs, Dataflow Shuffle Service does the same trick. Shuffle data lives off-worker, so workers can shrink. Combine with FlexRS (Flexible Resource Scheduling) and you get up to forty percent savings on batch by allowing the job to start within six hours instead of immediately.

Streaming pipelines without Streaming Engine require the worker boot disk to hold state. People crank up persistent disk to 400 GB per worker and then wonder why disk dominates the bill. Always enable Streaming Engine for production streaming. https://cloud.google.com/dataflow/docs/streaming-engine

Dataproc: Preemptible Workers and Autoscaling

Dataproc clusters have a primary worker pool (regular VMs) and a secondary worker pool. Secondary workers can be standard, spot, or preemptible. Preemptible and spot workers cost roughly twenty to thirty percent of regular pricing but can be reclaimed by GCP at any time with thirty seconds of warning.

The cost optimization architectures rule: keep enough primary workers to hold HDFS data and complete the job if all secondaries vanish. Add secondary preemptible workers for elastic compute. Spark handles preemption gracefully because it re-runs lost tasks.

Dataproc autoscaling policies define min and max worker counts plus scaling rules based on YARN pending memory. A well-tuned policy keeps the cluster small during quiet periods and grows for big jobs. Combined with preemptible secondaries, autoscaling can cut Dataproc spend by sixty to eighty percent versus a fixed cluster sized for peak.

Cloud Storage Class Transitions

Cloud Storage offers four classes with rising storage prices and falling retrieval prices. Standard is for frequently accessed data with no minimum storage duration. Nearline is for data accessed less than once per month, with a thirty-day minimum. Coldline is for data accessed less than once per quarter, with a ninety-day minimum. Archive is for data accessed less than once per year, with a 365-day minimum.

Lifecycle policies move objects automatically based on age, storage class, or other conditions. A typical cost optimization architectures lifecycle: Standard for the first thirty days, Nearline for ninety days, Coldline for one year, Archive thereafter. Delete after seven years for compliance.

Early deletion before the minimum storage duration triggers a prorated charge as if you stored the object for the full minimum. Moving a one-day-old file to Archive and deleting it the next day still costs you 365 days of Archive storage. Lifecycle rules with the wrong age threshold are an expensive mistake. https://cloud.google.com/storage/docs/lifecycle

Egress Cost

Egress is the silent killer in cost optimization architectures. Data leaving GCP to the internet costs roughly 8 to 23 cents per GB depending on destination. Cross-region replication, multi-region buckets read from a single region, and BigQuery exports to another cloud all incur egress.

Mitigation: keep compute and storage in the same region. Use multi-region buckets only when the workload genuinely reads from multiple regions. Use Private Service Connect for partner connections to avoid public-internet egress rates. Use BigQuery Omni or BigLake to query data in place rather than exporting it.

Committed Use Discounts and Reservations

Committed use discounts (CUDs) apply to Compute Engine, Cloud SQL, GKE, Memorystore, and other services. You commit to a baseline spend or capacity for one or three years. Resource-based CUDs discount up to fifty-seven percent on specific machine types in specific regions. Spend-based CUDs discount up to twenty-eight percent on a flexible spend amount across regions and machine families.

Cost optimization architectures pick CUDs based on confidence in the workload shape. Resource-based CUDs save more but require workload stability. Spend-based CUDs save less but tolerate workload changes. A common hybrid: resource-based for the baseline production fleet, spend-based for the rest.

BigQuery slot commitments work similarly. You commit to a slot count for a year or three years and get a discount versus the monthly flex commitment price.

Cloud Billing Tools and FinOps

Billing exports send detailed billing data into BigQuery, where you can analyze cost by label, project, service, or SKU. Standard exports include daily summaries. Detailed exports include resource-level breakdowns. Both are free to enable and form the foundation of any FinOps practice.

Budget alerts notify project owners or arbitrary email lists when spend crosses thresholds. They do not stop spending unless wired into a Pub/Sub topic that triggers automation. Cost optimization architectures use budgets as early warning, not as enforcement.

Recommender surfaces idle VM, idle persistent disk, idle IP, and rightsize recommendations automatically. It also flags BigQuery slot reservation tuning. Reviewing Recommender weekly catches drift before it shows up on the bill.

Common Pitfalls and Trade-offs

Several patterns recur across cost optimization architectures audits.

SELECT star against unpartitioned tables. Analysts run SELECT * FROM events against a 100 TB table because they want to peek at the data. The query costs over six hundred dollars. Partitioning, clustering, and table previews exist to prevent this, but governance has to enforce it.

Over-provisioned slot reservations. A team buys two thousand slots because the CFO wanted predictability. Actual usage averages three hundred. The other 1700 slots cost roughly forty thousand dollars per month for nothing. Always size reservations to the median, not the peak, and let on-demand or autoscale absorb spikes.

Preemptible workers on long-running jobs. A four-hour Dataproc job with mostly preemptible workers loses workers repeatedly and ends up running six hours. The savings evaporate. Cost optimization architectures use preemptibles for jobs under one hour or for elastic add-on capacity, not as the entire fleet on long jobs.

Streaming pipelines that never scale down. A Dataflow streaming job scales up for a holiday and stays at high worker count for months. Autoscaling does scale down, but only if max workers is set generously and the actual backlog drops. Audit streaming worker counts monthly.

Multi-region buckets used as default. Multi-region storage costs roughly twice single-region. Teams pick multi-region without needing the geographic redundancy. Cost optimization architectures default to single-region storage and upgrade only with a documented requirement.

Egress through the public internet for partner data exchange. Sending a daily 500 GB file to a partner cloud through public egress costs hundreds of dollars per month. Private Service Connect, Interconnect, or partner-specific solutions cut that by half or more.

A common review trap: the team optimized BigQuery scans down to a few terabytes per month but never noticed the seven-figure egress bill from cross-region replication. Cost optimization architectures look at the whole bill, not just the favorite line item. https://cloud.google.com/network-connectivity/docs/network-connectivity-center

Best Practices

These practices appear in every well-run GCP data platform.

  • Tag every resource with team, environment, cost_center, and workload labels. Billing exports without labels make chargeback impossible.
  • Enable detailed billing export to BigQuery on day one of a project, even before any workloads run. Backfilling is not possible.
  • Run a quarterly slot reservation review. Compare actual usage to committed slots. Adjust commitments at renewal.
  • Set lifecycle policies on every GCS bucket, even short-lived ones. Default to Standard for thirty days, Nearline for ninety, Coldline thereafter. Tune based on access logs.
  • Use materialized views and BI Engine before scaling slot reservations. Acceleration is almost always cheaper than more compute.
  • Audit Recommender output monthly. Idle VMs and oversized disks accumulate silently.
  • Test preemptible behavior in staging. Make sure the job recovers from worker loss before pushing to production.
  • Document the cost model in the same repo as the code. New engineers should see the cost contract alongside the API contract.

Real-World Use Case

A mid-size ad-tech company runs roughly three petabytes of clickstream data through GCP. Before applying cost optimization architectures, the monthly bill was 180,000 USD across BigQuery, Dataflow, and storage. The team ran a six-week optimization effort.

For BigQuery, they switched from on-demand to a 1,500-slot Enterprise reservation with autoscaling to 3,000. Most ETL jobs ran within the baseline. Analyst queries occasionally pushed into autoscale. The new BigQuery bill landed at 42,000 USD per month, down from 95,000 USD.

They added partitioning by event_date and clustering by user_id and event_type to the main events table. Average query scan dropped from 800 GB to 12 GB. They built three materialized views for the most common dashboard aggregations. Dashboard latency dropped from 14 seconds to under 1 second, with BI Engine handling the very hot aggregations.

For Dataflow, they enabled Streaming Engine on all four production streaming pipelines. Worker disk dropped from 400 GB to 30 GB per worker. They also moved batch jobs to FlexRS where the six-hour latency was acceptable. Combined Dataflow savings: 18,000 USD per month.

For Cloud Storage, they applied lifecycle policies. Raw events landed in Standard, moved to Nearline after thirty days, Coldline after ninety, Archive after one year. The storage bill fell from 22,000 USD to 9,000 USD per month.

For compute, they applied three-year resource-based CUDs to the steady-state Dataproc primary worker fleet and added preemptible secondary workers for batch elastic capacity. Dataproc dropped from 35,000 USD to 14,000 USD.

Total monthly bill after the effort: 78,000 USD. Annual savings: roughly 1.2 million USD. The optimization paid for itself in the first month and the team built a quarterly review process to keep it from drifting.

Exam Tips

The PDE exam tests cost optimization architectures heavily because real customers ask about it constantly. Know these flashcards cold.

  • BigQuery on-demand: per TiB scanned. Slot reservations: per slot-hour. Break-even is roughly 400 to 500 sustained slots.
  • Partitioning is the single biggest BigQuery cost lever. Clustering compounds. Always combine.
  • BI Engine fits in-memory aggregations. Materialized views fit precomputed rollups. Both reduce on-demand scan cost.
  • Dataflow Streaming Engine cuts worker disk and enables better autoscaling. Always enable for streaming production.
  • Dataproc primary workers must satisfy HDFS replication. Secondary preemptible workers add elastic compute at one-fifth cost.
  • Cloud Storage minimum storage durations: Nearline 30 days, Coldline 90 days, Archive 365 days. Early delete charges prorate.
  • Lifecycle rules transition objects between classes automatically based on age, storage class, or other conditions.
  • Egress to internet costs 8 to 23 cents per GB depending on destination. Cross-region replication and multi-region buckets incur it silently.
  • CUDs: resource-based discount up to 57 percent on specific machine types. Spend-based discount up to 28 percent flexibly.
  • Billing exports go to BigQuery. Standard or detailed. Free to enable. Foundation of FinOps.
  • Budget alerts notify; they do not enforce unless wired to Pub/Sub plus automation.
  • Recommender surfaces idle resources and rightsizing automatically.

The cost optimization architectures decision tree on the exam usually goes: predictable workload? Reserve. Bursty workload? On-demand or autoscale. Long-term commitment confidence? Three-year CUD. Cold data? Lifecycle to Archive. Egress involved? Move compute closer to data. https://cloud.google.com/architecture/framework/cost-optimization

Frequently Asked Questions (FAQ)

When should I switch from BigQuery on-demand to slot reservations?

The break-even depends on your region and edition tier. As a rough rule, if your monthly on-demand bill exceeds the cost of a 500-slot Enterprise reservation in your region, reservations save money. For most regions that is around the 30,000 to 40,000 USD per month on-demand mark. Cost optimization architectures often start with a smaller reservation and use autoscale to handle spikes, then grow the baseline as steady-state usage rises. You can also mix: keep on-demand for a sandbox project and put production ETL on a reservation.

How do I pick the right Cloud Storage class for a bucket?

Match the class to access frequency. Standard for data read more than once a month. Nearline for data read once a month or less, like monthly reports. Coldline for data read once a quarter or less, like compliance archives. Archive for data you might never read but must retain, like legal holds. Use lifecycle policies to transition automatically rather than picking one class for the bucket lifetime. Watch the minimum storage duration carefully because early deletion of cold data costs as much as keeping it the full minimum.

Are preemptible workers safe for production Dataproc jobs?

Yes, if the job tolerates restarts. Spark and Hadoop reschedule lost tasks automatically, so a preemption typically extends job runtime by a few percent rather than failing the job. Cost optimization architectures keep enough non-preemptible primary workers to hold HDFS data and use preemptible secondaries for elastic compute. Avoid all-preemptible clusters for jobs longer than two hours because cumulative preemption probability rises. Test the failure mode in staging before relying on the pattern in production.

What is the difference between materialized views and BI Engine?

Materialized views store precomputed query results in BigQuery storage and refresh incrementally. They work for any query that matches the view definition, even from BigQuery SQL. BI Engine is an in-memory accelerator that caches query results in RAM for visualization tools like Looker Studio. Both reduce on-demand scan cost, but BI Engine returns sub-second results for dashboards while materialized views handle precomputed aggregations across all clients. Most cost optimization architectures use both: materialized views for the heavy precomputation and BI Engine for the dashboard hot path.

How do committed use discounts interact with autoscaling?

CUDs apply to a baseline of usage. Anything above the baseline runs at on-demand rates. If your workload autoscales between 10 and 50 VMs, commit to roughly the floor (10 VMs) so the commitment never goes unused. The autoscaled portion costs more per hour but only runs when needed. Resource-based CUDs lock to specific machine types in specific regions, so they fit steady production fleets. Spend-based CUDs are more flexible and pair better with workloads whose machine type might change. Cost optimization architectures often layer both: resource-based for the steady core, spend-based for the variable layer.

How do I detect cost anomalies before the end of the month?

Enable billing export to BigQuery and run scheduled queries that compare daily spend against rolling averages. Set budget alerts at 50, 75, 90, and 100 percent of expected monthly spend so you get warning before overrun. Use Recommender to flag idle resources weekly. For BigQuery specifically, audit the INFORMATION_SCHEMA.JOBS view for queries that scan more than expected. Cost optimization architectures bake these into a weekly review rather than a monthly post-mortem. The earlier you catch a runaway pipeline, the cheaper the fix.

Does multi-region storage cost more than single-region?

Yes, roughly twice as much for storage. Multi-region also has different egress patterns because reads from outside the multi-region area incur cross-region charges. Pick multi-region only when you have a documented need for geographic redundancy or low-latency reads from multiple regions. Most analytical workloads serve a single team in a single region, so single-region with backup to a second region is cheaper than a true multi-region bucket. Cost optimization architectures default to single-region and require a justification for multi-region.

Further Reading

Official sources

More PDE topics