examlab .net The most efficient path to the most valuable certifications.
In this note ≈ 20 min

Monitoring BigQuery Performance and Cost

3,850 words · ≈ 20 min read ·

A practical PDE study guide on monitoring BigQuery usage and costs: INFORMATION_SCHEMA jobs views, audit log sinks, billing exports, slot reservations, max_bytes_billed, query labels, and anomaly alerts.

Do 20 practice questions → Free · No signup · PDE

Introduction to Monitoring BigQuery Usage and Costs

Monitoring BigQuery Usage and Costs is the daily discipline that keeps a data warehouse from quietly turning into a finance incident. Anyone preparing for the GCP Professional Data Engineer exam needs a working mental model of how BigQuery charges land in a bill, where the raw evidence lives, and which guardrails stop a single bad query from eating an entire team's budget before lunch.

This guide walks through the views, exports, alerts, and pricing levers an engineer uses in a real production project. The focus is practical: which table to query when a manager asks "who spent that $900 last night," which knob to turn when a streaming job suddenly doubles in cost, and which exam-flavored trap usually shows up around reservations and on-demand billing.

白話文解釋(Plain English Explanation)

Think of BigQuery Like a Taxi Meter on Every Query

When you hail a taxi, the meter starts ticking the moment the driver pulls away. Distance and waiting time both add to the fare. BigQuery on-demand pricing works the same way, but the "distance" is bytes scanned and the "meter" is invisible unless you check it. A query that touches a 4 TB table without partition filters runs the meter the entire trip, even if you only wanted yesterday's rows. The job of the data engineer is to mount a visible meter on the dashboard so the driver, the passenger, and the dispatcher can all see the fare climbing in real time. INFORMATION_SCHEMA.JOBS_BY_PROJECT is that meter receipt; the audit log is the GPS trail; the billing export is the monthly invoice the company actually pays.

Reservations Are Like a Gym Membership Versus Day Passes

On-demand pricing is the day pass: you pay every time you walk in, no commitment. BigQuery Editions with slot reservations are the annual gym membership: a flat monthly fee whether you visit ten times or a hundred. Day passes are great for occasional users; they are a disaster for a company with two hundred analysts running dashboards every morning. The trick is figuring out the break-even point, and that decision has to come from real usage data, not a hunch. If the membership sits idle on weekends, you can downgrade. If everyone shows up at 9 AM and waits for a treadmill, you need autoscaling slots, not more day passes. Monitoring tells you which side of that line you sit on.

Query Labels Work Like Restaurant Bill Splitting

Imagine ten coworkers ordering tapas at one table. The waiter brings one ticket. Without labels, finance just sees "BigQuery: $42,000 this month" and has no way to tell marketing, fraud, and the data science team apart. Add labels like team=marketing or pipeline=fraud_scoring to every job, and suddenly the billing export can group costs by who actually ordered the calamari. This is chargeback, and it transforms cost monitoring from a blame game into a management tool. The team that owns the spend is the team that learns to optimize.

Core Concepts of Monitoring BigQuery Usage and Costs

A few primitives keep showing up in every cost conversation, and the exam expects fluency in all of them.

Bytes processed is the on-demand billing unit. Every job logs total_bytes_processed and total_bytes_billed. The two differ when caching kicks in or when max_bytes_billed aborts a job before completion.

Slot-milliseconds (total_slot_ms) measures how much compute capacity a job actually consumed. On Editions pricing, this is the number that drives autoscaling decisions and chargeback. A query can scan few bytes but burn enormous slot-ms if it does heavy joins or window functions on already-loaded data.

Reservations and assignments decide which workloads share which pool of slots. A reservation buys a baseline of slot capacity; an assignment maps a project, folder, or organization to that reservation for one of three job types: QUERY, PIPELINE, or BACKGROUND.

Audit logs record every administrative and data-access event. Unlike INFORMATION_SCHEMA, they retain history for a configurable period and can be exported to long-term storage.

Billing export is the authoritative cost record. INFORMATION_SCHEMA tells you about jobs; the billing export tells you about money. Both views matter, and they answer different questions.

Architecture and Design Patterns

A mature monitoring stack for BigQuery usually has four layers stacked on top of each other.

The first layer is near-real-time visibility through INFORMATION_SCHEMA. Scheduled queries pull from JOBS_BY_PROJECT, JOBS_BY_USER, and JOBS_BY_ORGANIZATION every fifteen minutes, materializing aggregates into a small bq_cost_facts table. Dashboards in Looker Studio read from this fact table instead of hammering the system view directly.

The second layer is historical and cross-project audit via Cloud Logging sinks. A log sink filters on protoPayload.serviceName="bigquery.googleapis.com" and routes the matching entries into a dedicated audit_logs dataset, often partitioned by ingestion time and clustered by principalEmail. Retention is set explicitly because logs can grow fast.

The third layer is the billing export itself. Standard usage cost data and detailed usage cost data both flow into a separate billing project, partitioned by usage_start_time. The detailed export includes resource-level breakdowns that reveal which dataset or job ID drove a specific charge.

The fourth layer is proactive alerting. Cloud Monitoring policies fire when slot utilization stays above 90 percent for an hour, or when a single user crosses a daily bytes-billed threshold. Budget alerts in Cloud Billing trigger Pub/Sub notifications wired to Slack or PagerDuty.

The four layers answer different time horizons: INFORMATION_SCHEMA for "what happened in the last 180 days," audit logs for "what happened all year," billing export for "what did finance get charged," and Cloud Monitoring for "what is on fire right now." Mixing them up costs an exam point and an evening of debugging. See: https://cloud.google.com/bigquery/docs/information-schema-jobs

GCP Service Deep Dive

INFORMATION_SCHEMA.JOBS_BY_PROJECT and JOBS_BY_USER

The INFORMATION_SCHEMA.JOBS_BY_PROJECT view is the workhorse of cost analysis. Each row represents one job submitted in the project over the last 180 days. Useful columns include creation_time, user_email, job_type, statement_type, total_bytes_billed, total_slot_ms, query, labels, and error_result.

A typical hunting query for the top-spending users last week looks like this:

SELECT
  user_email,
  COUNT(*) AS jobs,
  SUM(total_bytes_billed) / POW(1024, 4) AS tb_billed,
  SUM(total_slot_ms) / 1000 / 60 AS slot_minutes
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
                        AND CURRENT_TIMESTAMP()
  AND job_type = 'QUERY'
  AND state = 'DONE'
GROUP BY user_email
ORDER BY tb_billed DESC
LIMIT 20;

JOBS_BY_USER shows only the calling user's jobs and is what individual analysts should query when asked to investigate their own behavior. JOBS_BY_ORGANIZATION requires the BigQuery Resource Admin role and aggregates across every project under an organization node, which is how a central FinOps team builds a single pane of glass.

BigQuery Audit Logs Export

INFORMATION_SCHEMA caps at 180 days. For year-over-year analysis, regulatory retention, or forensic work after a credential leak, audit logs are the answer. Three log types matter: Admin Activity (always on, free), Data Access (must be enabled per service, charged after the free tier), and System Event.

A log sink with the filter resource.type="bigquery_resource" AND protoPayload.metadata.jobChange.job.jobStats.queryStats.totalBilledBytes > 1099511627776 routes any query that scanned more than one terabyte into a dedicated archive dataset. From there, scheduled queries summarize the events into a long-horizon cost timeline that survives the 180-day INFORMATION_SCHEMA window.

Pin the sink destination to a separate billing account or project. If audit logs land in the same project they monitor, a runaway logging cost can compound the very incident the logs are meant to reveal. See: https://cloud.google.com/bigquery/docs/reference/auditlogs

Cloud Billing Export to BigQuery

Two flavors exist: standard usage cost data and detailed usage cost data. Standard gives daily totals per SKU; detailed adds resource-level breakdowns including the BigQuery resource.name field. For chargeback, always pick the detailed export.

The export creates a dataset with tables like gcp_billing_export_resource_v1_<BILLING_ACCOUNT_ID>. A common pattern joins this against INFORMATION_SCHEMA on labels to attribute spend back to teams. The billing export is the only source of truth that finance accepts; everything else is an estimate.

Dataset Access Logs

Read access on tables is captured by Data Access audit logs when enabled. This matters for two reasons: cost, because Data Access logs themselves cost money to ingest, and compliance, because some sensitive datasets require an audit trail of every SELECT. The pattern is to enable Data Access logging only on datasets tagged with pii=true or compliance=hipaa and leave the rest at default.

Query Cost Analysis: Slot-ms vs Bytes Processed

The two metrics measure different things and confusing them is a classic exam trap.

total_bytes_billed is the on-demand pricing unit. If the project pays per-query, this column drives the bill. Caching, partition pruning, and clustering all reduce it.

total_slot_ms is the Editions pricing unit. If the project bought reserved slots, the bill is fixed and slot-ms instead measures pressure on the reservation. A query with low bytes processed but high slot-ms (think recursive CTEs over a small table) is cheap on-demand but expensive on reservations.

A useful diagnostic is the ratio: total_slot_ms / total_bytes_billed. Outliers in either direction reveal interesting workloads. High ratio means CPU-bound; low ratio means I/O-bound; the optimization strategy differs.

max_bytes_billed Safety Limit

The maximumBytesBilled job configuration option (or --maximum_bytes_billed in the bq CLI) tells BigQuery to fail the query before execution if the dry-run estimate exceeds the limit. No partial billing applies. This is the single most effective guardrail against accidental terabyte scans.

The limit can be set at three levels: per-query, per-user via custom quotas, or project-wide via defaultJobConfig on a script. Best practice is to enforce a project default of 100 GB and require explicit overrides for known heavy jobs.

PDE questions that ask "prevent a single bad query from scanning a petabyte" expect maximumBytesBilled plus a per-user custom quota on QueryUsagePerUserPerDay, not a Cloud Monitoring alert. Alerts fire after the bytes are already billed; max_bytes_billed rejects the job in the dry-run phase so no charge applies. The two controls complement each other but only the cap is preventive. See: https://cloud.google.com/bigquery/docs/best-practices-costs

The cumulative CPU time consumed by every parallel slot working on a job, summed across all workers. A 10-second query running on 1,000 slots logs total_slot_ms = 10,000,000. It is the Editions pricing pressure metric exposed in INFORMATION_SCHEMA.JOBS_BY_PROJECT and JOBS_TIMELINE_BY_PROJECT, and is independent of total_bytes_billed.

A common mistake is setting max_bytes_billed only at the user-script level and forgetting that scheduled queries, BI tool connections, and external integrations bypass that script entirely. Set the cap at the project or organization policy level for true coverage. See: https://cloud.google.com/bigquery/docs/best-practices-costs

Query Labels for Chargeback

Labels are key-value pairs attached to a job at submit time. They appear in INFORMATION_SCHEMA, audit logs, and the detailed billing export. The standard taxonomy includes team, environment, pipeline, and cost_center. A label policy enforced through a wrapper or a CI pipeline ensures every production job carries the right tags.

Without labels, the billing export can only tell you "Project X spent $50,000 on BigQuery." With labels, the same export tells you "the marketing attribution pipeline cost $12,000 and the fraud detection daily refresh cost $8,500." That granularity is what makes optimization conversations productive instead of accusatory.

Reservations Utilization

A reservation is a fixed pool of slots. Utilization is the percentage of those slots actually consumed over a window. INFORMATION_SCHEMA exposes reservation metrics through views like RESERVATIONS_TIMELINE and JOBS_TIMELINE_BY_PROJECT.

Two failure modes show up regularly. First, chronic underutilization: a reservation sized for peak load sits idle 80 percent of the time. The fix is autoscaling slots, which add capacity above a small baseline only when queries demand it. Second, chronic saturation: queries queue because every slot is busy. The fix is either increasing the baseline or moving low-priority workloads to a separate reservation with a different assignment.

BigQuery Slot Estimator

The slot estimator is a console tool that recommends a reservation size based on past query history. It models how the same workload would have performed at different baseline-plus-autoscaling configurations and shows the trade-off between cost and query latency.

The estimator is most useful when migrating from on-demand to Editions or when planning capacity for a new analytics team. It does not replace ongoing monitoring; the workload that ran last quarter is rarely the workload running next quarter.

BigQuery Cost Recommender

Active Assist publishes BigQuery-specific recommendations through the Recommender API. Common categories include unused capacity recommendations (suggesting smaller reservations), idle slot suggestions, and partition or cluster recommendations on tables that frequently scan more than necessary.

These recommendations land in the Cloud console under the project's Recommendations Hub. They can also be programmatically pulled and routed into ticketing systems so the optimization backlog stays visible to the team that owns the cost.

Anomaly Alerts

Cloud Billing supports anomaly detection on cost trends. When daily spend deviates significantly from the recent baseline, Billing fires a Pub/Sub notification. Combined with Cloud Monitoring alerting policies on slot utilization, query failure rates, and bytes-billed-per-user thresholds, the team gets warned before a runaway dashboard or a leaky service account drains the month's budget overnight.

A typical alerting policy uses a metric absence or threshold condition on bigquery.googleapis.com/slots/total_allocated_for_project and routes to PagerDuty for severity-1 incidents and Slack for everything else.

Common Pitfalls and Trade-offs

The first pitfall is trusting cached query costs. A repeated query against unchanged data returns from cache and shows zero bytes billed. This makes a workload look cheaper in development than it really is in production where caches miss constantly.

The second is partition pruning that silently breaks. A query with WHERE event_date = CURRENT_DATE() prunes correctly. The same query with WHERE event_date = (SELECT MAX(event_date) FROM events) does not, because the inner query forces a full scan first. INFORMATION_SCHEMA reveals this through unexpectedly high total_bytes_processed.

The third is mixing on-demand and reservations in one project. The pricing model is per-project per-region, but a project assigned to a reservation can still fall back to on-demand for jobs that exceed the reservation's idle-slot capacity, depending on configuration. The bill becomes harder to reason about.

The fourth is ignoring streaming insert costs. The legacy streaming inserts API charges per gigabyte ingested. Pipelines that switched to the Storage Write API got a price cut, but old code paths often linger and continue to bill at the higher rate.

The fifth is overusing Data Access logs. Enabling them organization-wide can multiply Cloud Logging costs by five or ten times. The fix is targeted enablement on sensitive datasets only.

Best Practices

  • Enforce a project-level max_bytes_billed default of 100 GB and require explicit overrides through a tracked exception process.
  • Tag every production job with team, pipeline, and environment labels and validate the tags in CI before code merges.
  • Schedule a daily INFORMATION_SCHEMA aggregation into a small fact table; have dashboards read the fact table, not the system view.
  • Export Admin Activity audit logs to a long-retention dataset and Data Access logs only for compliance-flagged datasets.
  • Run the slot estimator quarterly and compare its recommendation against actual reservation utilization before resizing.
  • Wire Cloud Billing anomaly detection to Pub/Sub and forward to the on-call channel; do not rely on email-only alerts.
  • Review the Recommender output monthly and convert actionable items into tickets with owners.
  • Document the chargeback methodology so finance, engineering, and product all read the same numbers the same way.

Real-World Use Case

A mid-sized fintech with 250 engineers ran 4 PB of monthly BigQuery scans across three product lines. Their bill grew 40 percent quarter over quarter with no obvious cause. The data platform team built a monitoring stack in two sprints.

Sprint one focused on visibility. A scheduled query aggregated INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION into a cost_facts_daily table partitioned by date and clustered by team_label. A Looker Studio dashboard surfaced the top ten spending pipelines. Within a week, the team discovered that a single fraud-scoring pipeline accounted for 38 percent of total spend, driven by a join that scanned a non-partitioned dimension table on every run.

Sprint two focused on guardrails. The team rolled out a project-level max_bytes_billed default of 200 GB, enforced label policies through a Cloud Build wrapper, and migrated three high-volume workloads from on-demand to a 500-slot Enterprise reservation with autoscaling. They wired Billing anomaly detection to PagerDuty.

Three months later, total BigQuery spend dropped 28 percent. Just as importantly, the team could now explain every cost line to finance. The CFO stopped asking "why is BigQuery so expensive" and started asking "should we expand the fraud team's reservation."

Exam Tips

The PDE exam tests whether a candidate can pick the right tool for a specific cost question. Memorize this mapping.

If the question asks "how do I see jobs from the last 90 days," answer INFORMATION_SCHEMA.JOBS_BY_PROJECT (or JOBS_BY_ORGANIZATION for cross-project).

If the question asks "how do I retain query history beyond 180 days," answer export BigQuery audit logs to a Cloud Logging sink with a BigQuery destination.

If the question asks "how do I see actual dollars charged per dataset," answer detailed Cloud Billing export to BigQuery.

If the question asks "how do I prevent a single bad query from scanning a petabyte," answer set max_bytes_billed (or maximumBytesBilled in the API) on the job, project, or via custom quota.

If the question asks "how do I attribute cost to teams sharing one project," answer query labels combined with the detailed billing export.

If the question asks "how do I decide between on-demand and reservations," answer use the slot estimator on historical workload, then validate with reservation utilization metrics.

If the question asks "how do I get notified when spending spikes unexpectedly," answer Cloud Billing anomaly detection routed through Pub/Sub.

INFORMATION_SCHEMA = jobs and slots, 180-day window. Audit logs = full retention via sinks. Billing export = actual dollars. Reservations + slot estimator = capacity planning. max_bytes_billed = the brake pedal. Labels = chargeback. See: https://cloud.google.com/bigquery/docs/best-practices-costs

Frequently Asked Questions (FAQ)

How long does INFORMATION_SCHEMA.JOBS_BY_PROJECT retain history?

The view exposes 180 days of job metadata. For longer retention, export BigQuery audit logs through a Cloud Logging sink to a dedicated BigQuery dataset. The sink can be filtered by resource.type="bigquery_resource" and partitioned by ingestion time for efficient long-horizon queries.

What is the difference between total_bytes_processed and total_bytes_billed?

total_bytes_processed is what the query engine actually read. total_bytes_billed is what the on-demand pricing model charges, rounded up to the 10 MB minimum per table. The two diverge when caching applies (billed is zero), when max_bytes_billed aborts the job, or when reservations are in use (billed reflects on-demand fallback only).

Does max_bytes_billed prevent any charges if a query fails partway through?

If the query is rejected before execution because the dry-run estimate exceeds the cap, no bytes are billed. If the query starts and then fails for another reason, partial bytes can still appear in the bill. The cap is a pre-execution gate, not an in-flight kill switch.

How do I attribute BigQuery costs to specific teams when they share one project?

Apply labels to every job at submit time, then join the detailed Cloud Billing export against those labels. Common label keys include team, pipeline, environment, and cost_center. Without labels, billing data only resolves to the project level, which makes per-team chargeback impossible.

When should I move from on-demand pricing to BigQuery Editions reservations?

Use the slot estimator in the console to simulate how historical workload would have run on different reservation sizes. The break-even depends on query patterns, not just total spend. If on-demand spend exceeds roughly 2,000 dollars per month per region with consistent daily usage, reservations usually win. Sporadic or seasonal workloads often stay cheaper on-demand.

Can I monitor BigQuery costs across multiple projects from one place?

Yes. INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION aggregates jobs across every project under an organization node, provided the caller has the BigQuery Resource Admin role. For dollar-level visibility, use a single Cloud Billing export that captures all projects under the billing account and join on labels for team-level breakdowns.

What metric should I alert on to catch slot saturation?

The Cloud Monitoring metric bigquery.googleapis.com/slots/allocated_for_project paired with bigquery.googleapis.com/query/execution_times reveals when queries queue. An alert policy that fires when allocated slots stay above 95 percent of reservation capacity for more than 30 minutes catches sustained saturation without triggering on transient spikes.

Further Reading

Official sources

More PDE topics