examlab .net The most efficient path to the most valuable certifications.
In this note ≈ 21 min

BigQuery Advanced SQL and UDFs

4,120 words · ≈ 21 min read ·

Master BigQuery SQL transformations and UDFs for the GCP PDE exam: SQL and JavaScript UDFs, remote functions, table functions, JSON, ARRAY, MERGE, and scripting.

Do 20 practice questions → Free · No signup · PDE

Introduction to BigQuery SQL Transformations and UDFs

BigQuery SQL transformations and UDFs sit at the heart of every serious analytics workload on Google Cloud. Standard SQL alone covers a lot, but real pipelines hit cases where you need custom math, third-party API enrichment, JSON parsing, or repeating logic across dozens of dashboards. That is where user-defined functions, table functions, remote functions, and procedural scripting come in. The Professional Data Engineer exam expects you to know which tool to reach for, what it costs in slot-time, and where the sharp edges live.

白話文解釋(Plain English Explanation)

Think of UDFs as a Restaurant's House Sauces

A busy restaurant kitchen does not reinvent its signature sauce every time a customer orders the steak. The chef makes a big batch of the sauce, labels the jar, and every cook on the line ladles from the same jar. A SQL UDF in BigQuery is exactly that jar of sauce. You define the recipe once (CREATE FUNCTION ds.clean_phone(s STRING) AS ...), and every analyst in the company calls ds.clean_phone(phone) instead of hand-rolling their own regex. When the recipe changes, you change one jar and the whole menu updates.

A JavaScript UDF is the equivalent of a sous-chef who knows a fancy technique the regular line cooks do not. It costs more (slower, sandboxed runtime), but it can do things the standard SQL functions cannot, like decoding a custom binary format. You bring it out only when the dish demands it.

Think of Remote Functions as Phoning the Specialist

If a customer asks for something the kitchen genuinely cannot make, the maitre d' picks up the phone and calls a nearby bakery for a specialty cake. BigQuery remote functions work the same way. The query reaches a Cloud Run or Cloud Functions endpoint, the endpoint does its work (calls a translation API, runs a model, hits a payment gateway), and the result comes back to be folded into the query. It is slower than in-process work, you pay per invocation, but it unlocks behavior that simply does not exist inside the warehouse.

Think of Table Functions as a Pre-Booked Tour

A travel agency offers a "Kyoto in three days" tour. You give them the dates and the number of people, and you receive a complete itinerary table. A BigQuery table function takes a few parameters and returns a whole table you can SELECT FROM. Instead of teaching every analyst to write the same complicated multi-CTE join for "active customers in region X over period Y," you wrap it once: SELECT * FROM ds.active_customers('APAC', '2026-01-01', '2026-03-31'). Same parameters in, same shape of table out.

Think of Scripting as a Recipe Card With Steps

A SQL UDF is one ingredient. A script is the whole recipe card with numbered steps: chop, sauté, simmer, plate. BigQuery scripting (BEGIN ... END, DECLARE, IF, LOOP, EXECUTE IMMEDIATE) lets you orchestrate a sequence of statements with shared variables and control flow inside one job. It does not replace Cloud Composer for cross-system orchestration, but for "run these eight queries in order with a temp table in between," it is the cleanest path.

Core Concepts of BigQuery SQL Transformations and UDFs

User-defined functions in BigQuery come in three execution flavors and two persistence flavors. The execution flavors are SQL UDFs (logic written in standard SQL), JavaScript UDFs (logic in JS, run in a V8 sandbox), and remote functions (logic in any language, run in your Cloud Run or Cloud Functions service). The persistence flavors are temporary (lifetime of one script or session, declared with CREATE TEMP FUNCTION) and persistent (stored in a dataset, shared across users, declared with CREATE FUNCTION).

Table functions, sometimes called table-valued functions or TVFs, are a separate construct. They look like a parameterized view: input arguments produce a result set you can join or filter further. The parser sees them as a table, not a scalar.

Procedural language wraps statements in a script. Inside a script you can DECLARE variables, run loops, branch with IF and CASE, raise errors, and execute dynamic SQL with EXECUTE IMMEDIATE. Stored procedures (CREATE PROCEDURE) are scripts saved to a dataset for reuse and parameterized invocation.

Window functions are not UDFs at all but are usually grouped with "advanced SQL" because they enable analytic patterns that would otherwise need self-joins. ROW_NUMBER, RANK, LAG, LEAD, SUM() OVER, PERCENTILE_CONT and friends partition the rowset, order it, and apply a windowed calculation without collapsing rows.

JSON handling in BigQuery has matured into two layers: legacy JSON_EXTRACT* string functions and the native JSON type with dot-access (json_col.user.id) plus JSON_VALUE, JSON_QUERY, LAX_* casting, and JSON_OBJECT. The native type stores data in a parsed columnar form, which makes path lookups much cheaper than re-parsing strings on every query.

ARRAY and STRUCT are first-class. UNNEST flattens arrays into rows, ARRAY_AGG rolls rows back into arrays, STRUCT() builds nested records inline, and SELECT AS STRUCT projects a row into a struct. Combined with repeated fields, these give you a denormalized model that often outperforms star-schema joins.

MERGE (sometimes called UPSERT) reconciles a source table against a target table in one atomic statement, supporting WHEN MATCHED, WHEN NOT MATCHED BY TARGET, and WHEN NOT MATCHED BY SOURCE branches. It is the standard pattern for slowly-changing dimensions and for applying CDC streams.

A user-defined function is a routine written in SQL or JavaScript, stored either in a dataset or for a session, that returns a scalar value when called inside a query. See https://cloud.google.com/bigquery/docs/user-defined-functions

A remote function is a SQL-callable function whose body runs in a Cloud Run or Cloud Functions HTTPS endpoint registered through a CONNECTION resource. See https://cloud.google.com/bigquery/docs/remote-functions

Architecture and Design Patterns

A typical BigQuery transformation layer has three tiers. The bronze tier holds raw ingest tables, often partitioned by ingestion date, with minimal schema enforcement. The silver tier is cleaned and conformed: phone numbers normalized, currencies converted, timestamps anchored to UTC, dedup applied. The gold tier is business-ready: aggregated marts, slowly-changing dimensions, and curated facts joined to conformed dimensions.

UDFs are the glue between tiers. SQL UDFs do the cleaning logic in silver: clean_email, parse_address, normalize_currency. They are pure functions over column values and they push down efficiently because the planner can inline them. JavaScript UDFs sit in places where SQL is awkward: bit manipulation, custom hashing, parsing a vendor's quirky text format. Remote functions live at the edges where you must call out to a model server, a lookup service, or a third-party enrichment API.

Table functions belong in the gold tier as parameterized "data products." A table function ds.daily_revenue(region STRING, start DATE, end DATE) lets a BI dashboard pass three parameters and receive a consistent shape. The logic stays in one place; the dashboards stay simple.

Stored procedures and scripts orchestrate multi-step transforms when you do not want to push the orchestration to Composer or Workflows. A nightly silver-to-gold load can be one stored procedure that begins a transaction, runs three MERGE statements, computes a quality metric, raises an error if the metric falls below threshold, and commits.

For idempotent loads, the canonical pattern is MERGE keyed on a natural or surrogate key, with WHEN MATCHED updating tracking columns and WHEN NOT MATCHED BY TARGET inserting new rows. Combine this with table partitioning on event date so each MERGE only touches recent partitions.

Inline SQL UDFs are nearly free because the optimizer expands them into the calling query. JavaScript and remote UDFs cannot be inlined, so they evaluate row by row and dominate query latency. Reach for them only when SQL truly cannot do the job. Reference: https://cloud.google.com/bigquery/docs/user-defined-functions

GCP Service Deep Dive

SQL UDFs: Persistent vs Temporary

A persistent SQL UDF lives in a dataset and is created once:

CREATE OR REPLACE FUNCTION analytics.clean_phone(raw STRING)
RETURNS STRING
AS (
  REGEXP_REPLACE(LOWER(TRIM(raw)), r'[^0-9+]', '')
);

Anyone with bigquery.routines.get on the dataset can call analytics.clean_phone(phone) from any query. Schema and dataset-level access controls apply, including row-level and column-level security on the underlying tables.

A temporary UDF lives only for the duration of the script or session:

CREATE TEMP FUNCTION pct_change(a FLOAT64, b FLOAT64) AS (
  SAFE_DIVIDE(b - a, a)
);

SELECT pct_change(prev_revenue, revenue) FROM ...;

Temporary UDFs are great for ad-hoc analysis and for keeping a multi-statement script self-contained. They cannot be referenced from another query later because they vanish when the session ends.

SQL UDFs are pure expressions. They cannot run another SELECT, cannot write to tables, and cannot have side effects. That restriction is what allows BigQuery to inline them with no overhead.

JavaScript UDFs

A JS UDF is declared with LANGUAGE js and a function body:

CREATE TEMP FUNCTION decode_payload(b BYTES)
RETURNS STRING
LANGUAGE js
AS r"""
  const text = String.fromCharCode.apply(null, new Uint8Array(b));
  return text.normalize('NFC');
""";

You can attach external libraries via OPTIONS(library=["gs://my-bucket/lodash.min.js"]). The runtime is a V8 sandbox: no network, no filesystem, no require of arbitrary modules. Memory and CPU per row are bounded; long-running JS UDFs throw UDF out of memory or UDF timed out.

Type mappings are explicit. SQL INT64 is JS Number, but values beyond 2^53 lose precision unless you map to STRING. BYTES arrives as a Uint8Array. TIMESTAMP arrives as ISO strings. Always sanity-check the mapping when porting JS code from elsewhere.

Remote Functions

Remote functions execute in your Cloud Run service or Cloud Functions endpoint. The control flow:

  1. Create a BigQuery CONNECTION of type CLOUD_RESOURCE.
  2. Grant the connection's service account the roles/run.invoker role on your service.
  3. Deploy a service that accepts a JSON POST containing a batch of rows and returns a JSON array of results.
  4. Register the function: CREATE FUNCTION ds.translate(text STRING) RETURNS STRING REMOTE WITH CONNECTION ... OPTIONS (endpoint="https://...").

BigQuery batches rows (default ~1000 per request, tunable with max_batching_rows) and parallelizes calls across slots. Each invocation costs you the underlying Cloud Run / Cloud Functions billing plus extra slot-time waiting on the response. Latency-sensitive queries should pre-warm the service or set minimum instances.

Remote functions are not transactional. If your endpoint fails halfway, BigQuery retries, which means the endpoint must be idempotent. Charging a credit card or sending an email from a remote function is a bug waiting to happen. Use them for read-only enrichment, not side effects. Reference: https://cloud.google.com/bigquery/docs/remote-functions

Table Functions

A table function (TVF) is a parameterized query that returns a table:

CREATE OR REPLACE TABLE FUNCTION analytics.recent_orders(
  region STRING, days INT64
)
AS (
  SELECT order_id, customer_id, total
  FROM analytics.orders
  WHERE region_code = region
    AND order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL days DAY)
);

Calling it: SELECT * FROM analytics.recent_orders('APAC', 30) WHERE total > 1000. The optimizer can push predicates into the function body, so the WHERE total > 1000 is applied during, not after, the function executes.

Table functions cannot mutate data, cannot return different schemas based on input, and cannot recurse. Treat them as parameterized views with one extra layer of abstraction.

JSON Functions

Two generations coexist. The first uses string functions over STRING-typed columns:

  • JSON_EXTRACT(json_str, '$.user.id') returns a JSON string.
  • JSON_EXTRACT_SCALAR(json_str, '$.user.id') returns an unquoted scalar.
  • JSON_EXTRACT_ARRAY(json_str, '$.tags') returns an array of JSON strings.

The second generation uses the native JSON type:

SELECT
  payload.user.id AS user_id,
  JSON_VALUE(payload, '$.event.name') AS event_name,
  JSON_QUERY_ARRAY(payload, '$.tags') AS tags
FROM events;

The native JSON type is stored in a parsed columnar form. Path access is roughly an order of magnitude cheaper than re-parsing strings on each query, and the storage encoding compresses better for sparse fields. Use it for anything you query repeatedly. Keep STRING only for raw archival blobs you rarely touch.

ARRAY and STRUCT Operations

Repeated and nested fields let you keep one-to-many relationships in a single row. UNNEST is the verb that flattens:

SELECT order_id, item.sku, item.qty
FROM orders, UNNEST(items) AS item;

ARRAY_AGG reverses the operation:

SELECT customer_id,
  ARRAY_AGG(STRUCT(order_id, total) ORDER BY order_date DESC LIMIT 5) AS recent
FROM orders
GROUP BY customer_id;

Filtering inside an array uses ARRAY subqueries: ARRAY(SELECT x FROM UNNEST(items) AS x WHERE x.qty > 0). Existence checks use EXISTS(SELECT 1 FROM UNNEST(items) AS i WHERE i.sku = 'SKU1').

Struct field access is dot notation: address.city. Building structs inline: STRUCT(name AS first_name, age AS years). Projection of an entire row to a struct: SELECT AS STRUCT col1, col2 FROM t.

Window Functions

Window functions calculate over a set of rows related to the current row without collapsing them. The frame is OVER (PARTITION BY ... ORDER BY ... ROWS BETWEEN ...).

Ranking: ROW_NUMBER, RANK, DENSE_RANK, PERCENT_RANK, NTILE. Use ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time DESC) to dedupe to the latest event per user.

Navigation: LAG, LEAD, FIRST_VALUE, LAST_VALUE, NTH_VALUE. LAG(price) OVER (PARTITION BY symbol ORDER BY ts) gives the previous tick for a moving change.

Aggregate windows: SUM(amount) OVER (PARTITION BY account ORDER BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) produces a running balance. Switch the frame to BETWEEN 6 PRECEDING AND CURRENT ROW for a 7-day rolling sum if your rows are daily.

MERGE Statement

MERGE is the upsert primitive:

MERGE analytics.dim_customer T
USING staging.customer_changes S
ON T.customer_id = S.customer_id
WHEN MATCHED AND T.updated_at < S.updated_at THEN
  UPDATE SET name = S.name, email = S.email, updated_at = S.updated_at
WHEN NOT MATCHED BY TARGET THEN
  INSERT (customer_id, name, email, updated_at)
  VALUES (S.customer_id, S.name, S.email, S.updated_at)
WHEN NOT MATCHED BY SOURCE AND T.is_active = TRUE THEN
  UPDATE SET is_active = FALSE, deactivated_at = CURRENT_TIMESTAMP();

Three branches cover the full SCD-1 and soft-delete pattern. Always include a partition predicate on the target side (AND T.event_date BETWEEN ... AND ...) so BigQuery scans only the partitions that can possibly match. Without that, MERGE becomes a full table scan and bills accordingly.

Scripting and Procedural Language

Scripts wrap multiple statements in BEGIN ... END and add variables and control flow:

DECLARE max_date DATE;
DECLARE row_count INT64 DEFAULT 0;

BEGIN
  SET max_date = (SELECT MAX(event_date) FROM raw.events);

  IF max_date < CURRENT_DATE() - 1 THEN
    RAISE USING MESSAGE = 'Stale data: latest is ' || CAST(max_date AS STRING);
  END IF;

  MERGE analytics.facts T USING raw.events S ON T.id = S.id
  WHEN NOT MATCHED THEN INSERT ROW;

  SET row_count = @@row_count;
  SELECT row_count AS rows_inserted;
EXCEPTION WHEN ERROR THEN
  INSERT INTO ops.errors VALUES (CURRENT_TIMESTAMP(), @@error.message);
  RAISE;
END;

EXECUTE IMMEDIATE runs dynamic SQL: useful for templating across partitions or for tools that generate SQL on the fly. Stored procedures wrap a script as a callable routine: CREATE PROCEDURE ds.run_daily() then CALL ds.run_daily(). Procedures can have IN, OUT, and INOUT parameters and can return result sets to the caller.

Use BEGIN TRANSACTION; ... COMMIT TRANSACTION; inside a script for multi-statement atomicity over the same set of tables. If any statement fails, ROLLBACK TRANSACTION undoes the lot. This is how you build correct SCD loads without leaning on external orchestration. Reference: https://cloud.google.com/bigquery/docs/reference/standard-sql/transactions

Common Pitfalls and Trade-offs

JavaScript UDFs feel powerful but quietly throttle queries. A query that processes a billion rows through a JS UDF can take ten times longer than the same query in pure SQL because each row crosses the SQL-to-V8 boundary. Always benchmark on a representative slice before promoting to production.

Persistent UDFs create a hidden dependency graph. If a downstream pipeline calls analytics.clean_phone and someone changes the regex without warning, every dashboard shifts. Treat persistent UDFs like public APIs: version them (clean_phone_v2), document them, and announce changes.

Remote functions add a network hop and a billing dimension. A remote function that costs 2 ms per call and is invoked on 100 million rows is more than two days of compute time, parallelism notwithstanding. Audit usage in INFORMATION_SCHEMA.JOBS_BY_PROJECT before letting them run unattended.

MERGE without a partition predicate scans the whole target table. On a multi-terabyte table that is a five-figure mistake per run. Always include the partition column in the ON clause or in an AND filter on the target.

JSON STRING columns look fine until you query them millions of times. Migrate to the JSON type early; the migration is painful at year five and trivial at month three.

Window functions on huge unbounded partitions can spill to disk and slow queries. Add PARTITION BY to a high-cardinality column when you can: ranking events per user is fast, ranking events globally is slow.

Scripts mask billing. A script that runs ten queries shows ten child jobs, each with its own slot-time and bytes-billed. The parent script row in INFORMATION_SCHEMA.JOBS aggregates them, but it is easy to underestimate cost from a tidy-looking procedure.

A common interview trap: someone "optimizes" a query by wrapping its WHERE clause logic in a JavaScript UDF for "reusability." The result is a 50x slowdown because the JS UDF disables predicate pushdown to partitions and clusters. SQL UDFs preserve pushdown; JS UDFs do not. Reference: https://cloud.google.com/bigquery/docs/user-defined-functions

Best Practices

  • Default to SQL UDFs. Move to JavaScript only for logic SQL truly cannot express, and to remote functions only for external services.
  • Version persistent routines (clean_phone_v2, dim_customer_load_v3). Treat them as public contracts.
  • Always include the target's partition column in MERGE predicates. Verify the bytes-billed before scheduling a recurring run.
  • Use the native JSON type for any column queried more than a handful of times. Keep STRING only for raw archival or pass-through.
  • Batch remote function inputs aggressively. Tune max_batching_rows to the latency profile of your service. Set Cloud Run minimum instances if cold starts hurt.
  • Wrap multi-statement loads in BEGIN TRANSACTION blocks for atomicity. Prefer explicit transactions over hoping nothing fails between statements.
  • Profile JS UDFs on a 1% sample first. If the UDF is the bottleneck, see if a SQL rewrite or a precomputed lookup table replaces it.

SQL UDFs inline. JavaScript UDFs run in V8 row-by-row. Remote functions cross the network in batches. Choose the cheapest option that does the job.

Real-World Use Case

A mid-sized fintech, around 400 employees and 8 million customer accounts, runs a fraud-scoring pipeline on BigQuery. Card transactions stream in through Pub/Sub and Dataflow into a partitioned transactions table at roughly 50 GB per day.

The silver layer cleans inputs with three SQL UDFs: clean_merchant_name normalizes merchant strings using regex; to_iso_currency converts amount + local currency into USD using a daily rates table; risk_country flags ISO country codes against a watchlist. All three are persistent, stored in silver_routines, owned by the data platform team, and versioned.

The scoring layer calls a remote function ml.score_transaction that hits a Cloud Run service running an XGBoost model. Batches of 500 rows go out per request; the service auto-scales to 50 minimum instances during business hours to avoid cold starts. A nightly batch backfill processes the previous day's transactions in about 20 minutes; an hourly micro-batch handles same-day risk monitoring.

A stored procedure gold.run_fraud_marts() runs every hour. It opens a transaction, executes a MERGE to update the fraud-score dimension, runs a window function over the last 24 hours per customer to compute velocity features, and refreshes a materialized view that powers the analyst dashboard. If the hourly run fails, the EXCEPTION block writes to ops.errors and a Cloud Monitoring alert fires.

Analysts query a table function gold.high_risk_recent('US', 7) to pull the past week's high-risk events for a region. The function applies the right joins, filters, and SCD-2 logic in one place; analysts see a flat result set.

The whole platform uses around 1200 slot-hours per day for transformations, with the remote function accounting for roughly 30% of cost. A quarterly review prunes unused UDFs, deprecates v1 routines, and revisits the JS-vs-SQL split to keep slot consumption bounded.

Exam Tips

The PDE exam tests judgment, not syntax memorization. Expect questions of the shape "you have requirement X, which transformation construct fits best."

  • A scenario asks for "reusable cleaning logic across many queries with no performance overhead." That is a SQL UDF, persistent, in a shared dataset.
  • A scenario describes "calling a Hugging Face model from inside a query." That is a remote function with a Cloud Run endpoint and a CONNECTION.
  • A scenario describes "processing arrays of struct fields per row to compute a custom score with branching logic per element." Lean toward SQL with UNNEST first; only pick JavaScript UDF if SQL cannot express the logic naturally.
  • A scenario describes "running 10 SQL statements in order, with rollback on failure, scheduled nightly." That is a stored procedure with BEGIN TRANSACTION ... COMMIT TRANSACTION and EXCEPTION WHEN ERROR. Cloud Composer is overkill if everything runs inside BigQuery.
  • A scenario about "applying CDC inserts, updates, and deletes from a staging table." That is MERGE with all three WHEN branches, partition-predicate-pruned.
  • A scenario about "deduping events by user, keeping the latest." That is ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY ts DESC) filtered to row 1.
  • A scenario where "queries on a JSON string column are slow and getting slower." Migrate to the native JSON type.
  • A scenario warning that "a remote function is invoked billions of times." Push back: precompute a lookup table or use a SQL UDF if at all possible. Cost question, not capability question.

Watch for distractor answers that suggest "use a Dataflow job" when the work is pure transformation already inside BigQuery. The exam rewards keeping work in the warehouse when possible.

PDE exam scenarios distinguishing UDF types almost always reduce to: SQL UDF for performance and reuse, JavaScript UDF for logic SQL cannot express, remote function for external services or models. Memorize that triage. Reference: https://cloud.google.com/bigquery/docs/user-defined-functions

Frequently Asked Questions (FAQ)

When should I use a JavaScript UDF instead of a SQL UDF?

Only when the logic genuinely cannot be expressed in SQL. Examples: bit-twiddling, custom hashing algorithms, parsing a vendor's binary format, normalizing Unicode in a non-trivial way. For everything else, SQL UDFs are dramatically faster because the optimizer inlines them, while JS UDFs cross into a V8 sandbox per row and disable some pushdown optimizations.

Can a remote function be transactional with the rest of the query?

No. The remote endpoint runs outside BigQuery, with no transactional guarantees back to the SQL engine. BigQuery may also retry batches if a request fails, so your endpoint must be idempotent. Treat remote functions as read-only enrichment. Do not perform side effects like sending emails or charging cards from inside one.

What is the difference between a table function and a view?

A view is a saved query with no parameters; a table function is a saved query with parameters that returns a table. Table functions accept arguments at call time, while views are static. Use views for fixed projections and table functions when callers need to pass region, date range, or other inputs.

How does MERGE compare to INSERT plus UPDATE plus DELETE?

MERGE performs all three operations atomically against the target in one statement, with one scan of the source. Doing them separately requires three statements, three scans, and explicit transaction management. MERGE is faster, cheaper, and the canonical pattern for SCD and CDC. Always include a partition filter on the target to keep cost bounded.

Should I store JSON as STRING or as the native JSON type?

For anything queried more than occasionally, use the native JSON type. It stores values in a parsed columnar layout, so path access is roughly an order of magnitude cheaper than re-parsing strings each time. Keep STRING only for raw, write-once archival blobs you rarely touch.

Are temporary UDFs visible across multi-query scripts?

Yes, but only within the same script or session. A CREATE TEMP FUNCTION declared at the top of a script is visible to every statement in that script. The moment the script ends or the session closes, the function is gone. Persistent UDFs (CREATE FUNCTION) live in a dataset and survive across queries and users with the right permissions.

What happens when a stored procedure raises an exception inside a transaction?

If you have an explicit BEGIN TRANSACTION and an unhandled exception fires, BigQuery rolls back the transaction. With an EXCEPTION WHEN ERROR THEN block, you can log the failure, run cleanup, and decide whether to RAISE (which propagates and rolls back) or swallow the error (commits whatever was done so far). Always rethrow unless you have a specific recovery in mind.

Further Reading

Official sources

More PDE topics