examlab .net The most efficient path to the most valuable certifications.
In this note ≈ 20 min

BigQuery Omni Multi-cloud Analytics

3,820 words · ≈ 20 min read ·

Deep guide to BigQuery Omni multi-cloud analytics on AWS S3 and Azure Blob — BigLake connections, cross-cloud queries, EXPORT DATA, residency, pricing, and PDE exam patterns.

Do 20 practice questions → Free · No signup · PDE

Introduction to BigQuery Omni Multi-Cloud Analytics

BigQuery Omni multi-cloud analytics extends the BigQuery query engine into AWS and Azure so you can analyze data sitting in S3 or Blob Storage without copying it to Google Cloud. For a Professional Data Engineer, this is the canonical answer whenever the exam describes a workload pinned to another cloud by regulation, gravity, or contract. You keep one SQL surface, one IAM model on the GCP side, and one set of BI tools, while the bytes never cross a cloud boundary unless you ask them to.

The service is not a connector or a data-transfer pipeline. It is a real query engine — Dremel slots — running inside Google-managed clusters that live in AWS and Azure regions. Understanding that placement is half the exam: the slots are remote, the metadata is in BigQuery, and a thin layer called the cross-cloud transfer service is what carries result rows back when you ask for them.

白話文解釋(Plain English Explanation)

Think of it as a Visiting Professor, Not a Library Loan

A library loan moves the book to you. A visiting professor flies to the campus where the book lives, reads it on site, and walks back with notes. BigQuery Omni multi-cloud analytics behaves like the visiting professor. Your S3 bucket never ships its contents to Google; instead, a BigQuery worker spins up inside AWS, scans the Parquet files locally, and brings home a small summary. The travel cost is the professor's plane ticket — a few kilobytes of result rows — not the entire library.

Think of it as a Food Truck Parked Next to the Farm

If you want a salad made from a particular farm's lettuce, you have two options. Truck the lettuce to a downtown restaurant, or park a food truck next to the farm. The farm-side truck wastes no refrigeration, no fuel, and no time. BigQuery Omni multi-cloud analytics is the food truck model for analytics. The compute follows the data. The only thing leaving the farm is the finished salad — your aggregated query result.

Think of it as a Diplomatic Embassy

An embassy is sovereign territory of one country physically located inside another. Google operates what is essentially a BigQuery embassy inside AWS us-east-1 and Azure eastus2. Inside that embassy, Google rules apply: BigQuery slots, BigQuery SQL, BigQuery IAM. Outside the gate, AWS or Azure rules apply to the underlying storage. The two governments coordinate through a treaty — the BigLake connection — that grants the embassy permission to read documents in the host country's archives.

Core Concepts of BigQuery Omni Multi-Cloud Analytics

A handful of building blocks show up in nearly every exam scenario.

BigLake Connection

The connection is a first-class BigQuery resource that holds the cross-cloud identity. For AWS, it stores an IAM role ARN and the trust policy needed for AWS Security Token Service to vend temporary credentials. For Azure, it holds an application registration and federated identity credential. Without a connection, BigQuery has no legal way to read S3 or Blob Storage.

BigLake Table

A BigLake table is the queryable object that wraps remote files. It carries a schema, a URI pattern such as s3://lake-prod/clickstream/dt=*/*.parquet, partition definitions, and a reference to the connection. A BigLake table on Omni gives you fine-grained access control — column masking, row-level security, and policy tags — even though the storage lives outside Google Cloud.

Omni Region

BigQuery Omni regions are named to mirror their host: aws-us-east-1, aws-us-west-2, aws-eu-west-1, aws-ap-northeast-2, azure-eastus2, azure-westeurope, and a growing list. A dataset created in an Omni region can only contain BigLake tables pointing at storage in that exact region. Cross-region reads are not allowed inside a single query.

Cross-Cloud Transfer

The LOAD DATA and CREATE TABLE AS SELECT statements with an Omni source and a Google Cloud destination invoke the cross-cloud transfer service, which streams compressed result rows from the Omni region back to a Google-managed BigQuery region. This is the legal way to bring data home for joining with native datasets.

Materialized Result Cache

Query results from Omni are cached in the Omni region for 24 hours just like native BigQuery, so dashboard refreshes that hit the same SQL within a day are essentially free.

A BigQuery resource that stores the cross-cloud identity (AWS IAM role or Azure service principal) needed for the Omni query engine to authenticate against S3 or Blob Storage. Required before you can create any BigLake table on remote storage. See https://cloud.google.com/bigquery/docs/omni-aws-create-connection

Architecture and Design Patterns

The deployment topology matters because the exam loves to test where each component physically lives.

Control Plane Stays in Google Cloud

Job submission, query parsing, billing, audit logging, IAM resolution, and the result cache metadata all live in Google Cloud. When you point the BigQuery console at an aws-us-east-1 dataset, you are still talking to api.bigquery.com. The console then dispatches the query plan to the Omni cluster in AWS.

Data Plane Runs in the Host Cloud

The actual slots — the workers that read Parquet, evaluate predicates, and shuffle intermediate data — run in a Google-managed VPC inside AWS or Azure. These workers read S3 or Blob Storage over private cloud networking, not over the public internet, which is what keeps egress charges off your AWS bill.

Pattern 1: Federate, Then Aggregate, Then Bring Home

The most common production pattern is a nightly job that runs on Omni, aggregates a billion-row clickstream into a million-row daily summary, and uses CREATE TABLE ... AS SELECT with a bigquery-public-data style destination to ship the summary to a native US dataset. Joins against CRM data, marketing spend, or finance happen in native BigQuery. Bytes scanned in Omni get billed once; the summary lives in cheap native storage forever.

Pattern 2: Live BI on Remote Lake

Looker or Tableau points directly at the Omni dataset. Users get sub-minute freshness on data that has never been copied. This pattern is right when residency rules forbid copying or when the source data refreshes too fast for nightly transfer to keep up.

Pattern 3: Hub-and-Spoke Across Three Clouds

A single Google Cloud project becomes the analytical hub. Omni datasets in aws-us-east-1, azure-eastus2, and a native US multi-region all live side by side. Each spoke holds raw data; the hub holds curated marts. Cross-cloud transfer moves only the small results.

A single SQL query in BigQuery Omni cannot read from two different Omni regions in one statement, and it cannot join an Omni table with a native BigQuery table directly. Use cross-cloud transfer or scheduled queries to materialize results in a region where the join can happen. See https://cloud.google.com/bigquery/docs/omni-introduction

GCP Service Deep Dive

BigQuery Omni on AWS

For AWS, the connection setup is a four-step dance. You create a BigLake connection, BigQuery hands you a Google service account identifier, you create an AWS IAM role whose trust policy allows that Google identity to assume it, and you attach an S3 read policy to the role. From then on, CREATE EXTERNAL TABLE ... WITH CONNECTION and BigLake table syntax work as expected.

S3 buckets must live in a region that has a matching Omni region. As of 2026, supported pairings include aws-us-east-1, aws-us-west-2, aws-eu-west-1, and aws-ap-northeast-2. Bucket policies should restrict access to the role ARN that BigQuery assumes, plus a aws:SourceAccount condition so Google's role can only act on behalf of your project.

BigQuery Omni on Azure

Azure uses workload identity federation. You register an application in Microsoft Entra ID, add a federated identity credential pointing back to Google's STS issuer, and grant that application the Storage Blob Data Reader role on the target container. The Omni regions on Azure are fewer — azure-eastus2 and azure-westeurope are the workhorses — so plan storage placement before promising stakeholders a region.

BigQuery Connection Service

The same bq mk --connection command creates AWS, Azure, Cloud SQL, Spanner, and Spark connections. For Omni, you specify --connection_type=AWS or --connection_type=AZURE and supply the role ARN or tenant identifier. Connections are regional resources, which is why the connection itself lives in the Omni region, not in US or EU multi-region.

Cross-Cloud Transfer Mechanics

CREATE TABLE us_dataset.daily_summary AS SELECT ... FROM aws_dataset.clickstream WHERE dt = CURRENT_DATE() triggers two phases. The Omni slots execute the SELECT inside AWS, materialize the result to a managed staging area in S3, and then the transfer service streams that staging file across the cloud boundary into native BigQuery storage. You are billed for slot usage in Omni, plus a per-byte cross-cloud transfer fee on the result size, plus normal BigQuery storage on the destination.

Exam scenarios that mention "data must remain in AWS S3 or Azure Blob Storage" expect BigQuery Omni, not Storage Transfer Service or Dataflow ingestion. The Dremel slots run inside a Google-managed VPC in aws-us-east-1 or azure-eastus2, read S3 or Blob Storage over private cloud networking, and return only result rows via the cross-cloud transfer service — keeping bulk-scan egress off your AWS or Azure bill. See https://cloud.google.com/bigquery/docs/omni-introduction

Federated Queries vs. Omni vs. External Tables

Three terms collide on the exam. Federated queries (the EXTERNAL_QUERY function) push SQL down to Cloud SQL or Spanner inside Google Cloud. External tables read GCS, Drive, or Bigtable from a native BigQuery region. Omni runs the BigQuery engine itself inside AWS or Azure. If a question mentions S3 or Blob Storage and the goal is analytical SQL at scale, the answer is Omni — never federated query, never plain external table.

For predicate pushdown to work on S3 Parquet, the BigLake table must include explicit partition columns matching your S3 prefix layout (for example dt=2026-05-12/). Without partition definitions, every query scans the whole bucket and your slot bill explodes. Hive-style partitioning is detected automatically when you set hive_partition_uri_prefix. See https://cloud.google.com/bigquery/docs/omni-aws-create-external-table

Common Pitfalls and Trade-offs

Region Mismatch is the Number One Failure Mode

Creating a connection in aws-us-east-1 and then trying to point it at a bucket in us-west-2 returns an opaque permission error. The fix is always to align dataset region, connection region, and bucket region to the same physical AWS region.

You Cannot Join Omni and Native in One Query

This trips up engineers who are used to BigQuery's "everything is just a table" feel. An Omni table from AWS and a native table from US cannot appear in the same FROM clause. The workaround is materialization — bring one side home first, then join.

Cross-Cloud Transfer is Not Free

Transferring results back to Google Cloud carries a per-GB charge that is similar to AWS egress for the same volume. Engineers sometimes assume Omni eliminates all egress; what it eliminates is the bulk scan egress, not the result transfer. Aggregate aggressively before hitting CREATE TABLE AS SELECT.

Slot Reservations are Region-Pinned

Omni slot commitments are bought per Omni region. Slots purchased for aws-us-east-1 cannot run a query in azure-eastus2 or in a native US dataset. Capacity planning for Omni is therefore tighter than for native BigQuery, and on-demand pricing for Omni queries is also region-scoped.

Not Every BigQuery Feature Lands in Omni at Launch

Some features arrive in native BigQuery first and reach Omni months later. Notable historical gaps have included DML on certain table types, specific BQML model types, and some geo functions. When the exam offers a choice between an experimental feature and a standard pattern, prefer the pattern.

Result Set Size Limits

Cross-cloud transfer has size guardrails on a per-job basis. Trying to transfer a multi-terabyte result in one statement will fail. Chunk the transfer by date partition or by hash bucket to stay inside the limits.

A common architecture mistake is to build a Looker dashboard on an Omni table and discover that every page refresh costs real money because the result cache scope is per-user, per-region. Either materialize a daily summary into native BigQuery for BI, or buy a slot reservation in the Omni region so query cost is fixed. See https://cloud.google.com/bigquery/docs/omni-introduction

Best Practices

  • Place BigLake connections, datasets, and storage buckets in the same physical region from day one — retrofitting is painful because you cannot move a dataset across regions.
  • Always define partitions on BigLake tables that mirror the prefix structure of S3 or Blob Storage, and rely on Hive partitioning detection when prefixes follow key=value/ conventions.
  • Convert raw CSV or JSON in S3 to Parquet or ORC before pointing Omni at it; columnar formats cut bytes scanned by an order of magnitude.
  • Use BigLake row-level and column-level access policies to enforce sensitive-data controls in one place rather than mirroring them inside AWS Lake Formation.
  • For repeated analytical workloads, buy Omni slot reservations rather than paying on-demand; reservations also unlock predictable performance for BI tools.
  • Aggregate inside Omni before transferring results back; treat cross-cloud transfer as expensive, even when the per-GB rate looks small.
  • Tag every Omni job with a labels map identifying the cost center so the cross-cloud transfer line item on the invoice is attributable.
  • Pin BI workloads to materialized native tables refreshed on a schedule, leaving Omni for ad-hoc analytical queries and daily ETL.

Real-World Use Case

A European insurance company has thirty years of policy history sitting in an on-prem mainframe, two years of claims telemetry being written into Azure Blob Storage azure-westeurope (a regulatory requirement of the Swiss FINMA framework), and three years of marketing-funnel events flowing into AWS S3 aws-eu-west-1 from their advertising stack. They standardized BI on Looker connected to BigQuery in EU multi-region.

Their architecture uses BigQuery Omni as the legal bridge. An Omni dataset in azure-westeurope reads claims telemetry through a BigLake table that enforces column masking on Swiss insured-person identifiers. A second Omni dataset in aws-eu-west-1 reads marketing events. Each night, two scheduled queries roll up the previous day's data into roughly one-thousandth of its raw size and use cross-cloud transfer to push the result into the native EU dataset. Looker only ever queries the native dataset.

The compliance team is happy because raw claims data has never left Azure West Europe. The finance team is happy because the cross-cloud transfer line item runs about €400 a month versus the €30,000 a month they were quoted for nightly bulk export. The data team is happy because they write standard BigQuery SQL for everything, and because BigLake gives them a single audit trail for cross-cloud access.

Exam Tips

The PDE exam frames BigQuery Omni multi-cloud analytics in a few predictable ways. Watch for the following signals:

  • "Data must remain in AWS / Azure for compliance" — Omni is almost always the answer; Storage Transfer Service is a distractor because it copies the data out.
  • "Petabyte-scale dataset in S3 needs ad-hoc SQL" — Omni, not external tables on GCS, because the data is not on GCS.
  • "Join an S3 dataset with a CRM table in BigQuery US" — answer involves a two-step pattern: aggregate in Omni, transfer to native, then join.
  • "Cheapest way to run a one-off query against 50 TB in Blob Storage without copying" — Omni on-demand, with proper partition predicates so bytes scanned stay small.
  • "Single dashboard across three clouds" — Omni hub-and-spoke with summaries materialized to native BigQuery.
  • "Cross-cloud query failed with permission error" — diagnostic answer is region mismatch between connection and bucket, or a missing trust policy on the AWS IAM role.

Memorize the Omni region naming convention. Memorize that the connection holds the cross-cloud identity. Memorize that joining Omni with native requires materialization. Those three facts cover most exam questions.

Three commit-to-memory facts for BigQuery Omni: (1) The query engine runs inside AWS or Azure; only result rows return to Google Cloud. (2) An Omni dataset's region (e.g. aws-us-east-1) must match the storage bucket's region exactly. (3) Joining Omni and native BigQuery in one query is not allowed — use cross-cloud transfer first. See https://cloud.google.com/bigquery/docs/omni-introduction

Pricing Model

Omni billing has three meters that do not exist for native BigQuery, and one that behaves differently.

Omni Slot Pricing

On-demand Omni queries are billed per TB scanned at a rate roughly 25-30% higher than native BigQuery on-demand, reflecting the cost of running Google-managed infrastructure inside AWS or Azure. Editions pricing — Standard, Enterprise, Enterprise Plus — is also available, with reservations purchased per Omni region. There is no autoscaling across Omni regions; each region has its own pool.

Cross-Cloud Transfer

Bytes returned by cross-cloud transfer or by EXPORT DATA to a Google Cloud destination are billed per GB, with rates set per source cloud. AWS-to-Google transfers and Azure-to-Google transfers each carry their own rate sheet. Result rows from interactive queries that fit under a small threshold are free.

Storage Stays in the Source Cloud

You pay AWS or Azure directly for S3 or Blob Storage. Google does not bill for the underlying bytes, only for the slots that scan them and the rows transferred home. This is the opposite of native BigQuery, where storage and compute are both Google line items.

Result Cache Behaves the Same

Queries served from the 24-hour result cache are free, exactly as in native BigQuery. This is why dashboard scenarios benefit from materializing results into native datasets where the cache hit rate is highest and slot pricing is cheaper.

Performance Trade-offs vs. Native BigQuery

Native BigQuery reads Capacitor-format files from Colossus over Jupiter networking inside a single Google data center. Omni reads Parquet from S3 or Blob Storage over the host cloud's regional network. The two systems are not the same, and a few performance realities flow from that.

Cold scans on Parquet in S3 typically run two to four times slower per byte than equivalent native BigQuery scans on managed storage. Predicate pushdown, partition pruning, and columnar projection still work, so well-tuned queries hide most of the gap, but a SELECT * on a poorly partitioned bucket will feel sluggish.

Concurrency limits are tighter in Omni than in native BigQuery. A single Omni region's slot pool is sized for production analytical workloads, not for hundreds of concurrent BI users hammering refresh. This is another reason to materialize hot dashboards into native tables.

Latency for the first byte of a result set is higher than native BigQuery because the cross-cloud transfer service must spin up before delivering rows. For interactive notebook work, this manifests as a noticeable pause on the first query of a session.

Joins benefit from being inside Omni rather than spanning the cross-cloud boundary. A 1 TB ⟕ 1 TB join inside aws-us-east-1 is fine; the same join with one table on Omni and one in native BigQuery is impossible in a single query.

Use Cases Worth Memorizing

  • Regulated data in another cloud. Healthcare, financial services, and public-sector workloads where residency rules forbid moving data into Google Cloud, but the analytics team standardizes on BigQuery.
  • Acquisition integration. A Google Cloud company acquires an AWS-native startup and wants unified BI without a six-month migration project.
  • Cost-driven offload. A team running expensive Athena workloads wants BigQuery's slot economics without rewriting upstream pipelines that already write to S3.
  • Multi-cloud disaster recovery analytics. Audit and forensic queries need to run against logs replicated to a second cloud, without provisioning a separate analytics stack there.
  • Vendor-neutral data lakehouse. Teams who want one SQL surface over Parquet in S3, Parquet in Blob Storage, and Iceberg in GCS use Omni plus native BigQuery to get there.

Frequently Asked Questions

Does BigQuery Omni copy my data into Google Cloud?

No. The query engine runs inside AWS or Azure and reads the source files locally. Only the result rows of your SQL — typically a tiny fraction of the scanned data — are returned to Google Cloud, and only when your statement explicitly requests the result client-side or via cross-cloud transfer.

Can I join an S3 table on Omni with a native BigQuery table in one query?

Not directly. A single SQL statement cannot mix an Omni source with a native BigQuery source. The standard pattern is to aggregate inside Omni, materialize the result into native BigQuery using CREATE TABLE ... AS SELECT (which triggers cross-cloud transfer), and then join the materialized result with native data.

Which file formats perform best with BigQuery Omni?

Parquet and ORC give the best performance because Omni can leverage column projection, predicate pushdown, and partition pruning at the storage layer. Avro is supported and reasonable. CSV and JSON work but cost significantly more in slot time because every row must be parsed in full.

How is BigQuery Omni different from a federated query?

Federated queries push a SQL fragment to a remote Google Cloud database (Cloud SQL, Spanner) and return results into BigQuery. Omni runs the BigQuery engine itself inside AWS or Azure to read object storage there. Federated queries are for transactional sources inside Google Cloud; Omni is for analytical object storage in other clouds.

What happens if my BigQuery dataset region does not match my S3 bucket region?

Queries fail with a permission or location error. The connection, the dataset, and the bucket must all live in the same physical region. Plan region placement before creating any connection because datasets cannot be moved between regions after creation.

Is BigQuery Omni more expensive than native BigQuery?

On-demand Omni queries cost roughly 25-30% more per TB scanned than native BigQuery, plus you pay cross-cloud transfer fees on any results you bring back to Google Cloud. For workloads that would otherwise incur AWS or Azure egress on the entire scan volume, Omni is dramatically cheaper overall. For workloads small enough that egress was never a concern, native BigQuery on copied data may still win on raw query cost.

Can I use BigQuery ML on Omni tables?

A subset of BQML models can train and predict on Omni tables. Linear regression, logistic regression, and several other classical models are supported. Newer model types and Vertex AI integrations sometimes require native BigQuery tables, so check the supported features matrix when designing an ML workflow.

Further Reading

Official sources

More PDE topics