examlab .net The most efficient path to the most valuable certifications.
In this note ≈ 20 min

BigLake: Unified Storage and Security

3,850 words · ≈ 20 min read ·

Master BigLake unified storage and security for the GCP PDE exam: BigLake tables, Iceberg integration, object tables, fine-grained access, and multi-cloud queries across S3 and ADLS.

Do 20 practice questions → Free · No signup · PDE

Introduction to BigLake Unified Storage and Security

BigLake is the storage engine that lets BigQuery treat files in Cloud Storage, Amazon S3, and Azure Data Lake Storage as if they were native warehouse tables, with the same governance, the same SQL, and the same row- and column-level security. For the Professional Data Engineer exam, BigLake shows up whenever a question mixes the words "data lake," "Parquet," "open format," "multi-cloud," or "fine-grained access" in the same sentence.

The core promise is simple. You stop choosing between a warehouse and a lake. One engine, one security surface, many storage backends.

白話文解釋(Plain English Explanation)

BigLake is one of those services that sounds abstract until you map it onto something physical. Three analogies usually unstick the concept for engineers who already know BigQuery and Cloud Storage but have never deployed BigLake in production.

Think of BigLake as a Hotel Front Desk for Your Data

A large hotel has rooms in several wings: the original 1950s building, a 1980s tower, and a brand-new annex across the street. Guests do not care which wing they sleep in. They walk up to one front desk, show one ID, get one key card, and the card opens whichever room the hotel assigns them. Housekeeping schedules, security policies, and billing all flow through that desk.

BigLake is the front desk. The wings are GCS, S3, and ADLS. The key card is the BigLake connection. Whether your Parquet files live in gs://lake-prod/orders/ or s3://eu-archive/orders/, the analyst opens BigQuery, runs SELECT, and the front desk handles routing, authentication, and policy enforcement. The analyst never touches storage IAM directly.

Think of Iceberg Integration as a Library Catalog Card

Before computers, librarians wrote one card per book and filed those cards by title, author, and subject. The cards were small, fast to flip through, and updated whenever a book was added or removed. The actual books still sat on shelves. The catalog just told you which shelf, which row, and whether the book was checked out.

Apache Iceberg metadata files play the same role for Parquet data on object storage. They track which files belong to a table snapshot, which files are deleted, and which version is current. BigLake plus the BigLake metastore reads those Iceberg cards directly, so when Spark commits a new Iceberg snapshot, BigQuery sees the new rows on the next query without anyone running a refresh job.

Think of Fine-Grained Security as a Theme Park Wristband

At a theme park, you wear a single wristband. Some wristbands open every ride, some skip the kids' area, and some include a backstage tour. The wristband is checked at every gate, automatically, without you flashing a separate ticket each time. Staff at the gate cannot see your purchase history; they just see green or red.

BigLake row-level and column-level policies are wristbands for data. Policy tags decide which columns a user can see in cleartext, which get masked to ***, and which are hidden completely. The user runs the same query as everyone else; the gate (BigLake) returns a different view per wristband. A finance analyst sees full account numbers, a marketing analyst sees the hashed version, a contractor sees nothing in that column at all.

Core Concepts of BigLake Unified Storage and Security

A handful of building blocks carry most of the weight. Learn these by name and you can read any BigLake architecture diagram quickly.

BigLake Tables

A BigLake table is an external table with a security and performance upgrade. It points at files in Cloud Storage, S3, or ADLS, and it adds three things standard external tables cannot do: fine-grained access control, metadata caching, and connection-based delegated access. Supported file formats include Parquet, ORC, Avro, CSV, JSON, and increasingly Iceberg, Hudi, and Delta Lake.

BigLake Iceberg Tables and Managed Iceberg

Two flavors exist. The first is read-only Iceberg support: BigQuery reads an Iceberg table created and written by Spark, Flink, or Trino. The second is BigQuery-managed Iceberg, where BigQuery itself owns the writes, generates Iceberg metadata, and stores Parquet in your bucket. The managed flavor gives you DML, streaming inserts, and time travel while keeping the data open enough for Spark to read.

Object Tables

Object tables index unstructured files such as PDFs, images, audio, and video as rows in BigQuery. Each row carries the URI, size, content type, and metadata; the file body stays in GCS. You can then call ML.GENERATE_TEXT or ML.PROCESS_DOCUMENT against the table to run Vertex AI models over millions of files in parallel without writing a Dataflow job.

BigLake Metastore

The BigLake metastore is a serverless, Iceberg-compatible metadata service. Spark on Dataproc, Spark on Serverless, and BigQuery all read and write through it, so an Iceberg table created by Spark immediately shows up in BigQuery and vice versa. It replaces the older Dataproc Metastore (Hive Metastore) for new lakehouse builds.

Cloud Resource Connections

A connection is a GCP resource that holds a Google-managed service account. The service account is what actually reads from S3, ADLS, or GCS. Users get permission on the connection, not on the bucket. This indirection is the whole reason BigLake security works: when a user queries a BigLake table, BigQuery checks their permission on the table, then uses the connection's service account to fetch bytes.

A security pattern where end users hold no IAM permission on the underlying storage. They only have bigquery.connections.use on a connection resource, and the connection's service account holds storage.objectViewer on the bucket. This is the foundation of BigLake fine-grained security and the reason ACLs and row-level policies can be enforced on object storage. See https://cloud.google.com/bigquery/docs/biglake-security-concepts

Architecture and Design Patterns

The same primitives compose into several reference architectures. Pick one based on whether your writers are Spark, BigQuery, or both, and whether your data sits in one cloud or several.

Pattern 1: BigQuery as the Single SQL Surface Over a Lake

Files land in gs://lake/raw/ from Pub/Sub and Dataflow. A scheduled job promotes cleaned data into gs://lake/curated/ as Parquet partitioned by date. A BigLake table sits over curated/. Analysts query through BigQuery; data scientists read the same files from a Spark notebook through the BigQuery Storage Read API. Row- and column-level policies live on the BigLake table once and apply to both engines.

Pattern 2: Open Lakehouse with Iceberg and BigLake Metastore

Spark on Serverless writes Iceberg tables to GCS. The BigLake metastore tracks the schema. BigQuery reads Iceberg tables through BigLake without a copy, and BigQuery can also write to managed Iceberg tables that Spark can subsequently read. This is the standard "open table format" answer when an exam scenario insists on no vendor lock-in.

Pattern 3: Multi-Cloud Federated Query via BigQuery Omni

A BigLake table on top of S3 is created in a BigQuery dataset that lives in the AWS region, not the Google region. Compute runs inside AWS through BigQuery Omni and pushes results back to a Google region for joining with native BigQuery tables. The user writes one query; Omni handles cross-cloud transport. The same pattern applies to Azure Blob Storage in Azure regions.

Pattern 4: AI-Ready Object Lake

Object tables index every PDF and image in a GCS bucket. Vertex AI remote models are registered in BigQuery. A single SQL statement passes file URIs to Gemini and stores the structured output back into a native BigQuery table. Row-level policies on the object table prevent contractors from running models against confidential documents.

For multi-cloud scenarios where the question states data must remain in S3 or Azure Blob Storage for residency reasons, the BigLake table dataset must be created in the matching AWS or Azure region and queried through BigQuery Omni — not in a Google region. Omni runs the compute inside the foreign cloud and only ships aggregated results back, so the raw bytes never leave the source region. Column-level security via Data Catalog policy tags still applies across clouds through the same BigLake table definition. Reference: https://cloud.google.com/bigquery/docs/biglake-intro

GCP Service Deep Dive

BigLake is not a standalone product page; it is a feature spread across BigQuery, Cloud Storage, IAM, Data Catalog, and Vertex AI. Knowing which knob lives where saves time on troubleshooting questions.

BigQuery: The Query Surface

BigQuery is where you create the table, write the SQL, and apply policies. The DDL looks like a regular CREATE EXTERNAL TABLE statement plus a WITH CONNECTION clause and an optional OPTIONS(metadata_cache_mode = 'AUTOMATIC', max_staleness = INTERVAL 1 HOUR). The query engine itself, Dremel, does not care that the data is external; it pushes filters down through the storage API.

Cloud Storage, S3, and ADLS: The Bytes

The actual files stay in object storage. For S3 and ADLS, you set up cross-cloud connections that wrap an AWS IAM role or an Azure app registration. The trust relationship is configured once per connection. Users never see those credentials.

IAM: Who Can Do What

Two layers of IAM apply. First, who can use the connection (bigquery.connections.user on the connection resource). Second, who can query the table (bigquery.dataViewer on the dataset or table). Plus the policy tags from Data Catalog control column-level access. The connection's own service account needs read permission on the bucket, but this is invisible to end users.

Data Catalog and Policy Tags

Column-level security uses policy tags from Data Catalog. You build a taxonomy such as pii > high > national_id, attach the tag to a BigLake table column, then grant dataCatalog.categoryFineGrainedReader to specific principals. Anyone without the role gets either a permission error or a masked value depending on the table's data masking rule.

Vertex AI: The Inference Layer

Object tables become useful when paired with CREATE MODEL ... REMOTE WITH CONNECTION to register a Vertex AI endpoint inside BigQuery. The connection's service account calls Vertex AI; the user only needs SQL permissions. This is how Gemini, embeddings models, and Vision models become callable from SELECT statements.

The connection's service account is the only identity that ever talks to S3, ADLS, or even GCS for BigLake reads. End-user IAM on the bucket is irrelevant for BigLake queries. If you want users to bypass BigLake and read raw files directly, you must grant them bucket IAM separately, which usually defeats the point. Reference: https://cloud.google.com/bigquery/docs/biglake-security-concepts

Common Pitfalls and Trade-offs

Most BigLake support tickets cluster around a small set of mistakes. The exam likes to embed these as wrong-answer distractors.

Granting Bucket IAM to End Users

If an analyst has roles/storage.objectViewer on the bucket directly, they can use gsutil cp or a Spark job to read files without ever passing through BigLake. Row- and column-level policies will not apply. The fix is to remove direct bucket IAM and force everyone through the BigLake table.

Forgetting Metadata Cache Refresh

Metadata caching dramatically speeds up queries on partitioned external data, but a stale cache can hide newly arrived files. Setting max_staleness to 0 defeats caching; setting it to several days means new partitions are invisible. A 30-minute to 4-hour window suits most batch pipelines. For Iceberg tables managed by BigQuery, the cache refreshes on every commit.

Mixing Read-Only Iceberg with BigQuery DML

A BigLake Iceberg table created by Spark is read-only inside BigQuery. Trying to INSERT or UPDATE returns an error. If you need DML, switch to a BigQuery-managed Iceberg table, which BigQuery owns and writes.

Cross-Region Connection Headaches

A connection is regional. A BigLake table in us-central1 cannot use a connection in europe-west1. Multi-region datasets need multi-region connections. Mismatches show up as cryptic permission errors that look like IAM problems but are actually region mismatches.

Object Tables Over Buckets with Versioning

If your GCS bucket has object versioning enabled and you delete a row in BigQuery's object table, the underlying file (and its previous versions) still exists. Object tables index live objects only, so you may end up paying for storage you forgot about.

Do not assume that turning on Cloud Storage uniform bucket-level access automatically secures your data when BigLake is in play. Uniform access controls who can call the GCS API directly; it does nothing about queries arriving through a BigLake connection. The connection's service account still needs read permission, and that is the back door you must protect with VPC Service Controls. Reference: https://cloud.google.com/bigquery/docs/biglake-intro

Best Practices

These are the patterns that keep BigLake deployments healthy at scale.

  • Use one connection per environment (dev, staging, prod) rather than one per table. Easier to audit, easier to rotate, fewer service accounts to track.
  • Wrap the bucket in a VPC Service Controls perimeter and add the connection's service account as an ingress rule. This is the only practical way to stop data exfiltration through a stolen credential.
  • Define policy tag taxonomies before you create BigLake tables. Retrofitting tags onto a table that already has hundreds of columns is painful.
  • Prefer Iceberg over plain Parquet directories when you need DML, time travel, or schema evolution. The metadata overhead is worth it.
  • Set max_staleness based on your downstream SLA, not on a default. If a dashboard refreshes hourly, a 1-hour staleness is free performance.
  • Use BigLake metastore for new lakehouse builds; reserve Dataproc Metastore (Hive Metastore) for migrations that genuinely need Hive compatibility.
  • Audit access through BigQuery audit logs in Cloud Logging, not through GCS access logs. Queries route through the connection, so the GCS logs only show the service account, not the human user.

Real-World Use Case

A European retail bank with roughly 4,000 employees runs a hybrid lake. Transaction data sits in Cloud Storage in europe-west4. Customer data sits in Amazon S3 in eu-west-1 because of a vendor contract that predates the GCP migration. Marketing data is in Azure Blob Storage in westeurope because the campaign tool only writes there.

Before BigLake, three separate teams maintained three ETL pipelines that copied data into a single BigQuery dataset every night. Latency was 18 to 24 hours. Compliance complained that customer PII existed in four places at once.

The new architecture removed every copy job. Three BigLake tables were created in BigQuery, one per cloud, each backed by a connection. The S3 and ADLS tables run through BigQuery Omni so compute happens inside AWS and Azure for data residency reasons; only aggregated results cross into Google Cloud. Policy tags on customer columns enforce that contractors see masked values. Iceberg metadata managed by the BigLake metastore lets the data science team read the same tables from Spark on Dataproc Serverless without a separate copy.

The bank deleted 47 ETL jobs, cut their Dataflow bill by 38%, and brought query freshness from 18 hours to under 5 minutes. The compliance team got a single audit log in Cloud Logging instead of three siloed log streams.

When you migrate from copy-based federation to BigLake, kill the old copy jobs in stages. Run both in parallel for a sprint, compare row counts daily, and cut over only after a full reconciliation. Surprise schema drift in the source files is the most common cause of post-migration outages. Reference: https://cloud.google.com/bigquery/docs/biglake-intro

Exam Tips

PDE questions about BigLake follow a predictable shape. Recognize the shape, pick the right answer in 20 seconds.

If the scenario says "data must stay in S3 for compliance but analysts need SQL," the answer is BigLake plus BigQuery Omni. Wrong answers will offer Storage Transfer Service or Dataflow copies.

If the scenario says "fine-grained column access on Parquet files in GCS," the answer is a BigLake table with policy tags. Plain external tables do not support column-level security.

If the scenario says "Spark and BigQuery must read the same Iceberg table without copies," the answer is BigLake plus the BigLake metastore. Hive Metastore is the distractor for legacy systems.

If the scenario describes "millions of PDFs in GCS that need to be summarized by an LLM," the answer is an object table plus a remote Vertex AI model. Custom Dataflow pipelines are the distractor.

If the scenario asks about credentials for cross-cloud reads, the answer is a Cloud Resource Connection holding a service account with a federation trust to the foreign cloud's IAM. Direct AWS access keys stored in Secret Manager are wrong.

Three things only BigLake tables can do that plain external tables cannot: row-level security, column-level security with policy tags, and metadata caching with controlled staleness. If a question asks for any of those over object storage, BigLake is the answer. Reference: https://cloud.google.com/bigquery/docs/biglake-intro

Frequently Asked Questions (FAQ)

Is BigLake a separate product I need to enable?

No. BigLake is a feature of BigQuery, billed under BigQuery. You enable it implicitly by creating a Cloud Resource Connection and a BigLake table. There is no separate API to turn on, no separate quota to request, and no extra line item on the invoice for the BigLake feature itself. You still pay for bytes scanned or slot time, plus standard egress when reading across clouds.

How does BigLake differ from a regular external table?

A regular external table gives you SQL over files but stops there: no row-level security, no column-level masking, no metadata caching, and end users need direct IAM on the bucket. A BigLake table adds all four. The DDL difference is one line, the WITH CONNECTION clause, but the security model is fundamentally different because end users no longer need bucket access.

When should I choose Iceberg over plain Parquet for my BigLake tables?

Choose Iceberg when you need DML (UPDATE, DELETE, MERGE), time travel, schema evolution without rewriting files, or atomic multi-file commits. Choose plain partitioned Parquet when your data is append-only, your partitions never change, and you want the simplest possible file layout. For a brand-new lakehouse with mixed Spark and BigQuery writers, Iceberg with BigLake metastore is usually the right default.

Can I use BigLake to query data in another organization's S3 bucket?

Yes, if you can establish a trust between a BigQuery connection's service account and an AWS IAM role that the other organization grants you access to. The connection holds the federation configuration; the AWS role grants s3:GetObject on the bucket. Cross-account S3 access works the same way it would for any AWS-to-AWS federation, with the BigQuery connection acting as the principal.

What happens to BigLake queries during a regional outage?

Queries fail in the affected region. BigLake tables are regional resources, and so are the connections behind them. For multi-region datasets, BigQuery handles the failover for the metadata layer, but the underlying object storage outage still blocks reads. The mitigation is dual-region buckets for critical data plus a documented runbook to repoint connections to a secondary region. BigQuery Omni adds another dependency on the foreign cloud's region staying healthy.

Do object tables work with audio and video files?

Yes. Object tables index any file type because they store URIs and metadata, not file bodies. The interesting work happens at query time when you call a remote Vertex AI model. For audio, you can pass the URI to Speech-to-Text models. For video, you can pass it to Video Intelligence or to Gemini's multimodal endpoint. The BigLake security model still applies: row-level policies on the object table can hide certain files from certain users.

Further Reading

Official sources

More PDE topics