examlab .net The most efficient path to the most valuable certifications.
In this note ≈ 20 min

Cloud Storage Data Lake Architecture

3,850 words · ≈ 20 min read ·

A practical GCP Professional Data Engineer guide to Cloud Storage data lake design: bronze/silver/gold zones, Hive partitioning, file formats, lifecycle, BigLake, Dataplex, IAM, CMEK, and VPC-SC.

Do 20 practice questions → Free · No signup · PDE

Introduction to Cloud Storage Data Lake Design

A Cloud Storage data lake design on Google Cloud is the choice every analytics team makes before they realise they have made it. The bucket layout you sketch on day one ends up shaping permissions, billing, query performance, and disaster recovery for years. This study note walks through the moving parts a Professional Data Engineer is expected to recognise on the exam and to defend on a real architecture review: zones, file formats, partitioning, lifecycle, BigLake, Dataplex, IAM, encryption, perimeters, and ingestion patterns.

The exam does not test trivia about the GCS console. It tests whether you can pick the right tier of bucket, the right key layout, and the right governance boundary for a workload that will outlive the engineer who built it. That is the lens used throughout this Cloud Storage data lake design guide.

白話文解釋(Plain English Explanation)

A good Cloud Storage data lake design is easier to understand if you stop thinking about it as software for a minute and picture it as a physical place you have visited.

Think of the lake as a working kitchen with three counters

A restaurant kitchen has three workspaces. The receiving counter is where raw boxes arrive from suppliers, still in their original packaging, sometimes mislabeled, sometimes wet. The prep counter is where vegetables are washed, meat is portioned, sauces are reduced. The pass is where finished plates leave for the dining room, plated and ready to serve.

Bronze, silver, and gold zones in a Cloud Storage data lake design are the same three counters. Bronze takes whatever shows up. Silver cleans and conforms it. Gold serves curated, business-ready datasets. You never serve customers from the receiving counter, and you never store unwashed produce on the pass. The same rule keeps a Cloud Storage data lake design healthy.

Think of partitions as a public library catalogue

A library does not put every book on one giant shelf. It splits by floor, then by section, then by call number. When you want every cookbook published in 2024, the librarian walks you to one shelf, not the entire building. Hive-style partitioning in a Cloud Storage data lake design works the same way: event_date=2024-11-04/region=apac/ is a call number that BigLake and BigQuery use to skip every other shelf.

Without partitions, a query has to scan the whole library to answer a single question. With partitions, it walks to one corner. The cost difference shows up on your invoice.

Think of storage classes as a parking garage

A downtown parking garage charges more for the spaces near the elevator and less for the spaces six floors up. If you are running into the building for ten minutes, you pay for the convenient spot. If you are leaving the car for a month while you travel, you take the cheap spot at the top.

Standard, Nearline, Coldline, and Archive classes in a Cloud Storage data lake design are the same gradient. Hot transactional data sits near the elevator. Last quarter's audit logs go to the top floor. Object Lifecycle Management is the attendant who quietly moves your car upstairs while you are away, so you never overpay for a spot you stopped using.

Core Concepts of Cloud Storage Data Lake Design

A Cloud Storage data lake design is built from a small set of primitives. The skill is composing them into something that scales without becoming a swamp.

Buckets, prefixes, and the "folder" illusion

GCS has no folders. It has buckets and object names. A name like silver/orders/event_date=2024-11-04/part-000.parquet looks hierarchical because the console renders the slashes as a tree, but every object is flat under the bucket. This matters because IAM and lifecycle rules historically worked at the bucket level, not the prefix level. A modern Cloud Storage data lake design either splits zones into separate buckets or uses managed folders for prefix-scoped IAM where supported.

Zones: bronze, silver, gold

Most teams settle on three zones. Bronze (sometimes called raw or landing) holds immutable, append-only copies of source data in its native format. Silver holds cleaned, deduplicated, schema-conformed data, usually in Parquet. Gold holds business-ready, joined, aggregated tables that downstream tools and dashboards consume directly. A common Cloud Storage data lake design rule: data only moves downstream, never upstream, and gold can always be rebuilt from silver, and silver from bronze.

Naming conventions that survive a reorg

Bucket names are global. Object prefixes are not, but they are still load-bearing. A reliable convention looks like <company>-<env>-<zone>-<region> for buckets and <domain>/<dataset>/<partition_keys>/<file> for objects. Avoid timestamps in bucket names. Avoid PII in prefixes. Avoid sequential or monotonically increasing prefixes at the start of the key, because that pattern hotspots GCS auto-scaling under heavy write throughput.

Hive partition path layout

BigQuery, BigLake, Dataproc, and Dataflow all recognise Hive-style paths of the form key=value. A canonical Cloud Storage data lake design layout for an orders table:

gs://acme-prod-silver-us/sales/orders/
  event_date=2024-11-04/region=apac/part-00000.parquet
  event_date=2024-11-04/region=emea/part-00000.parquet
  event_date=2024-11-05/region=apac/part-00000.parquet

The partition keys you choose are the keys queries will filter on most often. Date is almost always one of them. A second key like region or tenant_id is fine if cardinality stays bounded; explode the cardinality and you create millions of tiny partitions that hurt more than they help.

A directory naming convention where each folder encodes one column as key=value. Query engines parse the path, prune partitions that do not match the filter, and only read matching files. See BigQuery hive-partitioned loads.

Architecture and Design Patterns

A Cloud Storage data lake design is more than a directory tree. It is a layered architecture with explicit data contracts between layers.

The medallion pattern on GCS

Bronze writes are append-only. Producers (Pub/Sub subscribers, Datastream sinks, Storage Transfer jobs) drop files as they arrive. Schema enforcement is minimal because you would rather store a malformed record than lose it. A common pattern keeps bronze in the source's native format: JSON Lines for event streams, Avro for CDC payloads, CSV for legacy exports.

Silver is the transformation layer. Dataflow, Dataproc, or BigQuery scheduled queries read bronze, deduplicate, fix types, and write Parquet partitioned by event date. Silver is the source of truth for analysts. Gold is the consumption layer, often materialised as BigQuery tables or as BigLake external tables on highly curated GCS objects.

Bucket-per-zone vs prefix-per-zone

Two camps exist. Bucket-per-zone gives you clean IAM separation, separate lifecycle policies, separate CMEK keys per zone, and easy cost attribution by bucket. Prefix-per-zone is simpler to set up and migrate but couples policies and pricing across zones. For any production Cloud Storage data lake design, bucket-per-zone wins on governance every time.

Region choice and the dual-region option

Single-region buckets are cheapest and have the lowest write latency, but lose availability when the region goes down. Multi-region buckets give you 99.95% availability SLA and serve global readers but cost more and add cross-region write semantics to think about. Dual-region buckets are the middle ground: turbo replication between two named regions with predictable RPO and clear data residency. For most analytics workloads, a regional bucket co-located with your BigQuery dataset is the right call. For globally consumed gold data, dual-region pays off.

Co-locate your GCS buckets and your BigQuery datasets in the same region. Cross-region external table reads incur egress and dramatically slow down BigLake queries. Reference: BigQuery locations.

Ingestion patterns

Batch ingestion uses Storage Transfer Service for SaaS sources, Transfer Appliance for petabyte-scale physical migrations, and gcloud storage cp for ad-hoc loads. Streaming ingestion uses Pub/Sub with a Cloud Storage subscription that writes Avro or text files on a time or size trigger, or a Dataflow job that micro-batches into bronze. CDC ingestion uses Datastream to land Avro or JSON files in bronze, with a downstream Dataflow template merging into silver.

The ingestion choice you make is structural. A Cloud Storage data lake design that uses Pub/Sub-to-GCS subscriptions with five-minute file rolls behaves very differently from one that uses Dataflow streaming with windowed writes. The former is cheaper and simpler. The latter is more flexible and lets you enforce schemas at ingest.

GCP Service Deep Dive

The Cloud Storage data lake design exam questions almost always involve more than GCS itself. They pull in BigLake, Dataplex, BigQuery, and IAM.

File format trade-offs

Choosing the file format is one of the highest-leverage decisions in a Cloud Storage data lake design. Five formats matter on the exam.

Parquet is the default for analytics layers (silver and gold). Columnar storage means BigLake and BigQuery only scan the columns referenced by a query. Snappy or ZSTD compression is standard. Predicate pushdown works on filter columns. Use Parquet for any table read by analytical queries.

ORC is similar to Parquet, more common in Hive and Spark ecosystems. BigLake supports it. If you are migrating from on-prem Hadoop and your existing pipelines emit ORC, keep it; otherwise, Parquet is the safer default on GCP.

Avro is row-based, with a self-describing schema embedded in every file. It is excellent for ingestion (CDC, event streams) because schema evolution is well-defined and writes are cheap. It is poor for analytical scans because every column gets read. Use Avro in bronze, convert to Parquet for silver.

JSON Lines is human-readable, tolerant of schema drift, and verbose on disk. Use it for raw event streams in bronze when downstream consumers need to reparse, or for small, infrequent extracts. Avoid it for any large analytical table.

CSV is the format you accept rather than choose. It has no native types, no nested structures, no compression by default, and ambiguous quoting rules. Land CSV in bronze if a partner sends it, then convert to Parquet immediately and never let analysts query CSV directly.

Parquet plus Snappy plus partitioning is the default analytical layout in a Cloud Storage data lake design. Anything else needs a justification. See BigQuery file format guidance.

Object Lifecycle Management and storage class transitions

OLM rules act on age, storage class, version state, custom time, and prefix. A standard Cloud Storage data lake design lifecycle for bronze:

  • After 30 days, transition Standard to Nearline.
  • After 90 days, transition Nearline to Coldline.
  • After 365 days, transition Coldline to Archive.
  • After 7 years, delete (or apply a retention lock if compliance forbids deletion).

Silver and gold typically stay in Standard because they are queried regularly. Bronze accumulates and benefits most from tiering. Lifecycle rules are evaluated once per day; they are not real-time. Transitions are one-way without re-uploading: you can move Standard to Archive automatically, but moving Archive back to Standard requires a rewrite.

Storage class access cadence on the PDE exam: Standard for hot data, Nearline for monthly access (30-day minimum storage), Coldline for quarterly access (90-day minimum), Archive for yearly access (365-day minimum). A canonical bronze OLM chain is Standard to Nearline at 30 days, Nearline to Coldline at 90 days, Coldline to Archive at 365 days, with partitions laid out as event_date=YYYY-MM-DD/region=.../ so prefix-scoped rules and BigLake on Parquet/ORC can prune by year/month/day. Dataplex zones (raw and curated) map onto these bronze/silver/gold tiers on top of the same GCS layout.

Lifecycle rules apply to every object in the bucket unless you scope them with prefix or matchesStorageClass conditions. Putting silver and bronze in the same bucket and writing a "move to Coldline after 30 days" rule will quietly demote your hot silver tables. Bucket-per-zone avoids this entire class of bug.

BigLake on GCS

BigLake tables are the bridge between a GCS-based Cloud Storage data lake design and BigQuery's governance model. A BigLake table points at a GCS prefix with Parquet, ORC, Avro, JSON, or CSV files. Unlike legacy BigQuery external tables, BigLake supports row- and column-level security, dynamic data masking, and uses a connection-based service account so end users do not need direct GCS IAM grants.

The practical consequence: you can lock down the underlying bucket so only one service account can read it, and let analysts query the BigLake table with their normal BigQuery permissions. This is the standard governance pattern on the PDE exam.

BigLake also supports Iceberg tables on GCS, which add table-level transactions, schema evolution, and time travel on top of Parquet files. Iceberg is the direction the ecosystem is moving for open lakehouse architectures.

Dataplex zones and lakes

Dataplex sits a level above GCS and BigQuery. You define a "lake" that spans buckets and datasets, then organise data into "zones" (raw, curated) inside the lake. Dataplex automatically discovers files in GCS, registers them as tables in a Dataproc Metastore and BigQuery, profiles data quality, and enforces tag-based policies.

For the exam, remember the mapping: Dataplex lakes equal logical data domains, Dataplex zones equal bronze/silver/gold tiers, and Dataplex assets are the actual GCS buckets or BigQuery datasets. Dataplex does not replace your Cloud Storage data lake design; it adds metadata, discovery, and governance on top of it.

IAM hierarchy

Cloud Storage IAM works at four levels: organisation, folder, project, and bucket. Roles inherit downward. A typical Cloud Storage data lake design grants:

  • Data engineering team: roles/storage.admin at the project level for the data lake project.
  • Ingestion service accounts: roles/storage.objectCreator on bronze buckets only.
  • Transformation service accounts: roles/storage.objectViewer on bronze, roles/storage.objectAdmin on silver.
  • Analyst groups: no direct GCS access; they query through BigLake.
  • Auditors: roles/storage.objectViewer plus roles/logging.viewer at the project level.

Use uniform bucket-level access on every bucket. Object ACLs are a legacy mechanism that fragments your security posture and makes audits painful.

Encryption with CMEK

Every object in GCS is encrypted at rest by default with Google-managed keys. CMEK (Customer-Managed Encryption Keys) lets you supply your own Cloud KMS key, which gives you the ability to revoke access by disabling or destroying the key, plus a clean audit trail of every encrypt/decrypt operation in Cloud Audit Logs.

A Cloud Storage data lake design that uses CMEK typically creates one key ring per zone, with separate keys per bucket. Key rings are regional, so the key region must match the bucket region. The Cloud Storage service account in your project needs roles/cloudkms.cryptoKeyEncrypterDecrypter on the key. Without that grant, writes fail silently from the user's perspective and surface as 400 errors in logs.

CSEK (Customer-Supplied) is rarely the right answer on the exam. It puts the key entirely in your hands, which sounds appealing until you lose it and your data becomes unrecoverable. CMEK with Cloud HSM-backed keys is the standard high-assurance choice.

VPC Service Controls perimeter

VPC-SC creates a security perimeter around GCS, BigQuery, and other API surfaces. Inside the perimeter, services can talk to each other freely. Crossing the perimeter requires an ingress or egress rule, even for users with valid IAM. This blocks the exfiltration path where a compromised credential is used from an unauthorised network to copy data out of GCS.

A production Cloud Storage data lake design wraps every bucket holding sensitive data in a VPC-SC perimeter that allows access only from your corporate VPN, your CI/CD project, and named on-prem CIDR ranges. Combined with private Google access, this means GCS API calls never traverse the public internet.

VPC-SC is the only control that prevents a stolen service account key from being used to download buckets from a coffee shop Wi-Fi. IAM alone does not give you that property. Reference: VPC Service Controls overview.

Common Pitfalls and Trade-offs

Real Cloud Storage data lake design rollouts trip over the same handful of mistakes.

The first is the small-files problem. Streaming ingestion that flushes every few seconds produces millions of one-kilobyte Parquet files. Each file carries metadata overhead, BigLake spends more time listing than reading, and Dataflow shuffle costs explode. Target file sizes between 128 MB and 1 GB. Compact small files with a scheduled Dataflow or BigQuery MERGE job.

The second is partition explosion. Partitioning by event_timestamp instead of event_date creates one partition per second, which produces hundreds of millions of empty directories and crushes any query planner. Pick partition keys with bounded cardinality, ideally under a few thousand distinct values per dataset.

The third is putting raw PII in bronze without protection. Bronze is the most permissive zone by design, and engineers often forget that ingesting an unredacted CSV of customer records into a permissively scoped bucket has now given anyone with roles/storage.admin on the project access to that PII. Either de-identify at ingest with Cloud DLP, or encrypt the bronze bucket with a CMEK key whose access is restricted to the de-identification job.

The fourth is forgetting cross-region egress. A BigQuery US dataset reading from a GCS EU bucket via a BigLake external table will work, but it will be slow and you will pay egress per query. Always check region alignment.

The fifth is over-relying on bucket-level IAM as data grows. Once you have hundreds of teams sharing a lake, granular access controls in BigLake or Dataplex tag policies scale better than dozens of buckets each with their own IAM policy.

Best Practices

A short, opinionated list of what consistently works in a Cloud Storage data lake design.

  • One bucket per zone per region, with descriptive names that include environment.
  • Uniform bucket-level access on every bucket; no object ACLs.
  • Hive-style partitioning with date as the primary partition key.
  • Parquet with Snappy compression for silver and gold; native format in bronze.
  • Lifecycle rules scoped by prefix or matchesStorageClass; never blanket rules on shared buckets.
  • CMEK on any bucket holding sensitive or regulated data.
  • BigLake tables instead of legacy external tables when analysts need access.
  • Dataplex for catalog, profiling, and policy enforcement at scale.
  • VPC-SC perimeter around the lake project for any production workload.
  • Audit logs (Data Access logs) enabled on bronze and silver for forensic capability.

Real-World Use Case

A mid-size retail company runs an analytics platform feeding dashboards for 200 stores and a feature store for demand forecasting. Their Cloud Storage data lake design uses three buckets per region: acme-prod-bronze-us, acme-prod-silver-us, acme-prod-gold-us, all in us-central1 to co-locate with their BigQuery dataset.

Datastream lands MySQL CDC into bronze as Avro, partitioned by event_date and source_table. A Dataflow streaming job reads bronze, deduplicates by primary key plus operation timestamp, and writes Parquet into silver partitioned by event_date and country. Daily BigQuery scheduled queries materialise gold tables for finance and merchandising.

Bronze has a lifecycle rule that transitions Standard to Nearline at 60 days and to Archive at 18 months, with a 7-year retention lock for tax compliance. Silver and gold stay in Standard. CMEK keys live in Cloud HSM, one key per bucket, with rotation every 90 days.

Analysts never touch GCS directly. They query BigLake tables defined over silver and gold prefixes, with row-level security restricting each store manager to their own store's data. The entire data lake project sits inside a VPC-SC perimeter that allows access only from the company VPN and the production Dataflow worker subnets. Total monthly cost for 80 TB of bronze, 20 TB of silver, and 5 TB of gold runs about a third of what their previous on-prem Hadoop cluster cost, with no operations team required.

Exam Tips

The PDE exam tests Cloud Storage data lake design through scenario questions, not definitions. Watch for these patterns.

When the scenario mentions petabyte-scale migration from on-prem with limited bandwidth, the answer involves Transfer Appliance, not gsutil. When it mentions ongoing daily SaaS imports, the answer is Storage Transfer Service. When it mentions CDC from a relational database, the answer is Datastream into bronze.

If a question asks how to give analysts SQL access to GCS files without giving them GCS IAM, the answer is BigLake, not legacy external tables. If it asks how to enforce row-level security on GCS-backed tables, the answer is BigLake. If it asks for a unified catalog and policy plane across GCS and BigQuery, the answer is Dataplex.

Storage class questions usually have a clear keyword: "accessed monthly" maps to Nearline, "quarterly" to Coldline, "yearly" to Archive. Lifecycle questions test whether you remember that rules act once per day and only forward through tiers.

Encryption questions distinguish CMEK (Google holds the encrypted key, you control rotation and access) from CSEK (you hold the raw key, Google never sees it). The default safe answer for "regulatory requirement to control encryption keys" is CMEK with Cloud HSM. CSEK appears as a distractor.

VPC-SC questions show up as "prevent data exfiltration" or "even with valid credentials, block access from outside the corporate network." IAM alone is the wrong answer; VPC-SC plus Private Google Access is the right one.

On the exam, when a question combines GCS with BigQuery and asks for the lowest-cost analytical query path, BigLake external tables on Parquet with Hive partitioning is almost always correct. See BigLake introduction.

Frequently Asked Questions

What is the difference between a Cloud Storage data lake and a Dataplex lake?

A Cloud Storage data lake design is the physical layout of buckets, prefixes, files, and IAM. A Dataplex lake is a logical construct that groups one or more GCS buckets and BigQuery datasets under a unified catalog, with zones, asset registration, data profiling, and tag-based policies. Dataplex sits on top of your existing GCS layout; it does not replace it. Most teams build the GCS structure first, then add Dataplex when governance complexity outgrows manual processes.

Should I use one bucket or multiple buckets for my data lake?

Multiple. Use one bucket per zone (bronze, silver, gold), per environment (dev, staging, prod), and per region. This gives you clean IAM boundaries, separate lifecycle policies, isolated CMEK keys, and clear cost attribution. The marginal complexity of more buckets is much smaller than the marginal pain of trying to apply different policies to different prefixes inside one bucket.

When should I pick Avro over Parquet for files in GCS?

Use Avro when the workload is write-heavy, schema evolves frequently, and consumers usually read whole records (CDC payloads, event streams, queue offloads). Use Parquet when the workload is read-heavy, queries scan a subset of columns, and you want predicate pushdown (analytical tables, BigLake sources). A common Cloud Storage data lake design pattern is Avro in bronze, Parquet in silver and gold.

How does BigLake differ from legacy BigQuery external tables?

Legacy external tables require every querying user to have direct GCS IAM read access, and they do not support row- or column-level security. BigLake tables use a connection with its own service account that holds GCS access, so users only need BigQuery permissions on the table. BigLake also supports row-level security, column-level security, dynamic masking, and integrates with Dataplex policies. For any production Cloud Storage data lake design with analyst access, BigLake is the right choice.

Can VPC Service Controls block access between projects in the same organisation?

Yes. VPC-SC perimeters are project-scoped, not organisation-scoped. A bucket in your data lake project can be perimeter-protected so even another project in the same organisation, with valid IAM, cannot access it without an explicit ingress rule. This is why VPC-SC is the standard control for sensitive data lakes: IAM grants permission, VPC-SC grants network reachability, and you need both.

How do I prevent the small-files problem in a streaming Cloud Storage data lake design?

Configure your streaming sink to roll files based on size and time, with a target between 128 MB and 1 GB. For Pub/Sub Cloud Storage subscriptions, set a generous file batch size and time window. For Dataflow streaming writes, use windowing plus Beam's FileIO.write() with appropriate sharding. Schedule a daily compaction job that reads small files in silver, rewrites them as larger Parquet files, and updates BigLake metadata.

What is the right storage class for bronze data that is rarely re-read?

Standard for the first 30 to 60 days while late-arriving silver jobs may need it, then Nearline for 90 days, then Coldline up to a year, then Archive for long-term retention. Encode this with a single lifecycle policy on the bronze bucket. Avoid putting bronze directly into Archive; if a silver job fails and you need to reprocess, retrieving Archive data costs money and adds latency.

Further Reading

Official sources

More PDE topics