examlab .net The most efficient path to the most valuable certifications.
In this note ≈ 21 min

Dataplex: Data Governance and Catalog

4,150 words · ≈ 21 min read ·

Practical guide to Dataplex data governance on GCP: Universal Catalog, policy tags, attribute store, business glossary, AutoDQ, profiling, lineage, IAM, and VPC-SC for the PDE exam.

Do 20 practice questions → Free · No signup · PDE

Introduction to Dataplex Data Governance

Dataplex data governance is the layer Google Cloud uses to make a sprawling estate of BigQuery datasets, Cloud Storage buckets, and BigLake tables behave like a single, well-managed warehouse. Instead of dropping yet another tool on top of your data, it turns the metadata you already have into a control plane: a place to search, classify, secure, profile, and trace everything from a hand-written CSV to a petabyte-scale fact table.

For a Professional Data Engineer, Dataplex data governance shows up under several guises across the exam. You will see questions on Universal Catalog (the rebrand of Data Catalog), on policy tags wrapped around sensitive columns, on AutoDQ scans wired into Cloud Composer, on lineage graphs that explain why a dashboard suddenly shows yesterday's numbers. They are all the same product, looked at from different angles.

This note walks through the full surface area: the catalog, taxonomies, business glossary, attribute store, automated harvest, data quality and profiling scans, lineage emission, IAM, and the perimeter story with VPC Service Controls. The goal is for you to leave with both a working mental model and the specific knobs the PDE exam likes to test.

白話文解釋(Plain English Explanation)

Three analogies, each from a different world, will help you keep Dataplex data governance straight when the exam throws a wall of acronyms at you.

The City Library With a Restricted Reading Room

A modern public library does not just store books. It shelves them by Dewey code, prints a catalog card for each one, locks the rare manuscripts behind glass, and asks you to show ID before you can even touch certain medical journals. The librarian on duty can also tell you exactly which earlier book inspired the new bestseller you are holding.

Dataplex data governance plays every one of those roles for your data. Universal Catalog is the card catalog. Policy tags are the lock on the rare manuscripts. The business glossary is the index at the back of every book. Lineage is the librarian explaining the citation chain. The data itself never leaves Cloud Storage or BigQuery, the same way the books never leave the library building.

The Airport Customs and Baggage Hall

Think about how a busy international airport handles arriving passengers. Bags come off planes from dozens of cities, get scanned, sorted onto carousels, and then either pass through the green channel or get pulled aside for inspection. Customs officers tag suspicious items, log every search, and can rebuild a chain of custody if something goes missing.

Dataplex behaves like the customs hall for incoming data. Discovery scans pull each new file off the carousel and log what is inside. Profiling is the X-ray showing the shape of the contents. AutoDQ is the secondary inspection that flags forbidden items, and lineage is the chain-of-custody form that lets you trace any record back to the gate it arrived from.

The Hospital Medical Records Room

A hospital has thousands of patient charts. Some are routine, some contain genetic data, some are tied to active research consent forms. The hospital does not move these charts into a vault every time someone wants to read one. It simply controls who can pull a folder, what they are allowed to see when they open it, and writes down every access in a log book.

That is the model Dataplex uses. The data stays where it is. Policy tags decide whether a clinician sees the full social security number or just the last four digits. Audit logs record every query. The business glossary is the staff manual that says "this column called pt_dob means patient date of birth, and it is PHI under HIPAA."

Core Concepts of Dataplex Data Governance

Dataplex bundles several long-standing GCP services under one roof. Knowing the vocabulary saves you on multiple-choice questions.

Universal Catalog is the search-and-metadata surface formerly branded as Data Catalog. Every BigQuery table, BigLake table, Cloud Storage fileset, Pub/Sub topic, and Vertex AI model in your projects can show up here. You search with a Gmail-style query language, filter by system or tag, and click through to the underlying asset.

Entries and entry groups are how the catalog stores objects. An entry is a single asset, an entry group is a logical bucket that holds related entries (for example, all entries from a custom on-prem PostgreSQL importer).

Aspects are the modern, typed metadata attached to entries. They replaced the older "tags" concept and live inside aspect types, which behave like reusable schemas. An aspect type called "data_classification" might require a confidentiality level and a retention period, and you attach an instance of it to each entry that needs governing.

Taxonomies and policy tags sit one layer below. A taxonomy is a tree of policy tags ("PII" with children "PII::High", "PII::Medium"). You attach a policy tag to a BigQuery column and BigQuery enforces who can see that column.

Business glossary holds the human definitions: "monthly_active_users", "customer_lifetime_value", "fiscal_q3". Glossary terms link to entries so a marketing analyst can search the business word and land on the right BigQuery view.

Attribute store is the typed registry behind aspects, allowing you to declare custom attributes (for example, "data_steward_email" must be a valid email) that get reused everywhere.

Data profile scans sample columns and emit statistics: null ratio, distinct values, min/max, top values. Data quality scans (AutoDQ) evaluate rules — uniqueness, ranges, referential integrity, custom SQL — and produce pass/fail metrics with row-level diagnostics.

Lineage captures source-to-sink relationships emitted by BigQuery, Dataform, Dataflow, Dataproc Spark, Cloud Composer, and the Lineage API for custom workloads.

A label attached to a BigQuery column that points at an entry in a taxonomy. BigQuery checks the IAM bindings on the policy tag at query time to decide whether the caller is allowed to read that column. Defined under Data Catalog Policy Tag Manager. See https://cloud.google.com/bigquery/docs/column-level-security-intro

Architecture and Design Patterns

A working Dataplex deployment usually has four layers stacked on top of your raw storage.

At the bottom sit the actual data services: BigQuery datasets, Cloud Storage buckets, BigLake tables federating Iceberg or Hudi files, and any Pub/Sub topics carrying streaming events. Dataplex never owns these. It only references them.

Above that is the organization layer of lakes, zones, and assets. A lake is a logical container scoped to a project and region. Within a lake you create zones tagged either "raw" or "curated". You then attach assets, each of which points at a Cloud Storage bucket or BigQuery dataset. Discovery jobs run inside this layer and surface the schema of every file or table they find.

Above that is the catalog layer: entries, aspects, taxonomies, glossary terms, and lineage records. This is where governance metadata lives and where Universal Catalog reads from when an analyst types a search query.

The top is the policy layer: IAM roles on lakes and zones, IAM bindings on policy tags, organization policies that constrain what regions Dataplex can run in, and VPC Service Controls perimeters that wrap everything to prevent metadata or query results from leaving an approved network boundary.

Two design patterns dominate the PDE exam. The first is the medallion-style raw → curated zone split, where ingestion lands in a raw zone with permissive schema-on-read and Dataflow promotes cleansed data into a curated zone with strict schemas and AutoDQ rules. The second is the federated mesh pattern, where each business domain owns its own lake but central platform engineering manages the global taxonomy and glossary, giving you decentralized ownership with centralized governance.

Dataplex zones do not move or copy data. Promoting an asset from raw to curated is a logical relabel plus the discovery and quality rules you choose to apply. Plan separate Cloud Storage prefixes or BigQuery datasets ahead of time, because the zone boundary is purely a metadata construct. See https://cloud.google.com/dataplex/docs/introduction

GCP Service Deep Dive

Universal Catalog (formerly Data Catalog)

Universal Catalog is the search front door. Behind it is the same metadata graph that powered Data Catalog plus the new typed aspect model. Searches use predicates such as system=bigquery, type=dataset, tag:pii=true, or free text against descriptions. Results return entries you can pin, share, or open in BigQuery directly.

The catalog harvests three kinds of metadata. Technical metadata comes from the source system: schema, partition keys, table size, last-modified time. Operational metadata comes from the discovery service and from query logs: row counts, freshness, last-accessed user. Business metadata is what you attach yourself: aspects, glossary links, tags.

Aspect types are the modern way to model business metadata. Each aspect type has a JSON schema with strongly typed fields. When you attach an aspect to an entry, the values must validate against that schema. This is a real upgrade from the old tag template system, which only supported a flat list of primitive fields.

Policy Tags and Taxonomies

Taxonomies live in Data Catalog Policy Tag Manager and are scoped to a single GCP region. You design a tree such as confidentiality > restricted > pii_high. On each policy tag you bind the role Data Catalog Fine-Grained Reader to the principals who are allowed to see columns wearing that tag.

Once a tag is applied to a BigQuery column, every query that touches that column is checked. Users without the role get either an access-denied error (default) or, when you turn on column-level access control with masking, they get a transformed value such as NULL, a SHA256 hash, or the default value for the type.

A common pitfall: policy tags are regional. If you copy a BigQuery table from us to eu, the tags do not follow. You must recreate the taxonomy in the new region and reapply the tags, or your column-level security silently disappears.

Attribute Store and Aspect Types

The attribute store is where reusable custom fields live. You declare an aspect type once, version it, and reference it from any entry across any project under the same organization. This means a single change to your "data_steward_email" definition propagates without you hand-editing every entry.

Aspect types support inheritance and required fields. You can mark certain fields mandatory so that an entry cannot be saved without them. Combine this with Cloud Asset Inventory feeds and you can build automated checks that block any new BigQuery dataset from being created unless it carries a valid steward aspect.

Business Glossary

The glossary holds business terms, their definitions, and relationships such as synonyms or "is a kind of". A term like "active customer" can be linked to several BigQuery views that implement it differently for marketing and finance. The glossary makes it explicit that those two definitions exist and points users at the canonical one.

Glossary terms can be organized into categories with their own approval workflows. Steward roles ensure that only nominated users can publish a term, while everyone else can suggest edits.

Automated Metadata Harvest

Discovery jobs run on a schedule (default hourly) and crawl Cloud Storage prefixes and BigQuery datasets that you have registered as assets. For Cloud Storage they infer schema from CSV, JSON, Avro, Parquet, and ORC files, detect Hive-style partitions, and publish a BigLake or BigQuery external table so the data is queryable.

Discovery emits events to Cloud Logging that you can route to Pub/Sub. A common pattern is to subscribe a Cloud Function to schema-change events so a downstream Dataform model is regenerated whenever new columns appear.

AutoDQ and Custom Data Quality Rules

AutoDQ is the auto-data-quality scan service. You point a scan at a BigQuery table or BigLake table, choose a sample (full table, percentage, or row count), and define rules. Built-in rule types cover null check, uniqueness, set membership, regex match, range check, statistic check (mean, median, stddev), referential integrity (foreign-key style), and table-level row-count delta.

For anything else, you write a custom SQL rule. The SQL must return zero rows for a pass, or one row per failing record. AutoDQ writes the failed-row sample to a results table you configure, so debugging a failed run is as simple as a SELECT.

Scan results emit a structured Cloud Logging entry, an optional notification to Pub/Sub, and a metric to Cloud Monitoring. The PDE exam likes the pattern of routing the Pub/Sub notification into Cloud Composer to halt downstream DAGs when quality drops below a threshold.

For tables that change shape often, use AutoDQ recommendations. Run a profiling scan first, then click "Generate rule recommendations". The service proposes baseline rules from the profile statistics, which you can then accept, edit, or discard before activating them. This shortens the cold-start time from hours to minutes. See https://cloud.google.com/dataplex/docs/auto-data-quality-overview

Data Profiling Scans

Profiling scans answer "what does this data look like" without you writing any SQL. For each column the scan reports null ratio, distinct count, min, max, average, standard deviation, top N values, and quartiles for numeric columns. Results land in a results table and are also surfaced inline in the BigQuery UI as a "Data profile" tab.

Profiling is the foundation for several downstream features. AutoDQ rule recommendations read profile output. Data Insights, the Gemini-backed natural-language query feature, uses profile statistics to suggest example queries. Even cost monitoring uses profile snapshots to estimate the impact of clustering recommendations.

Lineage Emission

Lineage is captured automatically for BigQuery jobs, Dataform executions, Cloud Composer tasks that wrap supported operators, Dataflow templated pipelines, and Dataproc Spark jobs running with the lineage agent enabled. Each captured event records source datasets, target dataset, the user or service account that ran the job, and a job ID you can click into for the full SQL or Spark plan.

For custom workloads you call the Data Lineage API directly. You report a Process (the long-lived job definition), a Run (one execution), and one or more LineageEvent entries each carrying a list of source and target FQNs. This is how third-party schedulers such as Airflow on-prem or Argo can show up in the same lineage graph as your native GCP jobs.

The lineage UI shows a directed graph with column-level edges where the underlying job exposed them (BigQuery does, most others currently do not). You can pivot from any node to its catalog entry, its policy tags, and its quality history.

Memorize the Dataplex hierarchy and APIs the PDE exam loves to test: a lake contains zones (raw or curated), which contain assets that point at Cloud Storage buckets or BigQuery datasets. Discovery runs hourly by default, lineage records are retained 30 days unless exported, and the three APIs that must travel together inside a VPC-SC perimeter are dataplex.googleapis.com, datacatalog.googleapis.com, and datalineage.googleapis.com. AutoDQ built-in rule types cover null check, uniqueness, set membership, regex, range, statistic, referential integrity, and row-count delta — anything else is a custom SQL rule that must return zero rows to pass.

Common Pitfalls and Trade-offs

The first trap is regional taxonomy duplication. Teams build a beautiful PII taxonomy in us-central1, then promote a dataset to europe-west4 and discover that none of the column protection followed. There is no automatic cross-region replication for taxonomies. You either re-deploy via Terraform or accept that EU data needs its own parallel taxonomy.

A second is discovery cost on hot Cloud Storage buckets. Discovery scans list objects, read file headers, and parse schemas. If you point a scan at a bucket receiving thousands of small JSON files per minute, the scan will run constantly, list-cost charges will spike, and downstream BigLake table refreshes can lag. Filter the prefix tightly or switch to event-driven Pub/Sub notifications instead of recurring discovery.

Third, AutoDQ on full-table mode against billing-by-bytes-scanned tables can surprise you on cost. A daily 100% scan of a 50 TB table will scan 50 TB per run. Use the row sampling option, or add a partition filter so the scan only evaluates yesterday's slice.

Fourth, lineage gaps for Spark on Dataproc. The native Spark lineage agent supports a subset of Spark connectors. If you run a custom RDD job that reads from a third-party source, lineage will not be emitted unless you call the Lineage API yourself.

Fifth, policy tag sprawl. Teams sometimes create one taxonomy per project. After a year you have hundreds of overlapping "PII" taxonomies and no consistent IAM story. Centralize taxonomies in a governance project and grant other projects the Policy Tag Reader role so they can attach but not redefine.

Granting roles/datacatalog.categoryFineGrainedReader at the project or dataset level gives the user access to every policy-tagged column in scope. That is almost never what you want. Always bind the role at the individual policy tag, not higher. Audit existing bindings with gcloud data-catalog taxonomies list-iam-policy regularly. See https://cloud.google.com/bigquery/docs/column-level-security-intro

Best Practices

  • One taxonomy per regulatory domain, not per project. Centralize PII, financial, and health taxonomies in a dedicated governance project. Other projects reference them through IAM, never re-create them.
  • Mandatory aspect types for ownership. Define a data_ownership aspect type with required fields for steward, owner, and contact, and enforce its presence with a Cloud Asset Inventory feed that alerts on missing aspects.
  • Profile first, rule second. Run a profiling scan on every new curated table before writing AutoDQ rules. Use the recommendations as a starting point and tighten manually.
  • Sample large tables. For tables over a terabyte, set AutoDQ to a fixed row sample (for example 100,000) or to a partition filter so cost and run time stay predictable.
  • Wire AutoDQ failures into Composer. Send the Pub/Sub notification to a Cloud Function that pauses or fails downstream DAGs. Quality breaks should stop the pipeline, not just log a warning.
  • Lineage everywhere via the API. For any custom job that reads or writes data, emit a lineage event. The cost is negligible and the operational value during incident response is enormous.
  • Wrap the catalog with VPC-SC. If your data lives inside a perimeter, the catalog and lineage APIs must also be inside the perimeter so an exfiltration attempt cannot read metadata to plan a deeper attack.

Real-World Use Case

A mid-size European insurer runs a customer-360 platform on GCP. They hold roughly 800 BigQuery tables across six business domains (claims, underwriting, marketing, finance, fraud, HR), plus 40 TB of Cloud Storage holding scanned policy documents and call center recordings. Compliance requires GDPR-grade access control on anything containing a customer identifier, and Solvency II quarterly reporting depends on three regulator-facing dashboards.

The platform team builds two governance projects. The first holds the central taxonomy with three top-level policy tags (pii_high, pii_low, regulatory) and a business glossary of about 200 terms. The second hosts Universal Catalog aspect types: data_ownership, regulatory_classification, retention_policy, pipeline_freshness_sla.

Each domain owns its own lake. Discovery runs hourly on the Cloud Storage zones and registers BigLake tables automatically. AutoDQ scans run nightly on every curated table with rules generated from profile recommendations and tuned by the data steward in each domain. Failures publish to a central Pub/Sub topic, which Cloud Composer subscribes to in order to halt the next morning's reporting DAG.

Lineage is enabled across BigQuery, Dataform, and Composer. When the regulator asks a question like "what is the source of the claim_severity column on dashboard X?", an analyst clicks one column in Looker, jumps to the catalog entry, and walks the lineage graph back to the raw policy document landing zone in Cloud Storage. What used to be a two-week investigation now takes ten minutes.

VPC Service Controls wrap the entire estate. Catalog API, Lineage API, BigQuery, Cloud Storage, and Dataplex are all inside one perimeter with a tightly scoped ingress rule for the corporate IdP. Even an over-privileged service account cannot exfiltrate metadata to a personal GCP project.

Exam Tips

The PDE exam tests Dataplex governance through scenario questions where the wrong answers usually look right at first glance. A few patterns to watch.

When the question mentions column-level security on BigQuery, the answer almost always involves policy tags from a Data Catalog taxonomy plus the Fine-Grained Reader role. Authorized views and row-level access policies solve different problems.

When the question asks how to automatically classify PII columns, look for the combination of Cloud DLP inspection plus Dataplex aspect updates or policy tag application. DLP alone identifies sensitive data, Dataplex enforces what to do about it.

When the question wants automated metadata without manual ingestion code, discovery jobs on lakes or BigLake automatic refresh are the answer rather than a custom Cloud Function.

When the question describes failed records visible to engineers, that is AutoDQ with a results table configured. Profiling does not flag failures, only statistics.

When multi-cloud governance comes up (S3 or Azure Blob), look for BigLake tables registered as Dataplex assets. Universal Catalog can index them through BigLake, and policy tags work on BigLake columns the same way they do on native BigQuery tables.

When the prompt wants lineage across Composer DAGs that call custom Python code, the answer is the Lineage API, not built-in Composer lineage. Built-in lineage only covers operators that expose source and target metadata.

When the prompt mentions a VPC-SC perimeter blocking access to the catalog, the fix is to add datacatalog.googleapis.com and datalineage.googleapis.com to the perimeter's allowed services list, not to grant additional IAM roles.

For the PDE exam, remember that Dataplex Universal Catalog is the rebranded Data Catalog. Older study materials and even some current GCP UI labels still say "Data Catalog". The APIs (datacatalog.googleapis.com, dataplex.googleapis.com) and IAM role names have not changed. See https://cloud.google.com/dataplex/docs/catalog-overview

IAM for Dataplex Governance

IAM in Dataplex is layered. The lake, zone, and asset resources each accept IAM bindings; permissions inherit downward unless overridden. Discovery results, AutoDQ scans, and lineage records each have their own permissions but generally inherit from the parent lake.

Several role names show up regularly on the exam. roles/dataplex.admin grants full control over lakes and content. roles/dataplex.dataReader lets a user query data in a zone without managing it. roles/dataplex.metadataReader gives catalog search rights without data access. roles/datacatalog.categoryFineGrainedReader is the policy-tag-bound role that unlocks individual BigQuery columns. roles/datalineage.viewer and roles/datalineage.editor control read and write on lineage records.

The principle to remember: data access lives on BigQuery and Cloud Storage IAM, metadata access lives on Dataplex and Data Catalog IAM, and fine-grained column access lives on the policy tag itself. A user can have full BigQuery Data Viewer on a dataset and still see NULL in a column because the policy tag IAM denies them.

Service accounts deserve special care. The Dataplex service agent [email protected] needs read access to discovery targets and write access to BigLake tables it creates. AutoDQ scans run as the user-defined service account you select; that account needs BigQuery Data Viewer on the scan target and BigQuery Data Editor on the results dataset. Forgetting the second one is the most common reason scans fail silently.

Integration with VPC Service Controls

VPC-SC is the perimeter product that prevents data and metadata from leaving an approved boundary. Dataplex governance integrates with VPC-SC at three layers.

First, the Dataplex API itself (dataplex.googleapis.com) is a supported VPC-SC service. Adding it to a perimeter blocks metadata reads and lake management calls from clients outside the perimeter, even with valid IAM credentials.

Second, the Data Catalog API (datacatalog.googleapis.com) and the Data Lineage API (datalineage.googleapis.com) are separately supported. You normally add all three together, because catalog and lineage queries from outside the perimeter would otherwise leak schema and pipeline structure to attackers planning a data exfiltration.

Third, the enforcement targets matter. BigQuery, Cloud Storage, BigLake, and Pub/Sub must also be inside the perimeter. Dataplex is the metadata layer; the perimeter still has to wrap the underlying data services to actually contain the data itself.

A common mistake is to put Dataplex inside a perimeter but leave BigQuery outside. Discovery jobs running inside the perimeter need to call BigQuery; the call crosses the boundary and fails. The fix is either to bring BigQuery into the perimeter or to use a perimeter bridge. The exam loves to test this exact misconfiguration.

When designing perimeters for Dataplex, always include dataplex.googleapis.com, datacatalog.googleapis.com, datalineage.googleapis.com, bigquery.googleapis.com, and storage.googleapis.com together. Missing any one of them creates either a broken pipeline or a silent metadata leak. See https://cloud.google.com/dataplex/docs/introduction

Operational Patterns and Cost Considerations

A working Dataplex deployment generates ongoing cost in three buckets: discovery scans (charged per task usage hour), AutoDQ and profiling scans (charged on the BigQuery slots or on-demand bytes they consume), and the underlying storage and query bills you would have anyway.

Discovery is usually the cheapest. A typical lake with 50 assets and hourly discovery runs at a small fixed monthly cost. The trap is enabling discovery on a high-write Cloud Storage bucket, where the API list and read costs against the bucket itself dominate.

AutoDQ and profiling cost scales with the data scanned. The cheap pattern is to scan only the latest partition of a partitioned table, sampled to a fixed row count. The expensive pattern is full-table profiling on a daily schedule against a multi-terabyte fact table. Expect at least one PDE question to compare these two and ask you to identify the cheap option.

Universal Catalog itself is free for entries, search, and aspects up to generous monthly limits. You only start paying when you call the Catalog API at very high QPS or store unusually large numbers of aspect instances. Lineage is also free for the supported integrations, billed only when you call the Lineage API for custom workloads beyond the free quota.

A practical operational pattern: keep a small Cloud Monitoring dashboard with three charts. Discovery task hours per lake, AutoDQ pass rate per critical table, and Lineage event count per pipeline. The first catches runaway scans, the second catches data quality regressions, and the third catches pipelines that have stopped emitting lineage (often a sign they have silently failed).

Frequently Asked Questions (FAQ)

Is Dataplex Universal Catalog the same as the old Data Catalog?

Yes. In late 2024 Google rebranded Data Catalog as Dataplex Universal Catalog and folded it under the Dataplex product family. The underlying APIs (datacatalog.googleapis.com) and IAM roles kept their names, so existing scripts and Terraform modules continue to work. Newer features such as typed aspect types and the modernized search experience only exist on the Universal Catalog surface.

How do policy tags differ from BigQuery row-level access policies?

Policy tags control which columns a user can see; row-level access policies control which rows a user can see. Both can apply at the same time. Policy tags live in a Data Catalog taxonomy and are managed centrally, while row-level policies are per-table SQL predicates managed by the table owner. Use policy tags when the column is sensitive everywhere it appears and use row-level policies when the same column is fine for some rows but not others.

When should I use AutoDQ versus writing my own SQL checks in Dataform or dbt?

AutoDQ wins when you want a managed scheduler, results tables, lineage integration, and a UI that non-engineers can read. Writing checks in Dataform or dbt wins when the rules are tightly coupled to the transformation logic and you want them in the same pull request as the SQL. Many teams use both: AutoDQ for cross-pipeline contracts (uniqueness on customer_id, freshness SLAs), Dataform tests for transformation-internal invariants.

Can Dataplex govern data outside Google Cloud?

Indirectly. BigLake tables can federate over Amazon S3 and Azure Blob Storage, and BigLake tables register as Dataplex entries. So your S3 data shows up in Universal Catalog, can wear policy tags, and can be scanned by AutoDQ. The catalog also supports custom entry sources through the Catalog API for assets that have no GCP integration at all, such as on-prem PostgreSQL or Snowflake; you import metadata only, not data.

What happens to lineage if a BigQuery job is deleted?

Lineage records are independent of the underlying jobs. Deleting a BigQuery job from the job history does not delete the lineage event already captured. Lineage records are retained for 30 days by default. If you need longer retention for audit, export lineage events to BigQuery via the Lineage API and store them in a long-lived dataset.

How do I prevent discovery from creating BigLake tables I do not want?

Discovery can be configured per asset to publish, not publish, or publish only to a specific dataset. You can also exclude file patterns at the asset level so log files or temporary outputs are ignored. For full control, disable automatic publication and use the discovery output as metadata only, then create BigLake tables manually through Terraform.

Further Reading

Official sources

More PDE topics