Introduction to Data Lineage and Impact Analysis
Data Lineage and Impact Analysis on Google Cloud answers two questions every regulator, auditor, and on-call engineer asks at 2 a.m.: where did this number come from, and what breaks if I change this column? On GCP, the canonical surface is Dataplex Universal Catalog (the renamed home of Data Catalog). It captures lineage automatically from BigQuery, Dataflow, Dataproc, Dataform, Vertex AI Pipelines, and Cloud Composer, and it exposes a Data Lineage API that lets you push events from custom workloads via OpenLineage. This study guide walks through how the system actually behaves in production, where it silently fails, and what the PDE exam is most likely to test on Data Lineage and Impact Analysis.
白話文解釋(Plain English Explanation)
Data Lineage and Impact Analysis is like a kitchen recipe trail
Picture a busy restaurant kitchen. A diner sends back a tomato bisque because it tastes off. The head chef does not start tasting every pot on the line. She walks backwards from the bowl: which pot, which stockpot before that, which crate of tomatoes, which supplier crate stamp. That backward walk is lineage. Now flip it. Tomorrow morning the supplier calls and says one crate was contaminated. The chef walks forward from that crate: which prep bin, which pots, which dishes left the pass, which tables. That forward walk is impact analysis. Data Lineage and Impact Analysis on GCP gives you both walks, and the kitchen is your data warehouse.
Think of policy tags as colored stickers on ingredients
Every ingredient at a Michelin restaurant has a colored sticker: red for allergens, blue for halal, yellow for organic. The sticker travels with the ingredient as it gets chopped, mixed, and plated. If a red sticker ends up on a dish that was sold as allergen-free, the kitchen has a problem. In BigQuery, policy tags from Dataplex Universal Catalog work the same way. Tag an email column as PII once, and the tag follows the column through views, materialized views, and downstream tables. When an analyst joins it into a public dashboard, the lineage graph plus policy tag tells you exactly which dashboards leak PII.
Impact analysis is a power-grid blackout map
Utility operators do not memorize every house wired to every transformer. They keep a graph. When a transformer trips, the dispatcher pulls up the map, sees the 1,400 homes downstream, and routes crews accordingly. When a substation needs maintenance at 3 a.m., they run the same query in reverse to schedule notifications. Replace transformer with staging.orders_raw and homes with the 47 BigQuery views, 3 Dataflow jobs, 2 Vertex AI feature sets, and 1 Looker dashboard that depend on it. Hit the impact analysis endpoint and you get the same blackout map for data.
Core Concepts of Data Lineage and Impact Analysis
Lineage on GCP is a directed acyclic graph stored by the Data Lineage API. It has three primitive entities you should memorize.
A Process is the thing that moves or transforms data. A BigQuery query, a Dataflow job, a Dataform action, a Cloud Composer task. Each Process has an identifier and a fully qualified name (FQN) that tells the catalog what kind of system produced it.
A Run is a single execution of a Process. The same scheduled query running daily produces one new Run per execution. Runs carry start time, end time, and state.
A Lineage Event ties a Run to its inputs and outputs. Inputs and outputs are referenced by FQN, for example bigquery:my-project.sales.orders_raw. Events are the unit of write; the graph is materialized by the API when you query it.
Two query patterns dominate Data Lineage and Impact Analysis. Upstream search asks "what does this table depend on?" Downstream search asks "what depends on this table?" Both are calls to searchLinks on the Data Lineage API. Depth is configurable but capped, so deep traversals require client-side recursion.
A canonical string that uniquely identifies a data asset across services, formatted as <system>:<project>.<dataset>.<resource>. The FQN is how the Data Lineage API stitches BigQuery tables, Dataflow jobs, Pub/Sub topics, Cloud Storage objects, and Vertex AI artifacts into one graph. See https://cloud.google.com/data-catalog/docs/fully-qualified-names
Architecture and Design Patterns
The reference architecture has four layers. At the bottom, producers emit lineage. BigQuery does this automatically for every job whose project has the Data Lineage API enabled. Dataflow emits lineage for templated and Apache Beam jobs starting from SDK 2.50. Dataproc Serverless and Dataproc on Compute Engine emit lineage when the Spark Lineage init action or the Spark BigQuery connector is configured. Dataform emits lineage for every action it executes. Custom workloads emit via the OpenLineage HTTP transport pointed at the GCP OpenLineage endpoint.
The Data Lineage API sits in the middle. It owns persistence, deduplication of events from the same Run, and the search graph. The same API serves the Dataplex Universal Catalog UI, the BigQuery console lineage tab, and any third-party tool you wire up.
On top sits Dataplex Universal Catalog itself. It joins lineage with technical metadata (schemas, partitions), business metadata (glossary terms, descriptions), and governance metadata (policy tags, IAM, data quality scores). The catalog is what humans actually look at.
Finally, consumers pull from the Lineage API or BigQuery INFORMATION_SCHEMA.JOBS for compliance reports, change-impact tickets, and incident response runbooks. This is where downstream impact analysis lives.
The dominant design pattern is the passive collection pattern: turn on the API, let producers emit, query when needed. The alternative is the push-with-OpenLineage pattern, used when you have a non-GCP transform (an on-prem dbt project, an Airflow on AWS, a Snowflake cross-cloud share) that you still want to appear in the same lineage graph.
Lineage is only auto-captured for BigQuery jobs after you enable the Data Lineage API on the project that runs the job, not the project that owns the data. A query in analytics-prod reading from data-platform records lineage only if analytics-prod has the API on. This is the single most common reason teams report missing lineage edges. See https://cloud.google.com/bigquery/docs/about-lineage
GCP Service Deep Dive
BigQuery query lineage
BigQuery captures lineage for SELECT, INSERT, MERGE, CREATE TABLE AS SELECT, and load jobs. Each job becomes a Process plus a Run. Inputs are the tables the planner reads, outputs are the destination table or the materialized result. Wildcards in FROM resolve to each matched table as a separate input edge, so a query against events_* with 365 daily shards creates 365 input edges. This is normal, not a bug, but it inflates graph size for downstream analysis.
The lineage tab in the BigQuery console shows a visual graph. Behind it, you can also pull edges with SQL via INFORMATION_SCHEMA.JOBS_BY_PROJECT, joining referenced_tables and destination_table. For programmatic use, prefer the Data Lineage API because it handles authorized views and routines that INFORMATION_SCHEMA truncates.
Streaming inserts and the BigQuery Storage Write API also emit lineage when the writer identifies itself with a requestId. Naked tabledata.insertAll calls without metadata do not, which is one reason teams migrating to the Storage Write API see their lineage graphs suddenly fill in.
Dataplex Universal Catalog and policy tags
Dataplex Universal Catalog is the renamed and merged home of what used to be Data Catalog plus Dataplex. It hosts the policy tag taxonomy: a hierarchical set of tags like PII > High > Email or PCI > Cardholder > PAN. You attach a tag to a BigQuery column. Column-level access control then enforces that only principals with the Fine-Grained Reader role on that tag can read the column.
The reason this matters for Data Lineage and Impact Analysis: policy tags propagate through views. Create a view that selects a tagged column, the view inherits the tag in queries. The lineage graph is what tells you which downstream views and materialized tables touched the tagged column, so when you add a new tag to an existing column, you can run downstream impact analysis to enumerate everything that just gained access restrictions.
Business glossary
The business glossary is a separate Dataplex Universal Catalog feature that defines terms (Customer, Active User, Net Revenue) once and links them to physical assets. Glossary terms appear in the same UI as lineage, so an analyst clicking on revenue_dashboard sees both the lineage graph and the canonical definition of Net Revenue it is supposed to represent. This closes a long-standing gap where lineage was technical and definitions lived in Confluence.
Dataflow and Dataproc lineage emission
Dataflow emits lineage automatically for jobs using Apache Beam SDK 2.50 or later when the job reads or writes BigQuery, Cloud Storage, Pub/Sub, or Bigtable through the standard IO connectors. The job's job ID becomes the Run ID. Custom IO connectors do not emit; you must wrap them with OpenLineage events.
Dataproc Serverless emits Spark lineage when you set the spark.openlineage.transport.type=gcplineage property. Dataproc on Compute Engine needs the Spark Lineage init action plus the same property. PySpark, Spark SQL, and Scala Spark all emit. Hive on Dataproc does not emit natively; you would push events from a Cloud Composer hook.
Dataform lineage
Dataform is the SQL workflow layer that compiles .sqlx files into BigQuery jobs. It produces lineage for free because the jobs it runs are BigQuery jobs, but it also adds action-level lineage that shows the Dataform action graph (the dependency tree the developer wrote) overlaid on the physical table graph. This is helpful for impact analysis at the source-controlled layer rather than the materialized-table layer.
OpenLineage integration
OpenLineage is the open-source standard for lineage events. GCP exposes a managed endpoint that accepts OpenLineage events and writes them to the Data Lineage API. You point your existing OpenLineage emitters (the Airflow OpenLineage provider, the dbt-openlineage adapter, the Spark OpenLineage listener) at the GCP endpoint, and your non-GCP workloads start showing up next to BigQuery jobs in the same graph. This is how multi-cloud and hybrid customers achieve a single pane of glass without abandoning OpenLineage tooling they already operate.
Data Lineage API
The Data Lineage API has three core RPCs to remember. ProcessOpenLineageRunEvent accepts an OpenLineage payload and converts it to native entities. SearchLinks returns upstream or downstream edges for a given FQN. BatchSearchLinkProcesses returns the Processes that produced a set of edges, which you call after SearchLinks to enrich the result. Quotas are generous for read but rate-limited for write; bulk backfills require throttling.
Vertex AI ML pipeline integration
Vertex AI Pipelines (Kubeflow Pipelines on Vertex) emits lineage for every component that reads from BigQuery, Cloud Storage, or the Vertex Feature Store. The lineage graph then connects training data to feature views to trained models to batch prediction outputs. When a regulator asks "which model produced this credit decision and what data was it trained on," the answer is one graph traversal. Vertex AI Model Registry entries are first-class FQNs in the lineage graph, so models appear as nodes alongside tables.
Turn on the Data Lineage API at the organization level via gcloud services enable datalineage.googleapis.com propagated through an organization policy. Per-project enablement is the most common cause of lineage gaps when a new project gets spun up by a team that forgot the runbook. See https://cloud.google.com/dataplex/docs/lineage-rest-api
Common Pitfalls and Trade-offs
The first pitfall is assuming lineage is retroactive. It is not. Enable the API today, you get lineage from today. Jobs that ran yesterday are invisible. For backfill, you have two options: re-run the jobs (cheap if Dataform-driven, expensive if ad-hoc), or scrape INFORMATION_SCHEMA.JOBS going back six months and push synthetic OpenLineage events. Most teams accept the gap.
The second is wildcard explosion. A query against events_* over 3 years creates over a thousand edges. The lineage UI handles it, but client-side downstream traversals time out. Aggregate by parent dataset before traversing.
The third is service account attribution. Lineage records the principal that ran the job. If 200 dbt models run under one service account, the graph cannot tell you which human authored the change. Push the dbt invocation user via OpenLineage event metadata, or read it from the Cloud Audit Logs join.
The fourth is cross-region lineage latency. Lineage events from a us-central1 BigQuery job appear in the multi-region US graph within seconds, but a europe-west4 job to a us-multi-region table takes longer to settle. Compliance reports that run within 60 seconds of a job completing can miss edges.
The fifth is the sql_string trap. Lineage stores the SQL text that generated the job. If your query embeds secrets via string concatenation (it should not, but it happens), those secrets are now in the lineage record, which is readable by anyone with roles/datalineage.viewer. Treat lineage as a read-restricted surface.
The Data Lineage API does not capture lineage for queries that use EXTERNAL_QUERY against Cloud SQL or Spanner. The federation layer hides the source from the BigQuery planner, so the upstream edge is missing. If you need lineage across federated sources, schedule a Dataflow job that materializes the external data into BigQuery and let lineage capture that step instead. See https://cloud.google.com/bigquery/docs/about-lineage#limitations
Best Practices
Enable the Data Lineage API at the org level via Cloud Resource Manager organization policy so every new project inherits it. The cost of enablement is zero; the cost of forgetting is six months of missing graph.
Standardize FQNs for non-GCP assets. If your Snowflake instance gets emitted as snowflake://account in one OpenLineage emitter and snowflake:account.warehouse in another, you end up with two disconnected sub-graphs. Pick one convention, write it in a runbook, enforce it in the OpenLineage emitter config.
Combine lineage with Cloud Audit Logs in the same BigQuery dataset. Lineage tells you the graph; audit logs tell you who and when. Joining them on job_id gives you a complete change-impact record that satisfies SOC 2 and ISO 27001 evidence requirements.
Tag glossary terms with policy tags. When the business glossary entry for "Customer Email" is linked to the PII.High.Email policy tag, an engineer creating a new column can search the glossary, find the term, and inherit the tag automatically. This collapses three lookup steps into one.
Use Dataform to generate lineage you can predict. Ad-hoc queries create messy graphs; a Dataform-driven warehouse creates a graph that mirrors the dependency tree in source control. The reproducibility makes impact analysis trustworthy.
Schedule a weekly downstream impact report for sensitive tables. A scheduled query against the Data Lineage API for every table with a PII.High policy tag, output to a Looker dashboard, gives the data governance team a moving picture of where sensitive data flows.
Test impact analysis as part of CI. Before merging a Dataform pull request that drops a column, the CI job calls the Data Lineage API for downstream consumers and posts the list as a PR comment. Reviewers see the blast radius before approving.
The Data Lineage API exposes lineage from BigQuery, Dataflow, Dataproc, Dataform, Cloud Composer, and Vertex AI Pipelines automatically. Custom and non-GCP sources push events via OpenLineage. The graph is queried via SearchLinks for both upstream (where did this come from) and downstream (what depends on this) traversals. Policy tags propagate through views, and the lineage graph tells you which downstream assets a tag now applies to.
Real-World Use Case
A mid-sized European insurance carrier runs about 600 BigQuery scheduled queries, 40 Dataform projects, 15 Dataflow streaming pipelines, and 8 Vertex AI training pipelines. They had two governance problems. First, when a regulator asked "which model decided this claim and what training data was used," the answer took two analysts a full week of grep-through-git-and-ask-people. Second, when they merged two business units, they could not predict which of the 600 scheduled queries would break if they renamed policy_id to contract_id in the source warehouse.
They enabled the Data Lineage API across all 47 projects, configured Dataflow jobs to emit via SDK 2.55, set the OpenLineage Spark listener on Dataproc, and routed their on-prem Airflow DAGs through the OpenLineage HTTP transport pointed at the GCP endpoint. Dataform was already wired in by default. Vertex AI Pipelines were upgraded to the latest SDK so model nodes appeared in the graph.
The first regulator query that came in after rollout was answered in 12 minutes by one analyst clicking through the Vertex AI Model Registry node and traversing upstream to the training feature view, then to the source BigQuery tables, then to the Dataflow ingestion jobs, then to the Pub/Sub topics from the policy administration system. Total graph traversal: 7 hops, 142 nodes, all visible in one screen.
For the column rename, they wrote a 60-line Python script that called SearchLinks downstream from the source table, filtered for jobs whose sql_string matched policy_id, and produced a CSV of 89 affected scheduled queries with owners pulled from job labels. Migration that was estimated at three months took six weeks because the team knew exactly what to fix and who to coordinate with.
The annual cost of the Data Lineage API for that volume is zero (the API is free at typical usage; costs accrue only on the queries you run against it via BigQuery). The cost of not having it was measured in regulator response time and migration risk.
Exam Tips
The PDE exam tests Data Lineage and Impact Analysis through scenario questions, not configuration recall. Memorize the entity model (Process, Run, Lineage Event) and the FQN format because distractors will swap them around.
When a question describes a hybrid setup with on-prem Spark or third-party warehouse, the answer is OpenLineage to the GCP endpoint, not building a custom collector. Custom collectors are a trap answer.
If a question mentions column-level access control plus traceability, the answer pairs Dataplex Universal Catalog policy tags with Data Lineage API. Either alone is incomplete.
For Vertex AI scenarios where a regulator asks for training data provenance, the answer is the lineage graph that connects Vertex AI Model Registry to BigQuery and Cloud Storage via Vertex AI Pipelines. Manual model cards are a partial answer at best.
If a question asks about lineage for federated queries (EXTERNAL_QUERY to Cloud SQL or Spanner), the correct answer is to materialize the data into BigQuery first via a managed pipeline, because federation breaks lineage capture.
Watch for the BigQuery wildcard scenario. Questions that mention thousands of input edges in the lineage graph are testing whether you know wildcard expansion creates one edge per matched table. The fix is to query at the parent dataset level or use a partitioned table instead of date-sharded tables.
Cloud Composer scenarios that test "how do I add lineage to my custom Python operator" should point you at the OpenLineage Airflow provider, not the Composer-specific Lineage backend (which is the legacy approach and is being phased out).
Dataplex Universal Catalog is the rebranded and merged product that replaces Data Catalog plus the original Dataplex governance plane. PDE questions written before the rename will say "Data Catalog" and questions written after will say "Dataplex Universal Catalog." Both refer to the same lineage and policy tag system today. Do not pick "Data Catalog API" as a distinct answer when "Dataplex Universal Catalog" is also offered, the latter is current. See https://cloud.google.com/dataplex/docs/introduction
Frequently Asked Questions (FAQ)
How does BigQuery automatically capture lineage without my code changes?
The Data Lineage API hooks into the BigQuery job submission path. When the API is enabled on the project that submits the job, every successful job emits a Lineage Event whose inputs are the tables the planner read and whose outputs are the destination table. There is nothing for you to instrument. The same applies to scheduled queries, BigQuery ML training jobs, and load jobs. The only requirement is API enablement and the appropriate IAM role (roles/datalineage.viewer) on the consumer side.
What is the difference between upstream and downstream impact analysis?
Upstream answers "where did this data come from." You start at a table and traverse backward through the Process graph to find the sources, transformations, and ingestion jobs that produced it. This is the regulator-question pattern: how did this number get here. Downstream answers "what depends on this data." You start at a table and traverse forward to find every job, view, model, and dashboard that consumes it. This is the change-impact pattern: what breaks if I touch this. The Data Lineage API exposes both via the same SearchLinks RPC with a direction parameter.
Do I need OpenLineage if I only use GCP services?
No. BigQuery, Dataflow (SDK 2.50+), Dataproc with the Spark OpenLineage listener, Dataform, Cloud Composer (with the OpenLineage provider), and Vertex AI Pipelines emit lineage natively to the Data Lineage API. OpenLineage becomes necessary only when you have non-GCP workloads (on-prem Spark, AWS Glue, Snowflake, dbt Core outside Dataform, third-party orchestrators) that you want to appear in the same graph. For those, you point the existing OpenLineage emitter at the GCP endpoint and the events flow into the same Data Lineage API store.
How long is lineage data retained?
Lineage events are retained for 30 days by default in the active store, after which they age out. For longer retention, export the lineage graph to BigQuery via a scheduled job and query it from there. Many regulated customers keep a rolling 7-year archive in a BigQuery dataset partitioned by event date, which is cheap because the data is small relative to typical warehouse volumes. The 30-day window is intentional: it keeps the live graph fast for interactive queries while pushing historical analysis to BigQuery where it belongs.
Can policy tags themselves be tracked through lineage?
Indirectly. Policy tags are attached to columns, not events. The lineage graph tells you which downstream tables and views read a tagged column, and from there you can infer policy tag exposure. Dataplex Universal Catalog also exposes a separate audit log of tag attachment and detachment events via Cloud Audit Logs, so you can answer "when did this column gain its PII tag and who applied it" by joining audit logs with the lineage graph. Combining the two is the standard pattern for SOC 2 evidence packets.
Does lineage capture transformations inside a Dataflow job?
It captures the inputs and outputs of the job (which sources it read, which sinks it wrote) but not intra-job transformations. The DAG of ParDo and GroupByKey operations is internal to the Beam runner. If you need finer granularity, use Dataform for SQL transformations (every action is a separate lineage node) or split a monolithic Dataflow pipeline into smaller pipelines that each materialize an intermediate result. The trade-off is graph richness versus pipeline complexity.
How does Vertex AI integrate with the lineage graph?
Vertex AI Pipelines emit lineage for every component that reads or writes a tracked artifact (BigQuery table, Cloud Storage object, Vertex Feature Store entity, Vertex AI Model Registry entry). Models in the Model Registry appear as first-class nodes with FQNs like vertexai:projects/.../models/model-id. Batch prediction outputs are tracked as their own nodes. The result is an end-to-end graph from raw ingestion through training data, feature engineering, model artifact, and prediction output. This is what makes Vertex AI lineage acceptable for regulated ML use cases like credit scoring and insurance underwriting.
Related Topics
- Data Sovereignty and Compliance Design — region pinning and audit obligations that lineage evidence supports
- PII De-identification with DLP — tagging and masking patterns that pair with policy tags
- Storage Security and IAM Best Practices — the IAM model that gates the Data Lineage API itself
- BigQuery Data Modeling and Clustering — how table design choices affect lineage graph shape
Further Reading
- About data lineage in Dataplex Universal Catalog — official conceptual overview
- Get lineage information for BigQuery — the BigQuery-specific behavior, limits, and IAM model
- Data Lineage API reference — the REST endpoints for
SearchLinks,ProcessOpenLineageRunEvent, and friends - OpenLineage integration with Google Cloud — how to point OpenLineage emitters at the GCP endpoint
- Dataplex Universal Catalog overview — the unified catalog and governance product page