Vertex AI Pipelines and Orchestration

Q: Can a Vertex AI Pipeline call another Vertex AI Pipeline?

Yes, with two patterns. The clean pattern is dsl.importer plus a sub-pipeline imported as a component — KFP v2 lets you compile one pipeline and use it as a step in another. The pragmatic pattern is a custom component that calls aiplatform.PipelineJob(...).submit() directly. The first preserves lineage; the second loses lineage between the parent and child runs but is more flexible for late-binding parameters.

Q: What is the difference between pipeline_root and pipeline_job.gcs_output_directory?

pipeline_root is set at pipeline definition or job submission time and is the parent prefix for all artifacts. The runner creates a job-specific subdirectory under it (something like pipeline_root/<job-id>/ ) that is the actual gcs_output_directory for that run. You usually only set pipeline_root ; the per-job directory is managed automatically.

Q: Can I use Vertex AI Pipelines to orchestrate non-ML work, like a nightly BigQuery transformation?

Technically yes — you can compose a pipeline of BigqueryQueryJobOp calls and schedule it. Practically, Cloud Composer or Dataform is a better fit for pure SQL/ETL workflows. Use Vertex AI Pipelines when ML artifacts and lineage are part of the picture; use Composer when the workflow is dozens of heterogeneous data engineering tasks.

Introduction to Vertex AI Pipelines and Orchestration

Vertex AI Pipelines is Google Cloud's serverless runner for machine learning workflows defined with the Kubeflow Pipelines (KFP) v2 SDK or the TFX SDK. You write your ML steps as Python components, compile the graph into a JSON or YAML specification, and submit a pipeline job. Google manages the underlying compute, the metadata database, and the artifact lineage graph. You pay per pipeline execution plus the resources each component consumes.

Most data engineering teams adopt Vertex AI Pipelines once their ML projects outgrow notebooks. A notebook is fine for one experiment. Once you need to retrain weekly, version inputs, and roll back a bad model, hand-managed scripts collapse under their own weight. Vertex AI Pipelines gives you a single place to define, run, schedule, and audit those workflows, and it integrates natively with Model Registry, Feature Store, and Model Monitoring without you wiring REST calls by hand.

Vertex AI Pipelines uses Kubeflow Pipelines v2 as its primary SDK. KFP v1 is deprecated. The v2 SDK introduces typed artifacts, stronger component contracts, and the Dataset, Model, Metrics, and ClassificationMetrics types. Always start new projects on v2. See https://cloud.google.com/vertex-ai/docs/pipelines/migrate-kfp-v1-to-v2

白話文解釋（Plain English Explanation）

Vertex AI Pipelines as a restaurant kitchen ticket system

Walk into a busy restaurant kitchen. The expediter at the pass holds a ticket: "table 12, two steaks medium-rare, one risotto, fries on the side." That ticket lists the dishes, who cooks each one, and the order they hit the pass. The grill chef and the sauté chef can work in parallel, but the fries cook only after the oil is hot, and the plates leave together so the table is served at once. If a steak burns, the expediter calls for a re-fire without redoing the risotto.

A pipeline definition is the ticket. Each component is a station — grill, sauté, fryer. The Vertex AI Pipelines runner is the expediter who looks at the ticket, schedules each station, parallelises what it can, blocks what depends on prior steps, and re-runs only what failed. Cached components are like a cold appetiser already on the pass: if the order matches a previous one, send it straight out instead of cooking again.

Components as Lego bricks with stud-and-hole contracts

A KFP v2 component is a Lego brick with specific studs on top and holes on the bottom. The studs are the outputs (a Model, a Dataset, a float), the holes are the inputs. You can only snap two bricks together when the studs match the holes. The compiler refuses to ship a pipeline where you try to plug a str output into a Dataset input.

This is why typed artifacts matter. Instead of passing file paths around as strings, you pass Output[Model] and Input[Dataset]. The runner handles where the file actually lives in Cloud Storage, signs URLs, and records the lineage edge in the Vertex ML Metadata graph. You stop debugging "wrong path" errors and start debugging actual model logic.

Caching as an open-book exam with a sticky-note answer key

Imagine an open-book exam where you can keep a stack of sticky notes from previous attempts. If the question is identical to one you already solved — same wording, same allowed references, same instructions — you peel off the old sticky note and submit the answer instantly. If even one word changes, you have to solve from scratch.

Pipeline caching works the same way. Vertex AI Pipelines hashes the component spec, the input parameter values, and the input artifact identifiers. If a previous successful execution in the same project matches that hash, the runner skips the work and reuses the prior outputs. Change the component image tag, change a parameter, change an upstream artifact, and the cache invalidates. This is why immutable image tags are critical: tagging your component image as :latest will silently break caching because the digest changes without the tag changing.

A single execution of a compiled pipeline. Each pipeline run is a PipelineJob resource with a unique resource name, a state (PIPELINE_STATE_RUNNING, _SUCCEEDED, _FAILED, _CANCELLED), and an associated PipelineRunContext in Vertex ML Metadata that anchors all artifacts and executions produced by the run. https://cloud.google.com/vertex-ai/docs/pipelines/run-pipeline

Core Concepts of Vertex AI Pipelines and Orchestration

Components and their flavours

A component is the smallest unit of work in a Vertex AI Pipelines graph. KFP v2 supports three flavours:

Lightweight Python components — defined with the @component decorator. The decorator inspects the function signature, builds a container spec, and compiles the function body to run inside a base Python image. Best for quick logic.
Containerized Python components — same decorator but built into your own image with kfp component build. Best when you have package dependencies that take too long to install on every cold start.
Container components — pure container specs where you bring your own image and define the command. Use this when the workload is not Python at all (a Spark submit, a Java jar, a custom CUDA binary).

Each component declares typed inputs and outputs. Parameters are small JSON-serialisable values (int, float, str, bool, list, dict). Artifacts are larger data structures with backing files in Cloud Storage. Artifact types include Dataset, Model, Metrics, ClassificationMetrics, SlicedClassificationMetrics, HTML, and Markdown.

Pipelines and DAG structure

A pipeline is a Python function decorated with @dsl.pipeline. Inside, you call components and chain their outputs to subsequent inputs. The compiler walks those calls and builds a directed acyclic graph. There are no for-each loops in classical Python sense at compile time — instead you use dsl.ParallelFor to fan out work across a list known only at runtime, and dsl.Collected to fan back in.

The compiled artifact is an Intermediate Representation (IR) JSON file (filename usually pipeline.json or pipeline.yaml). This IR is what you submit to the Vertex AI Pipelines API. The IR is portable: the same IR can run on any KFP v2 backend, including OSS Kubeflow on a self-managed cluster.

Conditional execution and dynamic control flow

KFP v2 ships three control-flow primitives:

dsl.If(condition) — runs a branch only when a runtime condition is true. Common pattern: only deploy if the new model's accuracy beats the champion's.
dsl.Elif(condition) and dsl.Else() — added in KFP 2.3 to express full if/elif/else without nested If blocks.
dsl.ParallelFor(items) — fan-out over a list output from an upstream component, useful for hyperparameter sweeps or per-region training.
dsl.ExitHandler(exit_task) — guarantee a cleanup component runs whether the pipeline succeeded or failed. Use this for Slack notifications and BigQuery audit row inserts.

Wrap deployment steps in dsl.If(eval_metrics.outputs["accuracy"] > 0.85). This single guard saves you from pushing a regressed model to production after a bad data day. The condition is evaluated against the artifact metadata, so the dependency is recorded in the lineage graph automatically. https://cloud.google.com/vertex-ai/docs/pipelines/control-flow

Artifacts and Vertex ML Metadata

Every artifact a component produces is registered as a node in the Vertex ML Metadata graph. The runner records which execution produced each artifact and which executions consumed it. This produces a free lineage trail: given a model in production, you can walk the graph backwards to the exact training dataset, preprocessing parameters, and source pipeline run.

The metadata store is namespaced per region. Pipelines submitted to us-central1 write to the us-central1 metadata store. Cross-region lineage queries are not supported, which is something to plan around if you run multi-region MLOps.

Architecture and Design Patterns

The standard training-evaluation-deployment pattern

The canonical Vertex AI Pipelines layout has six stages:

Ingest — pull raw data from BigQuery, Cloud Storage, or a Pub/Sub backfill into a Dataset artifact.
Validate — run TFDV or a custom schema check to fail fast on bad data.
Transform — apply feature engineering, optionally writing features to Vertex AI Feature Store for online serving.
Train — submit a CustomTrainingJobOp (a Google Cloud Pipeline Component) and emit a Model artifact.
Evaluate — score the candidate model against a holdout set and a baseline; emit ClassificationMetrics.
Deploy (conditional) — if metrics pass, register in Model Registry and deploy to an Endpoint.

Each stage produces typed artifacts that the next stage consumes. The lineage graph that results is your audit document for any compliance review.

The champion-challenger continuous training pattern

For continuous training (CT), you augment the standard pattern with a comparison gate. The new model is the challenger; the model currently aliased as prod in the Model Registry is the champion. The pipeline pulls the champion's metrics from the registry, runs the challenger against the same evaluation set, and only registers the challenger as the new champion if it wins on the chosen metric. This gate prevents silent regressions during automated retraining cycles.

The fan-out hyperparameter sweep pattern

Use dsl.ParallelFor over a list of hyperparameter dicts to launch parallel training jobs. Each iteration emits a Model and Metrics artifact. A downstream dsl.Collected consumer ranks them and selects the winner. Vertex AI Vizier is a more sophisticated alternative when you need Bayesian optimisation, but for grid or random search, ParallelFor is enough and keeps everything inside a single pipeline run.

The maximum number of concurrent tasks in a single pipeline run is bounded by the project's Compute Engine and Vertex AI quotas, not by the SDK. Before launching a 200-way ParallelFor, check your custom-model-training-cpus and GPU quotas in us-central1 (or your chosen region). https://cloud.google.com/vertex-ai/docs/quotas

GCP Service Deep Dive

The KFP v2 SDK in practice

A minimal Vertex AI Pipelines workflow looks like this:

from kfp import dsl, compiler
from google.cloud import aiplatform

@dsl.component(base_image="python:3.11", packages_to_install=["pandas==2.2.0"])
def preprocess(input_uri: str, output_data: dsl.Output[dsl.Dataset]):
    import pandas as pd
    df = pd.read_csv(input_uri)
    df.dropna().to_parquet(output_data.path)

@dsl.component(base_image="python:3.11", packages_to_install=["scikit-learn==1.4.0"])
def train(data: dsl.Input[dsl.Dataset], model: dsl.Output[dsl.Model]) -> float:
    import pandas as pd, joblib
    from sklearn.ensemble import RandomForestClassifier
    df = pd.read_parquet(data.path)
    clf = RandomForestClassifier().fit(df.drop("y", axis=1), df["y"])
    joblib.dump(clf, model.path)
    return clf.score(df.drop("y", axis=1), df["y"])

@dsl.pipeline(name="rf-train")
def pipeline(input_uri: str):
    p = preprocess(input_uri=input_uri)
    train(data=p.outputs["output_data"])

compiler.Compiler().compile(pipeline, "pipeline.json")
aiplatform.PipelineJob(
    display_name="rf-train",
    template_path="pipeline.json",
    parameter_values={"input_uri": "gs://my-bucket/data.csv"},
    pipeline_root="gs://my-pipeline-root",
).submit(service_account="[email protected]")

The pipeline_root is the GCS prefix where artifacts live. The service_account is the identity each component runs as. Both are required in production.

Google Cloud Pipeline Components (GCPC)

Most teams do not write training code from scratch. They use Google Cloud Pipeline Components — a library of pre-built components that wrap Vertex AI APIs. The most useful ones for the PDE exam:

CustomTrainingJobOp — submits a Vertex AI custom training job and returns a Model artifact.
AutoMLTabularTrainingJobRunOp — trains an AutoML tabular model end-to-end.
ModelUploadOp — registers a Model artifact into the Model Registry.
ModelDeployOp and EndpointCreateOp — deploy to an online prediction endpoint.
ModelBatchPredictOp — submits a batch prediction job.
BigqueryQueryJobOp — runs a BQ query as a pipeline step, useful for feature extraction.

These components are versioned independently of the KFP SDK, so pin them: from google_cloud_pipeline_components.v1.custom_job import CustomTrainingJobOp.

Caching mechanics in detail

Cache hits in Vertex AI Pipelines depend on a hash of:

The fully resolved container image digest (not the tag).
The component name and signature.
The literal values of every parameter input.
The URIs of every artifact input.
The names and types of every output.

Cache scope is per-project and per-location. A cache entry from us-central1 does not satisfy a lookup in europe-west4. The cache lifetime is unlimited — entries persist until you explicitly delete them via the metadata API.

You can disable caching at three levels: per task with task.set_caching_options(enable_caching=False), per pipeline submission with enable_caching=False on PipelineJob, or per execution by changing any input. The most common production pattern is to leave caching on by default and disable it only on the deployment step, since you usually want the deploy to actually run even when nothing changed.

Tagging your component image as python:3.11 instead of pinning to a digest like python:3.11@sha256:abc... will cause confusing cache misses. Docker Hub may push a new build of python:3.11 overnight, the digest changes, and your "unchanged" pipeline reruns every step. Pin digests in production. https://cloud.google.com/vertex-ai/docs/pipelines/configure-caching

Scheduling pipeline runs

Vertex AI Pipelines has a native Scheduler API, introduced to replace the older Cloud Scheduler + Cloud Functions pattern. You create a Schedule resource that holds the pipeline template, parameter values, cron expression, and concurrency policy.

aiplatform.PipelineJobSchedule(
    pipeline_job=job,
    display_name="weekly-retrain",
).create(
    cron="TZ=America/Los_Angeles 0 2 * * MON",
    max_concurrent_run_count=1,
    max_run_count=52,
)

max_concurrent_run_count=1 is the safe default: it prevents a Monday run from overlapping with a delayed Sunday run. For event-driven triggers (file arrives in a bucket), Eventarc or Cloud Functions calling PipelineJob.submit() is still the right pattern — the Scheduler API is cron-only.

Running TFX pipelines on Vertex

TFX pipelines are written with the TFX SDK and compiled with the KubeflowV2DagRunner. The output is the same IR JSON that the KFP SDK produces, so Vertex AI Pipelines runs both. TFX gives you opinionated components — ExampleGen, StatisticsGen, SchemaGen, Transform, Trainer, Evaluator, Pusher — that handle TensorFlow-specific concerns like TFRecord I/O and TensorFlow Transform graph baking.

Pick TFX when your team is deep in TensorFlow and you want strong data validation out of the box. Pick KFP v2 directly when your stack is mixed (PyTorch, scikit-learn, XGBoost) or when you want the lighter dependency footprint.

IAM and the runtime service account

Each pipeline job runs as a service account that you specify on submission. If you do not specify one, the Compute Engine default service account is used — which is almost never what you want in production.

The runtime service account needs:

roles/aiplatform.user on the project — to call Vertex AI APIs.
roles/storage.objectAdmin on the pipeline root bucket — to read and write artifacts.
roles/iam.serviceAccountUser on itself — granted to the user or build identity that submits the pipeline, otherwise the submission fails with a permission error.
Service-specific roles for any GCP service the components touch — roles/bigquery.dataEditor for BQ writes, roles/secretmanager.secretAccessor for credentials, etc.

The submitter (the user or CI service account that calls PipelineJob.submit()) needs roles/aiplatform.user and roles/iam.serviceAccountUser on the runtime SA. This two-identity model is a frequent source of permission errors during the first pipeline run from a new CI/CD setup.

Integration with Model Registry, Feature Store, and Model Monitoring

The Model Registry is where pipelines persist trained models for downstream serving. After training, call ModelUploadOp (or aiplatform.Model.upload from inside a custom component). Pass parent_model to add a new version to an existing model resource, and use version_aliases=["challenger"] or ["default"] to mark the version's role. Endpoints can deploy by alias, so promoting a challenger to champion is a single alias swap rather than an endpoint redeployment.

Vertex AI Feature Store integrates two ways. In the legacy "Feature Store (Legacy)" model, pipelines call FeaturestoreOnlineServingServiceClient to fetch features at training time; in the BigQuery-backed Feature Store, the pipeline reads directly from BQ feature views. Either way, the pipeline records the feature set version as part of the lineage graph, which is essential for reproducibility — you must be able to retrain with exactly the feature values that produced the original model.

Model Monitoring connects on the serving side. Once a model from your pipeline is deployed to an Endpoint, enable Model Monitoring with the training data as the baseline. When monitoring detects drift or skew above your threshold, it publishes to Pub/Sub. A Cloud Function subscribed to that topic calls PipelineJob.submit() to launch a retraining run with the most recent data. This is the closed-loop CT (continuous training) pattern.

Vertex AI Pipelines vs Cloud Composer vs Cloud Workflows

Choosing the right orchestrator is a recurring PDE exam topic. Here is how the three official Google Cloud orchestrators compare for ML workflows specifically:

Vertex AI Pipelines is purpose-built for ML. It has typed ML artifacts, a lineage graph aware of models and datasets, deep integration with Vertex services, and serverless execution. The cost model is per-execution plus per-component compute. There is no always-on cluster. Use it when the workflow is fundamentally about training, evaluating, or scoring models.

Cloud Composer is managed Apache Airflow. It runs an always-on GKE cluster (Composer 2 uses Autopilot), so there is a steady hourly bill regardless of usage. Airflow excels at heterogeneous data engineering DAGs that touch dozens of services on complex schedules — nightly ETL across BigQuery, Dataflow, and on-prem SFTP, with backfills, SLAs, and operator parameter sweeps. Composer can call Vertex AI Pipelines via the DataprocSubmitJobOperator-style operators, which is a common hybrid pattern: Composer schedules the day's data work, then triggers a Vertex AI Pipeline for the ML portion.

Cloud Workflows is a serverless YAML/JSON orchestrator built for short-lived API call chaining. It excels at glue: receive a Pub/Sub event, call three REST APIs in sequence with retries, branch on the response. It is not designed for long-running ML jobs (executions are limited to 1 year but each step has timeouts that make it impractical for hour-long training jobs). Use Workflows to coordinate the outside of an ML system — accept a webhook, validate it, then submit a Vertex AI Pipeline run.

The decision tree:

Workflow is mostly model training, evaluation, deployment? Vertex AI Pipelines.
Workflow is broad data engineering with ML as one stage among many, runs on a complex cron? Cloud Composer.
Workflow is API glue, short tasks, event-driven? Cloud Workflows.

Cost shape matters as much as feature fit. Cloud Composer 2 has a baseline of roughly $300–$400/month for a small environment even when idle. Vertex AI Pipelines has zero baseline cost — you pay only when a pipeline runs. For teams running ML workflows a few times per day, Vertex AI Pipelines is dramatically cheaper. For teams running 200 DAG types every 15 minutes, Composer's amortised cost is competitive. https://cloud.google.com/composer/pricing

Common Pitfalls and Trade-offs

Mutable image tags break caching silently. Mentioned above, but worth repeating because it is the single most common source of "why is my pipeline slow today" tickets. Pin to digests or to immutable semver tags published by your own CI.

Pipeline root bucket location must match the pipeline location. Submitting a pipeline to us-central1 with a pipeline_root in europe-west4 causes silent multi-region egress charges and slow component startup. Co-locate them, and use a single-region bucket.

Default service account everywhere. Submitting without specifying service_account uses the Compute Engine default SA, which has Editor-equivalent permissions in many projects. This is fine in a sandbox and dangerous in production. Always create a dedicated pipelines-runtime-sa per environment.

Forgetting enable_caching=False on non-deterministic components. If your component reads "today's data" via CURRENT_DATE() in a BQ query, the component itself looks identical across runs — same code, same parameters — but the underlying data has changed. The cache will return yesterday's output. Either disable caching on that component or pass the date as an explicit parameter.

Over-using ParallelFor without quota planning. ParallelFor happily fans out 500 ways. Each iteration consumes Compute Engine quota. Hitting a quota mid-run leaves your pipeline half-done in a confusing state. Right-size with parallelism=10 to cap concurrent iterations.

Component cold starts on every invocation. Lightweight Python components install packages_to_install at the start of every component execution. On a 30-step pipeline, those pip install calls add up to many minutes of wall time. For frequently-run pipelines, build a custom base image with dependencies baked in.

Do not store secrets as pipeline parameters. Parameters are visible in the pipeline UI, in Cloud Logging, and in the metadata store forever. Fetch secrets from Secret Manager inside the component using the runtime service account's roles/secretmanager.secretAccessor permission. https://cloud.google.com/secret-manager/docs/access-control

Best Practices

Pin everything. Pin the KFP SDK version, the GCPC version, and every component image to an immutable tag or digest. This is the foundation of reproducible pipelines.
One service account per environment. dev-pipelines-sa, staging-pipelines-sa, prod-pipelines-sa. Grant minimum permissions, never use the Compute default.
Separate the pipeline definition from its parameters. Compile once into pipeline.json, parameterise per environment when calling submit(). This lets the same artifact ship from dev to prod.
Always emit Metrics artifacts from your evaluation step. They show up in the pipeline UI as compare-able tables, and downstream dsl.If gates can read them directly.
Use ExitHandler for cleanup and notifications. A Slack ping on failure with a deep link to the pipeline UI cuts MTTR significantly.
Schedule via the Scheduler API, not Cloud Scheduler + Cloud Function. The native scheduler handles concurrency and gives you a single resource to monitor.
Deploy via aliases, not endpoint reconfiguration. version_aliases=["default"] plus serving_container_image_uri swap is faster, safer, and trivially revertible.

Real-World Use Case

A direct-to-consumer apparel brand with about 300 employees and 4 million monthly active users runs a daily product recommendation retraining job on Vertex AI Pipelines. Their pipeline has nine components:

BigqueryQueryJobOp extracts the previous 90 days of clickstream and purchase events.
A custom validate_data component runs TFDV against the previous day's schema and fails the run if anomalies exceed a threshold.
transform_features joins the clickstream with product metadata from BigQuery, writes the result to Cloud Storage, and updates a BigQuery feature view.
dsl.ParallelFor fans out into three training variants (collaborative filtering, two-tower, and a ranking baseline) with parallelism=3.
Each training variant produces a Model and Metrics artifact via CustomTrainingJobOp on n1-highmem-16 with two T4 GPUs.
compare_models reads the metrics, picks the best, and emits a ClassificationMetrics for the winner.
dsl.If(winner.outputs["recall_at_10"] > 0.42) gates deployment.
ModelUploadOp registers the winning model with parent_model set to the existing recommendation model and version_aliases=["challenger"].
An ExitHandler posts the run summary to Slack and inserts an audit row into BigQuery.

The pipeline is scheduled daily at 03:00 PT with max_concurrent_run_count=1. End-to-end runtime is about 90 minutes thanks to caching on the data validation step (which usually hits cache because the schema rarely changes). The pipeline runs as recsys-pipelines-prod-sa@, which has roles/aiplatform.user, roles/bigquery.dataEditor on a single dataset, roles/storage.objectAdmin on gs://recsys-pipeline-root, and roles/secretmanager.secretAccessor on a single secret holding the experiment-tracking API token.

A separate Model Monitoring job watches the production endpoint for prediction drift. When drift exceeds the configured threshold, it publishes to Pub/Sub. A Cloud Function submits an off-schedule pipeline run with enable_caching=False to force a full retrain on the latest data. This closes the loop: the pipeline trains, monitoring watches, monitoring triggers retraining when needed.

Exam Tips

The PDE exam asks orchestration questions in two flavours: "which service?" and "fix this broken pipeline." For the first, anchor on these heuristics:

ML-only workflow with model training, evaluation, deployment? Vertex AI Pipelines.
Mixed data engineering with complex schedules and dozens of operators? Cloud Composer.
Short API glue with event triggers? Cloud Workflows.
Single batch job submission on cron? Cloud Scheduler calling the service directly, no orchestrator needed.

For the "fix this" flavour, common scenarios to recognise:

"Pipeline keeps re-running unchanged steps" → mutable image tags or non-deterministic inputs invalidating cache.
"Pipeline fails with permission denied on submission" → submitter lacks roles/iam.serviceAccountUser on the runtime SA.
"Pipeline runs but cannot write artifacts" → runtime SA lacks roles/storage.objectAdmin on the pipeline root bucket.
"Need to retrain when new data arrives in GCS" → Eventarc or Cloud Function trigger calling PipelineJob.submit(), not the Scheduler API.
"Need to gate deployment on accuracy" → dsl.If reading the evaluation component's metrics output.
"Need automatic retraining when production model degrades" → Model Monitoring → Pub/Sub → Cloud Function → PipelineJob.submit().

Memorise the KFP v2 control-flow primitives: If, Elif, Else, ParallelFor, Collected, ExitHandler. Memorise the Google Cloud Pipeline Components for Vertex AI: CustomTrainingJobOp, ModelUploadOp, ModelDeployOp, EndpointCreateOp, ModelBatchPredictOp. Know that TFX pipelines compile to the same IR and run on the same runner. Know that the Scheduler API is the native cron mechanism and that max_concurrent_run_count controls overlap.

KFP v2 cheat sheet for the PDE exam: typed artifacts are Dataset, Model, Metrics, ClassificationMetrics, SlicedClassificationMetrics, HTML, Markdown. Control flow is dsl.If / dsl.Elif / dsl.Else / dsl.ParallelFor / dsl.Collected / dsl.ExitHandler. The cache key hashes the container image digest (not tag), component signature, parameter literals, and input artifact URIs — scope is per-project and per-location. The native PipelineJobSchedule accepts a cron string and max_concurrent_run_count (commonly set to 1); use Eventarc + PipelineJob.submit() for event-driven triggers instead. Lineage lives in Vertex ML Metadata, anchored on a PipelineRunContext per run. https://cloud.google.com/vertex-ai/docs/pipelines/configure-caching

On the exam, if a question mentions "lineage" or "reproducibility" alongside ML, the answer almost always involves Vertex AI Pipelines and Vertex ML Metadata, not Composer. Composer logs DAG runs but does not produce a typed artifact lineage graph. https://cloud.google.com/vertex-ai/docs/ml-metadata/introduction

Frequently Asked Questions (FAQ)

Can a Vertex AI Pipeline call another Vertex AI Pipeline?

Yes, with two patterns. The clean pattern is dsl.importer plus a sub-pipeline imported as a component — KFP v2 lets you compile one pipeline and use it as a step in another. The pragmatic pattern is a custom component that calls aiplatform.PipelineJob(...).submit() directly. The first preserves lineage; the second loses lineage between the parent and child runs but is more flexible for late-binding parameters.

How do I pass large data between components?

Always as artifacts, never as parameters. Parameters are stored inline in the pipeline IR and metadata, with a hard limit on size. Artifacts live in Cloud Storage under pipeline_root and are passed as Input[Dataset] / Output[Dataset] (or Model, Metrics, etc.). The runner only passes the URI between components, so the size of the actual file is bounded only by GCS, not by the pipeline runtime.

Does Vertex AI Pipelines support VPC Service Controls?

Yes. Vertex AI Pipelines is on the supported services list for VPC Service Controls. You also need to ensure the pipeline root bucket, the artifact registry holding component images, and any other GCP services the components call are inside the same perimeter. Components can run with enable_private_service_connect=True on the underlying training jobs to keep traffic on the private network.

What is the difference between `pipeline_root` and `pipeline_job.gcs_output_directory`?

pipeline_root is set at pipeline definition or job submission time and is the parent prefix for all artifacts. The runner creates a job-specific subdirectory under it (something like pipeline_root/<job-id>/) that is the actual gcs_output_directory for that run. You usually only set pipeline_root; the per-job directory is managed automatically.

How does Vertex AI Pipelines handle long-running components, like a training job that takes 12 hours?

The runner does not impose a hard timeout on individual components — the limit is the underlying execution backend's. For a CustomTrainingJobOp, the limit is the Vertex AI custom training job timeout, configurable up to 28 days. For a Python component running directly in the pipeline runner, the practical ceiling is around 7 days. For training that takes hours, always submit it via CustomTrainingJobOp rather than running training inside a @component — you get better resource scaling and fault tolerance.

Can I use Vertex AI Pipelines to orchestrate non-ML work, like a nightly BigQuery transformation?

Technically yes — you can compose a pipeline of BigqueryQueryJobOp calls and schedule it. Practically, Cloud Composer or Dataform is a better fit for pure SQL/ETL workflows. Use Vertex AI Pipelines when ML artifacts and lineage are part of the picture; use Composer when the workflow is dozens of heterogeneous data engineering tasks.

Feature Store and Feature Engineering — how to centralise features that your pipelines read and write.
Model Deployment and Monitoring on Vertex AI — what happens after ModelDeployOp.
Cloud Composer / Airflow Orchestration — the alternative orchestrator for general data engineering.

Vertex AI Pipelines and MLOps

Introduction to Vertex AI Pipelines and Orchestration

白話文解釋（Plain English Explanation）

Vertex AI Pipelines as a restaurant kitchen ticket system

Components as Lego bricks with stud-and-hole contracts

Caching as an open-book exam with a sticky-note answer key

Core Concepts of Vertex AI Pipelines and Orchestration

Components and their flavours

Pipelines and DAG structure

Conditional execution and dynamic control flow

Artifacts and Vertex ML Metadata

Architecture and Design Patterns

The standard training-evaluation-deployment pattern

The champion-challenger continuous training pattern

The fan-out hyperparameter sweep pattern

GCP Service Deep Dive

The KFP v2 SDK in practice

Google Cloud Pipeline Components (GCPC)

Caching mechanics in detail

Scheduling pipeline runs

Running TFX pipelines on Vertex

IAM and the runtime service account

Integration with Model Registry, Feature Store, and Model Monitoring

Vertex AI Pipelines vs Cloud Composer vs Cloud Workflows

Common Pitfalls and Trade-offs

Best Practices

Real-World Use Case

Exam Tips

Frequently Asked Questions (FAQ)

Can a Vertex AI Pipeline call another Vertex AI Pipeline?

How do I pass large data between components?

Does Vertex AI Pipelines support VPC Service Controls?

What is the difference between `pipeline_root` and `pipeline_job.gcs_output_directory`?

How does Vertex AI Pipelines handle long-running components, like a training job that takes 12 hours?

Can I use Vertex AI Pipelines to orchestrate non-ML work, like a nightly BigQuery transformation?

Further Reading

Official sources

More PDE topics

Introduction to Vertex AI Pipelines and Orchestration

白話文解釋（Plain English Explanation）

Vertex AI Pipelines as a restaurant kitchen ticket system

Components as Lego bricks with stud-and-hole contracts

Caching as an open-book exam with a sticky-note answer key

Core Concepts of Vertex AI Pipelines and Orchestration

Components and their flavours

Pipelines and DAG structure

Conditional execution and dynamic control flow

Artifacts and Vertex ML Metadata

Architecture and Design Patterns

The standard training-evaluation-deployment pattern

The champion-challenger continuous training pattern

The fan-out hyperparameter sweep pattern

GCP Service Deep Dive

The KFP v2 SDK in practice

Google Cloud Pipeline Components (GCPC)

Caching mechanics in detail

Scheduling pipeline runs

Running TFX pipelines on Vertex

IAM and the runtime service account

Integration with Model Registry, Feature Store, and Model Monitoring

Vertex AI Pipelines vs Cloud Composer vs Cloud Workflows

Common Pitfalls and Trade-offs

Best Practices

Real-World Use Case

Exam Tips

Frequently Asked Questions (FAQ)

Can a Vertex AI Pipeline call another Vertex AI Pipeline?

How do I pass large data between components?

Does Vertex AI Pipelines support VPC Service Controls?

What is the difference between pipeline_root and pipeline_job.gcs_output_directory?

How does Vertex AI Pipelines handle long-running components, like a training job that takes 12 hours?

Can I use Vertex AI Pipelines to orchestrate non-ML work, like a nightly BigQuery transformation?

Related Topics

Further Reading

Official sources

More PDE topics

What is the difference between `pipeline_root` and `pipeline_job.gcs_output_directory`?