Introduction to Vertex AI
For a Professional Cloud Architect, Vertex AI is not just a tool for data scientists; it is a unified platform designed to orchestrate the entire Machine Learning (ML) lifecycle. It integrates formerly disparate services (AI Platform, AutoML, etc.) into a single API and UI, enabling MLOps (Machine Learning Operations) at scale.
The goal of Vertex AI is to move ML from "experimental notebooks" to "production-grade pipelines."
Google Cloud's unified machine learning platform that provides tools for every stage of the ML workflow, from data preparation and training to deployment, monitoring, and governance. Reference: https://cloud.google.com/vertex-ai/docs/start/introduction-unified-platform
Plain-Language Explanation: Vertex AI Workflows
Vertex AI is like a professional film production studio.
Analogy 1 — The Film Studio (Unified Platform)
In the old days, you had to go to one place to write the script, another to film, and a third to edit. Vertex AI is the Integrated Film Studio. It has the writing rooms (Workbench), the sound stages (Training), the film archives (Model Registry), and the distribution network (Endpoints). Everything is in one building, sharing the same security and logistics.
Analogy 2 — The Assembly Line (Vertex AI Pipelines)
Vertex AI Pipelines are like an Automated Car Assembly Line. Instead of one person manually building a car from start to finish, you define a sequence of steps: "Attach wheels" -> "Install engine" -> "Paint." If you find a better way to paint, you just swap that one machine in the line. The pipeline ensures that every model is built exactly the same way every time, with a full record of every part used.
Analogy 3 — The Shared Tool Shed (Feature Store)
Feature Store is like a Shared Tool Shed for a neighborhood. Instead of every house buying its own lawnmower (calculating the same data features like "average customer spend"), everyone contributes their best tools to the shed. When someone needs a lawnmower, they just grab it from the shed. This ensures everyone uses the same high-quality tools and saves money on duplicates.
The ML Lifecycle on Vertex AI
Architecting an ML solution involves several distinct phases:
1. Data Preparation and Feature Engineering
- Vertex AI Workbench: Managed Jupyter notebooks for exploration. Integrated with BigQuery and Cloud Storage.
- Vertex AI Feature Store: A centralized repository to store, share, and serve ML features. It solves the problem of "training-serving skew" by ensuring that the same feature values used for training are available for real-time inference.
2. Model Training
- AutoML: For teams without deep ML expertise. You provide the data, and Google finds the best model architecture automatically.
- Custom Training: For data scientists who want full control. You provide the training code (Python, R, etc.) and a container image.
- Distributed Training: Vertex AI can automatically scale training across multiple GPUs or TPUs.
3. Model Management
- Vertex AI Model Registry: A central repository to manage model versions, metadata, and lineage.
- Model Evaluation: Automated tools to visualize model performance metrics (Precision-Recall curves, Confusion Matrices).
4. Model Deployment and Serving
- Vertex AI Prediction (Online): Deploy models to Endpoints for low-latency, real-time predictions.
- Batch Prediction: For large datasets where immediate response isn't needed. More cost-effective for offline processing.
Vertex AI Core Components Deep Dive
The unified platform groups roughly twelve first-class services under a single aiplatform.googleapis.com API. Architects should know what each component owns and where the boundaries are.
Workbench (Managed Notebooks)
Vertex AI Workbench replaces the legacy AI Platform Notebooks. It offers two flavors: Managed Notebooks (Google manages the runtime, supports idle shutdown and serverless Spark) and User-Managed Notebooks (you own the Compute Engine VM, install custom drivers). Workbench instances integrate natively with BigQuery via the %%bigquery magic and can write directly to Cloud Storage. Use Workbench for exploration and prototyping, not for production training.
Pipelines (KFP / TFX orchestration)
Vertex AI Pipelines runs Kubeflow Pipelines v2 YAML or TFX definitions on a serverless backend. Each step is a containerized component. Pipelines automatically writes artifact lineage to Vertex ML Metadata.
Experiments and TensorBoard
Vertex AI Experiments tracks hyperparameters, metrics, and artifacts across runs. Vertex AI TensorBoard provides hosted TensorBoard dashboards with IAM-controlled sharing — no need to expose local TensorBoard to teammates.
Feature Store (online + offline)
Two generations exist: the legacy Feature Store (Bigtable backend, deprecated for new projects) and the new Feature Store on BigQuery (offline store = BigQuery table, online store = optimized Bigtable-backed lookup). Architects choosing today should pick the BigQuery-backed version.
Model Registry
The Vertex AI Model Registry is the source of truth for trained models. It supports version aliases (default, staging, production), Model Evaluation attachments, and one-click deployment to Endpoints. Integrates with Artifact Registry for the underlying container.
Endpoints (online prediction)
Endpoints front one or more deployed model versions and support traffic splitting (e.g., 90% v1 / 10% v2 for canary), autoscaling on CPU/GPU utilization, and Private Service Connect for VPC-only ingress.
Vertex AI Studio for Generative AI
Vertex AI Studio is the console-based playground and SDK surface for foundation models — Gemini family (gemini-2.5-pro, gemini-2.5-flash), Imagen for images, Veo for video, and Chirp for speech. It is the GenAI counterpart to AutoML for classical ML.
Key Capabilities
- Prompt Design & Save: Iterate on prompts in the UI, save versions to a project, and export to Python/Node SDK code.
- Model Tuning: Supervised fine-tuning (SFT) and RLHF on Gemini variants using a small labeled dataset stored in Cloud Storage JSONL format. Tuning jobs produce a tuned model endpoint in the same project.
- Grounding: Attach a Vertex AI Search datastore or Google Search results so the model returns citations alongside answers. Critical for reducing hallucination in regulated industries.
- Safety Filters: Configurable thresholds for harassment, hate speech, sexually explicit, and dangerous content categories.
When to Use Studio vs Custom Training
Studio is the right tool when the base model already understands your domain and you just need prompt engineering or light tuning. Custom training (PyTorch / JAX on Vertex AI Training) is required only when you genuinely need a new architecture or pre-training from scratch — rare for application teams.
For PCA scenarios mentioning "build a chatbot," "summarize documents," or "extract entities from PDFs," the answer is almost always Vertex AI Studio + Gemini + grounding via Vertex AI Search, not custom training. Custom training is a distractor in 9 out of 10 GenAI exam questions.
Model Monitoring: Skew and Drift
Vertex AI Model Monitoring is the production-side safety net that detects when a deployed model's inputs or outputs have changed enough to harm prediction quality. It runs as a scheduled job against the prediction logs of an Endpoint and emails alerts when thresholds are breached.
Training-Serving Skew
Skew compares the statistical distribution of features at serving time against the training dataset baseline. Configure by pointing Monitoring at the training dataset URI (BigQuery table or Cloud Storage CSV/TFRecord). Trigger: when serving distribution diverges from training distribution beyond the configured L-infinity (categorical) or Jensen-Shannon (numerical) threshold.
Typical cause: a feature engineering pipeline change between training notebook and serving service (e.g., training applies log-transform but serving does not).
Prediction Drift
Drift compares today's serving distribution to yesterday's (or last week's) serving distribution. No training baseline needed. Trigger: when the world changes — new product catalog, seasonal user behavior, COVID-style market shock.
Output Attribution Drift
For tabular models with Explainable AI enabled, Monitoring also tracks shifts in feature attribution rankings. If customer_age was the top driver last month and account_balance is now top, that signals a structural change worth investigating.
Operational Setup
- Sampling rate: 10-100% of prediction requests written to a BigQuery log table.
- Monitoring frequency: hourly to daily windows.
- Alerts route through Cloud Logging -> Pub/Sub -> on-call.
AutoML vs Custom Training: Decision Framework
Both paths land models in the same Model Registry and serve from the same Endpoints. The choice is about who writes the model code.
AutoML (Google writes the model)
- Supported data types: tabular, image (classification/object detection), text (classification/entity extraction), video.
- Training time: typically 1-6 hours for tabular, longer for image.
- Cost model: node-hour billing for training; same Endpoint billing for serving.
- Strengths: zero ML code, automatic hyperparameter search, automatic feature engineering for tabular.
- Weaknesses: black-box architecture, limited control over loss function, no streaming/online learning.
Custom Training (You write the model)
- Container-based: provide a Docker image (or use Google's pre-built TensorFlow/PyTorch/scikit-learn containers).
- Supports single-replica, distributed multi-worker, distributed with parameter servers, and reduction servers for all-reduce GPU training.
- Hyperparameter tuning service uses Bayesian optimization or Vizier under the hood.
- Strengths: full control, custom architectures, multi-modal, RL.
- Weaknesses: you debug your own OOMs, you size your own machines.
Hybrid Path
Use AutoML to get a baseline in a day. If it meets the SLA, ship it. If not, port the training pipeline to custom training using the same dataset — both paths share Vertex AI Datasets as the input contract.
A common exam trap: a scenario says "the team has no ML expertise but needs forecasting on retail sales." The correct answer is AutoML Tabular Forecasting, not BigQuery ML and not custom training. Watch for BQML distractors that mention "SQL-only" — BQML is great but does not produce a managed Endpoint with traffic splitting.
Batch vs Online Prediction Architectures
Both run from the same Model Registry artifact, but the cost and operational shape are very different.
Online Prediction (Endpoints)
- Latency target: typically 50-300 ms p95.
- Billed by node-hours for the minimum replica count plus autoscaled overflow.
- Always-on: even at zero traffic, you pay for
min_replica_count. - Supports traffic split between model versions (canary, A/B).
- Supports Private Endpoints via Private Service Connect for VPC-only access.
- Best for: synchronous user-facing predictions (recommendations, fraud scoring at checkout).
Batch Prediction (BatchPredictionJob)
- Input: BigQuery table or Cloud Storage JSONL/CSV.
- Output: BigQuery table or Cloud Storage.
- Billed per node-hour of the job's compute; scales to zero when complete.
- No SLA on latency — minutes to hours depending on data size.
- Best for: nightly scoring of an entire customer base, churn lists, ETL-style enrichment.
Decision Heuristic
If a request is triggered by a user action and the answer is needed inside that request, use Online. If predictions are consumed by a downstream pipeline or dashboard with no user waiting, use Batch and save 70-90% on cost. Many production systems run both: nightly batch for the long tail, online for active sessions.
Vector Search and Vertex AI Search
These two services are often confused on the exam. They solve overlapping but distinct problems.
Vertex AI Vector Search (formerly Matching Engine)
A managed approximate nearest neighbor (ANN) index built on Google's internal ScaNN library. You bring your own embeddings (typically generated by text-embedding-004 or textembedding-gecko@003).
- Input: vectors of dimension 1-1408, in JSONL format on Cloud Storage.
- Index types: Tree-AH (best recall) and Brute Force (small datasets).
- Query latency: ~5-50 ms for billions of vectors.
- Use case: building blocks for RAG, recommendation, semantic search where you control the embedding model and ranking.
Vertex AI Search (formerly Discovery Engine / Gen App Builder)
A higher-level managed search-as-a-service. You upload documents (PDF/HTML/Cloud Storage/BigQuery) and get a search REST API with built-in semantic ranking, summarization, and out-of-the-box grounding for Gemini.
- No embedding work required — Google handles chunking, embedding, and indexing.
- Built-in answer generation with citations.
- Use case: enterprise search, customer support chatbots, internal knowledge bases when you want the fastest path to production.
Choosing Between Them
| Need | Service |
|---|---|
| Lowest-level vector primitives | Vector Search |
| Out-of-the-box document Q&A | Vertex AI Search |
| Custom embedding model (e.g., domain-tuned) | Vector Search |
| Don't want to manage chunking/indexing | Vertex AI Search |
Agent Builder and RAG Architecture
Vertex AI Agent Builder is the umbrella product for building grounded GenAI applications on GCP. It bundles Vertex AI Search, Conversational Agents (the rebranded Dialogflow CX), and the Agent Development Kit (ADK) for code-first agent workflows.
RAG Reference Architecture on Vertex AI
A typical Retrieval-Augmented Generation pipeline:
- Ingestion: Documents land in Cloud Storage. A Cloud Run service or Dataflow job extracts text (Document AI for PDFs).
- Chunking & Embedding: Text is split into ~500-1000 token chunks. Each chunk is embedded via
text-embedding-004. - Indexing: Embeddings and chunk metadata stream into Vertex AI Vector Search.
- Query Path: User question -> embed query -> ANN lookup -> top-k chunks -> stuff into Gemini prompt -> return answer with citations.
- Evaluation: Vertex AI Evaluation Service scores answers on groundedness, fluency, and safety.
Managed RAG Shortcut
If the team prefers managed over custom, Vertex AI Search collapses steps 2-4 into a single API call. The custom RAG path matters when you need a domain-tuned embedding model, custom chunking logic (e.g., respecting legal clause boundaries), or non-text modalities.
Agent Patterns
- Single-agent + tools: One Gemini agent with function-calling tools (BigQuery query, internal API).
- Multi-agent orchestration: ADK's
SequentialAgent,ParallelAgent, andLoopAgentcompose specialized agents (research agent -> writer agent -> reviewer agent).
RAG = embed user query + retrieve top-k chunks from Vector Search + prepend chunks to LLM prompt as context. Vertex AI Search is "managed RAG"; Vector Search is "DIY RAG." For exam questions about "ground LLM answers in private documents," prefer Vertex AI Search unless the question explicitly mentions custom embeddings or domain-specific chunking.
Training Compute: GPUs, TPUs, and Sizing
Vertex AI Training exposes Google's full accelerator catalog through the machineSpec field of a training job.
TPU Options
- TPU v5e — cost-optimized for inference and mid-size training.
ct5lp-hightpu-{1,4,8}tshapes. Best price/perf for foundation-model fine-tuning under ~10B parameters. - TPU v5p — performance-optimized successor to v4. Pods up to 8960 chips. Used for frontier-scale pre-training.
- TPU v4 — still widely available.
ct4p-hightpu-4tshape. Mature software stack (JAX/PAX). - TPU v3 — legacy; choose v5e for new workloads.
GPU Options
- NVIDIA A100 (40GB / 80GB) —
a2-highgpu-*anda2-ultragpu-*machine types. The default for PyTorch foundation-model work. - NVIDIA H100 —
a3-highgpu-8ganda3-megagpu-8g. Top-of-stack for large language model training. - NVIDIA L4 —
g2-standard-*. Best price/perf for inference, video, and lightweight training. - NVIDIA T4 / V100 — legacy; available but rarely the best choice today.
Sizing Heuristics
- Embedding fine-tuning, BERT-class: single A100 or 4x L4.
- 7B-13B LLM fine-tuning: 8x A100 80GB or TPU v5e-8.
- 70B+ LLM training: TPU v5p pod slice or H100 cluster with NVLink.
- Inference under 1B params: L4 or TPU v5e-1.
Reduction Server
For multi-worker GPU training, enable Vertex AI's Reduction Server to offload all-reduce traffic from the worker nodes, often improving throughput 20-40% on large language models.
Always start training jobs with --enable-web-access so you can gcloud beta ai custom-jobs stream-logs and SSH into the running container. Combine with Vertex AI Experiments to compare hyperparameter sweeps — never hand-roll a spreadsheet of runs.
Vertex AI Pipelines (The MLOps Backbone)
Vertex AI Pipelines allows you to automate, monitor, and govern your ML systems by orchestrating your ML workflows using Kubeflow Pipelines (KFP) or TensorFlow Extended (TFX).
- Metadata Tracking: Every time a pipeline runs, Vertex AI automatically records the artifacts (data, models, metrics) and the lineage (which data produced which model).
- Reusability: Build components once and use them across multiple pipelines.
- Architectural Tip: Use Pipelines to implement Continuous Training (CT). When new data arrives in BigQuery, trigger a pipeline to retrain and re-evaluate the model.
- Triggering: Cloud Scheduler -> Pub/Sub -> Cloud Run ->
aiplatform.PipelineJob.run(), or use Eventarc to fire on BigQuery table change events.
Vertex AI Security and IAM
- IAM Roles: Use specific roles like
roles/aiplatform.userfor data scientists,roles/aiplatform.viewerfor read-only auditors, androles/aiplatform.adminfor platform owners. Avoid the broadroles/ml.adminlegacy role. - VPC Service Controls: Protect Vertex AI resources from data exfiltration by placing them within a security perimeter. Critical for HIPAA/financial-services workloads.
- Private Endpoints: Ensure that prediction traffic stays within your private VPC and never traverses the public internet, via Private Service Connect.
- CMEK: Encrypt training data, model artifacts, and feature store contents with Customer-Managed Encryption Keys from Cloud KMS.
Cost Optimization Strategies
- Spot VMs (formerly Preemptible): Use for non-critical, long-running training jobs to save up to 60-91% on compute costs. Always checkpoint to Cloud Storage every N minutes so preemption is recoverable.
- Auto-scaling Endpoints: Configure endpoints with a low
min_replica_countand highmax_replica_countto balance latency and cost. Note that classical Vertex Endpoints do not scale to zero; use Cloud Run for true scale-to-zero serving of small models. - Idle Workbench Shutdown: Automatically shut down notebook instances after N minutes of inactivity via the
idle-timeout-secondsmetadata flag. - Batch over Online: If latency does not need to be real-time, batch prediction is typically 70-90% cheaper than maintaining an always-on Endpoint.
- Committed Use Discounts: TPU and GPU CUDs can save 20-55% on multi-month workloads.
For PCA cost-optimization scenarios, the highest-leverage levers are: (1) move from online to batch prediction when latency allows — typically 70-90% savings; (2) use Spot VMs for training with Cloud Storage checkpoints; (3) right-size accelerators (L4 instead of A100 for inference under 1B params). Reserved capacity (CUDs) only beats Spot when uptime SLAs forbid preemption.
FAQ — Vertex AI Platform Workflows
Q1. What is the difference between AutoML and Custom Training?
AutoML is a "low-code" approach where Google handles the model architecture and hyperparameter tuning. Custom Training is a "code-heavy" approach where you provide your own model code and training containers.
Q2. Why do I need a Feature Store?
A Feature Store ensures consistency between training and serving. It prevents "training-serving skew," where the data looks different during production than it did during training, leading to poor model performance.
Q3. How does Vertex AI handle model versioning?
The Model Registry allows you to store multiple versions of the same model. You can tag versions (e.g., "production," "staging") and easily roll back to a previous version if a new deployment fails.
Q4. Can I run Vertex AI Pipelines on-premises?
Vertex AI Pipelines is a managed service on GCP, but it is compatible with Kubeflow, which can run on-premises or on GKE. This allows for a hybrid ML strategy.
Q5. What is "Feature Drift" in Model Monitoring?
Feature Drift occurs when the statistical distribution of the input data changes over time (e.g., a new trend in user behavior). Vertex AI can alert you to this so you can retrain your model with fresh data.
Q6. When should I use Vertex AI Search versus Vector Search?
Use Vertex AI Search when you want managed document Q&A with built-in chunking, indexing, and grounding. Use Vector Search when you need full control over embedding models, chunking strategy, or non-text modalities.
Final Architect Tip
For the PCA exam, focus on MLOps. Understand how Vertex AI Pipelines and Model Registry provide the governance needed for enterprise ML. If a scenario involves "reproducibility" or "tracking model lineage," ML Metadata and Pipelines are the answers. For GenAI scenarios, default to Vertex AI Studio + Gemini + grounding. For document Q&A, default to Vertex AI Search. Explainable AI is the go-to solution for "transparency" and "regulatory compliance."