Introduction to Vector Search and Embeddings for GenAI
Vector Search and Embeddings for GenAI is the retrieval backbone behind almost every production LLM feature shipped on Google Cloud today, from chatbots that answer questions about a company handbook to product recommenders that respect user intent rather than keywords. The PDE exam expects you to design these retrieval layers correctly, knowing which index type fits which workload, where embeddings come from, and how to keep the freshness/latency/cost triangle balanced.
This note walks through Vertex AI Vector Search (the service formerly called Matching Engine), the ScaNN algorithm that powers it, the BigQuery VECTOR_SEARCH function for analytical workloads, and how all of these pieces snap into a Retrieval-Augmented Generation (RAG) architecture.
白話文解釋(Plain English Explanation)
Library catalog vs walking the stacks
Imagine a city library with two million books. A traditional keyword search is like flipping every book open and grepping for a word. A vector search is the catalog system: every book has a fingerprint that captures what the book is about (not which words it uses), and similar fingerprints sit on the same shelf. When you ask for "rainy day comfort reads", the librarian doesn't search for the literal phrase. She looks up the fingerprint of your request and pulls books whose fingerprints sit nearest to it in the catalog space. ScaNN is the sorting system that decides where each book lives so the librarian only has to walk to one or two shelves instead of every aisle.
The mall directory you don't need to read
Most malls have an alphabetical directory and a "you are here" map. The map is geometry; the directory is text. Embeddings turn the directory into geometry. Once "Apple Store", "Best Buy", and "GameStop" are placed near each other because their meaning is similar, finding "stores that sell electronics" stops being a string match and starts being a short walk on a map. Vector Search and Embeddings for GenAI relies on this exact trick. The map is high-dimensional (768 numbers for text-embedding-005), but the principle is identical: meaning becomes distance.
A kitchen that prepped before service
A restaurant kitchen during dinner rush cannot start chopping onions when an order arrives. It runs prep in the morning. Embedding generation is prep. You batch every document through an embedding model long before any user query lands, store the resulting vectors in an index, and at request time you only do the cheap, hot-path work: embed one query, ask the index for nearest neighbors, return them. This separation of cold prep and hot serving is what lets a 50ms RAG response feel snappy while sitting on top of millions of documents.
A dense numerical vector (typically 256-3072 floats) that represents semantic meaning. Two pieces of content with similar meaning produce vectors with small distance between them. Vertex AI's text-embedding-005 returns 768-dim vectors by default and supports configurable output dimensionality from 256 to 768. See Get text embeddings.
Core Concepts of Vector Search and Embeddings
Embeddings as the substrate
An embedding model is a learned function from raw input (text, image, video frame, audio clip) to a fixed-length vector. The geometry of the resulting space encodes similarity. On Vertex AI you have several first-party options:
text-embedding-005— the current general-purpose English text embedding model. 768 dimensions by default; supportsoutput_dimensionalitytruncation for cheaper indexing.text-multilingual-embedding-002— same shape, trained for cross-lingual similarity. Use this when your corpus or queries span more than English.multimodalembedding@001— produces a single 1408-dim vector for text, image, or video, so a text query can retrieve images and vice versa.- Open-source alternatives (e.g., E5, BGE) deployed to a Vertex AI endpoint when you need fine-tuning control or air-gapped behavior.
You will not generally train your own embedding model for the exam. The PDE blueprint cares that you choose the right pre-built model for the modality and language of your data.
Approximate Nearest Neighbor (ANN) search
A naive nearest-neighbor query against N vectors costs O(N) distance computations. At a billion vectors that is unacceptable for interactive serving. ANN algorithms trade a small recall loss for a massive latency win, typically returning 95–99% of the true top-k in milliseconds.
ScaNN: Google's ANN library
Vertex AI Vector Search is built on ScaNN (Scalable Nearest Neighbors), the same library that powers internal Google retrieval workloads. ScaNN combines three ideas: a partitioning step (so most of the dataset is skipped per query), an asymmetric hashing step (so the kept candidates are scored cheaply with quantized vectors), and a final exact rescoring step on a small short-list. The effect is sub-linear search time with controllable recall.
Indexes, deployed indexes, and endpoints
Three resources, easy to confuse:
- Index — the data structure built over your vectors. Lives in regional storage. Building an index is a batch job.
- Index Endpoint — the serving infrastructure. Public, VPC peered, or PSC.
- Deployed Index — the binding of one index onto one endpoint. You can deploy the same index to multiple endpoints, or deploy multiple indexes to one endpoint, each with its own
deployedIndexId.
You query a deployedIndexId on an IndexEndpoint, not an Index directly. Splitting these resources lets you A/B two index versions on the same endpoint and cut over by changing which deployedIndexId your client calls.
Deployed indexes are charged by the size of the machine type and replica count, not by query volume. A small idle index on e2-standard-2 still bills 24/7. Tear down deployed indexes that nobody calls; the underlying index file in Cloud Storage is cheap to keep. See Vertex AI pricing.
Architecture & Design Patterns
Reference RAG architecture on GCP
A typical RAG pipeline on Google Cloud has six stages: ingest, chunk, embed, index, retrieve, generate. Each stage maps to a clean GCP service:
- Ingest — Cloud Storage for documents, Pub/Sub for change events, Datastream for OLTP CDC.
- Chunk — Dataflow or Cloud Run Jobs that split documents into 200–800 token chunks with overlap. Store chunk text plus metadata in BigQuery or Firestore.
- Embed — Vertex AI embedding API, called from Dataflow with batched RPCs (typically 5–25 texts per call to balance throughput and timeout risk).
- Index — write
{id, embedding, restricts}JSONL to Cloud Storage; trigger an index update. - Retrieve — at query time, embed the user question and call
findNeighborson the deployed index endpoint. - Generate — pass the retrieved chunks as context into Gemini via Vertex AI, with a system prompt that grounds the answer.
Hybrid retrieval
Pure vector search misses exact matches (model numbers, SKU codes, named entities). Pure keyword search misses paraphrasing. Production systems run both and merge with Reciprocal Rank Fusion (RRF) or a learned reranker. Vector Search supports hybrid retrieval natively by letting you attach a sparse embedding alongside the dense one in the same index.
Metadata filtering with restricts
Every datapoint can carry restricts (categorical) and numeric_restricts (numeric range) tags. At query time you combine an ANN search with a filter expression like category = "kitchen" AND price < 100. The filter is applied during traversal, not after, so a restrictive filter can dramatically cut latency rather than wasting work on candidates you would discard. This is the right hook for tenant isolation in multi-tenant SaaS: stamp a tenant_id restrict on every vector and require it in every query.
Two-tower vs single-encoder retrieval
Two-tower models embed query and document with separate networks tuned for asymmetric tasks (e.g., short query, long passage). The first-party Vertex models are single-encoder and work well across symmetric and mildly asymmetric setups. If you need true two-tower behavior, host your own model on a Vertex AI endpoint and treat it as the embedding source.
GCP Service Deep Dive
Vertex AI Vector Search index types
There are two index algorithmConfig choices and they exist for different reasons.
Tree-AH is the production default. It builds a tree-structured partitioning of the dataset, then within each leaf uses asymmetric hashing for fast scoring. You tune three knobs:
leafNodeEmbeddingCount— vectors per leaf. Larger leaves mean fewer leaves to traverse but more in-leaf work.leafNodesToSearchPercent— what fraction of leaves to scan per query at serving time. Higher means better recall, slower latency.fractionLeafNodesToSearchOverride— per-query override so latency-sensitive callers can dial recall down.
A reasonable starting point for ~10M vectors: leafNodeEmbeddingCount=1000, search 7–10% of leaves, expect 95%+ recall at p99 < 50ms on a e2-highmem-16 shard.
BRUTE_FORCE does no approximation. It is exact, slow, and exists for two reasons: building a recall ground truth to evaluate Tree-AH against, and serving very small indexes (under ~100K vectors) where ANN setup overhead is not worth it. You will not use BRUTE_FORCE in production at scale, but you may see it as a wrong-but-tempting answer on the exam.
A common exam distractor: "Customer needs sub-50ms latency on 50M vectors, choose BRUTE_FORCE for accuracy." Wrong. BRUTE_FORCE on 50M vectors is seconds, not milliseconds. Tree-AH with tuned leafNodesToSearchPercent is the answer. BRUTE_FORCE is for recall benchmarking and tiny corpora only. See Configure index parameters.
Distance measures
Pick one and stay consistent: an index baked with DOT_PRODUCT_DISTANCE cannot be queried with cosine semantics later without rebuild.
COSINE_DISTANCE— most common for text embeddings; magnitude-invariant.DOT_PRODUCT_DISTANCE— equivalent to cosine when vectors are L2-normalized; slightly faster.SQUARED_L2_DISTANCE— geometric distance; appropriate when magnitudes carry meaning.
Vertex's first-party text embeddings are pre-normalized, so dot product and cosine produce identical rankings.
Batch vs streaming index updates
Index updates come in two flavors:
Batch update — you write the entire dataset (or a delta) as JSONL files in Cloud Storage and call updateIndex. Indexing throughput is high (millions of vectors per hour) but the change is not visible to queries until you call deployIndex or until the next compaction window for streaming-enabled indexes. Latency from upsert to queryable: minutes to tens of minutes.
Streaming update — you call upsertDatapoints and removeDatapoints directly on the index. Changes become queryable within seconds. Required when the corpus moves continuously: news articles, marketplace listings, support tickets. Costs more per write than batch, and you must enable streaming at index creation time (indexUpdateMethod: STREAM_UPDATE).
Default to STREAM_UPDATE for any index you expect to grow during normal operations. The cost premium is small, and you cannot convert a batch index to streaming after creation without rebuilding. See Update and rebuild an active index.
BigQuery VECTOR_SEARCH
For analytical retrieval workloads where data already lives in BigQuery, the VECTOR_SEARCH table function gives you nearest-neighbor results with no separate serving infra:
SELECT base.product_id, base.name, distance
FROM VECTOR_SEARCH(
TABLE retail.product_embeddings, 'embedding',
(SELECT embedding FROM ML.GENERATE_EMBEDDING(
MODEL retail.text_emb,
(SELECT 'comfortable running shoes' AS content))),
top_k => 10,
distance_type => 'COSINE',
options => '{"fraction_lists_to_search": 0.05}'
);
You can build an IVF index on the embedding column (CREATE VECTOR INDEX) to make this sub-second over hundreds of millions of rows. BigQuery vector search is the right pick when:
- The retrieval is part of a larger analytical query (joins, aggregations, window functions).
- Latency budget is hundreds of milliseconds, not tens.
- You want to avoid running a separate serving endpoint.
It is the wrong pick for a chatbot front end demanding p99 < 50ms and high QPS. For that, push embeddings out to Vertex AI Vector Search.
Generating embeddings inside BigQuery
ML.GENERATE_EMBEDDING invokes a Vertex AI embedding model from inside a BigQuery query, which is the cleanest way to keep the embedding column in sync with the source text. Combined with scheduled queries or Dataform, you get a fully SQL-driven embedding pipeline.
Endpoints: public, VPC, PSC
IndexEndpoint resources support three networking modes:
- Public endpoint — easiest to start, mTLS, suitable for managed clients.
- VPC peering — for workloads inside a single VPC. Older pattern, requires reserved IP range.
- Private Service Connect (PSC) — preferred for production. Gives a private IP in your VPC, works across organizations, and supports service attachment.
For any regulated workload (PII, healthcare, financial), PSC plus a CMEK-encrypted index is the expected design.
Common Pitfalls & Trade-offs
Recall vs latency vs cost
Three knobs, one budget. Higher leafNodesToSearchPercent buys recall at the cost of latency. More replicas buy QPS and lower tail latency at the cost of dollars. Larger leafNodeEmbeddingCount reduces tree depth (good for memory) but slows per-query work in leaves. The PDE exam likes scenarios that pin two of these knobs and ask which third lever to pull.
Embedding drift
Embedding models are versioned. If you reindex with text-embedding-005 half your corpus and text-embedding-004 the other half, distances are not comparable and ranking quality silently degrades. Always store the embedding model version as metadata on every chunk, and reindex the entire corpus when you upgrade the model.
Chunking too small or too large
Tiny chunks (50 tokens) lose context; the model retrieves "page 4" with no surrounding meaning. Huge chunks (4000 tokens) drown the LLM context and bury the relevant snippet. The empirical sweet spot is 200–500 tokens with 10–20% overlap. This is one of the few places where "tune empirically" is the right exam answer.
Forgetting the cold start
A freshly deployed index serves zero queries. The first few queries hit cold caches and can take seconds. For SLA-bound services, run a warm-up job that issues representative queries against every replica before flipping load balancer weight.
Querying from the wrong region
IndexEndpoint is regional. A query from us-east1 to a us-central1 endpoint adds 30–40ms of network latency, often more than the search itself. Co-locate the endpoint, the embedding API call, and the calling service in one region.
The cheapest-looking embedding choice is often output_dimensionality=256 to shrink the index. But once your corpus is below a few million vectors, the index size savings are negligible and the recall loss is real. Only truncate dimensions when storage or memory is a measurable constraint. See Embedding dimensions.
Best Practices
- Standardize on one embedding model per index, store the model name and version on every datapoint, and treat model upgrades as full reindex events.
- Keep chunk text in a system of record (BigQuery or Firestore) keyed by the same ID you store in Vector Search. The index returns IDs and distances, not the source text; you fetch text from the system of record.
- Use STREAM_UPDATE indexes by default; switch to BATCH_UPDATE only for static corpora where the cost difference matters.
- Stamp every datapoint with at least
tenant_id,source, andcreated_atrestricts so you can filter, debug, and expire content without rebuilding. - Run a recall benchmark by building a tiny BRUTE_FORCE index over a representative 100K-vector sample and measuring how often Tree-AH agrees with it on the top-10. Keep recall above 0.95 for general retrieval; above 0.99 for compliance-bound search.
- Pre-warm new deployed indexes with synthetic queries before sending production traffic.
- For RAG, always pass the retrieval distance into the LLM prompt or use it to drop weak matches; do not let the model speak confidently from a 0.6-cosine-distance result.
Vertex AI Vector Search resource trio: Index (the data) -> deployed onto IndexEndpoint (the infra) via a DeployedIndex (the binding). Clients call findNeighbors on the endpoint, naming the deployedIndexId. Reference: Vector Search resources.
Real-World Use Case
A mid-sized e-commerce company (12M SKUs, 4M monthly active users, three regions) wants conversational product discovery. The shopper types "warm jacket for a Tokyo winter business trip" and expects relevant products, not keyword matches.
Architecture:
- Product catalog lives in Cloud SQL. A Datastream pipeline mirrors changes into BigQuery in near real time.
- A Dataflow streaming job reads CDC events, generates embeddings via
text-multilingual-embedding-002(the catalog spans EN/JA/ZH), and upserts into a STREAM_UPDATE Vertex AI Vector Search index. - The index carries restricts:
category,gender,available_regions, and a numericprice_jpyrestrict. - The serving path is a Cloud Run service per region. Each request: embed the query, call
findNeighborson the regional endpoint with restricts derived from the user's market, fetch product detail rows from BigQuery by ID, and pass the top 8 into Gemini for a conversational response. - Cold start handled by a Cloud Scheduler job that issues 50 representative queries to every endpoint every 10 minutes.
Outcome: p95 retrieval latency at 28ms, end-to-end response (including LLM streaming) under 1.2s, and a measurable lift in click-through rate over the previous keyword-only search. Cost is dominated by the deployed indexes and the LLM, not the embedding API.
Exam Tips
The PDE exam covers Vector Search at a design level rather than asking you to write gcloud commands. Patterns to recognize:
- "Real-time updates required" -> STREAM_UPDATE, not BATCH_UPDATE.
- "Multi-language corpus" ->
text-multilingual-embedding-002(or multimodal if images are mixed in), nottext-embedding-005. - "Search images by a text query" ->
multimodalembedding@001because both modalities share one space. - "Multi-tenant SaaS, must isolate tenants" ->
restrictswithtenant_id, enforced at query time. - "Analytics team wants nearest neighbors inside SQL joins" -> BigQuery
VECTOR_SEARCH, not Vertex AI Vector Search. - "Sub-50ms p99 at high QPS" -> Vertex AI Vector Search with Tree-AH and tuned
leafNodesToSearchPercent, not BigQuery and not BRUTE_FORCE. - "Need to validate retrieval quality" -> build a small BRUTE_FORCE index as ground truth and measure recall.
- "Private network only, healthcare PII" -> Private Service Connect endpoint with CMEK-encrypted index.
- "Index won't accept new docs after creation" -> the index was created with BATCH_UPDATE; recreate as STREAM_UPDATE.
- "Embedding model upgrade" -> full reindex, do not mix versions.
The exam loves to put BRUTE_FORCE in scenarios where it does not belong. Default to Tree-AH unless the problem is "build a recall benchmark" or "tiny dataset under 100K vectors".
For any RAG question on the exam: the retrieval system (Vector Search) and the generation model (Gemini) are separate services. Vertex AI Search (formerly Enterprise Search / Discovery Engine) is a third, fully-managed product that bundles both. Pick Vertex AI Search when the requirement is "build a search experience over my docs with minimum infrastructure"; pick Vector Search + Gemini when you need control over the retrieval logic. Reference: Vertex AI Search overview.
Frequently Asked Questions (FAQ)
When should I use BigQuery VECTOR_SEARCH instead of Vertex AI Vector Search?
Use BigQuery VECTOR_SEARCH when retrieval is part of a larger analytical query, when the corpus naturally lives in BigQuery, and when latency budget is in the hundreds of milliseconds. Use Vertex AI Vector Search when you need interactive serving (sub-100ms p99), high QPS, or features like streaming updates and PSC endpoints. Both can coexist: many shops use BigQuery VECTOR_SEARCH for offline analytics and Vertex AI Vector Search for online serving, generating embeddings once and storing them in both places.
How do I pick between text-embedding-005 and text-multilingual-embedding-002?
If your corpus and queries are predominantly English, text-embedding-005 gives the best English-only quality. If either side spans multiple languages (a Japanese user querying an English knowledge base, or a multilingual product catalog), text-multilingual-embedding-002 produces vectors in a shared cross-lingual space and will retrieve correctly across languages. Both return 768-dim vectors and are interchangeable in terms of integration code; the difference is purely model quality on your data.
What is the difference between Tree-AH and ScaNN?
ScaNN is the underlying library and family of algorithms. Tree-AH is the specific configuration of ScaNN that Vertex AI Vector Search exposes for production: it combines a tree-structured partitioning step with asymmetric hashing for fast in-leaf scoring. When the documentation says "Vertex AI Vector Search uses ScaNN", what you actually configure is Tree-AH parameters. BRUTE_FORCE is the other algorithm option; it bypasses ScaNN entirely and computes exact distances.
How fresh can my index be?
With STREAM_UPDATE, individual upserts become queryable within seconds. With BATCH_UPDATE, freshness is bounded by your batch cadence plus the index update job (typically minutes). For most production GenAI features, STREAM_UPDATE with a Dataflow streaming job feeding it gives end-to-end document-to-queryable latency under a minute. If your corpus is genuinely static (annual policy documents, archived research), BATCH_UPDATE saves cost.
Do I need a vector database, or can BigQuery do everything?
If your retrieval QPS is modest (single-digit to low hundreds) and your latency tolerance is hundreds of milliseconds, BigQuery with a vector index handles it cleanly and removes operational overhead. Once you need interactive p99 latency under 100ms or QPS into the thousands, a purpose-built vector serving system (Vertex AI Vector Search, or third parties like Pinecone or Weaviate) becomes worth the operational cost. The PDE exam tends to use "interactive chatbot, low latency, high QPS" as the trigger phrase for Vertex AI Vector Search.
How do I prevent data leakage between tenants in a shared index?
Stamp every datapoint with a tenant_id restrict at ingest time. Require every query to include a restricts filter on tenant_id matching the authenticated caller. Enforce this in a thin server-side wrapper rather than trusting the client. For very high-sensitivity workloads (regulated industries), use a separate index per tenant; the cost trade-off is significant deployed-endpoint overhead per tenant, but the isolation is absolute.
Does Vector Search support hybrid (sparse + dense) retrieval?
Yes. You can attach a sparse embedding (typically from a model like SPLADE or a learned BM25 variant) alongside the dense embedding when you index a datapoint. At query time, Vector Search returns a fused ranking. This recovers the exact-match ability that pure dense retrieval lacks (model numbers, named entities) without running a separate keyword index.
Related Topics
- Vertex AI Feature Store for ML Serving — the sibling Vertex AI service for low-latency feature retrieval, often used alongside Vector Search in recommender architectures.
- BigQuery ML for Data Engineers — covers
ML.GENERATE_EMBEDDINGand other in-database ML functions that pair withVECTOR_SEARCH. - Dataflow Streaming Pipelines for ML — the recommended pattern for keeping a streaming-update vector index continuously fresh from CDC events.
Further Reading
- Vertex AI Vector Search overview — official conceptual documentation.
- BigQuery vector search introduction —
VECTOR_SEARCHfunction andCREATE VECTOR INDEXreference. - Get text embeddings on Vertex AI — model list, dimensionality, request format.
- Configure Vector Search index parameters — Tree-AH tuning reference.