examlab .net The most efficient path to the most valuable certifications.
In this note ≈ 20 min

Data Prep for LLMs and RAG

3,850 words · ≈ 20 min read ·

Practical GCP PDE guide to data preparation for LLM fine-tuning and RAG: chunking, embeddings with Vertex AI and BigQuery ML, JSONL formats, DLP cleaning, and Vector Search.

Do 20 practice questions → Free · No signup · PDE

Introduction to Data Preparation for LLM Fine-tuning and RAG

Data Preparation for LLM Fine-tuning and RAG is the silent backbone of every generative AI system that actually works in production. The model gets the credit, but the dataset does the heavy lifting. On Google Cloud, this work spans BigQuery, Vertex AI, Cloud Storage, Cloud DLP, and Vector Search, and the PDE exam expects you to know which tool fits which stage. This note walks through chunking, embedding, JSONL formatting, supervised tuning versus RLHF, DLP-based cleaning, evaluation set design, Vector Search index population, and the freshness problem that haunts every RAG architect.

白話文解釋(Plain English Explanation)

Think of it like prepping a kitchen before service

A restaurant kitchen does not start cooking when the first ticket prints. The prep cooks have already washed lettuce, portioned proteins, labeled sauces, and stacked them in line order. When the chef finally fires a dish, every component is within arm's reach. Data Preparation for LLM Fine-tuning and RAG plays the same role. By the time a user types a question, your chunks are already embedded, your training pairs are already validated, and your PII is already redacted. If you skip prep, you are improvising in front of a paying customer.

Think of RAG like an open-book exam

In a closed-book exam you must memorize everything. Fine-tuning is closer to that: you bake knowledge into the weights and the model recalls it later. RAG is the open-book version. The model still needs to be smart, but it gets to flip to the right page during the exam. The catch is that the textbook has to be well organized. If pages are out of order, paragraphs are cut mid-sentence, or the table of contents is wrong, the student finds nothing useful. Data Preparation for LLM Fine-tuning and RAG is what turns a pile of company documents into a textbook with a clean index.

Think of embeddings like coordinates on a map

Words by themselves do not know where they live. Embeddings give every chunk of text a coordinate in a high-dimensional space, the same way GPS gives every coffee shop a latitude and longitude. Two cafes on the same street end up close together; a coffee shop and a tire repair shop end up far apart. When a user asks a question, you convert the question into coordinates and look for the nearest neighbours. Without good Data Preparation for LLM Fine-tuning and RAG, half your map is mislabeled and the navigation system sends people to the wrong block.

Core Concepts of Data Preparation for LLM Fine-tuning and RAG

The work splits cleanly into two pipelines that share a lot of plumbing. The RAG pipeline is read-heavy and runs continuously: ingest, chunk, embed, index, refresh. The fine-tuning pipeline is write-heavy and runs in bursts: collect examples, clean, format, split, validate, then submit a tuning job. Both pipelines pull from the same upstream sources, both need PII protection, and both live or die by evaluation quality.

Chunking strategies

Chunking decides how much text travels together as a single retrieval unit. Three strategies dominate Data Preparation for LLM Fine-tuning and RAG on Google Cloud:

  • Fixed-size chunking splits text every N tokens with a small overlap, often 512 tokens with 50 tokens of overlap. It is the easiest to implement and the easiest to reason about. The downside is that it cuts mid-sentence and mid-thought, which hurts retrieval precision on dense technical documents.
  • Semantic chunking uses an embedding model to score sentence-to-sentence similarity and breaks where similarity drops. It produces chunks that respect topic boundaries, but it costs more compute up front because every sentence needs a preliminary embedding.
  • Recursive character chunking walks a hierarchy of separators (paragraph, line, sentence, word) and splits at the largest separator that keeps each chunk under the size limit. It is the LangChain default and a sensible middle ground for mixed corpora.

Embedding generation

Once chunks exist, you turn them into vectors. On Google Cloud you have two production paths. ML.GENERATE_EMBEDDING runs inside BigQuery and is ideal when your chunks already live in a BigQuery table; you write SQL, point at a remote model that wraps text-embedding-005 or text-multilingual-embedding-002, and the vectors land in a column. The Vertex AI textembedding-gecko family (now text-embedding-005 for English and text-multilingual-embedding-002 for cross-language work) is callable directly from any client and shines when chunking happens in Dataflow or Cloud Run jobs.

JSONL training format

Vertex AI supervised tuning accepts JSONL where every line is a complete example. For Gemini supervised tuning the schema uses a contents array with role and parts, plus a systemInstruction field. Each line is independent, line breaks inside a record are forbidden, and the file must be UTF-8. A clean JSONL is the difference between a tuning job that starts in 10 minutes and one that fails after two hours of validation.

Supervised tuning vs RLHF

Supervised fine-tuning teaches the model from prompt-response pairs. RLHF (reinforcement learning from human feedback) goes further: you also provide preference pairs ("response A is better than response B") and a reward model learns to rank outputs. Supervised tuning is cheaper, faster, and covers most enterprise use cases. RLHF earns its cost when style, safety, or subjective quality matter more than factual accuracy.

Dataset cleaning

Cleaning happens before chunking and before formatting. The non-negotiables are PII removal via Cloud DLP, deduplication, language detection, and toxicity filtering. Skipping any one of these is how teams end up with models that quote a customer's home address back at them in production.

Evaluation datasets

Two flavors are required. The golden set is small (typically 50 to 500 examples) and hand-curated by domain experts. The synthetic set is large and generated by another LLM, often Gemini, prompted to create realistic queries against your corpus. The golden set is the source of truth; the synthetic set lets you regression-test at scale.

Vector Search index population

Vertex AI Vector Search uses ScaNN under the hood. You write embeddings as JSON to Cloud Storage, create an index pointing at that bucket, and deploy the index to an endpoint. You can choose between batch updates (cheap, hourly to daily) and streaming updates (higher cost, near real-time).

Data freshness for RAG

Knowledge bases drift. A RAG pipeline that ingested last quarter's product catalog will confidently quote discontinued SKUs. Freshness strategies range from full nightly rebuilds to change-data-capture from Spanner or Cloud SQL into Pub/Sub, with a Dataflow job that re-embeds only the changed rows.

A dense numerical vector (typically 768, 1024, or 3072 dimensions on Vertex AI) that represents the semantic meaning of a text chunk so that similar meanings end up close together in vector space. Reference: https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings

Architecture and Design Patterns

A reference RAG architecture on Google Cloud usually flows: source systems (Cloud Storage, Drive, Confluence, Spanner) feed a Dataflow ingestion job, which writes raw documents to a landing bucket. A second Dataflow job parses, chunks, and calls Cloud DLP for redaction. The cleaned chunks land in a BigQuery staging table. ML.GENERATE_EMBEDDING populates an embeddings column, then a Cloud Composer DAG exports the embeddings to Cloud Storage in the JSONL format Vector Search expects, triggers an index update, and logs metadata to a tracking table.

For fine-tuning, the architecture is simpler but stricter. Curated examples sit in a BigQuery table with strong schema enforcement. A Cloud Run job exports them to JSONL in Cloud Storage, runs a validation pass (line count, schema, token limits), then kicks off a Vertex AI tuning job through the Python SDK. The resulting tuned endpoint is registered in Vertex AI Model Registry and routed through Vertex AI Endpoints with traffic splitting for A/B comparison against the base model.

The design pattern worth memorizing is separation of write paths. Ingestion writes raw, transformation writes cleaned, embedding writes vectors. Each stage is idempotent and resumable, which is what saves you when a Dataflow worker crashes halfway through a 200 GB corpus.

Data Preparation for LLM Fine-tuning and RAG must keep raw and processed data in separate buckets with separate IAM roles. The raw bucket holds unredacted source material and must be locked down to the ingestion service account only; the processed bucket is what humans and downstream services read. Mixing them is how unredacted PII leaks into a vector index. Reference: https://cloud.google.com/storage/docs/access-control/iam-permissions

GCP Service Deep Dive

BigQuery and ML.GENERATE_EMBEDDING

BigQuery is the most underrated tool in this stack. If your documents land in BigQuery anyway (rare, but common for support tickets, chat transcripts, and product reviews), ML.GENERATE_EMBEDDING skips an entire pipeline. You create a remote model that wraps a Vertex AI endpoint, then call it as a SQL function. A typical query looks like SELECT *, ml_generate_embedding_result FROM ML.GENERATE_EMBEDDING(MODEL my_model, TABLE my_chunks, STRUCT(TRUE AS flatten_json_output)). The function handles batching, retry, and quota management automatically. The output column is an ARRAY<FLOAT64> you can hand straight to Vector Search.

Vertex AI text embedding models

The textembedding-gecko branding has been replaced by the text-embedding-005 model for English and text-multilingual-embedding-002 for over 100 languages. Both produce 768-dimensional vectors by default but support output_dimensionality truncation down to as low as 256, which cuts storage and latency at a measured precision cost. Use the multilingual model whenever your corpus mixes languages; the English-only model is faster but useless on Japanese support tickets.

Vertex AI supervised tuning

Supervised tuning on Gemini models runs as a managed pipeline. You upload a JSONL file to Cloud Storage, call vertexai.tuning.sft.train(), and Vertex AI handles GPU provisioning, checkpointing, and evaluation. Minimum dataset size is 16 examples; the official recommendation is between 100 and 500 for most tasks, though style transfer can need more.

Vector Search supports two index types: Tree-AH (good for static or batch-updated corpora, lowest query latency) and Brute Force (only for evaluation against ground truth, never production). For streaming updates you enable STREAM_UPDATE at index creation time and push upserts via the API. Latency at the 95th percentile typically sits between 10 and 50 ms for indexes under 100 million vectors.

Cloud DLP (Sensitive Data Protection)

DLP is the gatekeeper for Data Preparation for LLM Fine-tuning and RAG. It detects over 150 infoTypes out of the box (credit cards, SSNs, phone numbers, regional ID formats) and supports custom infoTypes via regex or dictionary. For training data you typically use de-identification templates that replace PII with format-preserving tokens, so the model still learns sentence structure without memorizing personal data.

Dataflow

Dataflow handles the chunking and embedding-at-scale workload that BigQuery cannot. The Apache Beam Python SDK ships with MLTransform, which wraps Vertex AI embedding calls in a streaming-friendly DoFn with automatic batching and retry. For 100 GB-plus corpora, Dataflow is the only sane choice.

For chunk sizes in production RAG, start with 512 tokens and 50-token overlap, measure retrieval recall on your golden set, then adjust. Going below 256 tokens usually hurts because chunks lose context; going above 1024 hurts because too many concepts mix into one vector. Reference: https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings

Common Pitfalls and Trade-offs

The first mistake teams make is embedding before cleaning. Once a vector is in your index, you cannot easily undo the PII it encoded. The fix is to run DLP before chunking, not after embedding.

The second mistake is treating chunking as a one-time decision. Different document types need different chunk sizes. Legal contracts work best with semantic chunking at clause boundaries; chat transcripts work best with fixed-size chunks at message granularity; code repositories need AST-aware chunking that respects function boundaries. A single global chunk size is usually wrong for at least one document class in your corpus.

The third mistake is using the same embedding model for query and corpus without versioning. When Google upgrades text-embedding-005, your existing index becomes incompatible with new query embeddings. The fix is to pin the model version in your index metadata and re-embed the entire corpus when you upgrade.

The fourth mistake is using production data as evaluation data. If your golden set is sampled from the same logs you trained on, you are testing memorization, not generalization. Hold out evaluation queries from the training cutoff date forward.

Do not deduplicate fine-tuning examples by exact string match alone. Near-duplicates (same prompt with one extra space, same response with different capitalization) survive exact-match dedup and bias the model toward those patterns. Use MinHash or embedding-based similarity (cosine > 0.95) to catch near-duplicates. The Vertex AI tuning service does not deduplicate for you. Reference: https://cloud.google.com/vertex-ai/generative-ai/docs/models/tune-models

For supervised tuning on Gemini, hold out a separate evaluation JSONL (typically 10-20 percent of curated pairs, minimum 100 examples for stable metrics) and pass it via the validation_dataset_uri parameter of vertexai.tuning.sft.train(); never reuse training examples for eval or the golden-set regression check described above will mask overfitting. Orchestrate the chunk-clean-embed-index DAG with Vertex AI Pipelines (or Cloud Composer) and run the embedding-at-scale step on Dataflow with MLTransform, so a single re-embedding event when text-embedding-005 is upgraded is reproducible end-to-end. Reference: https://cloud.google.com/vertex-ai/generative-ai/docs/models/tune-models

Best Practices

  • Run Cloud DLP before chunking, not after, and store de-identification templates in version control alongside your pipeline code.
  • Pin the embedding model version in index metadata; treat a model upgrade as a full re-index event, not an in-place change.
  • Keep raw and processed data in separate Cloud Storage buckets with separate IAM bindings.
  • Maintain a golden evaluation set of at least 100 hand-curated examples, refreshed quarterly by domain experts.
  • Use streaming Vector Search updates only when business requires sub-hour freshness; batch updates are 5 to 10 times cheaper.
  • Validate JSONL files locally with a schema check before uploading to Cloud Storage; Vertex AI tuning errors after upload are slow to surface.
  • Track every dataset version in BigQuery with a dataset_version column so you can reproduce any tuning run from six months ago.
  • For multilingual corpora, default to text-multilingual-embedding-002; do not mix embedding models across the same index.

Real-World Use Case

A mid-size insurance company with 4,000 employees decided to build an internal claims-assistance chatbot. The corpus was 800 GB of policy PDFs, 12 years of email correspondence, and a Spanner-backed claims database with 40 million rows. Compliance required that no PII reach the LLM unredacted, that the model never hallucinate policy numbers, and that fresh policy revisions appear in the assistant within four hours of approval.

The team chose RAG over fine-tuning because policy language changes too often to bake into weights. The pipeline ran on Dataflow for ingestion and chunking with MLTransform for embeddings, Cloud DLP for PII redaction with a custom infoType for internal policy IDs, and Vector Search with STREAM_UPDATE for the policy index plus a separate batch-updated index for historical emails. BigQuery held the chunk-to-source mapping and dataset version history.

For the eventual fine-tuning of response style (formal, sympathetic, never legally binding), the team curated 320 prompt-response pairs reviewed by two senior claims adjusters. They used Vertex AI supervised tuning on gemini-2.0-flash-001, which finished in 90 minutes. The golden evaluation set caught a regression where the tuned model started over-using the phrase "I understand your frustration"; they pulled the offending examples and re-tuned. Total time from kickoff to production: 11 weeks. Total spend: roughly USD 38,000 in compute, dominated by the initial embedding generation pass over the historical email archive.

Exam Tips

The PDE exam loves to test the line between fine-tuning and RAG. If a question mentions "frequently changing data" or "domain-specific knowledge that updates daily," the answer is RAG. If it mentions "consistent style," "specialized format," or "task-specific behavior," the answer is fine-tuning.

When the exam asks about embedding generation inside BigQuery, the answer is ML.GENERATE_EMBEDDING with a remote model wrapping a Vertex AI endpoint. When it asks about embedding generation in a streaming pipeline, the answer is Dataflow with MLTransform calling Vertex AI directly.

For PII in training data, the answer is always Cloud DLP de-identification before the data leaves the cleaning pipeline. The exam will offer distractors like "use IAM to restrict access" or "encrypt with CMEK"; those are good practices but they do not remove PII from the model's training data.

For Vector Search, remember that Tree-AH is the production index type and that streaming updates cost more than batch. The exam may frame this as a cost-versus-freshness tradeoff.

For JSONL formatting, remember that each line is independent JSON, the file is UTF-8, and the schema differs between PaLM-era models (input_text/output_text) and Gemini models (contents with role and parts).

PDE exam shortcut: Fine-tuning bakes, RAG fetches. Fine-tuning changes weights and is best for style, format, and behavior. RAG retrieves at query time and is best for facts that change. Use both together when you need consistent style over fresh facts. Reference: https://cloud.google.com/vertex-ai/generative-ai/docs/models/tune-models

Frequently Asked Questions (FAQ)

Start with 512 tokens and 50 tokens of overlap, then tune based on retrieval recall measured against your golden set. Dense technical documents often benefit from smaller chunks (256 to 384 tokens) with semantic boundaries, while narrative content like meeting transcripts can use larger chunks (768 to 1024 tokens) without losing precision.

When should I use BigQuery ML.GENERATE_EMBEDDING versus calling Vertex AI directly?

Use ML.GENERATE_EMBEDDING when your text already lives in BigQuery and you want to keep the embedding column alongside the source data; it eliminates an entire data movement step. Call Vertex AI directly from Dataflow or Cloud Run when you need streaming, when chunks come from non-BigQuery sources, or when you need fine-grained control over batching and retry behavior.

Is supervised fine-tuning enough or do I need RLHF?

Supervised tuning covers the vast majority of enterprise use cases at a fraction of the cost and complexity. Reach for RLHF only when subjective quality matters more than correctness, for example tone-of-voice, safety alignment, or creative writing. Vertex AI's managed RLHF pipeline expects preference pairs and a reward model; the operational overhead is significant.

How do I handle PII in fine-tuning data without losing semantic context?

Use Cloud DLP de-identification with format-preserving tokenization rather than full redaction. The model still sees that a credit card number was present and learns the surrounding sentence structure, but the specific number is replaced with a synthetic value. For higher sensitivity data, use cryptographic hashing so the same input always maps to the same token, which preserves entity consistency across examples.

How fresh can my RAG index realistically stay?

With Vector Search streaming updates and a Pub/Sub-triggered Dataflow job, you can hit sub-minute freshness for individual document updates. Full corpus rebuilds are typically nightly or weekly because of the embedding cost. Most production RAG systems use a hybrid approach: streaming updates for high-priority sources (product catalog, policy documents) and batch updates for everything else.

What evaluation metrics should I track for RAG quality?

Track retrieval metrics (recall@k, mean reciprocal rank) separately from generation metrics (faithfulness, answer relevance, groundedness). Vertex AI offers Gen AI Evaluation Service which automates many of these. The single most useful metric in practice is groundedness, which measures whether the generated answer is supported by the retrieved chunks. A high groundedness score with low retrieval recall means your model is making things up confidently.

Do I need to re-embed my entire corpus when Google upgrades the embedding model?

Yes. Different embedding models produce vectors in incompatible spaces, so a query embedded with text-embedding-005 cannot be matched against chunks embedded with the previous model. Pin the model version in your index metadata, plan for re-embedding as a quarterly maintenance event, and budget the compute cost accordingly.

Further Reading

Official sources

More PDE topics