Introduction to BigQuery ML Inference and Deployment
BigQuery ML inference and deployment is the practice of generating predictions from trained models without leaving the warehouse. You write SQL, BigQuery moves the compute next to the data, and the predictions land in a table that downstream pipelines can query immediately. For a Professional Data Engineer, this matters because most production ML problems are not training problems; they are serving problems. Where do predictions live, who refreshes them, how do you swap a model without breaking the dashboard?
BigQuery ML inference and deployment covers a spectrum. On one end you have nightly batch scoring of a churn model over a terabyte fact table. On the other end you have a remote model calling Gemini through Vertex AI to summarise customer reviews row by row. In between sit object tables for unstructured data, embedding generation for semantic search, and exported TensorFlow artefacts running on a low-latency Vertex AI endpoint. This guide walks through each scenario with the trade-offs and the SQL patterns the exam expects you to recognise.
BigQuery ML inference and deployment is billed by bytes processed for built-in models and by Vertex AI tokens or invocations for remote models. The two pricing axes are independent, and most surprise bills come from confusing them. See https://cloud.google.com/bigquery/pricing#bqml for the current rates.
白話文解釋(Plain English Explanation)
Think of BigQuery ML inference and deployment as a kitchen where the pantry cooks for you
In a normal restaurant, the chef walks to the pantry, carries ingredients to a separate kitchen, cooks, and walks the plate back to the dining room. That round trip is what classical ML inference looks like: data sits in BigQuery, gets exported to a separate ML server, predictions come back, and someone joins them to the original table. BigQuery ML inference and deployment removes the trip. The pantry has burners. You point at a shelf with SELECT * FROM ML.PREDICT(...) and the meal arrives where the ingredients already lived. That is why batch scoring of a billion rows feels boring in BigQuery while it would be a multi-day ordeal in most other stacks.
Think of remote models as ordering takeout from a Michelin restaurant down the street
Some dishes you do not cook in-house. Image classification, large language model generation, and text embeddings all need specialised hardware that you do not want to manage. A remote model in BigQuery is like a phone line to a takeout place. You hand over the order (a row of text), the kitchen at Vertex AI cooks it with Gemini or an embedding model, and the result comes back to your table. The dining room never sees the GPUs. You still pay for the food, but you stop paying rent on a kitchen you only use twice a week.
Think of model versioning and A/B testing as running two menus in parallel
A confident chef does not replace the menu overnight. They print version 12 next to version 11, send half the diners to each, and watch which plates come back empty. Vertex AI Model Registry plays that role for BigQuery ML inference and deployment. You register version 11 of the churn model, register version 12 alongside it, and split prediction traffic. If version 12 wins, you promote the alias. If it loses, the rollback is a single command. The registry is the binder of menus, the endpoint is the dining room, and your evaluation queries are the head chef tasting plates after service.
Inference is the act of feeding new, unseen rows into a trained model to produce a prediction, score, embedding, or generated text. In BigQuery ML, inference is always a SELECT against a function such as ML.PREDICT, ML.GENERATE_TEXT, or ML.GENERATE_EMBEDDING. See https://cloud.google.com/bigquery/docs/bqml-introduction#supported_models for the full list of model families.
Core Concepts of BigQuery ML Inference and Deployment
The inference function family
BigQuery ML inference and deployment is exposed as a small family of table-valued functions. ML.PREDICT runs supervised models. ML.FORECAST returns the next N steps from an ARIMA_PLUS model. ML.RECOMMEND produces top-K items from matrix factorisation. ML.GENERATE_TEXT, ML.GENERATE_EMBEDDING, ML.UNDERSTAND_TEXT, and ML.TRANSLATE reach Vertex AI through remote models. Each function takes the model handle as the first argument and a SQL subquery as the second. The output is always a table you can JOIN, filter, or write back into another table.
Built-in models versus remote models
Built-in models live entirely inside BigQuery. The training data, the weights, and the inference compute all stay in your project. Linear regression, logistic regression, k-means, matrix factorisation, ARIMA_PLUS, boosted trees, DNN, and Wide-and-Deep are all built-in. Remote models are pointers. The model handle in BigQuery only stores the connection details and the endpoint name; the actual weights live in Vertex AI Model Garden or on a custom endpoint. This split is the single most important mental model for the exam, because billing, latency, and quotas behave very differently across the two.
Imported models and exported models
The third category bridges BigQuery and the wider TensorFlow ecosystem. You can CREATE MODEL ... OPTIONS(MODEL_TYPE='TENSORFLOW', MODEL_PATH='gs://bucket/saved_model/') to load a SavedModel into BigQuery and then run ML.PREDICT on it as if it were native. Going the other direction, EXPORT MODEL my_model OPTIONS(URI='gs://bucket/export/') writes a SavedModel or XGBoost booster to Cloud Storage so it can be deployed on a Vertex AI endpoint, on Cloud Run, or on the edge. This portability is what makes BigQuery a viable training environment for serving infrastructures it does not own.
When you intend to deploy on Vertex AI for low-latency serving, train with a model type that exports cleanly. Linear models, boosted trees, and DNNs all export to TensorFlow SavedModel. AutoML Tables models trained inside BigQuery cannot be exported, so plan that constraint up front. Reference https://cloud.google.com/bigquery/docs/exporting-models#model_export_format_and_sample_for_each_model_type.
Architecture and Design Patterns
Pattern 1: pure batch scoring
The simplest BigQuery ML inference and deployment pattern keeps everything inside BigQuery. A scheduled query runs ML.PREDICT against last night's fact table, writes the results to predictions.churn_scores_YYYYMMDD, and a Looker dashboard reads from a view that always points at the latest partition. There is no endpoint, no container, no autoscaler. Cost scales with bytes scanned, not with QPS. This pattern fits churn scoring, propensity models, demand forecasting, and any other workload where freshness measured in hours is good enough.
Pattern 2: hybrid batch plus online
When a few percent of inferences need millisecond latency but the majority can run in batch, you keep the BigQuery ML inference and deployment pattern for the bulk and export the model to a Vertex AI endpoint for the hot path. The same trained weights serve both worlds. Your application calls the Vertex endpoint for live recommendations during a checkout, while a nightly job pre-computes scores for the entire catalogue for email campaigns. The trick is that both must train from the same feature pipeline so the online and offline scores stay aligned.
Pattern 3: remote LLM enrichment
Remote models unlock a pattern that did not exist before generative AI. You have a table of two million customer support tickets. You want a one-line summary, a sentiment label, and a category for each. You write a single SQL statement against ML.GENERATE_TEXT referencing a Gemini remote model, BigQuery batches the rows into Vertex AI calls under the hood, and the results land in a new column. This BigQuery ML inference and deployment pattern is reshaping ETL because the enrichment step that used to require a Python service with retry logic is now a query.
Pattern 4: object table inference for unstructured data
Object tables make Cloud Storage objects look like rows. You point an object table at a folder of PDFs or images, and each row exposes the URI, content type, and metadata. With a remote vision or document model, ML.PREDICT reads the binary, sends it through Vertex AI, and returns structured fields. This is how BigQuery ML inference and deployment now reaches OCR, image labelling, and document understanding without anyone writing a Cloud Function.
Remote models depend on a BigQuery connection resource (an external connection of type CLOUD_RESOURCE). The connection's service account must have the Vertex AI User role on the project hosting the endpoint. Forgetting this grant is the most common reason ML.GENERATE_TEXT returns a 403. See https://cloud.google.com/bigquery/docs/generate-text-tutorial#grant-permissions.
GCP Service Deep Dive
ML.PREDICT for built-in models
ML.PREDICT is the workhorse of BigQuery ML inference and deployment. The signature is ML.PREDICT(MODEL dataset.model_name, (subquery), STRUCT(threshold AS threshold, keep_original_columns AS keep_original_columns)). The subquery must return columns whose names and types match the training schema; if the model used TRANSFORM, you pass raw columns and BigQuery applies the transform automatically before scoring. For classification, the output adds predicted_<label>, predicted_<label>_probs, and the original columns when you set keep_original_columns => TRUE. For regression, it adds predicted_<label> only.
A typical scoring query joins the prediction back to a customer dimension and writes the result with MERGE into a slowly-changing table. BigQuery handles the parallelism. A model with ten thousand parameters scoring a hundred million rows finishes in minutes on a healthy reservation. The latency floor for a single-row ML.PREDICT is in the hundreds of milliseconds, so do not use this function as a synchronous backend for an interactive UI; that is what Vertex AI endpoints exist for.
ML.EXPLAIN_PREDICT and ML.GLOBAL_EXPLAIN
Explainability is no longer optional. ML.EXPLAIN_PREDICT returns the same predictions as ML.PREDICT plus a top_feature_attributions array with each contributing feature and its Shapley-style value. For a credit decision, this is the column you show the regulator. ML.GLOBAL_EXPLAIN aggregates those attributions across an evaluation set and returns the overall feature importance. Use the local function for individual decision audits and the global function for model documentation. Both work on linear, boosted tree, and DNN models; they are not supported on remote models because the underlying weights are not exposed.
ML.GENERATE_TEXT and the Gemini integration
ML.GENERATE_TEXT is the front door to Gemini from SQL. You first create a remote model: CREATE OR REPLACE MODEL my_dataset.gemini_flash REMOTE WITH CONNECTION 'us.my_connection' OPTIONS(ENDPOINT = 'gemini-2.5-flash'). Then you call SELECT * FROM ML.GENERATE_TEXT(MODEL my_dataset.gemini_flash, (SELECT prompt FROM source_table), STRUCT(0.2 AS temperature, 1024 AS max_output_tokens, TRUE AS flatten_json_output)). The result includes the generated text, safety attributes, and the raw JSON response. Temperature, top-k, top-p, and JSON mode are all passed through the STRUCT.
For BigQuery ML inference and deployment workloads that summarise, classify, or extract from text, this is the cleanest path. BigQuery batches the rows, retries on transient failures, and respects per-project Vertex AI quotas. The cost is the Gemini per-token rate; the BigQuery side only charges for bytes scanned in the subquery, which is usually negligible compared to the LLM call.
ML.GENERATE_EMBEDDING for semantic search
Embeddings are the other side of the LLM coin. ML.GENERATE_EMBEDDING calls a Vertex AI text or multimodal embedding model and returns a FLOAT64 array per row. You typically write the embeddings into a column on the same table, then use VECTOR_SEARCH or a CREATE VECTOR INDEX to retrieve nearest neighbours. This is how a BigQuery ML inference and deployment workflow powers a "find similar tickets" feature without a separate vector database. The same function also handles image embeddings when pointed at an object table, which lets you search a product catalogue by image similarity using only SQL.
ML.UNDERSTAND_TEXT and ML.TRANSLATE
ML.UNDERSTAND_TEXT wraps the Cloud Natural Language API. It returns sentiment scores, entity extraction, syntax analysis, and content classification depending on the feature you request. ML.TRANSLATE wraps Cloud Translation. Both use remote models against Google's pre-built APIs, so you do not train anything; you point a model at the API and run SQL. These functions exist for teams that need NLP results without standing up a custom pipeline. They share the same connection and IAM model as the Gemini remote models.
ML.EVALUATE for ongoing quality checks
ML.EVALUATE is not just a training-time function. The exam loves to test the BigQuery ML inference and deployment pattern where you re-run ML.EVALUATE weekly against a fresh holdout set and compare metrics against a baseline. For classification, it returns precision, recall, accuracy, F1, log loss, and ROC AUC. For regression, MAE, MSE, MSLE, median absolute error, and R-squared. For clustering, the Davies-Bouldin index and mean squared distance. For ARIMA_PLUS, MAE, MAPE, and RMSE per forecast horizon. Pipe the output into a monitoring table and alert when any metric drifts beyond a threshold.
Model export and Vertex AI deployment
EXPORT MODEL writes the trained model to Cloud Storage in a format Vertex AI understands. From there, gcloud ai models upload registers it in Model Registry and gcloud ai endpoints deploy-model serves it on managed infrastructure. The same model handle can also be registered directly from BigQuery using the Vertex AI integration, which skips the GCS hop. Once the model is on an endpoint, you get HTTP/gRPC inference, traffic splitting between versions, autoscaling, and Cloud Monitoring metrics. This is where the BigQuery ML inference and deployment story formally hands off to MLOps.
Versioning and aliases in Model Registry
Vertex AI Model Registry treats every retrain as a new version of the same model resource. You can promote a version with the default alias, which downstream code uses without knowing the version number, and you can attach custom aliases such as champion and challenger for A/B tests. A BigQuery ML inference and deployment pipeline that re-trains nightly registers each new run, lets the evaluation job compare metrics against the current champion, and only promotes the alias when the new version wins on the agreed metric. The audit trail is automatic.
A/B testing on Vertex AI endpoints
Once two model versions are deployed to the same endpoint, you set traffic splits like 90/10 or 50/50 in the deployment configuration. Requests are routed accordingly and request-level logging captures which version handled each call. Pair this with prediction logging into BigQuery and you can join the served predictions with the eventual outcomes (a click, a churn event, a conversion) and compute lift between the two versions. This closes the loop on BigQuery ML inference and deployment because the data warehouse is now the analytics surface for the very models it helped train.
For LLM A/B tests, you do not need a Vertex AI endpoint at all. Create two remote models pointing at two prompt templates or two Gemini variants, run both with ML.GENERATE_TEXT on a sample of rows, and have human raters score the outputs. The whole experiment fits in a few SQL statements and a Looker Studio dashboard. Reference https://cloud.google.com/bigquery/docs/generate-text.
Common Pitfalls and Trade-offs
Pitfall 1: assuming ML.PREDICT is low-latency
ML.PREDICT is a query. It goes through query parsing, slot allocation, and result materialisation. Even on a single row, the floor is hundreds of milliseconds. Teams that try to call it from a synchronous web request end up disappointed. The fix is either to pre-compute predictions in batch and serve them from a Memorystore cache, or to export the model and host it on a Vertex AI endpoint. Recognising which side of this line a workload sits on is half of the BigQuery ML inference and deployment design problem.
Pitfall 2: ignoring quotas on remote models
Remote models share Vertex AI quotas with everything else in your project. A ML.GENERATE_TEXT query over ten million rows can blow through your tokens-per-minute quota and start failing partway through. BigQuery will retry, but only within limits. The professional pattern is to chunk large jobs into multiple queries with LIMIT and OFFSET, monitor Vertex AI quota dashboards, and request quota increases ahead of time. Treat remote model calls as an external dependency, not as a free function.
Pitfall 3: skipping TRANSFORM and getting training-serving skew
If you do feature engineering in a query before training and then forget to apply the same engineering before inference, your predictions will be wrong in subtle ways. CREATE MODEL ... TRANSFORM(...) solves this by binding the feature transformations to the model itself. At inference time, ML.PREDICT applies the same transforms automatically. This is the single biggest BigQuery ML inference and deployment hygiene win, and the exam will test whether you remember to use it.
Pitfall 4: not invalidating the prediction table
A common batch BigQuery ML inference and deployment design writes predictions into a partitioned table. If a retrain happens mid-day and you keep appending, the partition now contains predictions from two different model versions. Querying it gives a meaningless mix. The fix is to include model_version as a column on the prediction table, or to write to date-and-version partitions, so consumers can always pick a consistent slice.
Do not assume ML.PREDICT returns rows in the same order as your input subquery. BigQuery is a parallel engine and ordering is undefined unless you add ORDER BY to the outer query. Many BigQuery ML inference and deployment bugs come from joining predictions back to inputs by row number instead of by a real key. Always carry a primary key column through the subquery and join on it.
Pitfall 5: forgetting export limits
EXPORT MODEL does not work for every model type. AutoML, ARIMA_PLUS, matrix factorisation, and remote models cannot be exported. If your downstream plan is to deploy on Vertex AI or the edge, pick a model family that supports export before you start training. The error message at export time is unambiguous, but discovering it after a week of tuning is a bad day.
Best Practices
- Always carry a primary key column through
ML.PREDICTsubqueries so you can join predictions back to source rows by ID, never by row order. - Use
CREATE MODEL ... TRANSFORMfor any feature engineering so training and inference share the same logic with no chance of drift. - Schedule
ML.EVALUATEagainst a fresh holdout weekly and write the results to a metrics table. Alert on drift, do not wait for a complaint. - Tag every prediction row with
model_name,model_version, andprediction_timestamp. Future debugging will thank you. - For remote model jobs over a million rows, batch with
LIMIT/OFFSETor process by partition so a single query does not exhaust Vertex AI quotas. - Reserve BigQuery slots for predictable inference workloads. On-demand pricing is fine for ad hoc work, but a recurring nightly scoring job benefits from committed capacity.
- When deploying to a Vertex AI endpoint, start with traffic splits (90 percent champion, 10 percent challenger) before promoting a new model.
- Keep a separate connection resource per environment (dev, staging, prod) so IAM grants for remote models stay scoped.
Real-World Use Case
A mid-sized European retailer with around 18 million active customers needed to run three BigQuery ML inference and deployment workloads in parallel. The first was a churn model used for weekly retention campaigns. A boosted tree model trained on 90 days of transaction history scored every customer overnight, and the predictions landed in a partitioned table that the CRM tool read directly. Total runtime was under twenty minutes on a 500-slot reservation, and the marketing team got fresh scores by 06:00 every morning.
The second workload was personalised product recommendations for the homepage. The team trained a matrix factorisation model in BigQuery, exported the embeddings for each customer and product into a separate table, and then served the top-K via a Vertex AI endpoint backed by a custom container. The endpoint replied in under 80 milliseconds at the 95th percentile. When a new model version was retrained, they deployed it alongside the old one with a 90/10 traffic split and used the conversion events flowing back into BigQuery to decide promotion within seven days.
The third workload was new. The customer service team wanted automatic categorisation, sentiment, and a suggested reply for every incoming email. Instead of building a Python service, the data engineering team created a remote Gemini model via ML.GENERATE_TEXT and ran it as a scheduled query every fifteen minutes against new rows in the email ingest table. Throughput averaged 12,000 rows per run, well within Vertex AI quotas. The whole BigQuery ML inference and deployment pipeline for this workload was 40 lines of SQL and a single connection resource. The customer service team saw average response time drop by 35 percent within two months, and the engineering footprint was zero new services.
Exam Tips
- Know the difference between batch inference (BigQuery, query latency, cents per terabyte) and online inference (Vertex AI endpoint, sub-100 ms, per-node-hour pricing). The exam loves to give a latency requirement and ask you to pick.
- Remember that
ML.PREDICTrequires column names matching the training schema unless you usedTRANSFORM, in which case raw columns work. ML.EXPLAIN_PREDICTreturns local feature attributions per row.ML.GLOBAL_EXPLAINreturns model-wide feature importance. They are different functions, do not confuse them.- Remote models require a CLOUD_RESOURCE connection and the Vertex AI User role on the connection's service account. This grant is the most tested operational detail.
ML.GENERATE_TEXToutputs includeml_generate_text_status(an empty string on success, an error message on failure). Check this column before trusting the generated text.EXPORT MODELproduces TensorFlow SavedModel for DNN, linear, and Wide-and-Deep models, and Booster format for boosted trees. ARIMA_PLUS, matrix factorisation, and AutoML cannot be exported.- Vertex AI Model Registry supports versioning, aliases, and traffic splitting on endpoints. A/B testing in production is done by traffic split, not by deploying two endpoints.
- For semantic search use cases,
ML.GENERATE_EMBEDDINGplusVECTOR_SEARCHis the BigQuery-native answer. Do not propose a separate vector database unless the question explicitly demands one. - ARIMA_PLUS uses
ML.FORECASTfor inference, notML.PREDICT. This catches people who memorised one function name.
The Professional Data Engineer exam tests architectural choice under constraints. When you see a question that mentions latency under 100 ms, treat that as a signal to export the model and serve from Vertex AI. When you see "millions of rows nightly," that is a signal for in-warehouse batch with ML.PREDICT. The constraint determines the answer. Reference https://cloud.google.com/bigquery/docs/bqml-introduction.
Memorise the BQML inference function map: ML.PREDICT for batch scoring of built-in models, ML.EVALUATE for re-checking precision/recall/RMSE on a weekly holdout, and ML.EXPLAIN_PREDICT for per-row Shapley feature attributions. For deployment, EXPORT MODEL writes a TensorFlow SavedModel or XGBoost booster to GCS for a Vertex AI endpoint, while a remote model created with CREATE MODEL ... REMOTE WITH CONNECTION calls Gemini or embedding endpoints inside Vertex AI from SQL. ARIMA_PLUS, matrix factorisation, AutoML, and remote models cannot be exported.
Frequently Asked Questions
Can ML.PREDICT serve a real-time API directly?
Not in the way most teams want. Each invocation is a BigQuery query with hundreds of milliseconds of overhead even for a single row, and there is no per-row pricing model. For real-time serving, export the model with EXPORT MODEL, register it in Vertex AI Model Registry, and deploy it to an endpoint. The endpoint will return predictions in under 100 ms at the 95th percentile and bills per node-hour rather than per query.
What happens if a remote model call fails partway through ML.GENERATE_TEXT?
BigQuery handles transient failures with retries up to a small limit. Persistent failures land in the ml_generate_text_status column for that row as a non-empty error string, while the rest of the rows succeed. Your downstream pipeline should filter or branch on that column. For total failures of the connection or the endpoint itself, the whole query fails and you can re-run after fixing IAM or quota.
How do I A/B test two BigQuery ML models?
The cleanest pattern is to run ML.PREDICT against both models on the same input, write the predictions side by side with a model_id column, and then join the eventual outcomes (a click, a conversion, a churn event) back to compare lift. For online serving, deploy both versions to the same Vertex AI endpoint with a traffic split, log requests to BigQuery, and run the same join. The analytics layer is identical in both cases.
When should I use ML.UNDERSTAND_TEXT versus ML.GENERATE_TEXT?
ML.UNDERSTAND_TEXT calls the Cloud Natural Language API for fixed tasks like sentiment, entities, and syntax. The output is structured and the model is pre-trained. ML.GENERATE_TEXT calls a Gemini variant and returns whatever the prompt asks for. Use the former when you want consistent structured output for a known task. Use the latter when you need flexibility, custom categories, or generated content like summaries and replies.
Do I need Vertex AI Pipelines for a BigQuery ML inference and deployment workflow?
Not necessarily. For pure batch scoring, BigQuery scheduled queries plus a small Cloud Composer DAG are usually enough. Vertex AI Pipelines becomes worthwhile when the workflow includes custom training containers, hyperparameter tuning, or multi-step orchestration that crosses several services. Many production BigQuery ML inference and deployment systems run for years on scheduled queries alone.
Can I use BigQuery ML embeddings with a vector index for semantic search?
Yes, this is now a first-class pattern. Generate embeddings with ML.GENERATE_EMBEDDING, store them in a column, create a VECTOR INDEX on that column, and query with VECTOR_SEARCH. The index supports approximate nearest neighbour search at scale, and the whole BigQuery ML inference and deployment pipeline stays in SQL. This avoids the operational cost of a separate vector database for most analytical use cases.
How do I handle model versioning when retraining nightly?
Treat each nightly retrain as a new version. Train into a versioned model name like churn_model_v20260512, run ML.EVALUATE against a fresh holdout, and only update an alias view (CREATE OR REPLACE VIEW churn_model_current AS SELECT * FROM ML.PREDICT(MODEL churn_model_v20260512, ...)) when the metrics beat the current champion. For Vertex AI endpoints, use Model Registry aliases instead of view indirection. Either way, never overwrite the model in place.
Related Topics
- BigQuery Data Modeling and Clustering — partitioning and clustering choices that determine how cheap your
ML.PREDICTqueries will be at scale. - Vertex AI Feature Store and Pipelines — the canonical place to keep features that both BigQuery ML and online endpoints share, eliminating training-serving skew.
- MLOps with Vertex AI — Model Registry, endpoints, traffic splitting, and continuous evaluation patterns that finish the BigQuery ML inference and deployment story.
Further Reading
- Google Cloud, "Introduction to BigQuery ML": https://cloud.google.com/bigquery/docs/bqml-introduction
- Google Cloud, "The ML.PREDICT function": https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-predict
- Google Cloud, "Generate text by using the ML.GENERATE_TEXT function": https://cloud.google.com/bigquery/docs/generate-text
- Google Cloud, "Exporting models from BigQuery ML": https://cloud.google.com/bigquery/docs/exporting-models
- Google Cloud, "Vertex AI Model Registry overview": https://cloud.google.com/vertex-ai/docs/model-registry/introduction