examlab .net The most efficient path to the most valuable certifications.
In this note ≈ 20 min

Cloud Dataprep and Data Cleaning

3,820 words · ≈ 20 min read ·

Master Cloud Dataprep and data cleaning for the GCP PDE exam. Learn Trifacta-powered visual ETL, recipes, transforms, profiling, and when to choose Dataprep over Dataflow or Dataform.

Do 20 practice questions → Free · No signup · PDE

Introduction to Cloud Dataprep and Data Cleaning

Cloud Dataprep, powered by Trifacta (now part of Alteryx), is the visual data preparation service on Google Cloud. It lets analysts and data engineers explore raw files, fix dirty values, and ship cleaned datasets to BigQuery or Cloud Storage without writing Beam pipelines or SQL. The tool sits in a useful niche: too messy for Dataform, too interactive for Dataflow. This study note walks through the engine, the workflow, and the exact decision criteria the PDE exam expects you to apply when picking Cloud Dataprep over alternatives.

白話文解釋(Plain English Explanation)

Before diving into recipes and transforms, it helps to anchor the abstract pieces of Cloud Dataprep to things you already understand. Three analogies cover the surface area you need.

Cloud Dataprep as a Restaurant Kitchen Prep Station

Picture the back of a restaurant during morning prep. A delivery truck dumps crates of vegetables on the steel counter: muddy carrots, bruised tomatoes, leafy greens still tied with rubber bands. Before any of it touches a customer plate, the prep cooks wash, peel, dice, and sort everything into labelled bins. The prep station is not the stove and it is not the dining room. It is the in-between place where chaos becomes ingredients.

Cloud Dataprep is exactly that prep station for your data. The CSV with mixed date formats, the JSON export with three different country code schemes, the spreadsheet someone emailed with leading whitespace in every column — that is your muddy delivery. The Cloud Dataprep recipe is the prep cook turning the mess into clean, typed columns ready for BigQuery to consume. You are not running the analytical query yet (that is the stove). You are not writing the dashboard (that is the dining room). You are washing the carrots.

Cloud Dataprep as a Library Cataloguing Desk

Imagine a public library that just received a donation of ten thousand books from someone's estate. Before any of those books can hit the shelves, a librarian sits at the cataloguing desk and goes through them one by one: assigning Dewey numbers, fixing missing publication years, removing duplicates that already exist in the collection, and flagging anything in poor physical shape for the binding repair queue.

Cloud Dataprep gives you the cataloguing desk for raw data. The visual grid shows you every column the way the librarian sees every spine. Suggestion cards appear like the librarian's training manual: "this column looks like a phone number, want to standardise the format?" Recipes are the procedures the head librarian wrote down so junior staff can repeat the same cataloguing rules on next month's donation without re-thinking from scratch.

Cloud Dataprep as a Carpenter's Workbench

A carpenter does not start cutting straight from a rough log. The log goes on the workbench first. There the carpenter inspects for knots, planes the surface flat, marks measurements, and only then takes it to the saw. Cloud Dataprep is the workbench. The raw file is the rough log. Dataflow or BigQuery is the saw and the lathe.

The workbench analogy matters because it explains why Cloud Dataprep exists alongside Dataflow. Dataflow is a powerful saw — it cuts at scale, it cuts continuously, it cuts in patterns. But you would not use a table saw to inspect lumber for cracks. You use eyes and hands at a workbench. Cloud Dataprep is the eyes-and-hands tool. Once you know what cuts to make, the recipe gets compiled into a Dataflow job and the saw takes over.

Core Concepts of Cloud Dataprep

Cloud Dataprep operates on a small set of concepts that the exam will probe. Knowing the vocabulary cold prevents losing points on definition-flavoured questions.

A Flow is the top-level container, similar to a project folder. Inside a Flow you arrange datasets, recipes, and outputs into a graph. A Dataset can be either an Imported Dataset (a pointer to a file in Cloud Storage, a BigQuery table, or an upload) or a Reference Dataset (the output of another recipe used as input downstream). A Recipe is the ordered list of transformation steps applied to a dataset. Each step in a recipe is a single transform like Split Column, Extract Pattern, Deduplicate, or Replace Mismatched.

A Transform is the atomic unit of work. Cloud Dataprep ships with hundreds of transforms grouped into categories: structural transforms (pivot, unpivot, transpose), column transforms (split, merge, extract), value transforms (replace, standardise, format), filter transforms (keep mismatched, drop nulls), and aggregation transforms (group by, window). Transforms speak Wrangle Language under the hood, but you rarely type Wrangle directly. You click suggestion cards and the language gets generated for you.

The Sample is the slice of data shown in the visual grid. Cloud Dataprep does not load the whole dataset into the browser — that would melt your laptop on a terabyte file. It pulls a sample (default 10 MB or 1 million rows, configurable up to 1 GB) and applies your recipe to that sample so you can see results in real time. When you finally run the Job, the recipe gets translated into a Dataflow pipeline that processes the entire dataset.

A Profile is the statistical summary Cloud Dataprep generates for every column: data type distribution, null counts, mismatch counts, value frequency, min/max/mean for numeric columns, and a quality bar showing the percentage of valid versus invalid versus empty cells. Profiles run on demand or as part of a job and they are the diagnostic dashboard for data cleaning.

The internal DSL Cloud Dataprep generates from your visual clicks. Each recipe step compiles to a Wrangle statement like splitpatterns col: address on: ',' limit: 3. You can view, edit, and copy Wrangle directly when the visual UI is too slow for bulk edits. See Wrangle Language reference.

Architecture and Design Patterns

Cloud Dataprep is a fully managed service, but understanding what runs where matters for cost, security, and integration design. The user-facing application is a SaaS hosted by Alteryx (the current Trifacta operator), authenticated against your Google identity through OAuth. When you click Run Job, however, the actual data processing happens inside your Google Cloud project on Dataflow workers using a service account you control.

This split-plane architecture has implications. The control plane (the recipe editor, the flow graph, the schedule manager) lives outside your VPC. The data plane (Dataflow workers, Cloud Storage staging buckets, BigQuery sinks) lives inside your project. Sensitive data does not flow to Alteryx servers during job execution; only metadata about your recipe and the sample data you actively browsed.

A typical design pattern looks like this. Raw files land in a Cloud Storage landing bucket from upstream sources — SFTP transfers, Pub/Sub-triggered exports, manual uploads. Cloud Dataprep watches that bucket through an Imported Dataset definition. The recipe normalises columns, drops PII, and writes the cleaned result to a BigQuery staging table. A scheduled flow runs every hour or every night. A separate Dataform workflow then takes the staging table and runs SQL transformations to build the modelled marts.

Notice the layering. Dataprep handles the messy edge between unstructured-ish files and your warehouse. Dataform handles the structured SQL transformations once data is already clean and tabular. Dataflow handles whatever does not fit either box, particularly streaming and complex code-based pipelines. Trying to make Dataprep do streaming or trying to make Dataform clean a CSV with mixed delimiters wastes engineering effort.

Cloud Dataprep is batch only. There is no streaming Dataprep flow. If your source is Pub/Sub or Kafka, you need Dataflow. The exam regularly tests this boundary by describing a real-time scenario and offering Dataprep as a distractor. See Cloud Dataprep concepts.

For lineage and reproducibility, treat Flows as code even though they are visual. Export Flow definitions to JSON and check them into Git. Use the Cloud Dataprep API to programmatically deploy Flows across dev, staging, and production projects rather than clicking through the UI in each environment.

GCP Service Deep Dive

This section unpacks the parts of Cloud Dataprep that the exam questions usually drill into: source/sink connectors, the transform catalogue, sampling, and scheduling.

Sources and Sinks

Imported Datasets in Cloud Dataprep can come from several places. Cloud Storage is the most common — Dataprep reads CSV, TSV, JSON (newline-delimited and standard), Avro, Parquet, and Excel files directly from buckets. BigQuery tables and views work as both sources and sinks; Dataprep pushes table reads down into BigQuery so it does not have to physically copy gigabytes into a sample bucket. Direct upload from your laptop is fine for prototyping but obviously not for scheduled production flows.

Sinks (called Outputs in the UI) write to Cloud Storage in CSV, JSON, Avro, or Parquet, or to BigQuery in append, truncate, drop-and-recreate, or merge modes. The merge mode is particularly useful for slowly changing dimension patterns where you want Dataprep to upsert rows based on a key column.

The Transform Catalogue

Three transforms appear in nearly every PDE-style scenario. Split Column breaks one string into multiple columns by delimiter, position, or regex pattern. The classic use case is splitting a full_name column into first_name and last_name, or breaking an address line into street, city, and postal code. Extract Pattern uses regex or a built-in pattern (email, URL, phone, IP address) to pull a substring into a new column without destroying the original. Deduplicate removes exact-match rows or, with the Group By variant, removes rows that match on a chosen subset of columns while keeping the first occurrence.

Beyond those three, the cleaning workhorses are Replace Mismatched, which targets cells that fail the column's inferred data type (a string in a numeric column, for example) and replaces them with null or a default value, and Standardise, which normalises capitalisation, whitespace, and formatting variations within a single column. Standardise is the transform that fixes the classic problem where a country column contains "USA", "U.S.A.", "United States", and "us" all referring to the same country.

When you are not sure which transform to use, click on a value in the grid and watch the suggestion cards that appear at the bottom. Cloud Dataprep's suggestion engine is the fastest way to discover transforms and learn the Wrangle Language. See Transform suggestions.

Data Profiling

Profiling is what separates Cloud Dataprep from a glorified spreadsheet editor. Every column header in the grid shows a coloured quality bar: green for valid values matching the inferred type, red for mismatched values, grey for nulls. Click the column header and a profile panel opens with histograms, top values, percentiles, and pattern analysis showing the regex patterns most common values follow.

Profiles run on the active sample by default. To get column statistics across the entire dataset, you generate a Job Profile by checking the "Profile Results" box when running a job. The full-dataset profile is what you want for production data quality monitoring; the sample profile is fine for interactive exploration.

Sample Sizing

The default sample is the first 10 MB of your dataset. That works for tiny files and for datasets where the first rows are representative, but it fails badly when data is sorted by date and recent rows have a different schema than old rows, or when bad values cluster in particular partitions.

Cloud Dataprep offers sampling strategies to address this. Random Sample pulls a uniform random subset across the entire file. Stratified Sample picks rows weighted by the distribution of a chosen column, which is essential when you care about rare categories that random sampling would miss. Filter-Based Sample lets you preview only the rows that match a condition, perfect for debugging a specific bad cohort. Anomaly-Based Sample prioritises rows with mismatched or empty cells so you see the broken data first instead of the easy clean rows.

Designing recipes against the default 10 MB sequential sample is the single most common Cloud Dataprep mistake. Your recipe looks perfect in the editor, then the production job runs against the full dataset and discovers values you never saw in the sample. Always switch to Random or Stratified sampling before finalising a recipe for production. See Cloud Dataprep sampling overview.

Scheduled Flows

Production Cloud Dataprep is not someone clicking Run every morning. Scheduled Flows let you trigger a flow on a cron-like schedule. Each scheduled run executes every output destination in the flow in dependency order. Schedules support timezone configuration, hourly through monthly cadences, and pause/resume. Failed runs send notifications to configured email addresses.

For event-driven runs (rather than time-based), use the Cloud Dataprep API to trigger flows from Cloud Composer, Cloud Functions, or Workflows. A common pattern is a Cloud Storage trigger that fires a Cloud Function which calls the Dataprep API to run the flow whenever a new file lands in the landing bucket.

Common Pitfalls and Trade-offs

The boundaries of Cloud Dataprep are where engineers get tripped up. Below are the recurring mistakes worth committing to memory before exam day.

Cloud Dataprep generates Dataflow jobs under the hood, which means you pay both a Cloud Dataprep platform fee per user per month and the Dataflow processing fees per run. Teams often see the per-user platform fee and assume that covers everything, then get surprised when the Dataflow line item shows up. For high-volume cleaning jobs, the Dataflow cost can dominate. Always model both fees together when sizing a Dataprep deployment.

Recipes can become unreadable monsters. A recipe with 200 steps takes minutes to render in the UI and is impossible to review in code review. Break large recipes into smaller chained recipes using Reference Datasets. Each recipe should do one logical thing — type casting, deduplication, joining, formatting — and the next recipe in the chain consumes the output of the previous.

Cloud Dataprep has limited support for custom code. There are user-defined functions, but they are written in the Wrangle Language, not in Python or Java. If your transformation needs an external API call, a custom machine learning model inference, or anything more exotic than string manipulation and arithmetic, you have hit the ceiling. Switch to Dataflow.

Joins in Cloud Dataprep work, but they are not the right tool for production data warehouse joins across billion-row tables. The visual join experience is great for understanding what a join produces; the actual execution at scale is better handled by BigQuery SQL after Dataprep has cleaned the inputs. Use Dataprep for join exploration, BigQuery for join execution.

Schema drift bites everyone eventually. A new column appears in the source file, or an existing column changes type, and the recipe either silently ignores the new column or breaks at the column-rename step. Build schema validation into the start of your recipe — a column count check or a header validation step — and configure email alerts on job failure so drift gets caught immediately rather than discovered three days later when the dashboard looks wrong.

If your Cloud Dataprep job fails with a Dataflow worker out-of-memory error, the usual cause is a single-key skew in a Group By or Join transform. One key has billions of rows while others have thousands. The fix is to pre-aggregate or pre-filter in the source query before importing into Dataprep, not to throw bigger Dataflow workers at it. See Dataflow troubleshooting.

Best Practices

A short list of practices that consistently produce maintainable Cloud Dataprep deployments.

  • Pin recipe steps to column names rather than positional indexes whenever possible. Position-based references break the moment someone reorders columns upstream.
  • Add a header comment as the first step of every recipe describing the input contract: expected columns, expected types, and known data quality issues being handled.
  • Use Reference Datasets to share common cleaning logic (PII redaction, country code normalisation) across multiple flows. Update the shared recipe in one place instead of copy-pasting steps.
  • Profile production data weekly even when nothing has obviously broken. Watching the quality bar trend over time catches gradual data degradation before it becomes a fire.
  • Keep Flows export JSON in version control. Treat the JSON as the source of truth and rebuild dev and staging environments from it.
  • Set Dataflow worker autoscaling bounds explicitly when scheduling jobs. The default autoscaling can blow past expected cost ceilings on a bad-data day with unexpected row counts.
  • Tag every Cloud Dataprep flow with cost-centre and owner labels through the underlying Dataflow job labels. This makes the bill comprehensible six months later when someone asks "what is this Dataflow line item from?"
  • Document the sampling strategy chosen for each recipe in the recipe header comment. Future maintainers need to know whether the recipe was tested against random, stratified, or sequential sample.

Cloud Dataprep platform fee is per user per month, not per job. Dataflow fees are per job run. Total cost equals user licences plus job execution. Both lines appear on the Google Cloud bill. See Cloud Dataprep pricing.

Real-World Use Case

Consider a mid-size insurance company with around 800 employees and a single five-person data team. The marketing department receives weekly customer prospect files from a dozen partner brokers. Each broker uses a different Excel template, different column ordering, different date formats (US, European, ISO), different ways of representing missing values ("N/A", "null", blank, "-"), and different country code conventions. Some files have leading and trailing whitespace in every cell because they were exported from a system that pads strings to fixed widths.

Before Cloud Dataprep, a marketing analyst spent three days per week manually fixing these files in Excel before loading them into the CRM. Errors slipped through; duplicate prospects got contacted twice; bad postal codes caused mailings to be returned at significant cost.

The data team built one Cloud Dataprep flow per broker (twelve flows total). Each flow points at the broker's specific Cloud Storage subfolder. Cloud Storage triggers fire a Cloud Function on file arrival. The function calls the Dataprep API to run the appropriate flow. The flow standardises columns to a canonical schema, normalises country codes to ISO 3166, parses dates to ISO 8601, deduplicates against the previous week's file using fuzzy matching on email, and outputs to a BigQuery staging table. A scheduled Dataform workflow then merges the staging table into the master prospect table and updates the CRM through Reverse ETL.

The marketing analyst's three-day weekly task became a fifteen-minute review of the Cloud Dataprep job profile reports. The data team owns the twelve flows and updates them when a broker changes their export format, which now happens through a code review on the Flow JSON in Git rather than through frantic Slack messages.

The tooling boundary is worth noting. The team rejected a pure Dataflow build because the marketing analyst needed to be able to read and occasionally tweak the cleaning rules without writing Java. They rejected a pure SQL build in Dataform because the broker files have semi-structured weirdness (merged Excel cells, header rows in the middle of data) that SQL cannot reasonably handle. Cloud Dataprep occupied the right slot.

Exam Tips

The PDE exam tests Cloud Dataprep through scenario questions, not memorised facts. The pattern to recognise: a stakeholder describes a data preparation problem, four services are offered as candidates, and you must pick the best fit. The discriminating signals to watch for follow.

Pick Cloud Dataprep when the scenario emphasises non-engineer users, visual exploration, interactive transformation, batch source files, profiling and quality assessment, or iterative recipe refinement. The presence of words like analyst, business user, exploratory, or self-service is a strong Dataprep signal.

Pick Dataflow instead when the scenario mentions streaming, low latency, custom code, large-scale joins as the primary workload, complex windowing, or anything Beam-specific. Streaming is the cleanest disqualifier for Dataprep.

Pick Dataform instead when the scenario describes SQL transformations on data that is already in BigQuery and already has a clean schema. Dataform is for SQL-based ELT modelling; Dataprep is for upstream visual cleaning. They complement rather than compete.

Pick Data Fusion when the scenario emphasises hundreds of pre-built connectors to non-GCP systems (Salesforce, ServiceNow, on-premises databases) or a code-free pipeline that is heavier than visual data prep. Data Fusion is the heavier code-free ETL tool; Dataprep is lighter and more analyst-focused.

Remember that Cloud Dataprep recipes execute as Dataflow jobs. Cost-related questions where the answer involves Dataflow billing apply equally to Dataprep. Security questions about VPC controls, customer-managed encryption keys, and IAM apply through the Dataflow execution layer.

The exam loves to test the Dataprep is batch-only boundary, the per-user licence plus per-job Dataflow cost model, and the sampling strategy must match production data shape principle. Memorise these three. See Cloud Dataprep documentation.

Frequently Asked Questions

Is Cloud Dataprep free with a Google Cloud account?

No. Cloud Dataprep is a paid third-party service operated by Alteryx (formerly Trifacta) and integrated with Google Cloud. Pricing is a per-user monthly subscription with three editions (Starter, Professional, Enterprise) plus the Dataflow execution costs incurred when recipes run. There is a free Starter tier with limited features and limited job runs suitable for evaluation, but production usage requires a paid plan.

Can Cloud Dataprep process streaming data from Pub/Sub?

No. Cloud Dataprep is a batch-only service. The underlying Dataflow jobs it generates are batch jobs, not streaming jobs. If your source is Pub/Sub, Kafka, or any other streaming source, you need to build a Dataflow streaming pipeline directly. A common workaround is to micro-batch by writing streaming events to Cloud Storage in small files and then having Cloud Dataprep process those files on a frequent schedule, but this still introduces minutes of latency and is not true streaming.

What is the difference between Cloud Dataprep and Dataflow?

Dataflow is the underlying execution engine; Cloud Dataprep is a visual interface that compiles to Dataflow jobs. Use Cloud Dataprep when an analyst needs to interactively clean data with point-and-click transforms. Use Dataflow directly when you need streaming, custom Java or Python code, complex windowing, or fine-grained control over pipeline behaviour. The cost model differs too: Dataprep adds a per-user platform fee on top of the Dataflow execution cost.

Should I use Cloud Dataprep or Dataform for ELT in BigQuery?

Use Cloud Dataprep for the messy upstream stage where files arrive with inconsistent schemas, mixed types, or quality issues that need profiling. Use Dataform for the downstream SQL modelling stage where data is already in BigQuery in tabular form and you need to build dimensional models, aggregates, and tested transformations. The two tools chain naturally: Dataprep cleans files into a BigQuery staging table, Dataform models the staging table into marts.

How do I version control Cloud Dataprep flows?

Cloud Dataprep does not have built-in Git integration, but Flows can be exported as JSON through the UI or via the Cloud Dataprep API. The recommended pattern is to treat the exported JSON as source of truth, commit it to a Git repository, and use the Dataprep API to redeploy Flows to dev, staging, and production projects through your CI/CD pipeline. Direct UI edits should happen only in development; promotion to higher environments goes through code review on the JSON.

What sampling strategy should I use when designing a recipe?

Default sequential sampling (the first 10 MB of the file) is risky because it assumes the first rows are representative of the whole dataset. Switch to Random Sample for general-purpose recipes, Stratified Sample when you care about rare categories that random sampling might miss, and Anomaly-Based Sample when you are specifically debugging data quality issues. Always verify your recipe against a representative sample before scheduling it for production.

Does Cloud Dataprep support customer-managed encryption keys (CMEK)?

Yes, through the underlying Dataflow execution. Configure CMEK on the Dataflow service account that Cloud Dataprep uses to run jobs, and on the Cloud Storage staging buckets and BigQuery datasets that Dataprep reads and writes. The Cloud Dataprep control plane stores recipe metadata and sample data outside your project, so be aware that the visual sample you browse is not protected by your CMEK; only the production job execution and its outputs are.

Further Reading

  • Cloud Dataprep documentation — official service documentation including concepts, how-to guides, and the Wrangle Language reference.
  • Cloud Dataprep overview — high-level concept guide covering Flows, Recipes, Datasets, and the execution model.
  • Dataflow documentation — the execution engine that runs every Cloud Dataprep job; understanding Dataflow cost and tuning applies directly to Dataprep cost and tuning.
  • Dataform documentation — the SQL-based ELT companion that handles the downstream modelling stage after Cloud Dataprep finishes cleaning.

Official sources

More PDE topics