Data Analytics and Looker — CDL Study Notes

Q: Q: What is the difference between Looker and Looker Studio?

A: Looker is an enterprise, paid BI platform with LookML semantic modeling, governed metrics, and embedded analytics. Looker Studio is a free, self-service dashboard tool for ad-hoc reporting. Same brand, different products, different use cases.

Q: Q: When should I choose Dataflow over Dataproc?

A: Choose Dataflow for new cloud-native pipelines, especially streaming ones, since it offers a unified batch + stream model via Apache Beam. Choose Dataproc when migrating existing Hadoop or Spark workloads from on-premises with minimal code changes.

Q: Q: What is LookML?

A: LookML is Looker's modeling language. Data engineers use LookML to define dimensions, measures, and business logic exactly once, and every dashboard, report, and API call uses those shared definitions — producing a single source of truth across the organization.

Q: Q: What is the canonical streaming analytics pipeline on Google Cloud?

A: Pub/Sub → Dataflow → BigQuery → Looker (mnemonic: PDB-L). Events enter via Pub/Sub, are processed by Dataflow, stored in BigQuery, and visualized through Looker (or Looker Studio for free/ad-hoc cases).

Q: Q: Is Cloud Composer a replacement for Dataflow?

A: No. Cloud Composer (managed Apache Airflow) orchestrates workflows — it schedules and coordinates multi-step pipelines. Dataflow executes the data processing inside each step. They complement each other rather than compete.

What Is the Google Cloud Data Analytics Pipeline?

From raw data to business decisions

Modern businesses generate data at an unprecedented rate — every website click, every IoT sensor reading, every credit card swipe is a potential source of insight. But raw data sitting in a database is worthless. To turn data into decisions, you need a full analytics pipeline that moves data from its origin to a dashboard on an executive's screen.

Five integrated stages on Google Cloud

Google Cloud provides a complete, integrated set of services that map cleanly onto five stages: ingest → store → process → analyze → visualize. For the Cloud Digital Leader (CDL) exam, you do not need to know how to write code for any of these services. You need to recognize the name of each tool, understand what stage of the pipeline it serves, and pick the right one for a given business scenario.

Why the Looker family deserves a deep dive

The visualization layer — where the data finally meets the human eye — is dominated by two Google products: Looker (the enterprise BI platform) and Looker Studio (the free self-service dashboard tool, formerly Data Studio). Confusing these two is one of the most common CDL mistakes, and understanding the difference between governed enterprise BI and ad-hoc reporting is one of the most heavily tested concepts in the analytics section of the exam. This study note walks through the full pipeline from end to end, with the deepest dive reserved for the Looker family.

The Five Stages of the Analytics Pipeline

Every data analytics workload on Google Cloud, no matter how complex, can be broken down into the same five stages. Memorize these — they appear in dozens of CDL scenario questions:

Ingest — get the data into the cloud (often via Pub/Sub for streaming or Storage Transfer Service for batch).
Store — put it somewhere durable and queryable (BigQuery for structured warehouse, Cloud Storage for raw data lake).
Process — clean, transform, and enrich (Dataflow, Dataproc, Cloud Composer).
Analyze — run queries to answer business questions (BigQuery SQL, BigQuery ML).
Visualize — present results to humans (Looker, Looker Studio, Connected Sheets).

A retail company, for example, might ingest point-of-sale transactions via Pub/Sub, store them in BigQuery, process them with Dataflow to clean up nulls and convert currencies, analyze quarterly sales with BigQuery SQL, and visualize the results on a Looker dashboard for the CFO. Every stage is managed by Google — the customer focuses on the business question, not the plumbing.

白話文解釋（Plain English Explanation）

Data analytics can feel abstract, but every dashboard you have ever seen has a real-world equivalent. Here are three different ways to picture what Looker, Looker Studio, BigQuery, and Dataflow are doing under the hood.

Analogy 1 — The Aircraft Cockpit (Real-Time Operational Dashboards)

Imagine you are the captain of a commercial jet. In front of you is a cockpit dashboard with dozens of dials and gauges — altitude, airspeed, fuel level, engine temperature, hydraulic pressure. Each of those gauges is fed by a different sensor somewhere on the airplane. The sensors stream raw electrical signals through a central computer, which converts them into the clean numbers and colors you see on the glass panel.

Now map that onto Google Cloud. The sensors are like Pub/Sub — they emit a constant stream of events (one per second, thousands per second, sometimes millions per second). The central computer that cleans and converts the signals is Dataflow — it processes the stream in real time, drops bad readings, and calculates derived metrics. The cockpit glass itself is Looker — a governed, curated dashboard that always shows the same trusted numbers, no matter who sits in the captain's seat. Just like a pilot does not want one gauge showing fuel in gallons and another in litres, an enterprise does not want one VP seeing "revenue = $4.2M" and another seeing "revenue = $4.1M" because they pulled from different sources. Looker enforces a single source of truth via LookML, the same way an aircraft manufacturer enforces a single specification for every cockpit gauge.

Analogy 2 — The Hospital Emergency Room Monitor (Streaming Analytics)

Walk into an ER and you will see patients connected to monitors showing heart rate, blood pressure, blood oxygen, and respiration. These are streaming dashboards — they update second by second, and a doctor watching them can react to a sudden change in seconds, not hours.

In Google Cloud terms, the patient is the source of events (clicks on your website, payments through your app, errors from your servers). Pub/Sub is the wire connecting the patient to the monitor. Dataflow is the monitor's signal processor — it smooths out noise, computes rolling averages, and flags anomalies. BigQuery is the patient's medical chart — every reading is archived for later analysis. Looker Studio is the printed daily summary the doctor takes home — a quick, free, self-service report that summarizes the night's events. Looker is the hospital-wide governance dashboard that the chief of medicine uses to monitor outcomes across every department, using consistent KPI definitions across all wards. The cheap, ad-hoc reports come from Looker Studio. The enterprise-grade, governed metrics come from Looker.

Analogy 3 — The Stock Exchange Ticker Wall (Embedded Analytics)

Picture the iconic ticker walls at the New York or Tokyo stock exchanges — giant LED boards showing real-time quotes for thousands of securities. Now imagine that every brokerage app on your phone shows you a personalized slice of that same data, embedded right inside the broker's own interface. That embedded experience is what Looker embedded analytics delivers.

BigQuery holds the trade history. Dataflow processes the firehose of price updates from exchanges. Looker sits on top with a LookML semantic model that defines what "price," "volume," and "daily change" mean — and then exposes those governed metrics via APIs and iframes that the brokerage embeds inside its own product. The customer never sees Looker as a brand. They see "MyBroker Insights." But under the hood, it is the same governed BI platform, with the same metric definitions, used by the company's internal analysts. This is why fintechs, SaaS vendors, and marketplaces all standardize on Looker for customer-facing analytics: one model, one set of numbers, infinite surfaces. Looker Studio cannot do this at scale — it is a free dashboard tool for internal teams, not an embedded analytics platform for paying customers.

Looker: The Enterprise BI Platform

What Looker is and why Google bought it

Looker is Google Cloud's flagship enterprise Business Intelligence (BI) platform. Acquired by Google in 2020, it is now the recommended choice for any organization that needs governed metrics, a single source of truth, and embedded analytics in customer-facing products.

LookML — define a metric once, reuse everywhere

The defining feature of Looker is LookML, a YAML-like modeling language that lets data engineers define business logic — "What does revenue mean?" "How do we calculate active users?" — exactly once, in code, and have every dashboard, every report, every API call use that same definition. This eliminates the chaos of spreadsheet-based reporting, where every department invents its own version of the same KPI.

Database connectivity and pricing model

Looker connects to over 50 databases including BigQuery, Snowflake, Redshift, MySQL, and PostgreSQL. It is database-agnostic but works especially well with BigQuery thanks to BI Engine acceleration. Looker is sold as a paid subscription with per-user or capacity pricing — there is no free tier.

Looker is the answer when a CDL question mentions "single source of truth," "governed metrics," "consistent KPIs across the enterprise," "LookML," or "embedded analytics in a customer-facing product." It is the choice for organizations with a dedicated data team that needs to enforce metric definitions in code. See the Looker overview for product details.

Looker Studio: Self-Service Dashboards

From Data Studio to Looker Studio

Looker Studio (formerly Google Data Studio, rebranded in 2022) is a free, self-service dashboard and reporting tool. If Looker is the enterprise BI cockpit, Looker Studio is the personal scratchpad where a single analyst, a small team, or a marketing manager can drag-and-drop their way to a quick dashboard.

Looker Studio connects to BigQuery, Google Sheets, Google Analytics, Google Ads, YouTube Analytics, and dozens of third-party sources through community connectors. It is browser-based, requires zero installation, and dashboards can be shared with a link just like a Google Doc.

Ideal use cases

Looker Studio is ideal for:

A marketing analyst building a quick campaign performance report.
A small business tracking weekly sales from a Google Sheet.
An exploratory dashboard before deciding whether to invest in a full Looker rollout.
Publicly shareable reports (such as government open-data dashboards).

For mission-critical, enterprise-wide reporting where consistency matters, Looker is the right tool. For agility, experimentation, and "good enough" reporting at zero cost, Looker Studio wins.

Looker and Looker Studio are NOT the same product. Despite the shared brand, they have different audiences, different pricing (Looker is paid, Looker Studio is free), different governance models (Looker has LookML, Looker Studio does not), and different use cases. CDL questions deliberately exploit this confusion — read carefully. Reference the Looker Studio product page for the free tool.

Looker Studio Pro: The Middle Ground

Between the free Looker Studio and the full enterprise Looker, Google offers Looker Studio Pro. It adds enterprise features on top of the free version: team workspaces, Cloud customer support, and Dataplex integration for data governance. It does not include LookML — that remains a Looker-exclusive feature. Pro is priced per project and is a sensible choice for mid-sized teams that have outgrown free Looker Studio but cannot justify a full Looker deployment.

When to Choose Looker vs Looker Studio

This decision shows up on the CDL exam constantly. Here is the cheat sheet:

Treat Looker and Looker Studio as two completely different products that happen to share a brand. Looker is an enterprise BI platform built on the LookML semantic modeling language, governed metrics, role-based access through IAM, and embedded analytics in customer-facing applications. Looker Studio (formerly Data Studio) is a free self-service dashboarding tool for ad-hoc reports across Google Ads, Google Analytics, Sheets, and BigQuery. CDL exam scenarios that mention "governed KPIs," "single source of truth," or "embed analytics in our product" point to Looker; scenarios that mention "free," "quick share with the team," or "marketing analyst pulling ad data" point to Looker Studio.

Scenario	Choose
Enterprise-wide governed KPIs, LookML, single source of truth	Looker
Embedded analytics in a customer-facing SaaS product	Looker
Free, quick, ad-hoc dashboard for a small team	Looker Studio
Publicly shareable report linked from a marketing site	Looker Studio
Mid-sized team needing collaboration and support but not LookML	Looker Studio Pro
Marketing analyst pulling Google Ads + Sheets into one view	Looker Studio
CFO dashboard backed by a curated revenue model	Looker

On the exam, look for the words "governed," "enterprise," "single source of truth," "embedded," or "LookML" — those almost always point to Looker. Words like "free," "quick," "ad-hoc," "self-service," or "small team" point to Looker Studio. If the scenario mentions a customer-facing app showing analytics, it is Looker (embedded) every time.

The Processing Layer: Dataflow

Apache Beam under the hood

Dataflow is Google Cloud's fully managed service for executing Apache Beam pipelines. The defining characteristic is that Dataflow handles both batch processing (a fixed-size dataset processed once) and streaming processing (an unbounded stream of events processed continuously) using the same pipeline code.

Common Dataflow workloads

Real-world Dataflow scenarios include:

Cleaning and transforming streaming data from Pub/Sub before loading into BigQuery.
ETL (Extract-Transform-Load) jobs that move data from operational databases into a warehouse.
Real-time fraud detection on payment streams.
Sessionization of clickstream data for product analytics.

Serverless autoscaling

Dataflow is serverless — Google manages all autoscaling, worker pools, and shuffling. The customer just submits a pipeline; Dataflow figures out how many workers to use, when to scale up, and when to scale back down.

Dataflow is a fully managed Google Cloud service that runs Apache Beam pipelines, providing a unified programming model for batch and stream processing with automatic scaling and no infrastructure to manage. It is the recommended choice for new, cloud-native data pipelines on Google Cloud. See Dataflow overview.

The Processing Layer: Dataproc

Managed Hadoop and Spark

Dataproc is Google Cloud's managed service for running Apache Hadoop and Apache Spark clusters. While Dataflow is the recommended choice for greenfield projects, Dataproc exists primarily to support lift-and-shift migrations of existing on-premises Hadoop and Spark workloads.

Lift-and-shift without rewrites

If a company has spent years building Spark jobs in Scala or PySpark on a Cloudera or Hortonworks cluster, they do not want to rewrite everything in Apache Beam. Dataproc lets them lift those exact jobs and run them on Google Cloud with no code changes, while still getting cloud benefits: clusters can be created in 90 seconds, scaled elastically, and torn down when idle to save money.

When the Hadoop ecosystem is required

Dataproc is also the right answer when the workload requires the specific Hadoop ecosystem — Hive, Pig, Presto, HBase, Zeppelin — that does not have a direct Dataflow equivalent.

Dataflow = streaming + batch via Apache Beam; recommended for new cloud-native pipelines. Dataproc = managed Hadoop/Spark for lift-and-shift of existing on-prem workloads. When the question mentions "Apache Beam" or "new streaming pipeline," choose Dataflow. When it mentions "existing Hadoop," "migrate Spark jobs," or "lift and shift," choose Dataproc. See the Dataproc overview.

The Processing Layer: Cloud Composer

Managed Apache Airflow

Cloud Composer is Google Cloud's managed Apache Airflow service. Where Dataflow processes the data itself, Composer orchestrates the pipelines — it decides when each step runs, what depends on what, and what to do if a step fails.

A typical orchestration DAG

A typical Composer workflow might:

Wait for new files to land in Cloud Storage at 3 AM.
Trigger a Dataflow job to clean them.
Wait for the Dataflow job to finish.
Run a BigQuery SQL transform.
Notify Looker to refresh the daily dashboard.
Send a Slack message if any step failed.

Composer complements, not replaces, Dataflow

Composer is the choice when a CDL question describes "orchestration," "scheduling," "workflow management," "DAG" (Directed Acyclic Graph), or "Airflow." It is not a replacement for Dataflow — it works alongside Dataflow to coordinate multi-step jobs.

The Ingest Layer: Pub/Sub

Topics, subscriptions, and decoupling

Pub/Sub (Publish/Subscribe) is Google Cloud's globally distributed messaging service and the standard entry point for streaming data. Publishers (producers of data) send messages to topics; subscribers (consumers) pull or receive messages from subscriptions. Pub/Sub decouples the producer from the consumer so they can scale independently.

The canonical PDB-L reference pipeline

Pub/Sub is the front door of nearly every streaming analytics architecture on Google Cloud. The canonical reference pipeline is:

Pub/Sub → Dataflow → BigQuery → Looker

A device or web application publishes events to Pub/Sub. Dataflow consumes the events, cleans and enriches them, and streams them into BigQuery. Looker queries BigQuery to power a real-time dashboard. This four-service pipeline is so common on the CDL exam that it deserves its own mnemonic — remember the chain PDB-L (Pub/Sub, Dataflow, BigQuery, Looker).

Scale, durability, and ingestion patterns

Pub/Sub supports millions of messages per second, guarantees at-least-once delivery, and replicates messages across multiple Google Cloud regions for durability. It is the recommended ingestion service for IoT telemetry, application event logs, change-data-capture from databases, and webhook fan-out. Reference the Pub/Sub overview for service details.

The Storage Layer: BigQuery and Cloud Storage

Warehouse vs lake — two distinct roles

Storage in an analytics pipeline splits into two roles:

BigQuery is the data warehouse — structured, queryable with SQL, optimized for fast analytical queries against trillions of rows. This is where cleaned, modelled data lives so that Looker and Looker Studio can visualize it. For a deep dive on the warehouse layer, see Data Warehousing with BigQuery.
Cloud Storage is the data lake — raw, unstructured or semi-structured files (CSV, JSON, Parquet, images, video) stored cheaply at any scale. This is where data lands first, before processing, and where archival copies are kept for compliance.

BigLake — bridging warehouse and lake

Modern data architectures often use BigLake, which lets BigQuery query data directly in Cloud Storage without copying it — combining the cheapness of object storage with the SQL power of the warehouse. To compare BigQuery to other databases on the platform, see Google Cloud Databases.

Real-Time vs Batch Analytics

A critical CDL decision is whether a workload needs real-time (streaming) or batch analytics.

When streaming is the only option

Real-time (streaming) is required when latency matters in seconds:

Fraud detection on credit card transactions.
Personalized recommendations as a user browses.
Operational dashboards for a 24/7 service desk.
IoT alerting for industrial equipment.

When batch is good enough

Batch is acceptable when latency in hours or days is fine:

Monthly financial close.
Daily sales reports emailed at 8 AM.
Quarterly executive reviews.
Archive analytics on historical data.

Cost trade-offs between streaming and batch

The cost difference matters. Batch jobs run once and shut down; streaming jobs run 24/7 and accumulate cost continuously. Use streaming only when the business value justifies always-on processing. For the broader business case for cloud, see the Cloud value proposition.

Embedded Analytics: A Looker Superpower

Iframe embedding and Powered by Looker

One feature that distinguishes Looker from Looker Studio is embedded analytics. Through Looker's iframe embedding and Powered by Looker programs, Looker dashboards and visualizations can be embedded directly inside a third-party SaaS product. Customers of that SaaS see the analytics as a native feature of the product, never knowing that Looker is the engine.

White-labelled B2B SaaS analytics

This is increasingly the standard for B2B SaaS vendors who want to offer "white-labelled analytics" to their own customers without building a BI tool from scratch. Companies like HubSpot, Lyft, and Spotify have built customer-facing analytics on Looker. Looker Studio does not support this enterprise-grade embedding scenario.

Business Stories That Map to the Pipeline

Pattern matching exam scenarios to the stack

CDL questions are almost always framed as business stories. Here are the four most common and how they map to the analytics stack:

"Our executives want a real-time KPI dashboard with consistent definitions across regions" → Pub/Sub + Dataflow + BigQuery + Looker (governance and LookML).
"Our marketing analyst wants to pull Google Ads, Google Analytics, and a Sheet into one view by Friday" → Looker Studio (free, self-service, fast).
"We are a SaaS company that wants to show usage charts to our paying customers inside our app" → Looker embedded analytics.
"We have ten years of Hive jobs running on-prem and need to move to the cloud quickly without rewriting" → Dataproc (lift-and-shift), with output in BigQuery and dashboards in Looker.

Cost and Pricing Comparison

Service	Pricing model	Free tier
Looker	Per-user or capacity subscription; contact sales	None
Looker Studio	Free	Unlimited
Looker Studio Pro	Per-project subscription	Free trial
Dataflow	Per vCPU-hour, per GB-RAM-hour, per GB-data-processed	Limited
Dataproc	Per vCPU-hour for cluster nodes	Limited
Pub/Sub	Per GB of message data	10 GB/month free
BigQuery	Storage + query (on-demand or slot)	10 GB storage + 1 TB queries/month free

The combined free tier of Looker Studio + BigQuery means a small business can run real analytics for $0/month forever. As soon as governance, embedded analytics, or strict KPI consistency become requirements, the upgrade path to Looker begins.

Frequently Asked Questions

Q: What is the difference between Looker and Looker Studio?

A: Looker is an enterprise, paid BI platform with LookML semantic modeling, governed metrics, and embedded analytics. Looker Studio is a free, self-service dashboard tool for ad-hoc reporting. Same brand, different products, different use cases.

Q: When should I choose Dataflow over Dataproc?

A: Choose Dataflow for new cloud-native pipelines, especially streaming ones, since it offers a unified batch + stream model via Apache Beam. Choose Dataproc when migrating existing Hadoop or Spark workloads from on-premises with minimal code changes.

Q: What is LookML?

A: LookML is Looker's modeling language. Data engineers use LookML to define dimensions, measures, and business logic exactly once, and every dashboard, report, and API call uses those shared definitions — producing a single source of truth across the organization.

Q: Can Looker Studio connect to BigQuery?

A: Yes. Looker Studio has a native BigQuery connector and is one of the most common free tools for visualizing BigQuery data. For governed enterprise reporting, Looker is preferred; for quick or ad-hoc charts, Looker Studio is more than sufficient.

Q: What is the canonical streaming analytics pipeline on Google Cloud?

A: Pub/Sub → Dataflow → BigQuery → Looker (mnemonic: PDB-L). Events enter via Pub/Sub, are processed by Dataflow, stored in BigQuery, and visualized through Looker (or Looker Studio for free/ad-hoc cases).

Q: Is Cloud Composer a replacement for Dataflow?

A: No. Cloud Composer (managed Apache Airflow) orchestrates workflows — it schedules and coordinates multi-step pipelines. Dataflow executes the data processing inside each step. They complement each other rather than compete.

Summary: From Pipeline to Decision

For the Google Cloud Digital Leader exam, the data analytics layer is about recognizing the five stages (ingest → store → process → analyze → visualize) and matching the right Google service to each. The visualization stage is where two products demand careful attention: Looker for governed enterprise BI with LookML and embedded analytics, and Looker Studio for free self-service dashboards. Behind the scenes, Pub/Sub ingests streams, Dataflow and Dataproc process data, Cloud Composer orchestrates workflows, and BigQuery stores everything ready for SQL. Master this map of services and the business scenarios that fit each one, and you have mastered roughly one-fifth of the CDL exam.

What Is the Google Cloud Data Analytics Pipeline?

From raw data to business decisions

Five integrated stages on Google Cloud

Why the Looker family deserves a deep dive

The Five Stages of the Analytics Pipeline

白話文解釋（Plain English Explanation）

Analogy 1 — The Aircraft Cockpit (Real-Time Operational Dashboards)

Analogy 2 — The Hospital Emergency Room Monitor (Streaming Analytics)

Analogy 3 — The Stock Exchange Ticker Wall (Embedded Analytics)

Looker: The Enterprise BI Platform

What Looker is and why Google bought it

LookML — define a metric once, reuse everywhere

Database connectivity and pricing model

Looker Studio: Self-Service Dashboards

From Data Studio to Looker Studio

Connectors and zero-install sharing

Ideal use cases

Looker Studio Pro: The Middle Ground

When to Choose Looker vs Looker Studio

The Processing Layer: Dataflow

Apache Beam under the hood

Common Dataflow workloads

Serverless autoscaling

The Processing Layer: Dataproc

Managed Hadoop and Spark

Lift-and-shift without rewrites

When the Hadoop ecosystem is required

The Processing Layer: Cloud Composer

Managed Apache Airflow

A typical orchestration DAG

Composer complements, not replaces, Dataflow

The Ingest Layer: Pub/Sub

Topics, subscriptions, and decoupling

The canonical PDB-L reference pipeline

Scale, durability, and ingestion patterns

The Storage Layer: BigQuery and Cloud Storage

Warehouse vs lake — two distinct roles

BigLake — bridging warehouse and lake

Real-Time vs Batch Analytics

When streaming is the only option

When batch is good enough

Cost trade-offs between streaming and batch

Embedded Analytics: A Looker Superpower

Iframe embedding and Powered by Looker

White-labelled B2B SaaS analytics

Business Stories That Map to the Pipeline

Pattern matching exam scenarios to the stack

Cost and Pricing Comparison

Frequently Asked Questions

Q: What is the difference between Looker and Looker Studio?

Q: When should I choose Dataflow over Dataproc?

Q: What is LookML?

Q: Can Looker Studio connect to BigQuery?

Q: What is the canonical streaming analytics pipeline on Google Cloud?

Q: Is Cloud Composer a replacement for Dataflow?

Summary: From Pipeline to Decision

Official sources

More CDL topics