examlab .net The most efficient path to the most valuable certifications.
In this note ≈ 20 min

Looker and Looker Studio Integration

3,820 words · ≈ 20 min read ·

Practical GCP PDE study notes on Looker integration with BigQuery: LookML semantic layer, PDTs, BI Engine, aggregate awareness, embed strategies, and IAM.

Do 20 practice questions → Free · No signup · PDE

Introduction to Looker Integration with BigQuery

Looker Integration with BigQuery is the canonical analytics pattern on Google Cloud: a governed semantic layer on top of a serverless warehouse. For the PDE exam, you need to know when Looker fits, when Looker Studio is enough, and how the wiring choices (service account vs OAuth, PDTs, BI Engine, aggregate awareness) impact cost, latency, and governance.

This note walks through the architecture end-to-end. We will cover the difference between Looker (original) and Looker Studio, the LookML semantic layer, persistent derived tables, BI Engine acceleration, connection patterns to BigQuery, aggregate awareness, materialized view caching, embed strategies, IAM, and governance through the Looker Modeler. By the end, you should be able to design a Looker Integration with BigQuery deployment that scales to hundreds of analysts without burning a hole in the FinOps report.

Looker Integration with BigQuery shows up in PDE scenarios that mention "single source of truth metrics," "embedded analytics in a SaaS product," "row-level security across tenants," or "dashboards that hit a 10 TB table without a query queue." Spotting those signals is half the battle.

Looker Integration with BigQuery is a semantic-layer-on-warehouse pattern, not a BI tool sitting next to a warehouse. The LookML model is the source of truth for metrics; SQL is generated, not written by analysts. Understand that distinction before any other detail. Reference: https://cloud.google.com/looker/docs/what-is-looker

白話文解釋(Plain English Explanation)

Looker Integration with BigQuery is one of those topics where the marketing diagrams hide the real mechanics. Three analogies make it click.

The restaurant kitchen with a master recipe book

Picture a chain restaurant with 200 locations. Each branch could let cooks improvise burger recipes, and you would get 200 different burgers. Instead, headquarters publishes a master recipe book. Every cook follows the same grams of beef, the same sauce ratio, the same plating. Customers in Taipei and Tokyo get the same burger.

LookML is the master recipe book. Analysts in marketing, finance, and product all ask "what is monthly active users." Without Looker, three teams write three SQL queries and produce three numbers. With Looker, the metric is defined once in LookML, and BigQuery executes the same generated SQL every time. The kitchen (BigQuery) is powerful, but the recipe book (LookML) is what keeps the brand consistent.

Looker Studio, by contrast, is more like a home kitchen. Anyone can grab ingredients and cook. Great for exploration, terrible for governance.

The express lane at the airport security checkpoint

BI Engine acceleration is the TSA PreCheck of BigQuery. Regular passengers go through the standard line: take off shoes, unpack laptop, scan everything. PreCheck members keep shoes on, walk through fast, and clear in 30 seconds.

When a Looker dashboard hits BigQuery, the standard path provisions slots, scans columns from Capacitor storage, and builds the result. With BI Engine reservation, frequently queried tables are cached in a columnar in-memory layer, and the dashboard returns in sub-second. You did not change the SQL. You did not change the dashboard. You bought an express lane for hot data.

Aggregate awareness is the same idea but at the modeling layer. Instead of asking the warehouse to scan a billion-row fact table for a yearly trend, Looker silently rewrites the query against a pre-aggregated daily rollup. The user sees the same chart; BigQuery scans 365 rows instead of a billion.

The library card catalog that knows every book

A modern library has millions of books. Without a catalog, you wander aisles for hours. The catalog is a small, well-indexed dataset that points to the right shelf, the right floor, the right ISBN.

The Looker Modeler is the card catalog. The Explores, dimensions, measures, and joins are the index entries. When a business user clicks "revenue by region by month," the Modeler translates that intent into a precise SQL query against BigQuery. The user never sees a JOIN. The user never picks a partition filter. The Modeler does it.

Governance lives in the catalog: which books are restricted (row-level security), who can check out reference-only books (access grants), what late fees apply (usage attribution). Strip the catalog and the library collapses into a warehouse of paper.

Core Concepts of Looker Integration with BigQuery

The PDE exam tests whether you can recognize the moving parts and connect them to scenario constraints.

Looker (original) vs Looker Studio

These are two different products that share branding and not much else.

Looker (original, formerly Looker 7) is the enterprise platform with LookML, the IDE, Git integration, scheduled deliveries, embed SDKs, and the API. It runs as a managed service on Google Cloud and bills per-user or per-instance. Looker Integration with BigQuery in this context means a governed semantic layer with version-controlled metrics.

Looker Studio (formerly Data Studio) is the free, self-serve dashboarding tool. No LookML. No semantic layer. Direct connectors to BigQuery, Sheets, Cloud SQL, and 800+ third-party sources. Looker Studio Pro adds team workspaces, asset management, and Cloud-managed support, but still no LookML.

For PDE scenarios mentioning "governed metrics across the enterprise" or "embedded white-label analytics in a SaaS product," Looker (original) is the answer. For "ad hoc marketing dashboards" or "free internal reporting," Looker Studio.

LookML semantic layer

LookML is a YAML-flavored DSL that describes views (tables), explores (joinable starting points), dimensions (columns), and measures (aggregations). The Looker query engine reads LookML, generates BigQuery Standard SQL, runs it, and returns results.

Critical for the exam: LookML never stores data. It is metadata. The data stays in BigQuery. This is what makes Looker Integration with BigQuery elegant: no data duplication, no ETL into a BI cube, no proprietary storage format. The semantic layer is pure metadata over the warehouse.

Looker Modeling Language is the declarative DSL Looker uses to define the semantic layer. A LookML project is a Git repository containing model files (which Explores exist), view files (table definitions, dimensions, measures), and manifest files (project settings). Reference: https://cloud.google.com/looker/docs/what-is-lookml

Persistent Derived Tables (PDTs)

A derived table in Looker is a SQL query treated as if it were a view. An ephemeral derived table is recomputed on every dashboard load. A persistent derived table (PDT) is materialized into a scratch schema in BigQuery and refreshed on a schedule (datagroup trigger, sql_trigger_value, or persist_for interval).

PDTs solve two problems: expensive joins (precompute once, reuse 1000 times) and slow source queries (cache results overnight). The trade-off is freshness: a daily-refreshed PDT lags real-time data by up to 24 hours.

For Looker Integration with BigQuery, PDTs need a writable scratch dataset. The Looker service account requires BigQuery Data Editor on that dataset.

BI Engine acceleration

BI Engine is BigQuery's in-memory columnar cache. You buy a reservation (measured in GB) and pin tables or queries to it. When Looker submits a SQL query that hits a cached table and uses supported functions, BI Engine serves it from RAM in sub-second.

For Looker Integration with BigQuery, BI Engine is the dashboard accelerator. A 30-second cold query becomes a 500ms warm query. Users stop blaming the dashboard for being slow.

Connection: service account vs OAuth

Looker connects to BigQuery in one of two modes:

Service account (the default for most deployments): Looker authenticates as a single GCP service account. All queries run as that identity. Row-level security is enforced inside LookML using access_filter or sql_always_where.

OAuth (per-user credentials): Each Looker user authenticates to BigQuery with their own Google identity. BigQuery sees the actual user. Row-level security and column-level security from BigQuery (policy tags, authorized views) apply directly.

OAuth gives you native BigQuery IAM enforcement but requires every analyst to have a BigQuery role. Service account gives you simpler IAM but pushes RLS enforcement into LookML.

A common production bug: deploying Looker Integration with BigQuery with service account auth, then expecting BigQuery's row-level access policies to filter results per user. They will not. BigQuery sees only the service account. You must implement RLS in LookML with access_filter or switch to OAuth. Reference: https://cloud.google.com/looker/docs/db-config-google-bigquery

Aggregate awareness

Aggregate awareness lets you define rollup tables (daily, weekly, monthly aggregates) and tell Looker to use them transparently. A user asking for monthly revenue gets routed to the monthly rollup. A user asking for hourly revenue falls through to the raw fact table.

The cost win is enormous. Scanning a 1 TB fact table costs roughly 5 USD on-demand. Scanning a 100 MB monthly rollup costs less than a cent. Multiply by 1000 dashboard loads per day.

Materialized view caching

BigQuery materialized views are precomputed query results that BigQuery automatically refreshes (within the smart refresh window). Unlike PDTs (managed by Looker), materialized views are managed by BigQuery and benefit from automatic query rewriting: a query against the base table can be transparently served from the materialized view.

For Looker Integration with BigQuery, materialized views are an alternative to PDTs when:

  • The aggregation is simple (sum, count, avg, min, max with group by)
  • You want BigQuery to manage refresh, not Looker
  • You want other workloads (not just Looker) to benefit from the cache

Embed strategies

Looker supports three embed modes:

  • Public embed: anyone with the link sees the dashboard. Use only for non-sensitive data.
  • Private embed: requires a Looker user session.
  • Signed embed (SSO embed): your application generates a signed URL with an embedded user identity, permissions, and access filters. The user never logs into Looker directly.

Signed embed is the standard for SaaS products embedding analytics into customer-facing portals.

Architecture and Design Patterns

A reference Looker Integration with BigQuery architecture has six layers stacked from data to user.

At the bottom, raw data lands in BigQuery via Dataflow, Datastream, BigQuery Data Transfer Service, or batch loads from Cloud Storage. Partitioning by ingestion time and clustering by tenant ID set up the foundation for cost-efficient scans.

A transformation layer (Dataform or dbt) builds curated, modeled tables. This is where slowly changing dimensions, fact tables, and conformed dimensions live. Looker should not be doing transformation work; it should consume well-shaped tables.

The acceleration layer is where BI Engine reservations and BigQuery materialized views live. Pre-aggregated rollups (daily, weekly, monthly) sit here, ready for aggregate awareness routing.

Looker sits on top with LookML projects in Git, connection definitions pointing to BigQuery datasets, and PDTs writing to a scratch dataset. Explores expose curated joins; access grants enforce role-based visibility.

The delivery layer is dashboards, scheduled email/Slack deliveries, the Looker API, and embed SDKs for SaaS products.

The governance layer wraps everything: IAM on BigQuery datasets, Looker user groups, content access controls, and audit logs streamed to Cloud Logging.

Treat the LookML project as production code. Use Git branches, pull requests, code review, and a test environment that points at a BigQuery dev dataset. Looker's IDE supports all of this natively. Skipping Git discipline is the single biggest source of "the dashboard broke after that change" tickets. Reference: https://cloud.google.com/looker/docs/version-control-and-deploying-changes

GCP Service Deep Dive

Looker Integration with BigQuery touches several GCP services. Knowing each one's role keeps the architecture clear.

BigQuery as the query engine

BigQuery does the heavy lifting. Every Looker dashboard query becomes a BigQuery job. Slot allocation (on-demand or reservations), query queueing, and concurrency limits all apply. For Looker Integration with BigQuery at scale, a dedicated BigQuery reservation prevents dashboard queries from being throttled by ad hoc analyst queries.

Partitioning and clustering on the underlying tables matter enormously. A LookML measure that triggers a WHERE date BETWEEN x AND y will prune partitions only if the table is partitioned on that date column. Cluster on the high-cardinality columns that appear in dashboard filters (customer_id, region, product_id).

BI Engine reservations

BI Engine is purchased as a fixed-capacity reservation in GB. You pin specific tables or let BI Engine choose automatically based on query patterns. Supported SQL is a subset of Standard SQL: most aggregations, joins, and filters work; some user-defined functions and complex window functions do not.

For Looker Integration with BigQuery, allocate BI Engine capacity equal to the working set of your hottest dashboards. Monitor BI_Engine_statistics in INFORMATION_SCHEMA.JOBS to see hit rates.

Cloud IAM and Looker roles

Two IAM systems are in play. GCP IAM controls who and what can read which BigQuery datasets. Looker IAM controls who can see which dashboards, run which Explores, and apply which permissions.

The Looker service account needs:

  • roles/bigquery.dataViewer on source datasets
  • roles/bigquery.dataEditor on the PDT scratch dataset
  • roles/bigquery.jobUser on the project running the queries

Looker users need Looker permissions (developer, explorer, viewer) and, if using OAuth, BigQuery IAM roles too.

Dataform and dbt for upstream modeling

Looker generates SQL but does not transform raw data. Dataform (native to BigQuery) and dbt (community standard) handle the ELT layer. The Looker view files reference the curated marts produced by Dataform/dbt, not raw landing tables.

Cloud Logging and audit

Every BigQuery query Looker generates is logged with the service account or user identity. Stream logs to BigQuery for cost analysis: which dashboards burn the most slot-hours, which users trigger expensive queries.

Common Pitfalls and Trade-offs

Real production deployments fail in predictable ways.

The first failure mode is the runaway PDT. An analyst writes a derived table that joins five billion-row tables. PDT refresh runs nightly and consumes 4 TB of slot-hours every refresh. Fix: review PDTs the way you review production SQL, and cap concurrent PDT builds in connection settings.

The second failure mode is BI Engine misconfiguration. Teams buy a 100 GB reservation, expect 10x speedup, and see no improvement. Cause: dashboards use unsupported SQL functions, so BI Engine silently falls back to standard slots. Fix: monitor bi_engine_statistics.bi_engine_mode in job metadata; refactor offending measures.

The third failure mode is OAuth without IAM hygiene. Switching to OAuth surfaces every BigQuery permission gap. Users see "permission denied" errors instead of dashboards. Fix: pre-flight IAM with a test group before flipping the connection mode.

The fourth failure mode is over-modeling. Teams build 200 Explores when 20 would do. LookML becomes unmaintainable. Fix: one Explore per business process (orders, sessions, marketing attribution), not one per analyst request.

The fifth is forgetting clustering. The underlying BigQuery table is unclustered, so every dashboard scans the whole partition. Fix: cluster by the columns that appear in WHERE clauses of generated SQL.

Aggregate awareness with mismatched timezones is a classic silent bug. Your daily rollup is built in UTC; your dashboard filter is in America/Los_Angeles. Aggregate awareness routes the query to the rollup, but the day boundaries are off by 7-8 hours. Numbers look right, are wrong. Always align rollup and dashboard timezones explicitly. Reference: https://cloud.google.com/looker/docs/aggregate_awareness

Best Practices

A working set of habits for Looker Integration with BigQuery deployments:

  • Version-control LookML in Git from day one. Use branches and pull requests.
  • Separate dev, staging, and production BigQuery projects, with one Looker connection each.
  • Use service account auth unless you have a hard requirement for per-user BigQuery audit trails.
  • Cluster BigQuery tables on columns that appear in dashboard filters; partition on date.
  • Pre-aggregate hot metrics into rollup tables and wire them into aggregate awareness.
  • Set query and row limits in connection settings to prevent runaway queries.
  • Use BI Engine for dashboards with millisecond latency requirements; budget the reservation.
  • Stream Looker query logs to BigQuery for cost attribution by dashboard, user, and Explore.
  • Implement row-level security in LookML using access_filter, with user attributes.
  • Document Explores. A LookML model with no descriptions is a dashboard graveyard.

Real-World Use Case

Consider a mid-sized B2B SaaS company, 300 employees, 8000 paying customer accounts, BigQuery as the primary warehouse. The product has an in-app analytics tab that customers use daily.

Requirements: every customer sees only their own data; latency under 2 seconds for the in-app dashboard; internal teams (CS, sales, finance) need different views over the same data; metrics must reconcile across teams.

Architecture chosen: Looker (original) on Google Cloud. LookML project in GitHub with PR-based deployment. BigQuery curated dataset built nightly by Dataform from raw event logs. Looker connects via service account with BigQuery Data Viewer on curated and Data Editor on PDT scratch.

Row-level security is enforced via an access_filter on account_id, mapped from the account_id user attribute. Internal users have a wildcard attribute; customer-facing embeds inject the specific account_id from the SaaS app's signed URL.

Aggregate awareness handles the in-app dashboard: a daily account_metrics_daily rollup serves 80% of queries with sub-second latency. BI Engine pins this rollup table, giving consistent low latency.

Embedded analytics use signed embed URLs generated by the SaaS backend. Customers never see Looker's UI directly; they see Looker dashboards rendered inside the product. White-labeling hides the Looker brand.

Cost story: pre-aggregation reduced BigQuery scan volume by 95%. BI Engine reservation is 50 GB at roughly 1500 USD/month, far less than the slot-hours saved. Total Looker Integration with BigQuery cost is well within the analytics budget.

Exam Tips

PDE exam questions about Looker Integration with BigQuery cluster around a few patterns.

When the scenario emphasizes "single source of truth for metrics across teams" or "consistent KPIs," the answer is Looker (original) with LookML, not Looker Studio.

When the scenario says "free, self-serve dashboards for marketing" or "executive scorecard with no engineering involvement," Looker Studio is the answer.

When latency is the constraint ("sub-second response on a 1 TB table"), think BI Engine reservation plus aggregate awareness, not bigger BigQuery slot reservations.

When the scenario mentions "different users should see different rows," distinguish between OAuth (BigQuery enforces) and service account with access_filter (LookML enforces). Both are valid; the right answer depends on whether per-user BigQuery audit is required.

When the scenario talks about "embedded analytics in our SaaS product for customers," think signed embed with user attributes for tenant isolation.

When asked about cost optimization, the layered answer is: cluster the source tables, pre-aggregate hot metrics into rollups, configure aggregate awareness, then add BI Engine for the latency-sensitive dashboards.

The Looker pricing model on GCP is per-user (Standard, Enterprise, Embed editions) or per-instance (legacy). Looker Studio is free; Looker Studio Pro is per-user. BI Engine is reserved capacity in GB, billed hourly. BigQuery scan or slot costs are separate and stack on top. Reference: https://cloud.google.com/looker/docs/google-cloud-billing

For latency-sensitive Looker Integration with BigQuery dashboards, the exam-correct stack is aggregate awareness routing to pre-aggregated rollups plus a BI Engine reservation pinning those rollups, not a bigger BigQuery slot reservation. Aggregate awareness rewrites the SQL to scan a 365-row monthly rollup instead of a billion-row fact table, and BI Engine then serves the result from in-memory columnar cache in sub-second. Verify acceleration with bi_engine_statistics.bi_engine_mode = FULL in INFORMATION_SCHEMA.JOBS. Reference: https://cloud.google.com/bigquery/docs/bi-engine-intro

Frequently Asked Questions (FAQ)

Should I use Looker or Looker Studio for my project?

Use Looker (original) when you need a governed semantic layer, version-controlled metrics, embedded analytics in a SaaS product, or row-level security across many users. Use Looker Studio when you need free, self-serve dashboards for small teams or executives without engineering involvement. The two products share a brand but solve different problems; do not let the name overlap mislead you.

Does Looker store any data, or does it always query BigQuery live?

Looker itself stores only metadata: the LookML project, user accounts, dashboard definitions, scheduled job history. Data lives in BigQuery (or whichever connected database). The exception is persistent derived tables, which Looker materializes back into a scratch BigQuery dataset, but the data still physically resides in BigQuery storage. Looker is a metadata layer, not a data store.

When should I choose persistent derived tables vs BigQuery materialized views?

PDTs win when the underlying SQL is complex (multi-step CTEs, window functions, things materialized views do not support) or when refresh logic depends on Looker datagroups. Materialized views win when the aggregation is simple, when you want BigQuery to handle refresh automatically, and when other workloads (not just Looker) benefit from the cache. Many production stacks use both.

How do I enforce row-level security across thousands of customers in an embedded scenario?

Use signed embed URLs and inject the customer ID as a Looker user attribute. In LookML, define an access_filter on the relevant view that filters on that user attribute. The Looker query engine appends a WHERE clause to every generated SQL query, so even a curious user inspecting the embedded iframe cannot see other tenants' rows. Combined with a service account connection, this scales to tens of thousands of tenants without per-customer GCP IAM.

Will BI Engine accelerate every Looker dashboard automatically?

No. BI Engine accelerates queries only when the SQL is supported (most aggregations, joins, filters), the table is in a region matching the reservation, and the data fits in cache. Unsupported functions cause silent fallback to standard slots. Always verify acceleration by inspecting bi_engine_statistics in BigQuery's INFORMATION_SCHEMA.JOBS. Look for bi_engine_mode = FULL for fully accelerated queries.

What happens to dashboards if my LookML project has a syntax error?

The Looker IDE blocks deploys when LookML fails validation. If a bad commit slips into production (rare with PR review), the affected Explores fail to load with an error message; unaffected Explores keep working. Treating LookML as production code with mandatory review and a staging environment makes this a non-event in mature deployments.

Further Reading

Official sources

More PDE topics