Cloud DLP: Inspection and Templates

Introduction to Cloud DLP

In a data-driven world, knowing where your sensitive data resides is the first step toward securing it. Cloud Data Loss Prevention (DLP) is a fully managed service designed to help you discover, classify, and protect sensitive data across your Google Cloud ecosystem and beyond.

For the PSE exam, you must understand how to configure inspection jobs, leverage templates for scalability, and interpret risk analysis metrics to ensure compliance with regulations like GDPR, HIPAA, and PCI DSS.

白話文解釋（Plain English Explanation）

Cloud DLP can feel abstract because it bundles detection, classification, and statistical risk analysis into one product. The analogies below ground each concept to something physical so the PSE-level decisions become obvious.

Analogy 1: The Airport Security Scanner (InfoType detectors)

Built-in InfoTypes are like the X-ray machines at airport security — they are pre-trained to spot known shapes: laptops, liquids, knives. CREDIT_CARD_NUMBER and US_SOCIAL_SECURITY_NUMBER are equivalent to the "metallic object" silhouettes the machine already knows. Custom InfoTypes (regex + dictionary) are the supplementary list a specific airport adds, like "ban this country's currency over $X" — narrow rules layered on top of the global ruleset. Hotwords are the agent at the side reading the luggage tag: even if the X-ray is ambiguous, the tag saying "MEDICAL EQUIPMENT" raises confidence on what's inside.

Analogy 2: The Library Cataloging Project (Inspection Templates)

Imagine cataloging 10,000 books across 50 branches. You don't want each librarian inventing their own classification rules. An Inspection Template is the shared cataloging manual — one team writes it, and every branch (read: every job on every BigQuery dataset or GCS bucket) references the same manual. If you update the rule for "what counts as a rare book" (e.g., raise the minimum likelihood to VERY_LIKELY), every future inventory uses the new rule automatically. De-identification templates are the redaction stamp the librarian uses on book records before publishing the public catalog.

Analogy 3: The Census Anonymization Office (Risk Analysis)

After collecting census data, the statistics bureau wants to publish aggregate findings without exposing any individual. k-anonymity is the rule "every published row must look identical to at least k-1 other rows on the quasi-identifiers" — like making sure no household is the only one in its ZIP+age+income bucket. l-diversity adds "and within that bucket, the sensitive column must show at least l different values" — so even if an attacker narrows you down to a group, they still can't pin the diagnosis. δ-presence answers "does this person even appear in the dataset?" — useful when the mere act of being included is itself sensitive (e.g., a list of HIV-clinic visitors).

Understanding InfoTypes

At the heart of Cloud DLP are InfoTypes. An InfoType is a type of sensitive data, such as a credit card number, an email address, or a government-issued ID.

Types of InfoTypes:

Built-in InfoTypes: Google provides over 150 pre-defined detectors for global PII, PHI, and financial data (e.g., CREDIT_CARD_NUMBER, US_SOCIAL_SECURITY_NUMBER).
Custom InfoTypes: Tailored detectors created by you.
- Regular Expressions (Regex): For patterns like employee IDs or internal project codes.
- Dictionaries: A list of specific words or phrases to search for.
- Surrogate Term InfoTypes: Used to identify tokens or placeholders from previous de-identification steps.

An InfoType is a category of sensitive data that Cloud DLP is trained to recognize using various detection techniques like pattern matching and machine learning.

Hotword Rules and Likelihood Adjustment

A raw regex or built-in detector often produces too many false positives because numbers and strings are ambiguous out of context. Hotword rules let you raise (or lower) the likelihood score of a finding when a configured phrase appears within a proximity window of the match.

Anatomy of a hotword rule

A hotword rule lives inside an inspectionRuleSet attached to an InfoType and contains:

hotwordRegex — the trigger phrase, e.g. (?i)(employee\s*id|emp\s*#)
proximity — windowBefore and windowAfter measured in characters (typical: 30–50)
likelihoodAdjustment — either fixedLikelihood: VERY_LIKELY or relativeLikelihood: +2

Example: tightening US_SOCIAL_SECURITY_NUMBER

{
  "inspectionRuleSet": [{
    "infoTypes": [{"name": "US_SOCIAL_SECURITY_NUMBER"}],
    "rules": [{
      "hotwordRule": {
        "hotwordRegex": {"pattern": "(?i)(ssn|social\\s*security)"},
        "proximity": {"windowBefore": 50},
        "likelihoodAdjustment": {"fixedLikelihood": "VERY_LIKELY"}
      }
    }]
  }]
}

Without the hotword, a 9-digit invoice number might trigger POSSIBLE. With the rule, only digits preceded by "SSN:" within 50 characters get promoted to VERY_LIKELY, letting you filter the rest out via a likelihood threshold.

Hotwords change likelihood, not the regex itself — DLP still has to match the underlying InfoType pattern first. If your custom regex never fires, no hotword rule in the world will rescue it. Always validate the base detector with dlp.content.inspect against sample payloads before layering hotwords on top.

Exclusion Rules: Suppressing False Positives

The mirror image of hotwords is the exclusion rule, which removes findings that match a noisy pattern even if the InfoType would otherwise fire. Exclusions are essential when scanning test data, fixture files, or domains where placeholders look real.

Common exclusion patterns

dictionary — suppress findings that appear in a list (e.g., [email protected], [email protected]).
regex — suppress findings matching a regex (e.g., test phone numbers 555-01\d{2}).
excludeInfoTypes — drop a finding if it overlaps with another InfoType (e.g., suppress PERSON_NAME when it also matches US_HEALTHCARE_NPI).
matchingType — MATCHING_TYPE_FULL_MATCH, MATCHING_TYPE_PARTIAL_MATCH, or MATCHING_TYPE_INVERSE_MATCH.

{
  "exclusionRule": {
    "dictionary": {"wordList": {"words": ["[email protected]", "[email protected]"]}},
    "matchingType": "MATCHING_TYPE_FULL_MATCH"
  }
}

A common PSE-exam trap: candidates assume that raising the minimum likelihood is equivalent to an exclusion rule. It is not. Likelihood thresholding drops all low-confidence findings globally, including real PII. Exclusion rules surgically remove a known false-positive cohort while keeping the likelihood floor low enough to catch genuine matches. Use exclusion rules first; only raise likelihood as a coarse last resort.

Scan Jobs on BigQuery and Cloud Storage

Inspection jobs (dlp.dlpJobs.create with a inspectJob payload) handle the heavy lifting for at-rest data. Two storage targets dominate PSE scenarios: BigQuery and Cloud Storage.

BigQuery storage config

tableReference — {projectId, datasetId, tableId} identifies the target.
rowsLimit / rowsLimitPercent — sample a subset for very large tables to control cost.
sampleMethod — TOP (first N rows) or RANDOM_START (statistically representative).
identifyingFields — columns DLP echoes back so you can pinpoint the source row of any finding (critical for remediation playbooks).
Findings sink: set actions.saveFindings.outputConfig.table to write structured findings to a BigQuery table — then query findings with SQL like any other dataset.

Cloud Storage storage config

fileSet.url — gs://bucket/prefix/* glob; recursion happens automatically.
bytesLimitPerFile / bytesLimitPerFilePercent — cap scan volume per object.
fileTypes — restrict to [CSV, TEXT_FILE, IMAGE, PDF, AVRO, WORD_DOCUMENT, EXCEL_DOCUMENT, ...].
sampleMethod — TOP reads from the start of each file; RANDOM_START jumps to a random offset.

gcloud dlp jobs create inspect \
  --inspect-template-name="projects/$PROJECT/locations/global/inspectTemplates/pii-baseline" \
  --bigquery-table="$PROJECT.analytics.events" \
  --action-save-findings-table="$PROJECT.dlp_findings.events_pii"

For the PSE exam, remember the three pillars of a scan job: (1) storageConfig (where to scan — BigQuery / GCS / Datastore / hybrid), (2) inspectConfig or inspectTemplateName (what to look for — InfoTypes, rule sets, min likelihood), and (3) actions (what to do with findings — saveFindings to BigQuery, pubSub for alerting, publishSummaryToCscc for Security Command Center, publishFindingsToCloudDataCatalog for tagging). Any exam question about DLP architecture maps to one of these three pillars.

Sample Inspection via Cloud Functions and Eventarc

For event-driven scanning — "every file uploaded to the landing bucket must be scanned before it lands in the curated bucket" — combine Cloud Functions (or Cloud Run) with the DLP API.

Architecture pattern

Eventarc trigger on google.cloud.storage.object.v1.finalized fires on each new GCS object.
The Cloud Function downloads (or streams) the object and calls dlp.projects.content.inspect with the bytes inline for files under ~0.5 MB, or kicks off an async dlp.dlpJobs.create for larger files.
Findings are routed via the function: clean files copied to the curated bucket; files with LIKELY+ findings moved to a quarantine bucket with a Pub/Sub alert to the security team.

Sample Python snippet

from google.cloud import dlp_v2

def scan_object(event, context):
    client = dlp_v2.DlpServiceClient()
    parent = f"projects/{PROJECT}/locations/global"
    item = {"byte_item": {"type_": "TEXT_UTF8", "data": fetch_bytes(event)}}
    response = client.inspect_content(
        request={
            "parent": parent,
            "inspect_template_name": INSPECT_TEMPLATE,
            "item": item,
        }
    )
    if any(f.likelihood >= dlp_v2.Likelihood.LIKELY for f in response.result.findings):
        quarantine(event)

Quota and cost considerations

The content.inspect method is billed per megabyte scanned; cap object size before calling.
For files larger than the 0.5 MB synchronous limit, use inspect_gcs_file via a DlpJob and have the function return immediately; reconcile results from the Pub/Sub JOB_COMPLETED notification.
Use regional endpoints (us-east1, europe-west1, etc.) when data residency matters — the parent becomes projects/$P/locations/europe-west1.

Inspection Jobs and Scopes

Cloud DLP can scan data in various locations using Inspection Jobs.

Cloud Storage (GCS): Scans files (CSV, JSON, PDF, Images).
BigQuery: Scans tables and columns.
Datastore: Scans entities.
Hybrid Environments: Using the DLP API, you can scan data from on-premises databases or other cloud providers by sending payloads to the service.

Likelihood Scores and Confidence Levels

DLP doesn't just say "this is a match"; it provides a Likelihood score:

VERY_UNLIKELY
UNLIKELY
POSSIBLE
LIKELY
VERY_LIKELY

When configuring an inspection job, you can set a minimum likelihood threshold to reduce noise. For example, only report matches that are LIKELY or VERY_LIKELY.

Designing Inspection Templates for Scale

Managing individual jobs for hundreds of buckets or tables is inefficient. Inspection Templates allow you to define your configuration once and reuse it across multiple jobs.

Consistency: Ensures that the same InfoTypes and likelihood levels are applied across the entire organization.
Updatability: Changing the template automatically affects all future jobs that reference it.
Separation of Duties: A security admin can create the template, while a data engineer triggers the job using that template without needing to know the specific detection logic.

Hybrid Inspection and Streaming Data

For data that doesn't reside in Google Cloud managed storage, you can use the DLP API for:

Content Inspection: Sending small snippets of text or images for immediate analysis.
Hybrid Jobs: Scanning large datasets on-premises and sending the metadata/findings to the Google Cloud console for centralized reporting.

Risk Analysis: Beyond Simple Detection

Risk analysis helps you understand the privacy risk of a dataset even if it has been partially de-identified.

k-anonymity: Measures the risk of a person being re-identified by comparing them to at least k other people in the dataset.
l-diversity: Ensures that sensitive attributes have at least l distinct values within each "quasi-identifier" group, preventing "homogeneity attacks."
δ-presence (Delta-presence): Estimates the probability that a specific individual is part of the dataset.

Risk analysis is crucial for Open Data initiatives. It helps you determine if a "sanitized" dataset is truly safe to publish.

Managing Job Triggers and Schedules

You can automate discovery using Job Triggers.

Schedules: Run a scan every day or every week to catch new sensitive data.
Event-based: Trigger a scan when a new file is uploaded to a specific GCS bucket (using Cloud Functions or EventArc).

Integration with Security Command Center (SCC)

DLP findings can be sent directly to Security Command Center.

Provides a "single pane of glass" for security posture.
Allows for automated alerting and remediation (e.g., if DLP finds a public bucket with PII, SCC can trigger a block).

PSE Exam Scenarios

Scenario 1: Minimizing False Positives

"A security engineer is scanning a BigQuery table for customer IDs. The built-in detectors are flagging too many unrelated numbers. How can the engineer improve accuracy?" Answer: 1. Increase the minimum likelihood threshold to VERY_LIKELY. 2. Use Hotwords (contextual clues like 'ID' or 'Account' appearing near the number) to increase the confidence score. 3. Create a Custom InfoType with a more specific regex.

Scenario 2: Compliance for On-Prem Data

"A company needs to identify PII in an on-premises SQL database but wants to use Google Cloud DLP for classification. What is the best approach?" Answer: Use the DLP API in a Hybrid Job configuration. The on-premises agent scans the data and sends only the findings (or small snippets) to Cloud DLP for classification.

Summary Checklist

Define the difference between Built-in and Custom InfoTypes.
Explain the five levels of Likelihood.
Describe the benefits of using Inspection Templates.
Contrast k-anonymity and l-diversity in risk analysis.
Identify the services where DLP can natively perform inspection jobs.
List the ways to trigger a DLP job automatically.

DLP: Inspection and Templates

Introduction to Cloud DLP

白話文解釋（Plain English Explanation）

Analogy 1: The Airport Security Scanner (InfoType detectors)

Analogy 2: The Library Cataloging Project (Inspection Templates)

Analogy 3: The Census Anonymization Office (Risk Analysis)

Understanding InfoTypes

Types of InfoTypes:

Hotword Rules and Likelihood Adjustment

Anatomy of a hotword rule

Example: tightening US_SOCIAL_SECURITY_NUMBER

Exclusion Rules: Suppressing False Positives

Common exclusion patterns

Scan Jobs on BigQuery and Cloud Storage

BigQuery storage config

Cloud Storage storage config

Sample Inspection via Cloud Functions and Eventarc

Architecture pattern

Sample Python snippet

Quota and cost considerations

Inspection Jobs and Scopes

Likelihood Scores and Confidence Levels

Designing Inspection Templates for Scale

Hybrid Inspection and Streaming Data

Risk Analysis: Beyond Simple Detection

Managing Job Triggers and Schedules

Integration with Security Command Center (SCC)

PSE Exam Scenarios

Scenario 1: Minimizing False Positives

Scenario 2: Compliance for On-Prem Data

Summary Checklist

Official sources

More PSE topics