Cloud DLP: De-identification and Masking

Introduction to De-identification

Discovering sensitive data is only half the battle. Once identified, you must decide how to protect it while still allowing your organization to derive value from it. De-identification is the process of removing or transforming sensitive information so that it can no longer be associated with an individual.

For a PSE, mastering the various transformation techniques in Cloud DLP is essential for enabling secure analytics, machine learning, and data sharing across different environments.

白話文解釋（Plain English Explanation）

Analogy 1: The Library's Restricted Section

Think of Cloud DLP de-identification like a library that keeps rare manuscripts behind glass. Visitors can still read a photocopy (the transformed dataset) — but the photocopy has names blacked out (redaction), credit card numbers half-covered (masking), or page numbers replaced with library codes (tokenization). The original manuscript stays locked away, accessible only to authorized librarians holding the KMS-wrapped key.

Analogy 2: The Airline's Passenger Manifest

Format-Preserving Encryption (FPE) is like an airline replacing a passenger's real passport number with a fake one that still has the same shape — 9 digits, same checksum pattern. Boarding gate scanners (your legacy applications) don't crash because the format is intact, but a stolen manifest reveals nothing about real travelers. Date Shifting is similar: a flight log shifted by 47 days still shows the gaps between flights (useful for fraud detection) without exposing when the trip actually happened.

Analogy 3: The Hospital's Research Data Room

Hospitals share anonymized records with researchers via bucketing/generalization. Instead of "Patient A, age 27, ZIP 94043", the record becomes "Patient A, age 20-30, ZIP 940**". Researchers can still calculate "average age of diabetic patients" but cannot single out one person — that's k-anonymity in action. The DLP De-identification Template is the hospital's IRB-approved recipe book that ensures every dataset leaving the room gets the same consistent treatment.

Core Transformation Techniques

Cloud DLP offers several ways to transform data, depending on the required level of privacy and data utility.

1. Masking and Redaction

Redaction: Completely removes the sensitive value (e.g., replacing "John Doe" with [REDACTED]).
Masking: Replaces a portion of the data with a fixed character (e.g., 4532 XXXX XXXX 1234).

2. Pseudonymization and Tokenization

Pseudonymization replaces a sensitive value with a surrogate value (a "token").

Deterministic: The same input always produces the same output. This allows for JOIN operations in BigQuery across different datasets.
Non-deterministic: The same input produces different outputs each time. More secure, but destroys data utility for joining.

Pseudonymization is a de-identification method that replaces identifying fields within a data record with one or more artificial identifiers, or pseudonyms.

3. Format-Preserving Encryption (FPE)

FPE encrypts a value while maintaining its original format and length.

Example: A 16-digit credit card number remains a 16-digit number after encryption.
Benefit: Allows legacy applications that expect a specific format to continue functioning without modification.

Advanced Transformations

Date Shifting

Instead of removing dates (which might be needed for longitudinal studies), you can shift them by a random number of days.

The shift is consistent for a specific user within a specific dataset if using a Crypto Key.
Maintains the sequence and spacing of events while obscuring the actual dates.

Bucketing (Generalization)

Replaces a specific value with a range or a less specific value.

Example: Replacing age 27 with the bucket 20-30.
Benefit: Reduces the risk of re-identification (k-anonymity) while preserving statistical trends.

When choosing a transformation, there is always a trade-off between Privacy and Utility. Redaction offers maximum privacy but zero utility, while Bucketing offers high utility but lower privacy.

Managing De-identification Keys

Most cryptographic transformations (FPE, Tokenization, Date Shifting) require a key.

Transient Keys: Generated on-the-fly and then discarded. Use this if you never need to reverse the transformation.
Wrapped Keys (KMS): A key managed in Cloud KMS. This allows for Re-identification (unmasking) if authorized.

If you use a transient key for tokenization, you cannot perform join operations across different DLP jobs because the same input will map to different tokens.

De-identification Templates

Similar to inspection templates, De-identification Templates allow you to standardize your transformation logic.

Define which InfoTypes should be masked, which should be encrypted with FPE, and which should be bucketed.
Can be referenced in DLP jobs or when using the DLP API for real-time de-identification.

Secure Analytics with BigQuery

A common pattern for PSEs is to integrate DLP with BigQuery:

DLP Proxy: Use a Cloud Function to intercept data, call the DLP API for de-identification, and then store the "sanitized" data in BigQuery.
Remote Functions: Call the DLP API directly from within a SQL query in BigQuery to mask data for specific users.

Performance and Optimization

Calling the DLP API for every single row in a large database can be slow and expensive.

Batching: Send multiple records in a single API call.
Transformation at the Source: De-identify data before it reaches the cloud to minimize exposure.
Sampling: Use inspection jobs to find where PII exists, and then only apply de-identification to those specific columns.

Sensitive Data Protection: Transformation Method Selection

Cloud DLP (rebranded as Sensitive Data Protection in 2023) exposes six primary PrimitiveTransformation types via the content.deidentify API. Picking the wrong one wastes utility or leaks data — the PSE exam tests this matrix directly.

Decision Matrix

Method	API Field	Reversible?	Preserves Format	Preserves Joins	Typical InfoType
`replaceConfig`	static string	No	No	No	`EMAIL_ADDRESS` → `[EMAIL]`
`characterMaskConfig`	char + count	No	Partially	No	`CREDIT_CARD_NUMBER` → `****1234`
`cryptoReplaceFfxFpeConfig`	FFX-FPE w/ KMS	Yes (with key)	Yes	Yes	`US_SOCIAL_SECURITY_NUMBER`
`cryptoDeterministicConfig`	AES-SIV w/ KMS	Yes (with key)	No	Yes	`PERSON_NAME` for analytics
`cryptoHashConfig`	HMAC-SHA-256	No	No	Yes (same key)	`EMAIL_ADDRESS` for tracking
`dateShiftConfig`	±N days, context-keyed	Yes (with key)	Yes (date)	Yes (per context)	`DATE_OF_BIRTH` for cohorts
`redactConfig`	drop value	No	No	No	Free-text PII removal

When Each Wins

cryptoReplaceFfxFpeConfig (FPE): Mandatory when a downstream Mainframe or legacy COBOL system validates the format (e.g., Luhn-check on credit cards). Uses FFX mode of NIST SP 800-38G. Only supports alphanumeric radixes; no spaces, no Unicode.
cryptoDeterministicConfig (AES-SIV): Best default for BigQuery analytics joins. No format constraints — output is an opaque base64-ish token. Cannot be used for legacy format compatibility.
cryptoHashConfig: One-way HMAC. Use for GA-style anonymous user IDs where you never need to recover the original — same email always hashes to same value (per key), enabling cohort counts without storing PII.
dateShiftConfig with context: Pass a FieldId (e.g., patient_id) as context — DLP derives a per-patient offset so all dates for that patient shift by the same amount, preserving longitudinal study integrity.

Only FFX-FPE, deterministic encryption, and date shifting with context can be re-identified via content.reidentify — and only if the original CryptoKey (Transient, Unwrapped, or KMS-Wrapped) is still available. Hashing is mathematically one-way — losing the original means the data is gone forever.

For BigQuery-native pipelines, skip the DLP API entirely for low-sensitivity columns and use AEAD functions (AEAD.ENCRYPT_AES_GCM) with a KMS-wrapped keyset — it gives deterministic encryption at SQL speed without per-row API quotas. Reserve the DLP API for InfoType discovery and complex multi-column templates.

Six transformations, two reversibility classes:

Reversible (need key): FPE, Deterministic, Date Shift
Irreversible: Redact, Replace, Mask, Hash

Key types for crypto transformations:

Transient — DLP generates and discards (no re-identify possible)
Unwrapped — you supply base64 raw key in request (insecure, dev only)
KMS-Wrapped — production standard; DLP unwraps via cloudkms.cryptoKeyVersions.useToDecrypt

PSE Exam Scenarios

Scenario 1: Preserving Data Utility for Analytics

"A data scientist needs to analyze customer transaction patterns in BigQuery. The table contains credit card numbers. How can you protect the numbers while allowing the scientist to count unique customers?" Answer: Use Deterministic Tokenization or Format-Preserving Encryption (FPE). Both will ensure that the same credit card number always maps to the same token, allowing for COUNT(DISTINCT card_token) operations.

Scenario 2: Re-identifying Data for Support

"A customer support agent needs to see the original email address of a user who was de-identified in a dataset. How can this be achieved?" Answer: Use a KMS-wrapped Crypto Key during the initial de-identification. Then, authorize the support agent's identity to perform the Re-identify operation using the same key.

Summary Checklist

List the differences between Redaction, Masking, and Tokenization.
Explain why Format-Preserving Encryption (FPE) is useful for legacy apps.
Describe how Date Shifting maintains event sequences.
Differentiate between Deterministic and Non-deterministic encryption.
Identify the two types of keys used for transformations (Transient vs. KMS).
Explain the trade-off between data privacy and data utility.

DLP: De-identification and Masking

Introduction to De-identification

白話文解釋（Plain English Explanation）

Analogy 1: The Library's Restricted Section

Analogy 2: The Airline's Passenger Manifest

Analogy 3: The Hospital's Research Data Room

Core Transformation Techniques

1. Masking and Redaction

2. Pseudonymization and Tokenization

3. Format-Preserving Encryption (FPE)

Advanced Transformations

Date Shifting

Bucketing (Generalization)

Managing De-identification Keys

De-identification Templates

Secure Analytics with BigQuery

Performance and Optimization

Sensitive Data Protection: Transformation Method Selection

Decision Matrix

When Each Wins

PSE Exam Scenarios

Scenario 1: Preserving Data Utility for Analytics

Scenario 2: Re-identifying Data for Support

Summary Checklist

Official sources

More PSE topics