Security Monitoring and Alerting

Introduction to Security Monitoring

In a complex cloud environment, "you cannot secure what you do not monitor." For a Professional Cloud Security Engineer (PSE), Cloud Monitoring is the toolset used to gain visibility into the health and security of your infrastructure. While SCC focuses on findings and threats, Cloud Monitoring focuses on performance metrics, system health, and log-based patterns that might indicate a slow-moving attack or a misconfiguration.

The goal of security monitoring is to reduce the "Mean Time to Detect" (MTTD) by ensuring that the right people are alerted at the right time.

白話文解釋（Plain English Explanation）

1. The Dashboard of a Car (Security Dashboards)

When you drive, you don't look under the hood. You look at the dashboard. It tells you your speed (Traffic volume), your fuel level (Quota usage), and if the engine light is on (System errors). A Security Dashboard in GCP provides this high-level view of your cloud environment's "health."

2. The Smoke Detector (Log-based Alerts)

A smoke detector doesn't wait for a fire to start; it looks for a specific pattern (smoke particles). A Log-based Alert is like that detector. It scans millions of log entries for a specific pattern, like "100 failed login attempts in 1 minute," and screams (sends an alert) as soon as it sees it.

3. The Neighborhood Watch (Uptime Checks)

Imagine a neighbor who walks by your house every hour to make sure the front door is closed. If they see it open, they call you. Uptime Checks do this for your web applications. they "ping" your security endpoints from different locations around the world to ensure they are available and responding correctly.

Creating Security-Focused Dashboards

A security dashboard should provide a high-level view of risks across the organization.

Key Metrics to Include:
- Number of active SCC "Critical" findings.
- VPC Flow Log volume (to detect spikes in traffic).
- IAM policy change frequency.
- API Error rates (4xx and 5xx errors can indicate probing).
- Service Account usage spikes.

Use Custom Dashboards in Cloud Monitoring to group metrics by "Security Domain" (e.g., Network Security, IAM, Data Access).

Log-Based Metrics and Alerts

Sometimes, the information you need isn't a metric (like CPU usage) but a specific string in a log.

Counter Metrics: Count the number of times a specific log entry appears (e.g., "Access Denied" in BigQuery).
Distribution Metrics: Track the size or latency of events (e.g., "Size of objects downloaded from a sensitive GCS bucket").
Alerting on Metrics: Once you have a metric, you can set a threshold. "If the 'Access Denied' metric > 50 in 5 minutes, trigger an alert."

Log-based Metrics are Cloud Monitoring metrics that are based on the content of log entries in Cloud Logging.

Notification Channels and Incident Management

An alert is useless if nobody sees it.

Channels: GCP supports Email, SMS, Slack, PagerDuty, Webhooks, and the Google Cloud Mobile App.
Best Practice: Use Slack or PagerDuty for high-priority security alerts to ensure immediate visibility. Use Email for low-priority, informational alerts.

Monitoring API Usage and Quotas

Attackers often probe APIs or try to spin up hundreds of VMs for cryptomining.

Quota Monitoring: Set alerts on serviceruntime.googleapis.com/quota/exceeded. A sudden spike in quota exceeded errors is a strong indicator of an automated attack or an out-of-control script.
API Anomalies: Monitor for unusual API calls (e.g., a "Read" heavy service account suddenly performing "Delete" operations).

Uptime Checks for Security Endpoints

Uptime checks are not just for availability; they are for Integrity.

Scenario: You have a critical internal security API. You set an uptime check to verify it returns a 200 OK and a specific string in the header. If an attacker replaces your API with a malicious one that doesn't return that string, the uptime check fails, and you are alerted.

Uptime checks can be configured to originate from multiple geographic regions to ensure global availability.

Managing Alert Fatigue

One of the biggest risks for a PSE is Alert Fatigue—receiving so many alerts that you start ignoring them.

Refine Thresholds: Don't alert on every single failure. Use "M of N" logic (e.g., "Alert if 3 failures occur within 5 minutes").
Grouping: Group related alerts into a single incident to reduce noise.
Auto-Close: Configure alerts to automatically close if the condition resolves itself, so your dashboard stays clean.

Infrastructure-as-Code (IaC) for Monitoring

For a consistent security posture, you should manage your monitoring via code.

Terraform: Use the google_monitoring_alert_policy and google_monitoring_dashboard resources to deploy your security monitoring along with your infrastructure.
Benefit: This ensures that every new project you create automatically has the required security alerts and dashboards.

Monitoring for Sensitive Data Access

Combined with Cloud DLP, you can monitor for access to sensitive data.

Scenario: Create a log-based metric for every time a "High Sensitivity" tag is accessed in a BigQuery audit log.
Alert: Trigger an alert if a user outside of the "Data Science" group touches that data.

Monitoring can generate a large volume of data. Be mindful of Cloud Monitoring costs, especially when using high-cardinality custom metrics or frequent uptime checks.

Integrating Monitoring with SCC

Cloud Monitoring and SCC are complementary.

SCC tells you "The door is unlocked" (Finding).
Monitoring tells you "Someone just walked through the door" (Metric/Alert).
Integration: You can view SCC findings directly within Cloud Monitoring dashboards to correlate security threats with system performance.

Security Best Practices for PSE

Define "Normal": You cannot detect an anomaly if you don't know what "normal" behavior looks like. Spend time baselining your environment.
Tiered Alerting: Create a "Critical" channel for things that require waking someone up (e.g., Root account login) and a "Standard" channel for daily review.
Use Error Reporting: Enable Cloud Error Reporting to automatically group application-level security exceptions (e.g., SQL injection attempts that caused code crashes).
Audit the Monitors: Periodically check that your alert policies are still active and that the notification channels are still valid (e.g., the Slack webhook hasn't expired).

PSE Exam Scenarios

For SOAR-style automated response, PSE scenarios expect the pipeline Cloud Monitoring Alert Policy → Pub/Sub notification channel → Eventarc → Cloud Run / Cloud Functions to trigger remediation (e.g., auto-quarantine a VM on a brute-force alert). Routing alerts straight to email or Slack stops at human notification and does not satisfy "automated response" requirements.

Long-term SIEM retention pattern: Cloud Logging → Log Sink → Pub/Sub → Dataflow → Chronicle or Splunk. Chronicle ingests Google-native telemetry (Cloud Audit Logs, VPC Flow Logs, DNS) for 12-month hot retention, while Splunk/3rd-party SIEMs consume the same Pub/Sub stream via the Splunk Dataflow template. Log-based metrics stay inside Cloud Monitoring for alerting; the sink path is what feeds the SIEM.

Scenario 1: Detecting a Brute Force Attack

"A PSE needs to set up an alert to detect if someone is attempting to brute-force a legacy application's login page, which logs failed attempts as 'Login Failed' in Cloud Logging. How should this be implemented?" Answer: Create a Log-based Counter Metric that filters for the string "Login Failed". Then, create a Cloud Monitoring Alert Policy based on this metric with a threshold (e.g., > 100 per minute). Set the notification channel to the security team's Slack.

Scenario 2: Monitoring for Cryptomining

"You want to be alerted if there is a sudden, massive spike in Compute Engine CPU usage across your entire organization, which could indicate a cryptomining compromise. What is the best approach?" Answer: Create a Dashboard that aggregates compute.googleapis.com/instance/cpu/utilization across all projects. Set an Alert Policy with a threshold based on a percentage increase over the historical baseline (e.g., 50% increase in 10 minutes).

Summary Checklist

List at least three key metrics for a security dashboard.
Explain how to create an alert based on a log entry.
Identify the purpose of Uptime Checks in a security context.
Describe the strategy for reducing alert fatigue.
Understand how to use Terraform to manage monitoring resources.

Security Monitoring and Alerting

Introduction to Security Monitoring

白話文解釋（Plain English Explanation）

1. The Dashboard of a Car (Security Dashboards)

2. The Smoke Detector (Log-based Alerts)

3. The Neighborhood Watch (Uptime Checks)

Creating Security-Focused Dashboards

Log-Based Metrics and Alerts

Notification Channels and Incident Management

Monitoring API Usage and Quotas

Uptime Checks for Security Endpoints

Managing Alert Fatigue

Infrastructure-as-Code (IaC) for Monitoring

Monitoring for Sensitive Data Access

Integrating Monitoring with SCC

Security Best Practices for PSE

PSE Exam Scenarios

Scenario 1: Detecting a Brute Force Attack

Scenario 2: Monitoring for Cryptomining

Summary Checklist

Official sources

More PSE topics