Network Monitoring and Logging Design — ANS-C01 Advanced Networking Study Notes

Q: Q1: When do I use VPC Flow Logs vs Traffic Mirroring vs CloudWatch metrics?

Flow Logs for connection-level metadata — cheap, always-on, ideal for security baselines, GuardDuty input, and broad anomaly detection. Traffic Mirroring for full-packet payload — expensive, targeted, ideal for forensic capture, IDS signature matching, and protocol-level debugging. CloudWatch metrics for service-level health and rate signals — bandwidth, error rates, BGP session state, port exhaustion. They are complements, not alternatives. The Specialty exam will combine all three: "Flow Logs to detect the anomaly, Traffic Mirroring to inspect the payload, CloudWatch metrics to validate the symptom".

Q: Q2: How does Reachability Analyzer differ from Network Access Analyzer?

Reachability Analyzer answers "can resource A reach resource B given the current configuration?" — point-to-point analysis with a specific source and destination. Network Access Analyzer answers "given a scope (e.g. 'production VPCs should never reach the internet directly'), what unintended paths exist?" — broad scope analysis surfacing all violations. Use Reachability Analyzer for "verify intended paths"; use Network Access Analyzer for "discover unintended paths". Both are static-configuration analysis (not live tests). The Specialty exam scenarios will keyword "intended" → Reachability Analyzer, and "find any unauthorised" → Network Access Analyzer.

Q: Q3: How do I diagnose a BGP session that is reported as up but is dropping packets?

Layer the telemetry. (a) Check CloudWatch Direct Connect metrics : ConnectionState (1 if up), ConnectionBpsEgress and ConnectionBpsIngress (saturation?), LightLevelTx/Rx (fiber-layer health), ConnectionPpsEgress (packet rate). Saturation or light-level degradation suggests physical-layer issue. (b) Check VPC Flow Logs with v5 fields, filtering by traffic-path = ThroughDX . Look for retransmits (TCP flags including SYN-ACK retries) and skewed packet counts. (c) Check TGW Network Manager metrics if traffic crosses TGW. (d) Set up CloudWatch alarms for BGP session flap ( VirtualInterfaceState changes > 3 in 1 hour ). The exam answer for "BGP up but packets dropping": investigate fiber light levels (LightLevelTx/Rx), BGP holdtime tuning, BFD enablement.

Q: Q4: What does v5 Flow Logs add that v2 doesn't have?

Critical fields: pkt-srcaddr / pkt-dstaddr (true source/destination IPs distinct from immediate hops — essential for NAT GW analysis), pkt-src-aws-service / pkt-dst-aws-service (identifies AWS service traffic without IP-range lookup), flow-direction (ingress/egress relative to ENI), traffic-path (Through IGW, NAT GW, TGW, etc.), VPC ID, subnet ID, instance ID (for cross-resource correlation), TCP flags , and ECS task ARN, sublocation type/id for container/Outposts workloads. Cost is identical to v2. Always pick v5 for new deployments.

Q: Q5: How do I set up proactive alerting when a Direct Connect BGP session drops?

Create a CloudWatch alarm on the VirtualInterfaceState metric for each VIF: Sum < 1 for 1 datapoint of 5 minutes . The metric is 1 when up and 0 when down. Set alarm action to publish to SNS topic, which routes to PagerDuty, Slack, or email. For the BGP-flap pattern (intermittent up/down), use MetricMath to compute "VirtualInterfaceState recent changes" — alarm if more than 3 state changes in 1 hour. Add a similar alarm for Connection BpsEgress saturation (>80% of port capacity for capacity planning). The Specialty answer for "design proactive BGP monitoring": CloudWatch alarms on VirtualInterfaceState + LightLevelTx/Rx degradation + ConnectionPpsEgress saturation, all routing to SNS.

Q: Q6: When does Traffic Mirroring fail to deliver packets to the target?

Common failure modes: Source ENI is on a non-Nitro instance (m4, c4, etc.) — Traffic Mirroring is silently inert. Source and target are in different accounts — VPC peering or TGW connectivity is required. Mirror target NLB is overwhelmed — bandwidth doubled (original + mirrored), and the NLB can't keep up. Mirror filter is too restrictive — packets are filtered out before being sent. Mirror target ENI has tight security groups that drop GENEVE-encapsulated traffic. Diagnose with: CloudWatch metrics for the mirror session ( MirrorSession Bytes ), target NLB metrics ( ActiveFlowCount , ProcessedBytes ), and NLB target health.

Q: Q7: Where do I store Flow Logs for long-term retention with cost-effective querying?

S3 with Athena is the canonical pattern. Configure Flow Logs with destination S3, format Parquet (smaller, faster to query than text), partitioned by account/region/date. Athena queries scan only the relevant partitions (drastic cost reduction). For >90 day retention, transition to S3 Glacier Instant Retrieval for further cost reduction. Use Athena's CTAS (CREATE TABLE AS SELECT) to materialize derived tables (e.g. "top 10 talkers by hour") for fast dashboard queries. The Specialty answer for "retain Flow Logs cost-effectively for 1 year with ad-hoc queries": S3 Parquet + Athena + lifecycle policies to Glacier.

Q: Q8: What is the difference between Route Analyzer and Reachability Analyzer?

Reachability Analyzer is VPC-scoped — analyses route tables, SGs, NACLs, gateways within and adjacent to a VPC. Route Analyzer (inside Transit Gateway Network Manager) is TGW-scoped — analyses TGW route tables, attachments, propagations across TGWs and TGW peering. Pick based on where the suspected issue lives. For "VPC A's instance can't reach VPC B's RDS, both attached to the same TGW", you'd use both: Reachability Analyzer to confirm intra-VPC path, Route Analyzer to confirm TGW path. ANS-C01 frequently distinguishes them.

Q: Q9: How do I capture a baseline of normal network behavior to detect anomalies?

Establish baselines for: (a) bandwidth utilization per Direct Connect / VPN / NAT GW (peak, p99, average), (b) TCP retransmit rate (should be <0.1%), (c) latency p50/p99/p999 between regions, between AZ, to common destinations (S3, DynamoDB, public APIs), (d) NAT GW concurrent connection count , (e) Flow rate per VPC (flows per minute), (f) BGP route advertisement count . Run these for 2–4 weeks of typical operation, then set CloudWatch anomaly-detection alarms (statistical baselines that auto-adjust). For 30-day rolling baselines, CloudWatch's anomaly detection band is the easiest answer; for static baselines, set hardcoded thresholds. ANS-C01 will reward "capture baseline before incident, alert on deviation" as the proactive posture.

Q: Q10: What does CloudWatch Logs Insights add over Athena for Flow Log queries?

CloudWatch Logs Insights queries logs in CloudWatch Logs (not S3) with a SQL-like syntax. Pros: real-time (sub-minute), interactive, integrated with CloudWatch dashboards. Cons: more expensive per ingested GB, retention limited (1–10 years configurable), partitioning is automatic but less granular. Athena queries S3-stored Flow Logs (Parquet preferred). Pros: cheaper, supports massive historical datasets, custom partitioning. Cons: query latency higher (seconds to minutes), needs explicit table DDL. Use Logs Insights for active investigations (the past 1–7 days) and Athena for historical analytics (90 days to 7 years). Many SCS-C02 / ANS-C01 architectures use both.

Network monitoring and logging on ANS-C01 is the conversation that separates a Network Engineer from an SRE. The Specialty exam is full of "the application is occasionally slow, the BGP session is reported as up, but customer-facing latency p99 has spiked from 50ms to 800ms — what telemetry sources do you correlate to root-cause this in 10 minutes?" That is a Network Engineer problem that pulls in VPC Flow Logs v5 with extended fields, CloudWatch metrics for NAT GW errors and BGP session state, Reachability Analyzer for path-validation, Network Access Analyzer for unintended internet egress, Traffic Mirroring for full-packet capture, Transit Gateway Network Manager for cross-region topology, ALB access logs for HTTP-level diagnostics, and proactive alerting on packet loss thresholds, BGP session flapping, and bandwidth saturation — and ANS-C01 routinely tests every one of those moving parts inside a single five-line scenario.

This topic is Domain 1 (Network Design, 30 percent of the exam) Task Statement 1.4 in its entirety. The official ANS-C01 exam guide lists the knowledge bullets verbatim: "Amazon CloudWatch metrics, agents, logs, alarms, dashboards, and insights", "AWS Transit Gateway Network Manager", "VPC Reachability Analyzer", "Flow logs and traffic mirroring", and "Access logging (load balancers, CloudFront)". The skills push you to identify logging requirements, recommend metrics for visibility, and capture baseline performance. Roughly 5 to 8 of the 65 exam questions touch this territory.

Why Monitoring Design Is the Specialty Exam's Reliability Layer

The Specialty exam frames Network Engineer as the role that owns observability of the network plane end-to-end. Without telemetry, no design decision can be validated, no failure can be diagnosed, no SLA can be defended. The exam tests this because real-world Network Engineers spend more time interpreting CloudWatch dashboards and Flow Log queries than they do drawing architecture diagrams.

The mental model the exam rewards is layered telemetry: cheap-and-always-on at the bottom (Flow Logs metadata, CloudWatch metrics), targeted-and-detailed in the middle (Traffic Mirroring full packets, Reachability Analyzer path simulations), and proactive-and-automated at the top (CloudWatch alarms, EventBridge rules, Reachability Analyzer-on-config-change). ANS-C01 will reward this layered thinking with full-credit answers; candidates who think of monitoring as "just turn on Flow Logs" will be punished on questions about packet payload inspection and proactive alerting.

Plain-Language Explanation: Network Monitoring and Logging Design

Network monitoring stacks five distinct telemetry sources (Flow Logs, Traffic Mirroring, Reachability Analyzer, Network Access Analyzer, Network Manager) plus CloudWatch metrics and alarms plus access logs from LBs and CloudFront. Three analogies anchor the moving parts.

Analogy 1: The Hospital Monitoring Suite

A VPC's monitoring is a patient monitoring suite in an ICU.

VPC Flow Logs are the vital signs monitor: heart rate, blood pressure, respiration — continuously sampled metadata, cheap to store, always recording, useful for spotting trends and anomalies. Flow Logs do not show the contents of the conversation between organs (packet payload), only the connection metadata (5-tuple).

Traffic Mirroring is the endoscope camera: when a specific concern arises, the doctor inserts a camera and watches the full content live. Expensive and intrusive, only used on specific patients (Nitro instances), but reveals payload-level detail invisible to the vital signs monitor.

Reachability Analyzer is the surgical pre-op simulation: before the operation begins, the surgical team simulates the procedure on a model — checking that the route from incision to organ is unobstructed. No actual scalpel involved (no packets sent on the wire); the analysis is static-graph-based.

Network Access Analyzer is the medical access control audit: who has the keys to which ward? It surfaces "this nurse can access the pharmacy without a doctor's signature" — analogously, "this VPC can egress to the internet via an unauthorized path".

Transit Gateway Network Manager is the hospital command centre dashboard: showing every wing, every ward, every patient, every device's status across the hospital, with topology maps, route analysis, and event timelines.

CloudWatch metrics and alarms are the bedside alarm bells: when heart rate drops below threshold, the bell rings, the nurse runs in. Configurable thresholds, configurable cool-downs.

Access logs (ALB, CloudFront) are the patient intake records at the hospital reception — every patient who arrived, what they requested, when they were seen, and what response they received. Useful for billing, billing disputes, and behavioural analysis.

Analogy 2: The Airport Air Traffic Control

VPC monitoring is airport ATC and ground operations.

VPC Flow Logs are the flight strip records: every aircraft's takeoff time, route, arrival, denied/accepted clearance — continuously recorded for every movement. Cheap, always-on, but you can't hear what's said in the cockpit.

Traffic Mirroring is the cockpit voice recorder pulled for analysis: full audio of pilot/ATC conversations, used after an incident to investigate. Only enabled on specific aircraft (Nitro instances) when needed.

Reachability Analyzer is the flight-plan validator at dispatch: "can this aircraft reach Tokyo from JFK given the routes, NOTAMs, fuel?" — paper-based simulation, no actual flight involved.

Network Access Analyzer is the TSA audit: are there unauthorised paths from the public terminal into restricted airside areas? Surfaces "this gate has unintended access to the cargo apron".

Transit Gateway Network Manager is the regional airspace control centre: showing every aircraft in 10 airports across 3 regions, with alerts when flight patterns deviate from filed flight plans.

CloudWatch metrics are the runway lights and weather sensors: continuous measurements of runway condition, wind speed, visibility, with alarms when thresholds breach.

Access logs are the gate boarding logs: every passenger who boarded, when, on what flight, with what ticket. Forensic records of every ALB or CloudFront request.

Analogy 3: The Manufacturing Quality Control

VPC monitoring is a factory's quality control system.

VPC Flow Logs are the production line counters: every part that came down the line, where it came from, where it went, accepted/rejected. Always on, cheap, metadata-only.

Traffic Mirroring is the microscope and X-ray station: pull a part off the line, examine it in detail, image its internal structure. Targeted, expensive, payload-level.

Reachability Analyzer is the assembly-line simulation in CAD: model the assembly path before building it, to verify there's no obstruction.

Network Access Analyzer is the factory security audit: are there unauthorized doors connecting the warehouse to the secure R&D zone?

Transit Gateway Network Manager is the plant manager's overview dashboard: all factories, all production lines, all status indicators, with topology and route trace tools.

CloudWatch metrics are the production line sensors: temperature, pressure, RPM — continuously sampled, alarmable.

Access logs are the shipping records: every box that left the warehouse, with timestamps, destinations, and contents.

For ANS-C01, the hospital monitoring is the highest-yield mental model when the question mixes multiple telemetry sources for diagnostic reasoning. The airport ATC is best for questions about Transit Gateway Network Manager and topology visibility. The manufacturing QC sub-analogy is intuitive for "Reachability Analyzer is a simulation, not a live test" distinctions. Reference: https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html

VPC Flow Logs — The Foundation of Network Telemetry

VPC Flow Logs capture metadata about IP traffic flowing through ENIs, subnets, or entire VPCs. They are the foundational network-security and network-operations telemetry source on AWS.

Capture scope

Flow Logs can be enabled at three scopes, each with different aggregation:

VPC-level — captures all ENIs in the VPC.
Subnet-level — captures all ENIs in the subnet.
ENI-level — captures one specific ENI.

Lower-scope flow logs override higher-scope, so an ENI-level log replaces a VPC-level log for that ENI.

Capture types

A flow log captures one of three traffic types:

ACCEPT — only allowed traffic is logged.
REJECT — only blocked traffic (by SG or NACL) is logged.
ALL — both accepted and rejected flows are logged.

For security analytics, ALL is the recommended default — denied flows reveal misconfigured clients, port scans, and policy violations.

Destinations

CloudWatch Logs — near-real-time, queryable with Logs Insights, costs more per GB but better for short-retention dashboards.
S3 — cheaper for long-term storage, queryable via Athena with partitioning by account/region/date.
Kinesis Data Firehose — streaming into a SIEM (Splunk, Datadog, Sumo Logic) or a custom data lake.

v2 vs v3 vs v4 vs v5 record formats

Flow Logs support multiple record-format versions, each adding fields:

v2 (default if you don't specify) — original 14 fields: version, account-id, interface-id, srcaddr, dstaddr, srcport, dstport, protocol, packets, bytes, start, end, action, log-status.
v3 — adds VPC ID, subnet ID, instance ID, TCP flags, traffic type, packet src/dst (the "true" source and destination IPs, distinct from the immediate hop's IPs — useful for NAT GW analysis).
v4 — adds AWS service identifiers (pkt-src-aws-service, pkt-dst-aws-service like S3, EC2, RDS), flow direction (ingress/egress), and traffic path (Internet, Through TGW, etc.).
v5 — adds 6 more fields including ECS task ARN, sublocation type (e.g. wavelength, outpost), sublocation ID — the richest format for container and Outposts workloads.

Pick v5 for full-feature SIEM ingestion. Cost is identical regardless of format.

Extended fields — the v5 game-changers

The fields added in v3-v5 are exam-relevant:

pkt-srcaddr / pkt-dstaddr: distinguish the original packet source/destination from the hop. Critical for NAT GW analysis: the immediate source might be the NAT GW IP, but pkt-srcaddr reveals the original instance.
pkt-src-aws-service / pkt-dst-aws-service: identifies AWS service prefixes — answers "is this S3 traffic?" without IP-range lookup.
flow-direction: ingress or egress relative to the ENI — critical for understanding directionality.
traffic-path: Through IGW, NAT GW, TGW, VGW, Direct Connect, etc. — invaluable for diagnosing routing.

What Flow Logs do NOT capture

Flow Logs explicitly exclude:

Traffic to and from the Amazon DNS server (169.254.169.253).
Windows license activation traffic.
Instance Metadata Service requests (169.254.169.254 — including IMDSv2).
Amazon Time Sync (169.254.169.123).
DHCP traffic.
Traffic to the VPC router reserved address (.1).
Traffic between endpoints when using a Gateway Load Balancer (mirrored traffic, not Flow-Log eligible).

For these blind spots, use Traffic Mirroring or host-based logging.

A frequent ANS-C01 distractor: a question shows REJECT records and asks "is the SG or NACL blocking this?". Flow Logs only record the action (REJECT), not which control did the dropping. To diagnose: check the SG inbound rules for the destination ENI; if the SG allows, the NACL must be denying. Network Firewall drops do not appear as VPC Flow Log REJECTs at all — they appear in Network Firewall flow logs (separate). Memorise: Flow Logs metadata only, no per-control attribution. Reference: https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html

VPC Reachability Analyzer — Path Validation

VPC Reachability Analyzer is a static-analysis tool that simulates whether traffic can reach from a source to a destination given the current VPC, route table, SG, NACL, and gateway configurations. It does not send actual packets; it analyses the configuration graph.

What it analyzes

You specify a source and destination — instances, ENIs, IGWs, VGWs, TGWs, peering connections, VPC endpoints. Reachability Analyzer computes the path: each hop, each control evaluated (SG, NACL, route table), and reports REACHABLE or NOT-REACHABLE with the specific blocking control.

Use cases

Pre-deployment validation — before pushing a change, simulate that the change does not break expected paths.
Compliance — recurring scheduled analysis to verify "production VPC is NOT reachable from dev VPC". Failure raises an alert.
Troubleshooting — operator submits "why can't my Lambda reach RDS?" and Reachability Analyzer pinpoints the missing route table entry.

Automation pattern

A typical ANS-C01 scenario: every time a config change is detected (via CloudTrail event), an EventBridge rule triggers a Lambda that calls Reachability Analyzer for a set of "intent paths" (e.g. "production app must reach RDS", "dev VPC must not reach production VPC"). Lambda raises SNS alerts on path-state changes. This is the "automate the verification of connectivity intent" skill from the exam guide.

Limitations

Reachability Analyzer is a simulation, not a live test:

It analyses static config — runtime issues (out-of-memory, packet drops at the host) are invisible.
It does not test latency, throughput, or jitter — only reachability.
Cross-region paths are limited to TGW peering reachability.
It does not analyse third-party appliances (e.g. a Palo Alto VM-Series in a GWLB inspection VPC's logic is opaque).

A common ANS-C01 trap: candidates assume Reachability Analyzer sends ICMP probes or ping. It does not. It is a graph-traversal simulation over the configuration. If the configuration graph says reachable but the application reports timeouts, the issue is at runtime (instance health, application bug, host firewall, MTU mismatch) — diagnose with Traffic Mirroring or instance-level tools. Reference: https://docs.aws.amazon.com/vpc/latest/reachability/what-is-reachability-analyzer.html

Network Access Analyzer — Unintended Access Findings

Network Access Analyzer is a different analyzer: instead of "can A reach B?", it answers "what unauthorised paths exist?". You define a scope (e.g. "VPCs in the production OU should not have any IGW or VPN egress except via the inspection VPC"), and the analyzer surfaces every path that violates the scope.

Scope definition

Scopes are JSON documents specifying source-and-destination predicates: "from any production VPC", "to internet (0.0.0.0/0)", "via any path that does not include the inspection VPC". Scopes can be reused across analyses.

Findings classification

Findings are categorised by severity and include the full path. Integrate with Security Hub for cross-region aggregation.

Compared to Reachability Analyzer

Reachability Analyzer: "can A reach B?" — given a specific source and destination.
Network Access Analyzer: "what surprising paths exist?" — given a scope, find violations.

Both are static configuration analysis; both are exam-relevant.

Traffic Mirroring — Deep Packet Inspection

VPC Traffic Mirroring copies network packets from a source ENI to a target for out-of-band analysis. It is the answer to "capture full packet payloads for forensic review or IDS without disrupting production".

Components — source, target, filter, session

A mirror source is an ENI on a Nitro-based EC2 instance (older instance types are not supported). A mirror target is a Network Load Balancer, Gateway Load Balancer, or another ENI. A mirror filter defines which traffic to copy. A mirror session ties source + target + filter together.

Nitro-only requirement

Traffic Mirroring works only on Nitro-based instances: M5/M5n/M6i, C5/C6i, R5/R6i, T3, and similar generations. Older types (M4, C4, R4, T2) are not supported. ANS-C01 routinely tests this.

Use cases

Forensic capture during an incident.
Threat hunting with Suricata IDS.
Compliance packet retention.
Deep performance debugging for retransmits, packet loss.

Cost

Mirrored bandwidth is duplicated — every byte mirrored adds to the network bill. For high-throughput instances, this can be expensive. Use mirror filters to reduce the volume.

A common ANS-C01 distractor: a scenario describes Traffic Mirroring on an m4.large or c4.xlarge instance — the answer is "you cannot mirror from non-Nitro instances; upgrade to m5 or c5". The exam version tests this specifically: "you've enabled Traffic Mirroring but no packets reach the target". Answer: source is non-Nitro. Reference: https://docs.aws.amazon.com/vpc/latest/mirroring/what-is-traffic-mirroring.html

Transit Gateway Network Manager — Global Topology

Transit Gateway Network Manager is the global view of your TGW-based network across regions and accounts. On ANS-C01 it is the canonical answer to "give me one pane of glass for our 5-region 47-account TGW mesh".

Global Network registration

A Global Network is the umbrella construct. You register TGWs (and on-premises devices, sites, and links) in the Global Network for a unified topology view.

Topology and route analysis

The Network Manager UI shows a topology map of all attachments, route tables, and propagated routes. Route Analyzer is a tool inside Network Manager that traces packets across attachments — answering "if I send a packet from VPC A to VPC B, what TGW route table is consulted, what propagated routes are matched, and what is the next hop?". This is the TGW-cross-attachment analogue of Reachability Analyzer.

Event notifications

Network Manager publishes events to EventBridge when topology changes (attachment created, route table modified, peering established) — useful for triggering Reachability Analyzer or compliance checks on every change.

CloudWatch metrics for TGW

Network Manager surfaces TGW metrics: BytesIn, BytesOut, PacketsIn, PacketsOut, PacketsDropCountBlackhole, PacketsDropCountNoRoute. The drop counters are critical for diagnosis: BlackholeRoute indicates intentional drops at TGW route table; NoRoute indicates a missing route — a configuration bug.

Compared to Reachability Analyzer

Reachability Analyzer: VPC-level path analysis.
Route Analyzer (in Network Manager): TGW-level path analysis.

Both are exam-relevant; pick based on whether the problem is intra-VPC or cross-attachment.

VPC Flow Logs: metadata of every flow (5-tuple), cheap, always-on.
Traffic Mirroring: full packet copy, expensive, Nitro-only.
Reachability Analyzer: VPC-level static path analysis (not live test).
Network Access Analyzer: scope-based "unintended paths" finder.
Route Analyzer: TGW-level path tracing inside Network Manager.
Transit Gateway Network Manager: global topology and metrics across TGWs.
Reference: https://docs.aws.amazon.com/vpc/latest/tgw/tgw-network-manager.html

CloudWatch Network Metrics

CloudWatch is the metric and alarm plane for AWS networking services. Each service publishes a distinct set of metrics relevant to monitoring.

NAT Gateway metrics

BytesInFromDestination, BytesOutToSource, BytesInFromSource, BytesOutToDestination — bytes by direction.
ConnectionAttemptCount, ConnectionEstablishedCount — connection rates.
ErrorPortAllocation — port exhaustion (the canonical NAT GW failure mode).
IdleTimeoutCount — idle TCP timeouts.
PacketsDropCount — dropped packets.

ErrorPortAllocation > 0 is a critical alert — it means the NAT GW has run out of ephemeral ports and new connections are failing.

Direct Connect metrics

ConnectionState — 1 if up, 0 if down.
ConnectionBpsEgress, ConnectionBpsIngress — bandwidth.
ConnectionPpsEgress, ConnectionPpsIngress — packet rate.
ConnectionLightLevelTx, ConnectionLightLevelRx — fiber optic transmit/receive light levels (warning of fiber degradation).
VirtualInterfaceBpsEgress, etc. — per-VIF.

VPN tunnel metrics

TunnelState — 1 if up, 0 if down. Per tunnel.
TunnelDataIn, TunnelDataOut.
TunnelEstablished — recent failure events.

TGW metrics

BytesIn, BytesOut, PacketsIn, PacketsOut.
PacketsDropCountBlackhole, PacketsDropCountNoRoute.

ALB metrics

RequestCount, TargetResponseTime, HealthyHostCount, UnHealthyHostCount.
HTTPCode_Target_4XX_Count, HTTPCode_Target_5XX_Count.
RejectedConnectionCount — when ALB rejects due to capacity.

NLB metrics

ActiveFlowCount, NewFlowCount, ProcessedBytes.
TCP_Client_Reset_Count, TCP_Target_Reset_Count, TCP_ELB_Reset_Count — RST counts that reveal misbehavior.

CloudWatch alarms — proactive alerting

Set alarms on critical metrics:

BGP session down: Direct Connect VirtualInterfaceState != UP for 5 minutes → page on-call.
NAT GW port exhaustion: ErrorPortAllocation > 0 → page immediately.
VPN tunnel flap: TunnelState change > 3 in 1 hour → page.
Bandwidth saturation: ConnectionBpsEgress > 80% of port capacity → warn (capacity planning).
TGW route drops: PacketsDropCountNoRoute > 0 → page (route table bug).

NAT GW: ErrorPortAllocation is the canonical port-exhaustion signal.
Direct Connect: ConnectionState, LightLevelTx/Rx for fiber-layer health.
VPN tunnel: TunnelState per tunnel; alarm on flap.
TGW: PacketsDropCountNoRoute is a route table bug signal.
ALB: HTTPCode_ELB_5XX_Count is LB-side; HTTPCode_Target_5XX_Count is backend.
CloudWatch metrics latency: ~1–3 minutes for standard, ~10 seconds for high-resolution.
Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html

Access Logging — ALB, NLB, CloudFront, API Gateway

Access logs capture every HTTP request (or TCP connection for NLB) handled by the LB or edge service. They complement Flow Logs (network-layer) with application-layer detail.

ALB access logs

ALB writes access logs to S3 (no CloudWatch option). Each log line contains: time, ALB ARN, client IP:port, target IP:port, request processing time, target processing time, response time, ELB and target status codes, received and sent bytes, request line, user agent, SSL cipher and protocol, target group ARN, trace ID, domain name, chosen cert ARN, matched rule priority, request creation time, action executed, redirect URL, error reason, target status, classification, classification reason.

Access logs are written every 5 minutes to S3. Use Athena for querying.

NLB access logs

NLB access logs (TLS listeners only — no logs for TCP/UDP) similarly write to S3. Less detailed than ALB logs (NLB doesn't see HTTP).

CloudFront access logs

CloudFront has two log options: standard logs (delivered to S3 every minute) and real-time logs (Kinesis Data Streams, sub-second latency). Real-time logs are field-customisable; standard logs have a fixed schema.

API Gateway access logs

API Gateway writes execution logs and access logs to CloudWatch Logs. Configurable per-stage.

Use cases

Security investigation: which IP made the malicious request, what user agent, what response was returned.
Performance debugging: tail latency analysis, slow-path identification.
Billing dispute: prove or disprove a customer's claim about request volume.

The Specialty exam expects you to capture a performance baseline before incidents — so deviations are detectable.

What to baseline

TCP retransmit rate: should be <0.1% on healthy networks.
Latency p50, p99, p999: per-region, per-AZ, per-endpoint.
Bandwidth utilization: peak vs average per Direct Connect, per VPN tunnel.
NAT GW connections per second: peak.
Flow Logs flow rate: flows per minute per VPC.

Tools

CloudWatch metrics (built-in) — for AWS services.
CloudWatch Synthetics — canaries that probe HTTP endpoints continuously.
CloudWatch Logs Insights — query Flow Logs for derived metrics like flow rate, top talkers.
Third-party agents (Datadog, New Relic) — for deeper application-layer telemetry.

Build CloudWatch dashboards combining Flow Logs queries, NAT GW metrics, BGP session state, and alarm history. Update dashboards as the architecture evolves.

Common Traps Recap — Network Monitoring on ANS-C01

Trap 1: Flow Logs distinguish SG vs NACL drops

Wrong. REJECT is binary; no per-control attribution.

Trap 2: Reachability Analyzer is a live ping test

Wrong. It's a static configuration simulation.

Trap 3: Traffic Mirroring works on all instance types

Wrong. Nitro-based instances only.

Trap 4: Flow Logs capture all traffic in the VPC

Wrong. Traffic to AWS DNS, IMDS, NTP, license activation, and DHCP is excluded.

Trap 5: VPC Flow Logs include packet payloads

Wrong. Metadata only (5-tuple, bytes, packets). Use Traffic Mirroring for payload.

Trap 6: Network Firewall drops appear in VPC Flow Logs

Wrong. NFW has separate flow logs and alert logs.

Trap 7: Reachability Analyzer works across AWS accounts

Partial. Cross-account analysis requires explicit RAM-shared resources or AWS Organisations integration.

Trap 8: NAT GW `BytesInFromSource` indicates the bottleneck

Partial. The bottleneck signal is ErrorPortAllocation > 0, not bytes.

Trap 9: Network Manager's Route Analyzer is for VPC routes

Wrong. TGW-level routes only. For VPC route analysis, use Reachability Analyzer.

Trap 10: ALB access logs go to CloudWatch by default

Wrong. ALB logs go to S3 only. CloudFront logs go to S3 (standard) or Kinesis (real-time).

Trap 11: v5 Flow Logs cost more than v2

Wrong. Same per-GB cost. Always pick v5.

Trap 12: Reachability Analyzer can verify TGW peering reachability

Partial. It can analyse paths through TGW peering but with limitations on cross-region complexity.

ANS-C01 exam priority — Network Monitoring and Logging Design. This topic carries weight on the ANS-C01 exam. Master the trade-offs, decision boundaries, and the cost/performance triggers each AWS service exposes — the exam will test scenarios that hinge on knowing which service is the wrong answer, not just which is right.

FAQ — Network Monitoring on ANS-C01

Q1: When do I use VPC Flow Logs vs Traffic Mirroring vs CloudWatch metrics?

Flow Logs for connection-level metadata — cheap, always-on, ideal for security baselines, GuardDuty input, and broad anomaly detection. Traffic Mirroring for full-packet payload — expensive, targeted, ideal for forensic capture, IDS signature matching, and protocol-level debugging. CloudWatch metrics for service-level health and rate signals — bandwidth, error rates, BGP session state, port exhaustion. They are complements, not alternatives. The Specialty exam will combine all three: "Flow Logs to detect the anomaly, Traffic Mirroring to inspect the payload, CloudWatch metrics to validate the symptom".

Q2: How does Reachability Analyzer differ from Network Access Analyzer?

Reachability Analyzer answers "can resource A reach resource B given the current configuration?" — point-to-point analysis with a specific source and destination. Network Access Analyzer answers "given a scope (e.g. 'production VPCs should never reach the internet directly'), what unintended paths exist?" — broad scope analysis surfacing all violations. Use Reachability Analyzer for "verify intended paths"; use Network Access Analyzer for "discover unintended paths". Both are static-configuration analysis (not live tests). The Specialty exam scenarios will keyword "intended" → Reachability Analyzer, and "find any unauthorised" → Network Access Analyzer.

Q3: How do I diagnose a BGP session that is reported as up but is dropping packets?

Layer the telemetry. (a) Check CloudWatch Direct Connect metrics: ConnectionState (1 if up), ConnectionBpsEgress and ConnectionBpsIngress (saturation?), LightLevelTx/Rx (fiber-layer health), ConnectionPpsEgress (packet rate). Saturation or light-level degradation suggests physical-layer issue. (b) Check VPC Flow Logs with v5 fields, filtering by traffic-path = ThroughDX. Look for retransmits (TCP flags including SYN-ACK retries) and skewed packet counts. (c) Check TGW Network Manager metrics if traffic crosses TGW. (d) Set up CloudWatch alarms for BGP session flap (VirtualInterfaceState changes > 3 in 1 hour). The exam answer for "BGP up but packets dropping": investigate fiber light levels (LightLevelTx/Rx), BGP holdtime tuning, BFD enablement.

Q4: What does v5 Flow Logs add that v2 doesn't have?

Critical fields: pkt-srcaddr / pkt-dstaddr (true source/destination IPs distinct from immediate hops — essential for NAT GW analysis), pkt-src-aws-service / pkt-dst-aws-service (identifies AWS service traffic without IP-range lookup), flow-direction (ingress/egress relative to ENI), traffic-path (Through IGW, NAT GW, TGW, etc.), VPC ID, subnet ID, instance ID (for cross-resource correlation), TCP flags, and ECS task ARN, sublocation type/id for container/Outposts workloads. Cost is identical to v2. Always pick v5 for new deployments.

Q5: How do I set up proactive alerting when a Direct Connect BGP session drops?

Create a CloudWatch alarm on the VirtualInterfaceState metric for each VIF: Sum < 1 for 1 datapoint of 5 minutes. The metric is 1 when up and 0 when down. Set alarm action to publish to SNS topic, which routes to PagerDuty, Slack, or email. For the BGP-flap pattern (intermittent up/down), use MetricMath to compute "VirtualInterfaceState recent changes" — alarm if more than 3 state changes in 1 hour. Add a similar alarm for Connection BpsEgress saturation (>80% of port capacity for capacity planning). The Specialty answer for "design proactive BGP monitoring": CloudWatch alarms on VirtualInterfaceState + LightLevelTx/Rx degradation + ConnectionPpsEgress saturation, all routing to SNS.

Q6: When does Traffic Mirroring fail to deliver packets to the target?

Common failure modes:

Source ENI is on a non-Nitro instance (m4, c4, etc.) — Traffic Mirroring is silently inert.
Source and target are in different accounts — VPC peering or TGW connectivity is required.
Mirror target NLB is overwhelmed — bandwidth doubled (original + mirrored), and the NLB can't keep up.
Mirror filter is too restrictive — packets are filtered out before being sent.
Mirror target ENI has tight security groups that drop GENEVE-encapsulated traffic.

Diagnose with: CloudWatch metrics for the mirror session (MirrorSession Bytes), target NLB metrics (ActiveFlowCount, ProcessedBytes), and NLB target health.

Q7: Where do I store Flow Logs for long-term retention with cost-effective querying?

S3 with Athena is the canonical pattern. Configure Flow Logs with destination S3, format Parquet (smaller, faster to query than text), partitioned by account/region/date. Athena queries scan only the relevant partitions (drastic cost reduction). For >90 day retention, transition to S3 Glacier Instant Retrieval for further cost reduction. Use Athena's CTAS (CREATE TABLE AS SELECT) to materialize derived tables (e.g. "top 10 talkers by hour") for fast dashboard queries. The Specialty answer for "retain Flow Logs cost-effectively for 1 year with ad-hoc queries": S3 Parquet + Athena + lifecycle policies to Glacier.

Q8: What is the difference between Route Analyzer and Reachability Analyzer?

Reachability Analyzer is VPC-scoped — analyses route tables, SGs, NACLs, gateways within and adjacent to a VPC. Route Analyzer (inside Transit Gateway Network Manager) is TGW-scoped — analyses TGW route tables, attachments, propagations across TGWs and TGW peering. Pick based on where the suspected issue lives. For "VPC A's instance can't reach VPC B's RDS, both attached to the same TGW", you'd use both: Reachability Analyzer to confirm intra-VPC path, Route Analyzer to confirm TGW path. ANS-C01 frequently distinguishes them.

Q9: How do I capture a baseline of normal network behavior to detect anomalies?

Establish baselines for: (a) bandwidth utilization per Direct Connect / VPN / NAT GW (peak, p99, average), (b) TCP retransmit rate (should be <0.1%), (c) latency p50/p99/p999 between regions, between AZ, to common destinations (S3, DynamoDB, public APIs), (d) NAT GW concurrent connection count, (e) Flow rate per VPC (flows per minute), (f) BGP route advertisement count. Run these for 2–4 weeks of typical operation, then set CloudWatch anomaly-detection alarms (statistical baselines that auto-adjust). For 30-day rolling baselines, CloudWatch's anomaly detection band is the easiest answer; for static baselines, set hardcoded thresholds. ANS-C01 will reward "capture baseline before incident, alert on deviation" as the proactive posture.

Q10: What does CloudWatch Logs Insights add over Athena for Flow Log queries?

CloudWatch Logs Insights queries logs in CloudWatch Logs (not S3) with a SQL-like syntax. Pros: real-time (sub-minute), interactive, integrated with CloudWatch dashboards. Cons: more expensive per ingested GB, retention limited (1–10 years configurable), partitioning is automatic but less granular. Athena queries S3-stored Flow Logs (Parquet preferred). Pros: cheaper, supports massive historical datasets, custom partitioning. Cons: query latency higher (seconds to minutes), needs explicit table DDL. Use Logs Insights for active investigations (the past 1–7 days) and Athena for historical analytics (90 days to 7 years). Many SCS-C02 / ANS-C01 architectures use both.

Once monitoring and logging is in place, the natural next operational layers on ANS-C01 are: VPC Flow Logs, Reachability Analyzer, and Traffic Mirroring for deeper Domain 3 troubleshooting; Network Performance — ENA, EFA, Placement Groups, and Jumbo Frames for the throughput layer that monitoring observes; AWS Network Firewall — Suricata Rules, TLS Inspection, and Centralized Deployment which generates its own flow and alert logs separate from VPC Flow Logs; and Compliance, Auditing, and Network Governance which composes Flow Logs, CloudTrail, Config, and Firewall Manager into an audit story.

Why Monitoring Design Is the Specialty Exam's Reliability Layer

Plain-Language Explanation: Network Monitoring and Logging Design

Analogy 1: The Hospital Monitoring Suite

Analogy 2: The Airport Air Traffic Control

Analogy 3: The Manufacturing Quality Control

VPC Flow Logs — The Foundation of Network Telemetry

Capture scope

Capture types

Destinations

v2 vs v3 vs v4 vs v5 record formats

Extended fields — the v5 game-changers

What Flow Logs do NOT capture

VPC Reachability Analyzer — Path Validation

What it analyzes

Use cases

Automation pattern

Limitations

Network Access Analyzer — Unintended Access Findings

Scope definition

Findings classification

Compared to Reachability Analyzer

Traffic Mirroring — Deep Packet Inspection

Components — source, target, filter, session

Nitro-only requirement

Use cases

Cost

Transit Gateway Network Manager — Global Topology

Global Network registration

Topology and route analysis

Event notifications

CloudWatch metrics for TGW

Compared to Reachability Analyzer

CloudWatch Network Metrics

NAT Gateway metrics

Direct Connect metrics

VPN tunnel metrics

TGW metrics

ALB metrics

NLB metrics

CloudWatch alarms — proactive alerting

Access Logging — ALB, NLB, CloudFront, API Gateway

ALB access logs

NLB access logs

CloudFront access logs

API Gateway access logs

Use cases

Baseline Performance Capture and Trending

What to baseline

Tools

Trending and dashboards

Common Traps Recap — Network Monitoring on ANS-C01

Trap 1: Flow Logs distinguish SG vs NACL drops

Trap 2: Reachability Analyzer is a live ping test

Trap 3: Traffic Mirroring works on all instance types

Trap 4: Flow Logs capture all traffic in the VPC

Trap 5: VPC Flow Logs include packet payloads

Trap 6: Network Firewall drops appear in VPC Flow Logs

Trap 7: Reachability Analyzer works across AWS accounts

Trap 8: NAT GW BytesInFromSource indicates the bottleneck

Trap 9: Network Manager's Route Analyzer is for VPC routes

Trap 10: ALB access logs go to CloudWatch by default

Trap 11: v5 Flow Logs cost more than v2

Trap 12: Reachability Analyzer can verify TGW peering reachability

FAQ — Network Monitoring on ANS-C01

Q1: When do I use VPC Flow Logs vs Traffic Mirroring vs CloudWatch metrics?

Q2: How does Reachability Analyzer differ from Network Access Analyzer?

Q3: How do I diagnose a BGP session that is reported as up but is dropping packets?

Q4: What does v5 Flow Logs add that v2 doesn't have?

Q5: How do I set up proactive alerting when a Direct Connect BGP session drops?

Q6: When does Traffic Mirroring fail to deliver packets to the target?

Q7: Where do I store Flow Logs for long-term retention with cost-effective querying?

Q8: What is the difference between Route Analyzer and Reachability Analyzer?

Q9: How do I capture a baseline of normal network behavior to detect anomalies?

Q10: What does CloudWatch Logs Insights add over Athena for Flow Log queries?

Further Reading and Related Operational Patterns

Official sources

More ANS-C01 topics

Trap 8: NAT GW `BytesInFromSource` indicates the bottleneck