Network monitoring and logging on ANS-C01 is the conversation that separates a Network Engineer from an SRE. The Specialty exam is full of "the application is occasionally slow, the BGP session is reported as up, but customer-facing latency p99 has spiked from 50ms to 800ms — what telemetry sources do you correlate to root-cause this in 10 minutes?" That is a Network Engineer problem that pulls in VPC Flow Logs v5 with extended fields, CloudWatch metrics for NAT GW errors and BGP session state, Reachability Analyzer for path-validation, Network Access Analyzer for unintended internet egress, Traffic Mirroring for full-packet capture, Transit Gateway Network Manager for cross-region topology, ALB access logs for HTTP-level diagnostics, and proactive alerting on packet loss thresholds, BGP session flapping, and bandwidth saturation — and ANS-C01 routinely tests every one of those moving parts inside a single five-line scenario.
This topic is Domain 1 (Network Design, 30 percent of the exam) Task Statement 1.4 in its entirety. The official ANS-C01 exam guide lists the knowledge bullets verbatim: "Amazon CloudWatch metrics, agents, logs, alarms, dashboards, and insights", "AWS Transit Gateway Network Manager", "VPC Reachability Analyzer", "Flow logs and traffic mirroring", and "Access logging (load balancers, CloudFront)". The skills push you to identify logging requirements, recommend metrics for visibility, and capture baseline performance. Roughly 5 to 8 of the 65 exam questions touch this territory.
Why Monitoring Design Is the Specialty Exam's Reliability Layer
The Specialty exam frames Network Engineer as the role that owns observability of the network plane end-to-end. Without telemetry, no design decision can be validated, no failure can be diagnosed, no SLA can be defended. The exam tests this because real-world Network Engineers spend more time interpreting CloudWatch dashboards and Flow Log queries than they do drawing architecture diagrams.
The mental model the exam rewards is layered telemetry: cheap-and-always-on at the bottom (Flow Logs metadata, CloudWatch metrics), targeted-and-detailed in the middle (Traffic Mirroring full packets, Reachability Analyzer path simulations), and proactive-and-automated at the top (CloudWatch alarms, EventBridge rules, Reachability Analyzer-on-config-change). ANS-C01 will reward this layered thinking with full-credit answers; candidates who think of monitoring as "just turn on Flow Logs" will be punished on questions about packet payload inspection and proactive alerting.
Plain-Language Explanation: Network Monitoring and Logging Design
Network monitoring stacks five distinct telemetry sources (Flow Logs, Traffic Mirroring, Reachability Analyzer, Network Access Analyzer, Network Manager) plus CloudWatch metrics and alarms plus access logs from LBs and CloudFront. Three analogies anchor the moving parts.
Analogy 1: The Hospital Monitoring Suite
A VPC's monitoring is a patient monitoring suite in an ICU.
VPC Flow Logs are the vital signs monitor: heart rate, blood pressure, respiration — continuously sampled metadata, cheap to store, always recording, useful for spotting trends and anomalies. Flow Logs do not show the contents of the conversation between organs (packet payload), only the connection metadata (5-tuple).
Traffic Mirroring is the endoscope camera: when a specific concern arises, the doctor inserts a camera and watches the full content live. Expensive and intrusive, only used on specific patients (Nitro instances), but reveals payload-level detail invisible to the vital signs monitor.
Reachability Analyzer is the surgical pre-op simulation: before the operation begins, the surgical team simulates the procedure on a model — checking that the route from incision to organ is unobstructed. No actual scalpel involved (no packets sent on the wire); the analysis is static-graph-based.
Network Access Analyzer is the medical access control audit: who has the keys to which ward? It surfaces "this nurse can access the pharmacy without a doctor's signature" — analogously, "this VPC can egress to the internet via an unauthorized path".
Transit Gateway Network Manager is the hospital command centre dashboard: showing every wing, every ward, every patient, every device's status across the hospital, with topology maps, route analysis, and event timelines.
CloudWatch metrics and alarms are the bedside alarm bells: when heart rate drops below threshold, the bell rings, the nurse runs in. Configurable thresholds, configurable cool-downs.
Access logs (ALB, CloudFront) are the patient intake records at the hospital reception — every patient who arrived, what they requested, when they were seen, and what response they received. Useful for billing, billing disputes, and behavioural analysis.
Analogy 2: The Airport Air Traffic Control
VPC monitoring is airport ATC and ground operations.
VPC Flow Logs are the flight strip records: every aircraft's takeoff time, route, arrival, denied/accepted clearance — continuously recorded for every movement. Cheap, always-on, but you can't hear what's said in the cockpit.
Traffic Mirroring is the cockpit voice recorder pulled for analysis: full audio of pilot/ATC conversations, used after an incident to investigate. Only enabled on specific aircraft (Nitro instances) when needed.
Reachability Analyzer is the flight-plan validator at dispatch: "can this aircraft reach Tokyo from JFK given the routes, NOTAMs, fuel?" — paper-based simulation, no actual flight involved.
Network Access Analyzer is the TSA audit: are there unauthorised paths from the public terminal into restricted airside areas? Surfaces "this gate has unintended access to the cargo apron".
Transit Gateway Network Manager is the regional airspace control centre: showing every aircraft in 10 airports across 3 regions, with alerts when flight patterns deviate from filed flight plans.
CloudWatch metrics are the runway lights and weather sensors: continuous measurements of runway condition, wind speed, visibility, with alarms when thresholds breach.
Access logs are the gate boarding logs: every passenger who boarded, when, on what flight, with what ticket. Forensic records of every ALB or CloudFront request.
Analogy 3: The Manufacturing Quality Control
VPC monitoring is a factory's quality control system.
VPC Flow Logs are the production line counters: every part that came down the line, where it came from, where it went, accepted/rejected. Always on, cheap, metadata-only.
Traffic Mirroring is the microscope and X-ray station: pull a part off the line, examine it in detail, image its internal structure. Targeted, expensive, payload-level.
Reachability Analyzer is the assembly-line simulation in CAD: model the assembly path before building it, to verify there's no obstruction.
Network Access Analyzer is the factory security audit: are there unauthorized doors connecting the warehouse to the secure R&D zone?
Transit Gateway Network Manager is the plant manager's overview dashboard: all factories, all production lines, all status indicators, with topology and route trace tools.
CloudWatch metrics are the production line sensors: temperature, pressure, RPM — continuously sampled, alarmable.
Access logs are the shipping records: every box that left the warehouse, with timestamps, destinations, and contents.
For ANS-C01, the hospital monitoring is the highest-yield mental model when the question mixes multiple telemetry sources for diagnostic reasoning. The airport ATC is best for questions about Transit Gateway Network Manager and topology visibility. The manufacturing QC sub-analogy is intuitive for "Reachability Analyzer is a simulation, not a live test" distinctions. Reference: https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html
VPC Flow Logs — The Foundation of Network Telemetry
VPC Flow Logs capture metadata about IP traffic flowing through ENIs, subnets, or entire VPCs. They are the foundational network-security and network-operations telemetry source on AWS.
Capture scope
Flow Logs can be enabled at three scopes, each with different aggregation:
- VPC-level — captures all ENIs in the VPC.
- Subnet-level — captures all ENIs in the subnet.
- ENI-level — captures one specific ENI.
Lower-scope flow logs override higher-scope, so an ENI-level log replaces a VPC-level log for that ENI.
Capture types
A flow log captures one of three traffic types:
- ACCEPT — only allowed traffic is logged.
- REJECT — only blocked traffic (by SG or NACL) is logged.
- ALL — both accepted and rejected flows are logged.
For security analytics, ALL is the recommended default — denied flows reveal misconfigured clients, port scans, and policy violations.
Destinations
- CloudWatch Logs — near-real-time, queryable with Logs Insights, costs more per GB but better for short-retention dashboards.
- S3 — cheaper for long-term storage, queryable via Athena with partitioning by account/region/date.
- Kinesis Data Firehose — streaming into a SIEM (Splunk, Datadog, Sumo Logic) or a custom data lake.
v2 vs v3 vs v4 vs v5 record formats
Flow Logs support multiple record-format versions, each adding fields:
- v2 (default if you don't specify) — original 14 fields: version, account-id, interface-id, srcaddr, dstaddr, srcport, dstport, protocol, packets, bytes, start, end, action, log-status.
- v3 — adds VPC ID, subnet ID, instance ID, TCP flags, traffic type, packet src/dst (the "true" source and destination IPs, distinct from the immediate hop's IPs — useful for NAT GW analysis).
- v4 — adds AWS service identifiers (
pkt-src-aws-service,pkt-dst-aws-servicelike S3, EC2, RDS), flow direction (ingress/egress), and traffic path (Internet, Through TGW, etc.). - v5 — adds 6 more fields including ECS task ARN, sublocation type (e.g. wavelength, outpost), sublocation ID — the richest format for container and Outposts workloads.
Pick v5 for full-feature SIEM ingestion. Cost is identical regardless of format.
Extended fields — the v5 game-changers
The fields added in v3-v5 are exam-relevant:
- pkt-srcaddr / pkt-dstaddr: distinguish the original packet source/destination from the hop. Critical for NAT GW analysis: the immediate source might be the NAT GW IP, but pkt-srcaddr reveals the original instance.
- pkt-src-aws-service / pkt-dst-aws-service: identifies AWS service prefixes — answers "is this S3 traffic?" without IP-range lookup.
- flow-direction: ingress or egress relative to the ENI — critical for understanding directionality.
- traffic-path: Through IGW, NAT GW, TGW, VGW, Direct Connect, etc. — invaluable for diagnosing routing.
What Flow Logs do NOT capture
Flow Logs explicitly exclude:
- Traffic to and from the Amazon DNS server (169.254.169.253).
- Windows license activation traffic.
- Instance Metadata Service requests (169.254.169.254 — including IMDSv2).
- Amazon Time Sync (169.254.169.123).
- DHCP traffic.
- Traffic to the VPC router reserved address (.1).
- Traffic between endpoints when using a Gateway Load Balancer (mirrored traffic, not Flow-Log eligible).
For these blind spots, use Traffic Mirroring or host-based logging.
A frequent ANS-C01 distractor: a question shows REJECT records and asks "is the SG or NACL blocking this?". Flow Logs only record the action (REJECT), not which control did the dropping. To diagnose: check the SG inbound rules for the destination ENI; if the SG allows, the NACL must be denying. Network Firewall drops do not appear as VPC Flow Log REJECTs at all — they appear in Network Firewall flow logs (separate). Memorise: Flow Logs metadata only, no per-control attribution. Reference: https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html
VPC Reachability Analyzer — Path Validation
VPC Reachability Analyzer is a static-analysis tool that simulates whether traffic can reach from a source to a destination given the current VPC, route table, SG, NACL, and gateway configurations. It does not send actual packets; it analyses the configuration graph.
What it analyzes
You specify a source and destination — instances, ENIs, IGWs, VGWs, TGWs, peering connections, VPC endpoints. Reachability Analyzer computes the path: each hop, each control evaluated (SG, NACL, route table), and reports REACHABLE or NOT-REACHABLE with the specific blocking control.
Use cases
- Pre-deployment validation — before pushing a change, simulate that the change does not break expected paths.
- Compliance — recurring scheduled analysis to verify "production VPC is NOT reachable from dev VPC". Failure raises an alert.
- Troubleshooting — operator submits "why can't my Lambda reach RDS?" and Reachability Analyzer pinpoints the missing route table entry.
Automation pattern
A typical ANS-C01 scenario: every time a config change is detected (via CloudTrail event), an EventBridge rule triggers a Lambda that calls Reachability Analyzer for a set of "intent paths" (e.g. "production app must reach RDS", "dev VPC must not reach production VPC"). Lambda raises SNS alerts on path-state changes. This is the "automate the verification of connectivity intent" skill from the exam guide.
Limitations
Reachability Analyzer is a simulation, not a live test:
- It analyses static config — runtime issues (out-of-memory, packet drops at the host) are invisible.
- It does not test latency, throughput, or jitter — only reachability.
- Cross-region paths are limited to TGW peering reachability.
- It does not analyse third-party appliances (e.g. a Palo Alto VM-Series in a GWLB inspection VPC's logic is opaque).
A common ANS-C01 trap: candidates assume Reachability Analyzer sends ICMP probes or ping. It does not. It is a graph-traversal simulation over the configuration. If the configuration graph says reachable but the application reports timeouts, the issue is at runtime (instance health, application bug, host firewall, MTU mismatch) — diagnose with Traffic Mirroring or instance-level tools. Reference: https://docs.aws.amazon.com/vpc/latest/reachability/what-is-reachability-analyzer.html
Network Access Analyzer — Unintended Access Findings
Network Access Analyzer is a different analyzer: instead of "can A reach B?", it answers "what unauthorised paths exist?". You define a scope (e.g. "VPCs in the production OU should not have any IGW or VPN egress except via the inspection VPC"), and the analyzer surfaces every path that violates the scope.
Scope definition
Scopes are JSON documents specifying source-and-destination predicates: "from any production VPC", "to internet (0.0.0.0/0)", "via any path that does not include the inspection VPC". Scopes can be reused across analyses.
Findings classification
Findings are categorised by severity and include the full path. Integrate with Security Hub for cross-region aggregation.
Compared to Reachability Analyzer
- Reachability Analyzer: "can A reach B?" — given a specific source and destination.
- Network Access Analyzer: "what surprising paths exist?" — given a scope, find violations.
Both are static configuration analysis; both are exam-relevant.
Traffic Mirroring — Deep Packet Inspection
VPC Traffic Mirroring copies network packets from a source ENI to a target for out-of-band analysis. It is the answer to "capture full packet payloads for forensic review or IDS without disrupting production".
Components — source, target, filter, session
A mirror source is an ENI on a Nitro-based EC2 instance (older instance types are not supported). A mirror target is a Network Load Balancer, Gateway Load Balancer, or another ENI. A mirror filter defines which traffic to copy. A mirror session ties source + target + filter together.
Nitro-only requirement
Traffic Mirroring works only on Nitro-based instances: M5/M5n/M6i, C5/C6i, R5/R6i, T3, and similar generations. Older types (M4, C4, R4, T2) are not supported. ANS-C01 routinely tests this.
Use cases
- Forensic capture during an incident.
- Threat hunting with Suricata IDS.
- Compliance packet retention.
- Deep performance debugging for retransmits, packet loss.
Cost
Mirrored bandwidth is duplicated — every byte mirrored adds to the network bill. For high-throughput instances, this can be expensive. Use mirror filters to reduce the volume.
A common ANS-C01 distractor: a scenario describes Traffic Mirroring on an m4.large or c4.xlarge instance — the answer is "you cannot mirror from non-Nitro instances; upgrade to m5 or c5". The exam version tests this specifically: "you've enabled Traffic Mirroring but no packets reach the target". Answer: source is non-Nitro. Reference: https://docs.aws.amazon.com/vpc/latest/mirroring/what-is-traffic-mirroring.html
Transit Gateway Network Manager — Global Topology
Transit Gateway Network Manager is the global view of your TGW-based network across regions and accounts. On ANS-C01 it is the canonical answer to "give me one pane of glass for our 5-region 47-account TGW mesh".
Global Network registration
A Global Network is the umbrella construct. You register TGWs (and on-premises devices, sites, and links) in the Global Network for a unified topology view.
Topology and route analysis
The Network Manager UI shows a topology map of all attachments, route tables, and propagated routes. Route Analyzer is a tool inside Network Manager that traces packets across attachments — answering "if I send a packet from VPC A to VPC B, what TGW route table is consulted, what propagated routes are matched, and what is the next hop?". This is the TGW-cross-attachment analogue of Reachability Analyzer.
Event notifications
Network Manager publishes events to EventBridge when topology changes (attachment created, route table modified, peering established) — useful for triggering Reachability Analyzer or compliance checks on every change.
CloudWatch metrics for TGW
Network Manager surfaces TGW metrics: BytesIn, BytesOut, PacketsIn, PacketsOut, PacketsDropCountBlackhole, PacketsDropCountNoRoute. The drop counters are critical for diagnosis: BlackholeRoute indicates intentional drops at TGW route table; NoRoute indicates a missing route — a configuration bug.
Compared to Reachability Analyzer
- Reachability Analyzer: VPC-level path analysis.
- Route Analyzer (in Network Manager): TGW-level path analysis.
Both are exam-relevant; pick based on whether the problem is intra-VPC or cross-attachment.
- VPC Flow Logs: metadata of every flow (5-tuple), cheap, always-on.
- Traffic Mirroring: full packet copy, expensive, Nitro-only.
- Reachability Analyzer: VPC-level static path analysis (not live test).
- Network Access Analyzer: scope-based "unintended paths" finder.
- Route Analyzer: TGW-level path tracing inside Network Manager.
- Transit Gateway Network Manager: global topology and metrics across TGWs.
- Reference: https://docs.aws.amazon.com/vpc/latest/tgw/tgw-network-manager.html
CloudWatch Network Metrics
CloudWatch is the metric and alarm plane for AWS networking services. Each service publishes a distinct set of metrics relevant to monitoring.
NAT Gateway metrics
BytesInFromDestination,BytesOutToSource,BytesInFromSource,BytesOutToDestination— bytes by direction.ConnectionAttemptCount,ConnectionEstablishedCount— connection rates.ErrorPortAllocation— port exhaustion (the canonical NAT GW failure mode).IdleTimeoutCount— idle TCP timeouts.PacketsDropCount— dropped packets.
ErrorPortAllocation > 0 is a critical alert — it means the NAT GW has run out of ephemeral ports and new connections are failing.
Direct Connect metrics
ConnectionState— 1 if up, 0 if down.ConnectionBpsEgress,ConnectionBpsIngress— bandwidth.ConnectionPpsEgress,ConnectionPpsIngress— packet rate.ConnectionLightLevelTx,ConnectionLightLevelRx— fiber optic transmit/receive light levels (warning of fiber degradation).VirtualInterfaceBpsEgress, etc. — per-VIF.
VPN tunnel metrics
TunnelState— 1 if up, 0 if down. Per tunnel.TunnelDataIn,TunnelDataOut.TunnelEstablished— recent failure events.
TGW metrics
BytesIn,BytesOut,PacketsIn,PacketsOut.PacketsDropCountBlackhole,PacketsDropCountNoRoute.
ALB metrics
RequestCount,TargetResponseTime,HealthyHostCount,UnHealthyHostCount.HTTPCode_Target_4XX_Count,HTTPCode_Target_5XX_Count.RejectedConnectionCount— when ALB rejects due to capacity.
NLB metrics
ActiveFlowCount,NewFlowCount,ProcessedBytes.TCP_Client_Reset_Count,TCP_Target_Reset_Count,TCP_ELB_Reset_Count— RST counts that reveal misbehavior.
CloudWatch alarms — proactive alerting
Set alarms on critical metrics:
- BGP session down:
Direct Connect VirtualInterfaceState != UPfor 5 minutes → page on-call. - NAT GW port exhaustion:
ErrorPortAllocation > 0→ page immediately. - VPN tunnel flap:
TunnelState change > 3 in 1 hour→ page. - Bandwidth saturation:
ConnectionBpsEgress > 80% of port capacity→ warn (capacity planning). - TGW route drops:
PacketsDropCountNoRoute > 0→ page (route table bug).
- NAT GW:
ErrorPortAllocationis the canonical port-exhaustion signal. - Direct Connect:
ConnectionState,LightLevelTx/Rxfor fiber-layer health. - VPN tunnel:
TunnelStateper tunnel; alarm on flap. - TGW:
PacketsDropCountNoRouteis a route table bug signal. - ALB:
HTTPCode_ELB_5XX_Countis LB-side;HTTPCode_Target_5XX_Countis backend. - CloudWatch metrics latency: ~1–3 minutes for standard, ~10 seconds for high-resolution.
- Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html
Access Logging — ALB, NLB, CloudFront, API Gateway
Access logs capture every HTTP request (or TCP connection for NLB) handled by the LB or edge service. They complement Flow Logs (network-layer) with application-layer detail.
ALB access logs
ALB writes access logs to S3 (no CloudWatch option). Each log line contains: time, ALB ARN, client IP:port, target IP:port, request processing time, target processing time, response time, ELB and target status codes, received and sent bytes, request line, user agent, SSL cipher and protocol, target group ARN, trace ID, domain name, chosen cert ARN, matched rule priority, request creation time, action executed, redirect URL, error reason, target status, classification, classification reason.
Access logs are written every 5 minutes to S3. Use Athena for querying.
NLB access logs
NLB access logs (TLS listeners only — no logs for TCP/UDP) similarly write to S3. Less detailed than ALB logs (NLB doesn't see HTTP).
CloudFront access logs
CloudFront has two log options: standard logs (delivered to S3 every minute) and real-time logs (Kinesis Data Streams, sub-second latency). Real-time logs are field-customisable; standard logs have a fixed schema.
API Gateway access logs
API Gateway writes execution logs and access logs to CloudWatch Logs. Configurable per-stage.
Use cases
- Security investigation: which IP made the malicious request, what user agent, what response was returned.
- Performance debugging: tail latency analysis, slow-path identification.
- Billing dispute: prove or disprove a customer's claim about request volume.
Baseline Performance Capture and Trending
The Specialty exam expects you to capture a performance baseline before incidents — so deviations are detectable.
What to baseline
- TCP retransmit rate: should be <0.1% on healthy networks.
- Latency p50, p99, p999: per-region, per-AZ, per-endpoint.
- Bandwidth utilization: peak vs average per Direct Connect, per VPN tunnel.
- NAT GW connections per second: peak.
- Flow Logs flow rate: flows per minute per VPC.
Tools
- CloudWatch metrics (built-in) — for AWS services.
- CloudWatch Synthetics — canaries that probe HTTP endpoints continuously.
- CloudWatch Logs Insights — query Flow Logs for derived metrics like flow rate, top talkers.
- Third-party agents (Datadog, New Relic) — for deeper application-layer telemetry.
Trending and dashboards
Build CloudWatch dashboards combining Flow Logs queries, NAT GW metrics, BGP session state, and alarm history. Update dashboards as the architecture evolves.
Common Traps Recap — Network Monitoring on ANS-C01
Trap 1: Flow Logs distinguish SG vs NACL drops
Wrong. REJECT is binary; no per-control attribution.
Trap 2: Reachability Analyzer is a live ping test
Wrong. It's a static configuration simulation.
Trap 3: Traffic Mirroring works on all instance types
Wrong. Nitro-based instances only.
Trap 4: Flow Logs capture all traffic in the VPC
Wrong. Traffic to AWS DNS, IMDS, NTP, license activation, and DHCP is excluded.
Trap 5: VPC Flow Logs include packet payloads
Wrong. Metadata only (5-tuple, bytes, packets). Use Traffic Mirroring for payload.
Trap 6: Network Firewall drops appear in VPC Flow Logs
Wrong. NFW has separate flow logs and alert logs.
Trap 7: Reachability Analyzer works across AWS accounts
Partial. Cross-account analysis requires explicit RAM-shared resources or AWS Organisations integration.
Trap 8: NAT GW BytesInFromSource indicates the bottleneck
Partial. The bottleneck signal is ErrorPortAllocation > 0, not bytes.
Trap 9: Network Manager's Route Analyzer is for VPC routes
Wrong. TGW-level routes only. For VPC route analysis, use Reachability Analyzer.
Trap 10: ALB access logs go to CloudWatch by default
Wrong. ALB logs go to S3 only. CloudFront logs go to S3 (standard) or Kinesis (real-time).
Trap 11: v5 Flow Logs cost more than v2
Wrong. Same per-GB cost. Always pick v5.
Trap 12: Reachability Analyzer can verify TGW peering reachability
Partial. It can analyse paths through TGW peering but with limitations on cross-region complexity.
ANS-C01 exam priority — Network Monitoring and Logging Design. This topic carries weight on the ANS-C01 exam. Master the trade-offs, decision boundaries, and the cost/performance triggers each AWS service exposes — the exam will test scenarios that hinge on knowing which service is the wrong answer, not just which is right.
FAQ — Network Monitoring on ANS-C01
Q1: When do I use VPC Flow Logs vs Traffic Mirroring vs CloudWatch metrics?
Flow Logs for connection-level metadata — cheap, always-on, ideal for security baselines, GuardDuty input, and broad anomaly detection. Traffic Mirroring for full-packet payload — expensive, targeted, ideal for forensic capture, IDS signature matching, and protocol-level debugging. CloudWatch metrics for service-level health and rate signals — bandwidth, error rates, BGP session state, port exhaustion. They are complements, not alternatives. The Specialty exam will combine all three: "Flow Logs to detect the anomaly, Traffic Mirroring to inspect the payload, CloudWatch metrics to validate the symptom".
Q2: How does Reachability Analyzer differ from Network Access Analyzer?
Reachability Analyzer answers "can resource A reach resource B given the current configuration?" — point-to-point analysis with a specific source and destination. Network Access Analyzer answers "given a scope (e.g. 'production VPCs should never reach the internet directly'), what unintended paths exist?" — broad scope analysis surfacing all violations. Use Reachability Analyzer for "verify intended paths"; use Network Access Analyzer for "discover unintended paths". Both are static-configuration analysis (not live tests). The Specialty exam scenarios will keyword "intended" → Reachability Analyzer, and "find any unauthorised" → Network Access Analyzer.
Q3: How do I diagnose a BGP session that is reported as up but is dropping packets?
Layer the telemetry. (a) Check CloudWatch Direct Connect metrics: ConnectionState (1 if up), ConnectionBpsEgress and ConnectionBpsIngress (saturation?), LightLevelTx/Rx (fiber-layer health), ConnectionPpsEgress (packet rate). Saturation or light-level degradation suggests physical-layer issue. (b) Check VPC Flow Logs with v5 fields, filtering by traffic-path = ThroughDX. Look for retransmits (TCP flags including SYN-ACK retries) and skewed packet counts. (c) Check TGW Network Manager metrics if traffic crosses TGW. (d) Set up CloudWatch alarms for BGP session flap (VirtualInterfaceState changes > 3 in 1 hour). The exam answer for "BGP up but packets dropping": investigate fiber light levels (LightLevelTx/Rx), BGP holdtime tuning, BFD enablement.
Q4: What does v5 Flow Logs add that v2 doesn't have?
Critical fields: pkt-srcaddr / pkt-dstaddr (true source/destination IPs distinct from immediate hops — essential for NAT GW analysis), pkt-src-aws-service / pkt-dst-aws-service (identifies AWS service traffic without IP-range lookup), flow-direction (ingress/egress relative to ENI), traffic-path (Through IGW, NAT GW, TGW, etc.), VPC ID, subnet ID, instance ID (for cross-resource correlation), TCP flags, and ECS task ARN, sublocation type/id for container/Outposts workloads. Cost is identical to v2. Always pick v5 for new deployments.
Q5: How do I set up proactive alerting when a Direct Connect BGP session drops?
Create a CloudWatch alarm on the VirtualInterfaceState metric for each VIF: Sum < 1 for 1 datapoint of 5 minutes. The metric is 1 when up and 0 when down. Set alarm action to publish to SNS topic, which routes to PagerDuty, Slack, or email. For the BGP-flap pattern (intermittent up/down), use MetricMath to compute "VirtualInterfaceState recent changes" — alarm if more than 3 state changes in 1 hour. Add a similar alarm for Connection BpsEgress saturation (>80% of port capacity for capacity planning). The Specialty answer for "design proactive BGP monitoring": CloudWatch alarms on VirtualInterfaceState + LightLevelTx/Rx degradation + ConnectionPpsEgress saturation, all routing to SNS.
Q6: When does Traffic Mirroring fail to deliver packets to the target?
Common failure modes:
- Source ENI is on a non-Nitro instance (m4, c4, etc.) — Traffic Mirroring is silently inert.
- Source and target are in different accounts — VPC peering or TGW connectivity is required.
- Mirror target NLB is overwhelmed — bandwidth doubled (original + mirrored), and the NLB can't keep up.
- Mirror filter is too restrictive — packets are filtered out before being sent.
- Mirror target ENI has tight security groups that drop GENEVE-encapsulated traffic.
Diagnose with: CloudWatch metrics for the mirror session (MirrorSession Bytes), target NLB metrics (ActiveFlowCount, ProcessedBytes), and NLB target health.
Q7: Where do I store Flow Logs for long-term retention with cost-effective querying?
S3 with Athena is the canonical pattern. Configure Flow Logs with destination S3, format Parquet (smaller, faster to query than text), partitioned by account/region/date. Athena queries scan only the relevant partitions (drastic cost reduction). For >90 day retention, transition to S3 Glacier Instant Retrieval for further cost reduction. Use Athena's CTAS (CREATE TABLE AS SELECT) to materialize derived tables (e.g. "top 10 talkers by hour") for fast dashboard queries. The Specialty answer for "retain Flow Logs cost-effectively for 1 year with ad-hoc queries": S3 Parquet + Athena + lifecycle policies to Glacier.
Q8: What is the difference between Route Analyzer and Reachability Analyzer?
Reachability Analyzer is VPC-scoped — analyses route tables, SGs, NACLs, gateways within and adjacent to a VPC. Route Analyzer (inside Transit Gateway Network Manager) is TGW-scoped — analyses TGW route tables, attachments, propagations across TGWs and TGW peering. Pick based on where the suspected issue lives. For "VPC A's instance can't reach VPC B's RDS, both attached to the same TGW", you'd use both: Reachability Analyzer to confirm intra-VPC path, Route Analyzer to confirm TGW path. ANS-C01 frequently distinguishes them.
Q9: How do I capture a baseline of normal network behavior to detect anomalies?
Establish baselines for: (a) bandwidth utilization per Direct Connect / VPN / NAT GW (peak, p99, average), (b) TCP retransmit rate (should be <0.1%), (c) latency p50/p99/p999 between regions, between AZ, to common destinations (S3, DynamoDB, public APIs), (d) NAT GW concurrent connection count, (e) Flow rate per VPC (flows per minute), (f) BGP route advertisement count. Run these for 2–4 weeks of typical operation, then set CloudWatch anomaly-detection alarms (statistical baselines that auto-adjust). For 30-day rolling baselines, CloudWatch's anomaly detection band is the easiest answer; for static baselines, set hardcoded thresholds. ANS-C01 will reward "capture baseline before incident, alert on deviation" as the proactive posture.
Q10: What does CloudWatch Logs Insights add over Athena for Flow Log queries?
CloudWatch Logs Insights queries logs in CloudWatch Logs (not S3) with a SQL-like syntax. Pros: real-time (sub-minute), interactive, integrated with CloudWatch dashboards. Cons: more expensive per ingested GB, retention limited (1–10 years configurable), partitioning is automatic but less granular. Athena queries S3-stored Flow Logs (Parquet preferred). Pros: cheaper, supports massive historical datasets, custom partitioning. Cons: query latency higher (seconds to minutes), needs explicit table DDL. Use Logs Insights for active investigations (the past 1–7 days) and Athena for historical analytics (90 days to 7 years). Many SCS-C02 / ANS-C01 architectures use both.
Further Reading and Related Operational Patterns
- VPC Flow Logs
- VPC Flow Logs Record Examples
- VPC Reachability Analyzer
- Network Access Analyzer
- VPC Traffic Mirroring
- Transit Gateway Network Manager
- Amazon CloudWatch User Guide
- ALB Access Logs
Once monitoring and logging is in place, the natural next operational layers on ANS-C01 are: VPC Flow Logs, Reachability Analyzer, and Traffic Mirroring for deeper Domain 3 troubleshooting; Network Performance — ENA, EFA, Placement Groups, and Jumbo Frames for the throughput layer that monitoring observes; AWS Network Firewall — Suricata Rules, TLS Inspection, and Centralized Deployment which generates its own flow and alert logs separate from VPC Flow Logs; and Compliance, Auditing, and Network Governance which composes Flow Logs, CloudTrail, Config, and Firewall Manager into an audit story.
- VPC Flow Logs, Reachability Analyzer, and Traffic Mirroring — ANS-C01 Study Notes
- Network Performance — ENA, EFA, Placement Groups, and Jumbo Frames — ANS-C01 Study Notes
- AWS Network Firewall — Suricata Rules, TLS Inspection, and Centralized Deployment — ANS-C01 Study Notes
- Compliance, Auditing, and Network Governance — ANS-C01 Study Notes