examlab .net The most efficient path to the most valuable certifications.
In this note ≈ 27 min

Network Monitoring, Troubleshooting, and Cost Optimization

5,400 words · ≈ 27 min read ·

ANS-C01 Domain 3 cross-cutting deep dive on AWS network cost optimization (NAT Gateway per-AZ deployment, cross-AZ traffic doubling, VPC peering vs Transit Gateway economics, CloudFront for egress reduction, gateway endpoints), CloudWatch network metrics, alarm patterns, and Route 53 for blue-green DNS migrations.

Do 20 practice questions → Free · No signup · ANS-C01

The AWS Certified Advanced Networking — Specialty exam (ANS-C01) treats cost optimisation as a first-class network engineering skill, not an afterthought. Domain 3 task 3.3 explicitly demands you "Optimize AWS networks for performance, reliability, and cost-effectiveness", and the cost-related skill bullets call out reducing bandwidth utilisation (CloudFront, multicast vs unicast), choosing cost-effective hybrid connectivity, picking between VPC peering and Transit Gateway based on cost, and optimising subnets to prevent IP exhaustion. The exam writes scenarios where four architectures all work but only one is economically rational — and recognising the cheap path is the right answer.

This topic synthesises the cross-cutting cost and operational story across all of Domain 3. We cover the AWS data transfer pricing model with its in-region/cross-AZ/inter-region/internet-egress tiers; the NAT Gateway trap of cross-AZ data transfer being double-billed and the per-AZ deployment fix; the VPC peering versus Transit Gateway decision matrix where peering is "free" but TGW provides centralisation worth its per-attachment fee; the CloudFront lever for offloading expensive internet egress; gateway endpoints for free private access to S3 and DynamoDB; the PrivateLink interface endpoint cost trade-off; the CloudWatch metrics and alarms that catch problems before they become bills; Route 53 weighted routing for blue-green DNS migrations; subnet optimisation including secondary CIDR for IP exhaustion; and the canonical troubleshooting flow when the bill spikes unexpectedly.

Why Network Cost Sits at the Heart of ANS-C01

Networking is the second-largest line item on most AWS bills (after compute), and within networking the largest line items are typically NAT Gateway, inter-AZ data transfer, and inter-region replication. The exam cares because the wrong network architecture can multiply the bill by 5–10x without changing functionality — and recognising those architectural pitfalls is exactly what differentiates an Advanced Networking Specialty engineer from a Solutions Architect Associate.

The ANS-C01 cost-optimisation surface area:

  • NAT Gateway charges per-hour ($0.045/hr) plus per-GB processed ($0.045/GB). At 24×7×30 days plus moderate egress, a single NAT Gateway easily exceeds $200/month before any data is even moving. Multiply by AZs (the standard recommendation is one NAT per AZ for HA), and the costs add up.
  • Cross-AZ data transfer charges $0.01/GB in each direction — meaning $0.02/GB round-trip — and many designs accidentally double-pay this for traffic that bounces between AZs through a single NAT Gateway.
  • Inter-region traffic is more expensive: $0.02/GB cross-region within North America, higher across continents. Inter-region VPC peering is not free.
  • VPC peering is free for the connection itself; data transfer through it is charged at the cross-AZ or inter-region rate as appropriate.
  • Transit Gateway charges per attachment per hour ($0.05/hr) plus per-GB processed ($0.02/GB). For many spoke VPCs, TGW is more economical than mesh peering; for two or three VPCs, peering wins on cost.
  • Gateway endpoints for S3 and DynamoDB are entirely free — no per-hour fee, no per-GB processing fee. They are the single highest-leverage cost lever in AWS networking.
  • Interface endpoints (PrivateLink) are charged per ENI per hour per AZ ($0.01/hr/ENI/AZ) plus per-GB processed ($0.01/GB) — meaningful at scale, often less than NAT Gateway alternative for AWS service traffic.
  • CloudFront offloads internet egress to its edge network at lower per-GB rates than direct EC2/S3 egress, with regional and volume-tiered discounts.

ANS-C01 expects you to internalise this pricing model and recognise the patterns where a small architecture change produces 50%+ cost savings.

Plain-Language Explanation: Network Monitoring, Troubleshooting, and Cost Optimization

The mental model has three layers: the bill is shaped by topology (where traffic flows), the topology is shaped by routing decisions (which gateway each subnet uses), and the routing decisions are observable through metrics and Flow Logs (so cost spikes are diagnosable). Three analogies help.

Analogy 1: The City Highway Toll System

Imagine your VPC as a city with multiple districts (AZs), highways (subnets and route tables) connecting them, and tolls (per-GB charges) at specific bridges. The NAT Gateway is a toll bridge to the suburbs (the internet) — every car (packet) pays a toll to cross outbound, and there is a flat rate to keep the bridge open 24×7. If you put one toll bridge in each district (one NAT GW per AZ) and route district-A cars only through bridge-A, traffic crosses the bridge once and pays once. If you put one toll bridge in district-A only and force cars from district-B to first drive across the inter-district highway (cross-AZ data transfer) to reach the bridge, every car pays both the inter-district highway toll and the bridge toll — the double-charge trap. The exam-canonical fix is "one NAT GW per AZ" so each car uses its local bridge.

VPC peering is a direct private road between two specific districts — no toll on the road itself, but still pay the inter-district fee. Transit Gateway is a central hub with one entrance, one set of tolls, but every district can route through it once added. For a city of 3 districts, the direct private roads (full mesh = 3 roads) are cheaper than building the central hub. For a city of 30 districts, the central hub (30 connections) is dramatically cheaper than the 435 direct roads of full mesh. Gateway endpoints to S3/DynamoDB are free private subway stations — no toll, no fare; for districts that mostly travel to the central library (S3), the subway eliminates bridge tolls entirely. CloudFront is the citywide delivery service that picks up packages from the central library and distributes them through neighborhood depots, charging less per delivery than direct shipping.

Analogy 2: The Power Grid

Network costs are like electricity bills. The per-hour NAT Gateway charge is the monthly meter rental fee — fixed regardless of usage. The per-GB data processing charge is the per-kilowatt-hour cost — variable with traffic. Cross-AZ data transfer is inter-substation transmission losses charged separately — the energy still gets delivered, but the utility charges for the long-distance transmission. Gateway endpoints are like on-site solar generation — produce/consume locally, no transmission charges, no meter fee. CloudFront is the citywide rooftop solar network — local generation at neighborhood depots reduces long-distance transmission needs.

Analogy 3: The Shipping Container Logistics

Treat each packet as a shipping container. NAT Gateway is the central port-of-export with a flat operating cost and per-container handling fee. Cross-AZ shipping is the inter-terminal trucking within the same harbor. Direct inter-VPC peering is rail link between two specific warehouses (free track, but per-container fee). Transit Gateway is the central rail hub that all warehouses connect to (modest hub fee plus per-container fee). CloudFront is the prepaid bulk-shipping franchise that aggregates many shippers at lower per-container rates. Gateway endpoints to S3 are dedicated free pneumatic tubes between every warehouse and the central library — no fees, no friction.

For ANS-C01 cost questions, the shipping logistics + free pneumatic tubes for S3 mental model captures the asymmetry between paid paths (NAT GW, TGW, cross-AZ) and free paths (gateway endpoints) cleanly. When a question asks how to reduce a NAT bill, think "swap to free pneumatic tubes for S3-bound traffic" (gateway endpoint). When it asks about offloading internet egress, think "prepaid bulk-shipping franchise" (CloudFront). Reference: https://aws.amazon.com/vpc/pricing/

AWS Data Transfer Pricing Model

The pricing model has tiers based on where the bytes go. ANS-C01 expects you to recognise these on sight.

In-region, intra-AZ

Traffic between two ENIs in the same AZ via private IPv4 (using the local RFC 1918 address) is free. This is why AWS-Architects always recommend keeping latency-critical and bandwidth-heavy components in the same AZ — both for performance and for cost.

In-region, cross-AZ (private IP)

Traffic between AZs in the same region via private IPv4 costs $0.01/GB in each direction = $0.02/GB round-trip. This is the "cross-AZ tax" — small per byte, but enormous at scale. A database replicating 1 TB/day across AZs is ~$300/month in cross-AZ fees alone.

In-region, public IP (or via NAT)

Traffic to the public IP of another EC2 instance in the same region (or via NAT Gateway out and back in) costs more than private-IP traffic. This is why always use private IPs for intra-region communication — the architecture is cheaper and the security perimeter is tighter.

Inter-region

Traffic between AWS regions via inter-region VPC peering or Transit Gateway peering costs $0.02/GB within North America and Europe; higher across continents (Asia-Pacific, South America). Important: inter-region peering is NOT free, even though same-region peering's connection itself is free.

Internet egress

Traffic out to the public internet from EC2/NAT/IGW costs $0.09/GB for the first 10 TB/month, with volume tiers reducing the rate at higher monthly bands ($0.085 at 10–50 TB, $0.07 at 50–150 TB, $0.05 at 150 TB+). There is no charge for inbound traffic from the internet.

CloudFront egress

Egress through CloudFront is at lower per-GB rates than direct EC2 egress, with significant volume tiers — $0.085/GB at low volume tapering down to ~$0.02/GB at very high volume. CloudFront also offers "Regional Edge Cache" reducing origin egress to ~$0.02/GB for cached responses. The CloudFront + S3 origin pattern can reduce egress costs by 50–80% for high-traffic web applications.

  • Intra-AZ private IP: free; the cheapest path.
  • Cross-AZ private IP: $0.01/GB each direction = $0.02/GB round-trip.
  • Inter-region peering: $0.02/GB within continents; higher across.
  • Internet egress (EC2 / NAT / IGW): $0.09/GB at low volume, tiered down.
  • CloudFront egress: lower per-GB than direct, with volume discounts.
  • NAT Gateway: per-hour + per-GB processed (both charged).
  • Transit Gateway: per-attachment-hour + per-GB processed.
  • Gateway endpoint (S3, DynamoDB): free.
  • Interface endpoint (PrivateLink): per-AZ-ENI-hour + per-GB processed.
  • Reference: https://aws.amazon.com/vpc/pricing/

NAT Gateway: The Most Misunderstood Cost Centre

The NAT Gateway is the canonical AWS cost trap. It is essential (private subnets need outbound internet for OS updates, package downloads, AWS service calls without endpoints), it is expensive ($0.045/hr per gateway plus $0.045/GB processed in most regions), and the deployment pattern dramatically affects the bill.

Per-AZ NAT GW deployment

The cost-optimal HA pattern: one NAT Gateway per AZ, with each AZ's private subnet route table pointing 0.0.0.0/0 to the local-AZ NAT GW. Traffic from instance in AZ-1 goes to NAT-1 (no cross-AZ fee); traffic from instance in AZ-2 goes to NAT-2 (no cross-AZ fee). Hourly cost: ~$0.045/hr × 3 AZs × 730 hours = ~$98/month for the gateway alone, plus per-GB processing.

The single-AZ NAT GW trap

The wrong pattern: deploy one NAT GW in AZ-1, and route private subnets in AZ-2 and AZ-3 to it. Now every byte from AZ-2 instances travels (a) cross-AZ to NAT-1 ($0.01/GB), (b) processed by NAT-1 ($0.045/GB), (c) egress to internet ($0.09/GB). For a single byte: $0.145 versus $0.135 with per-AZ NAT (just NAT processing + egress). Sounds small, but at TB scale the cross-AZ cost compounds — and the architecture also has lower availability (NAT-1 AZ failure breaks AZ-2 and AZ-3 too).

NAT GW SNAT port exhaustion

A NAT Gateway uses source NAT (SNAT) with a fixed port range per gateway. The CloudWatch metric ErrorPortAllocation counts SNAT failures, which occur when too many concurrent connections to the same destination IP+port exhaust the port range. The fix: distribute connections across multiple destination IPs (load balancers in front), use multiple NAT Gateways with split routes, or move the workload to a service that does not need NAT (use VPC endpoints for AWS services).

Replacing NAT with endpoints

The single highest-impact NAT-cost reduction: eliminate NAT for AWS service traffic. (1) S3 trafficgateway endpoint (free). (2) DynamoDBgateway endpoint (free). (3) KMS, Secrets Manager, STS, Systems Manager, ECR, SQS, SNS, EventBridge, etc.interface endpoints (per-AZ-ENI-hour cost, but no per-GB NAT processing). For a workload where 80% of egress is AWS-service traffic, swapping to endpoints can drop NAT bills by 80% with a small interface-endpoint hourly cost replacing the variable per-GB.

A standard exam pattern: scenario describes a NAT Gateway bill that doubled in a month after an auto-scaling group expanded into AZ-2 and AZ-3 from AZ-1 only. The candidate is offered four solutions. Wrong: "increase NAT Gateway size", "switch to NAT Instance", "add CloudFront". Right: deploy one NAT Gateway per AZ, route each private subnet to its local NAT Gateway, eliminating cross-AZ data transfer. Memorise: a NAT Gateway in AZ-1 serving traffic from AZ-2 instances incurs cross-AZ charges in BOTH directions (request out, response back) plus the NAT processing fee — three charges instead of one. Reference: https://docs.aws.amazon.com/vpc/latest/userguide/vpc-nat-gateway.html

VPC Peering vs Transit Gateway: The Cost Decision

A common ANS-C01 question pattern: "you have N VPCs that all need to communicate. Which connectivity option minimises cost?". The answer depends on N.

VPC peering economics

VPC peering connection itself is free; you pay only for data transfer at the standard cross-AZ or inter-region rate. For two VPCs in the same region, peering is the cheapest option — no per-hour fees, just data transfer.

A full mesh of N VPCs requires N(N-1)/2 peering connections — for N=3, 3 peerings; for N=10, 45 peerings; for N=30, 435 peerings. Each peering must be manually accepted, and route tables in each VPC must be updated for every other VPC's CIDR. At small N (2–4 VPCs), peering wins on cost and simplicity; beyond that, the route-table maintenance burden becomes operational pain even though raw cost stays low.

VPC peering does not support transitive routing — VPC-A peered to VPC-B and VPC-B peered to VPC-C does not give VPC-A access to VPC-C. This is a hard limit.

Transit Gateway economics

TGW charges per-attachment-per-hour ($0.05/hr) plus per-GB processed ($0.02/GB). For a 10-VPC TGW, that is ~$365/month for the 10 attachments alone, plus data processing. For high-traffic workloads, the per-GB fee is often the dominant cost.

But TGW has overwhelming operational advantages at scale: centralised route management (one TGW route table per security segment), supports transitive routing, integrates with AWS RAM for cross-account sharing, supports inter-region peering, integrates with Site-to-Site VPN and Direct Connect Gateway, and supports SD-WAN via TGW Connect. For more than ~5 VPCs, TGW is operationally far simpler and the cost difference is justified.

The break-even

A rough heuristic: for 2 to 4 VPCs, peering is cheaper and simpler. For 5 to 10 VPCs, TGW becomes operationally better; cost is comparable. For 10+ VPCs, TGW is dominant. For multi-account or organisation-wide, TGW with AWS RAM is the only practical option — peering does not scale operationally.

When to mix

A common pattern: TGW as the hub with a few direct VPC peerings between high-traffic pairs. For example, a database VPC and an analytics VPC sending hundreds of TB/month between them peer directly to bypass the TGW per-GB fee, while the rest of the org connects through TGW. The exam may write this as a cost-optimisation question.

Gateway endpoints — S3 and DynamoDB only

Gateway endpoints are free — no per-hour, no per-GB. Add them to every VPC that uses S3 or DynamoDB. The endpoint is a route-table entry pointing to a managed prefix list, intercepting traffic to the service's public IP ranges and routing across the AWS backbone. No ENI, no security group, no DNS override.

For workloads that read/write a lot of S3 (ML training reading datasets, log archival, backup), gateway endpoints can save hundreds to thousands of dollars per month in NAT processing fees.

Interface endpoints are per-AZ-ENI-per-hour ($0.01/hr) plus per-GB processed ($0.01/GB). For a 3-AZ deployment with one interface endpoint to KMS, that is ~$22/month plus per-GB fees.

The economic sweet spot: interface endpoints are cheaper than NAT GW when the per-GB volume of AWS service calls is high enough to amortise the per-AZ-ENI-hour. For a workload making millions of KMS calls per month, the interface endpoint pays for itself; for a workload making a few KMS calls per day, NAT is fine.

For SCS-C02-style "data perimeter" architecture, interface endpoints are mandatory regardless of cost — they provide the network-layer enforcement of the perimeter. ANS-C01 cost questions assume the security requirement is met first, then optimise.

Centralised interface endpoint pattern

For organisations with hundreds of VPCs, replicating interface endpoints in every VPC is expensive. The centralised endpoint pattern: deploy interface endpoints in one shared services VPC, and use Route 53 Resolver (with conditional forwarding rules) to direct VPC-private DNS queries for those AWS services to the central endpoint. Spoke VPCs route through TGW to reach the central endpoints. Cost: one set of endpoints amortised across many VPCs; trade-off: cross-VPC TGW fees when calling AWS services.

A subtle ANS-C01 distractor: candidates assume interface endpoints always save money. They do not — interface endpoints have hourly fees that exist whether you use them or not. For low-volume AWS-service usage (e.g., a Lambda function making a few KMS calls per day), the interface endpoint hourly fee exceeds the NAT processing fee for that small volume. The break-even calculation: per-AZ-ENI-hour × 24 × 30 = ~$7/month per endpoint per AZ. If your monthly AWS-service traffic via NAT is less than ~155 GB to that service, NAT is cheaper. ANS-C01 expects you to make this calculation when the question gives volume hints. Reference: https://aws.amazon.com/vpc/pricing/

CloudFront for Egress Cost Reduction

Amazon CloudFront is AWS's content delivery network with edge locations globally. The performance benefit (reduced latency for end users) is well-known; the cost benefit is equally important and tested by ANS-C01.

How CloudFront reduces egress cost

CloudFront has its own per-GB egress pricing, lower than direct EC2/S3 egress at all volume tiers, and offers regional volume tiers that compound to substantial savings at scale. For traffic that is cacheable (static content, video, software downloads), CloudFront can serve from edge cache without touching the origin — meaning zero origin egress cost for the cache hit ratio. Even for un-cached traffic, CloudFront-fronted egress is billed at lower rates than direct.

Origin protection and pricing patterns

When CloudFront fronts S3, configure Origin Access Control (OAC) so the S3 bucket is private and only CloudFront can read it. This (a) tightens security, (b) ensures all egress goes through CloudFront's lower pricing, and (c) prevents bypass that would generate direct S3 egress charges.

For ALB-fronted dynamic content, CloudFront still helps via TLS termination at the edge (offloading from ALB), and via the lower-rate egress tier on the path back to the user.

When CloudFront is NOT the answer

CloudFront makes economic sense when (a) the response is cacheable, (b) the user base is geographically distributed (the latency benefit matters), or (c) the egress volume is high enough that the volume tiers kick in. For a low-volume internal API serving a few thousand requests per day to users in one region, CloudFront overhead is not justified.

CloudFront does not support UDP origins or non-HTTP protocols. For UDP/TCP non-HTTP workloads (gaming, VoIP, financial trading), Global Accelerator is the equivalent — anycast static IPs, traffic dial, and lower-latency routing, but without caching.

CloudWatch Metrics, Alarms, and Operational Patterns

Network metrics worth alarming on

  • NAT GatewayErrorPortAllocation > 0 (SNAT exhaustion), BytesOutToDestination sustained spike (cost spike alert), IdleTimeoutCount (connection-pool sizing).
  • Transit GatewayPacketDropCountBlackhole > 0 (intentional drops happening too often = misconfiguration), BytesIn / BytesOut spike (cost alert).
  • Direct ConnectConnectionState = 0 (link down), ConnectionLightLevelTx / Rx outside nominal range (physical-layer warning).
  • VPN connectionTunnelState = 0 (tunnel down), TunnelDataIn / Out to validate active/active sharing.
  • Application Load BalancerHTTPCode_Target_5XX_Count, UnHealthyHostCount, TargetResponseTime p99.
  • VPC Flow Logs — derived metric filters on REJECT count, top byte consumers per CIDR.

Cost-spike alarm pattern

CloudWatch billing alarms on the AWS/Billing namespace can alert on overall account or per-service spend exceeding thresholds. For network-specific spend, AWS Cost Explorer with daily granularity and Cost Anomaly Detection is more granular. The actionable metric: NAT Gateway processed bytes doubling week-over-week is the typical canary for an architectural change that bloated the bill.

Network insights via Network Manager

For TGW deployments, TGW Network Manager centralises metrics across multiple TGWs and exposes them as CloudWatch metrics for the registered "global network". Alarms on BytesProcessed per-attachment can flag a misbehaving spoke.

  • Intra-AZ private IP: free.
  • Cross-AZ private IP: $0.01/GB each direction = $0.02/GB RT.
  • Inter-region peering: $0.02/GB within continents.
  • Internet egress (EC2/NAT/IGW): $0.09/GB up to 10 TB, tiered down.
  • NAT Gateway: ~$0.045/hr + ~$0.045/GB processed.
  • Transit Gateway: ~$0.05/hr per attachment + ~$0.02/GB processed.
  • Gateway endpoint: free (S3, DynamoDB only).
  • Interface endpoint: ~$0.01/hr per AZ per ENI + ~$0.01/GB.
  • CloudFront egress: tiered, lower than direct.
  • NAT cross-AZ trap: traffic from AZ-2 instance to AZ-1 NAT GW pays cross-AZ fee BOTH directions plus NAT fee.
  • VPC peering: free same-region; inter-region $0.02/GB.
  • VPC peering does NOT support transitive routing.
  • Break-even (rough): 2-4 VPCs peering wins; 5-10 break-even; 10+ TGW dominant.
  • NAT replacement candidates: S3/DynamoDB to gateway endpoint; KMS/SM/STS to interface endpoint.
  • Reference: https://aws.amazon.com/transit-gateway/pricing/

Route 53 for Blue-Green DNS Migrations

A canonical Domain 3.3 skill: Route 53 weighted routing for traffic migration. The pattern: deploy a new version of the workload to a new ALB or instance set; create Route 53 records with weighted routing — 95% old, 5% new; gradually shift weights over hours or days while monitoring metrics; complete migration when 100% on new and old can be decommissioned.

Weighted routing economics

Route 53 charges per query ($0.40 per million queries for the first billion). For most applications this is negligible. Health checks ($0.50 per health check per month for HTTPS) add to the bill but are required for failover routing.

Failover routing for HA

Primary record points to the active site; secondary record points to standby. Health check on the primary determines routing — if primary fails, queries return the secondary. Useful for active-passive disaster recovery; for active-active, use weighted records with health checks.

Latency-based routing for geo-optimisation

Route 53 latency policy directs each user to the AWS region with lowest measured latency from the user's resolver. Reduces application response time without changing the application; particularly useful for global user bases. Latency routing pairs naturally with regional ALBs/NLBs.

The exam-canonical traffic-migration pattern: (1) deploy new infrastructure (new ALB, new instance set) alongside old; (2) create Route 53 weighted records — 95% pointing to old ALB, 5% to new; (3) monitor CloudWatch metrics on the new infrastructure for errors, latency, capacity; (4) shift weights in 5%-10% increments until 100% on new; (5) decommission old. The TTL on the records determines how fast clients pick up changes — short TTL (60s) for fast rollback, longer TTL for cost reduction. Combine with health checks on the new infrastructure for automatic rollback. Reference: https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/routing-policy.html

Subnet Optimisation and IP Exhaustion

A frequent task 3.3 skill: updating and optimising subnets to prevent the depletion of available IP addresses within a VPC and to support increased application load via auto scaling.

Detecting IP exhaustion

CloudWatch does not directly alarm on subnet IP exhaustion, but a custom metric publishing AvailableIpAddressCount (queried via the EC2 API) on a schedule lets you alarm on remaining IPs in critical subnets. Auto Scaling failures with "InsufficientFreeAddressesInSubnet" error are the visible symptom — by then, scaling is already broken.

Mitigations

  • Add secondary CIDR blocks to the VPC. AWS supports up to 5 CIDRs per VPC (raisable). Add a non-overlapping range and create new subnets in it. Existing subnets are unchanged; new workloads go to the new subnets.
  • RFC 6598 (100.64.0.0/10) for shared address space — useful when RFC 1918 (10/8, 172.16/12, 192.168/16) is exhausted across the org. RFC 6598 is reserved for CGNAT but is technically usable in private VPCs without conflict.
  • Larger subnets — refactor /24 (256 IPs) to /22 (1024 IPs) where feasible; this requires a migration since AWS does not resize subnets in place.
  • Prefix delegation for IPv6 — assign /80 to the ENI rather than individual IPv6 addresses; supports massive scale.
  • Move stateless workloads to managed services that handle IPs internally — Fargate, Lambda — reducing the IP-per-replica ratio.

IP planning patterns

For new VPCs at scale, plan the CIDR with 5x growth headroom. For example, 10.x.0.0/16 (65k addresses) divided into per-AZ /20 blocks (4k each) gives plenty of headroom for several years. For high-density EKS clusters with secondary CIDR for pods (Cilium, Calico-with-VPC-IPAM), allocate the secondary CIDR up front.

Common Traps Recap — Network Cost on ANS-C01

The traps the exam writes most frequently in cost optimisation.

Trap 1: VPC peering is always free

Wrong. Peering connection itself is free; inter-region peering data transfer is $0.02/GB, and same-region cross-AZ traffic via peering is still $0.01/GB each way. Only intra-AZ peer traffic via private IP is free.

Trap 2: NAT Gateway is the only outbound option

Wrong. Many workloads can replace NAT GW with gateway endpoints (S3/DynamoDB) and interface endpoints (other AWS services), eliminating most or all NAT processing fees.

Trap 3: Cross-AZ traffic through NAT GW is single-charged

Wrong. Cross-AZ to NAT GW + NAT processing = double-charged. Per-AZ NAT GW deployment is the fix.

Trap 4: Transit Gateway is always more expensive than peering

Wrong. For more than ~5 VPCs, TGW is operationally and often economically better. Mesh peering at scale has hidden operational costs.

Trap 5: CloudFront only helps for static content

Wrong. CloudFront also reduces egress cost for dynamic content via lower per-GB pricing and TLS termination offload, even when cache hit ratio is low.

Trap 6: Interface endpoints are always cheaper than NAT

Wrong. Interface endpoints have per-AZ-ENI hourly fees. For low-volume AWS-service usage, NAT is cheaper. The break-even is roughly 155 GB/month per service per AZ.

Trap 7: AvailableIpAddressCount alarms automatically

Wrong. CloudWatch does not natively expose IP availability per subnet; you must publish it as a custom metric via Lambda/EventBridge or rely on Auto Scaling failure as the late indicator.

Trap 8: Adding secondary CIDR resizes existing subnets

Wrong. Secondary CIDR creates new IP ranges; existing subnets are untouched. New workloads must be placed in new subnets created in the secondary CIDR.

Trap 9: VPC peering supports transitive routing

Wrong. Hard limit; A-B and B-C does not give A-C. Use TGW for transitive routing.

Trap 10: Route 53 health checks come free with hosted zones

Wrong. Health checks are billed separately (~$0.50/check/month for HTTPS, more for advanced features like latency measurement and string matching).

Trap 11: Inter-region traffic via Direct Connect is free

Wrong. Direct Connect data transfer is charged (lower than internet egress, but not free). Cross-region traffic on Direct Connect Gateway across AWS regions has its own pricing.

Trap 12: Multiple NAT Gateways always cost more than one large one

Wrong. AWS NAT GW does not have a "size" — bandwidth scales automatically up to 100 Gbps. Multiple NAT GWs are NOT more expensive in per-GB processing; you pay the same per-GB regardless of how many gateways. The hourly fee multiplies, but cross-AZ savings often exceed it.

Decision Matrix — Cost Construct for Each ANS-C01 Goal

Cost optimisation goal Primary construct Notes
Reduce NAT bill for S3-heavy workload S3 gateway endpoint Free; eliminates NAT processing for S3.
Reduce NAT bill for DynamoDB-heavy workload DynamoDB gateway endpoint Free.
Reduce NAT bill for AWS-service traffic at high volume Interface VPC endpoints Per-AZ-ENI fee; cheaper at high volume.
Eliminate cross-AZ NAT trap Per-AZ NAT GW with local routing Each AZ's private subnets to its own NAT.
Reduce internet egress for global users CloudFront with OAC to S3 Lower per-GB egress; cache hits are zero origin egress.
Reduce internet egress for non-HTTP UDP/TCP Global Accelerator Anycast, lower latency, no caching.
Connect 2-4 VPCs same region VPC peering Free connection, no per-hour.
Connect 10+ VPCs across accounts Transit Gateway with RAM share Centralised, transitive routing.
Reduce inter-region replication cost Use AWS-internal paths (peering or TGW peering) Cheaper than internet egress; bulk via DataSync.
Detect cost spike before bill arrives CloudWatch + Cost Anomaly Detection Daily cost granularity.
Plan IP space for scale Multi-CIDR VPC with reserved blocks Up to 5 CIDRs per VPC.
Solve IP exhaustion in auto scaling subnet Add secondary CIDR + new subnets Existing subnets unchanged.
Migrate traffic gradually to new infra Route 53 weighted records 5% → 25% → 50% → 100% with metric monitoring.
Active-passive DR routing Route 53 failover with health checks Primary + secondary records.
Geo-optimise multi-region application Route 53 latency-based routing Lowest-latency region per user.

ANS-C01 exam priority — Network Monitoring, Troubleshooting, and Cost Optimization. This topic carries weight on the ANS-C01 exam. Master the trade-offs, decision boundaries, and the cost/performance triggers each AWS service exposes — the exam will test scenarios that hinge on knowing which service is the wrong answer, not just which is right.

FAQ — Network Monitoring, Troubleshooting, and Cost Optimization

Q1: My NAT Gateway bill doubled this month — what is the diagnostic procedure?

Step-by-step: (1) Identify the spike — open Cost Explorer, filter by "Service: EC2-Other", "Usage Type: NatGateway-Bytes" and "NatGateway-Hours". The Bytes line is the per-GB processing fee — that is what doubled. (2) Find the source — query VPC Flow Logs at the NAT GW ENI for the past 30 days, group by pkt-srcaddr (the original instance IP before NAT), order by total bytes. The top consumers are your culprits. (3) Categorise destinations — check dstaddr of the top flows. If the destinations are AWS service IP ranges (S3, DynamoDB, etc.), the answer is add a gateway endpoint or interface endpoint. If destinations are public internet (third-party APIs, package mirrors, container registries), examine whether the volume is legitimate or runaway (a misbehaving cron job hammering an API). (4) Check for cross-AZ amplification — if private subnets in AZ-2 route to a NAT GW in AZ-1, every byte is paying both the cross-AZ fee and the NAT processing fee. Deploy per-AZ NAT GW with local routing. (5) Validate the fix — after deploying endpoints or per-AZ NAT, monitor BytesOutToDestination on each NAT GW for a few days; expect a 50–80% reduction for typical AWS-heavy workloads.

Q2: When does Transit Gateway beat VPC peering on cost, and when does peering beat TGW?

For raw same-region intra-VPC traffic, VPC peering is always cheaper because it has no per-hour fee and no per-GB processing fee — you pay only the standard cross-AZ data transfer rate. TGW adds ~$0.05/hr per attachment ($36/month per VPC) plus $0.02/GB processing. So for 2 to 4 VPCs, peering is dramatically cheaper. The break-even shifts at scale: a 10-VPC TGW costs ~$365/month in attachments alone, but the equivalent 10-VPC full mesh is 45 peerings to maintain, route tables to update on every change, and operationally untenable. For organisation-wide multi-account multi-region, TGW is the only practical option — peering does not scale operationally at all. The cost-optimal mixed pattern: TGW as the hub for general connectivity, plus direct VPC peering between high-traffic pairs to bypass the TGW per-GB fee for those flows (e.g., a database-VPC-to-analytics-VPC peering carrying TBs/month, while the rest of the org uses TGW).

Q3: How do I architect Lambda outbound to AWS services without paying NAT fees?

Put Lambda in a VPC, give it private subnets, and use VPC endpoints. For each AWS service the Lambda calls (Secrets Manager, STS, KMS, Systems Manager Parameter Store, S3 via gateway endpoint, DynamoDB via gateway endpoint), provision the corresponding endpoint in the function's VPC. Lambda's outbound rules through endpoints have zero NAT involvement and zero NAT processing fees; only the per-GB endpoint processing fee (or zero for gateway endpoints to S3/DynamoDB). For Lambda that also needs to call non-AWS internet endpoints (a third-party API), add a NAT Gateway tightly scoped — but the endpoint-pattern dramatically reduces total NAT volume. ANS-C01 expects this pattern as the answer to any "Lambda in VPC with cost-optimal egress" question. The variant when Lambda only calls AWS services: no NAT at all, just endpoints — the function's subnets are pure private, no internet route, lowest possible network cost.

Q4: What CloudWatch alarms should I always have on networking infrastructure?

The non-negotiable set: (1) NAT GW ErrorPortAllocation > 0 — SNAT port exhaustion is the most common NAT failure mode and impacts production immediately. (2) NAT GW BytesOutToDestination week-over-week comparison alarm — early warning on cost spikes before the bill arrives. (3) Direct Connect ConnectionState = 0 — link-down is the highest severity hybrid event. (4) VPN tunnel TunnelState = 0 — tunnel-down on either of the two tunnels of any VPN. (5) TGW PacketDropCountBlackhole > 0 — intentional drops are by design but a high rate signals routing misconfiguration or new spoke conflict. (6) TGW PacketDropCountNoRoute > 0 — packets arriving without a matching route, signalling missing route propagation. (7) ALB HTTPCode_Target_5XX_Count and UnHealthyHostCount — backend health. (8) VPC Flow Logs metric filter on REJECT count from outside the org CIDR — possible scan or attack. (9) Custom metric: subnet IP availability < 20% — pre-emptive IP exhaustion warning, requires Lambda + EventBridge to publish from DescribeSubnets.

Q5: How does the centralised vs distributed VPC endpoint pattern affect cost?

Distributed pattern — each VPC has its own interface endpoints for the AWS services it uses. Cost: per-AZ-ENI-hour for each service in each VPC ≈ $7/month/service/AZ × VPCs × AZs. For 50 VPCs in 3 AZs each using 5 services: 50 × 3 × 5 × $7 = $5,250/month in fixed endpoint fees. Pros: simple, no cross-VPC traffic for endpoint calls. Centralised pattern — one shared services VPC has all the interface endpoints; spoke VPCs reach them via TGW with Route 53 Resolver conditional forwarding for the AWS service DNS names. Cost: 1 × 3 × 5 × $7 = $105/month in fixed endpoint fees, plus TGW per-GB for the cross-VPC traffic. Pros: dramatic reduction in hourly fees at scale; cons: TGW adds latency and per-GB cost, and there is operational complexity in DNS forwarding setup. Break-even: distributed wins below ~10 VPCs; centralised wins above ~20. ANS-C01 expects you to recognise this trade-off, particularly in multi-account org-wide questions.

Q6: Is CloudFront ever the wrong answer for egress optimisation?

Yes. CloudFront is wrong when: (a) the workload is non-HTTP (UDP, raw TCP) — CloudFront only handles HTTP/HTTPS; the alternative is Global Accelerator with anycast IPs. (b) The user base is single-region and the data is mostly uncacheable dynamic — the latency benefit is small, and CloudFront's per-request fee plus origin egress (un-cached) might exceed direct egress at low volume. (c) Compliance or data-locality rules forbid edge caching — some regulated workloads cannot have copies at edge PoPs, requiring origin-only delivery. (d) The egress volume is too low to hit volume tiers — a few GB/day at the standard tier sees only marginal savings. The exam-canonical case for CloudFront is global user base, cacheable content (static assets, images, video), or any S3 origin where OAC plus CloudFront converts S3 egress charges into CloudFront's lower tiered egress.

Q7: How do I plan VPC CIDR for organisation-wide scale without hitting IP exhaustion?

The hierarchical approach. (1) Allocate large supernets at the AWS Organisation level — e.g., 10.0.0.0/8 for the entire org, divided into per-account /16s using AWS IPAM. (2) Within each account, plan the VPC CIDR for at least 5x current scale — a /16 (65k IPs) for production, /20 (4k) for development, accounting for EKS pod-IP density if using VPC CNI. (3) Reserve a secondary CIDR space in each VPC for future expansion — RFC 6598 100.64.0.0/10 is widely used for shared services or NAT-translated EKS pod ranges to avoid running out of RFC 1918 space. (4) Plan subnet sizing — for auto-scaling worker subnets, use /22 (1024 IPs) or /20 (4k) rather than /24 (256), giving ASG headroom. (5) Leverage Fargate / Lambda for stateless workloads — these manage their IPs internally, taking pressure off VPC subnets. (6) Monitor available IPs via custom CloudWatch metric — Lambda on a schedule queries DescribeSubnets, publishes AvailableIpAddressCount per subnet, and CloudWatch alarms when below 20%. ANS-C01 expects this proactive planning, not reactive secondary-CIDR-when-broken.

Q8: Why does the NAT cross-AZ trap charge me so much?

Because every cross-AZ byte through a NAT Gateway pays three charges instead of one. The flow: instance in AZ-2 sends a packet to a NAT GW in AZ-1. (1) Cross-AZ data transfer outbound from AZ-2 to AZ-1: $0.01/GB. (2) NAT GW data processing: $0.045/GB. (3) Internet egress: $0.09/GB. Plus on the response: (4) Internet ingress: free. (5) NAT GW processes the response: same $0.045/GB. (6) Cross-AZ data transfer back AZ-1 to AZ-2: $0.01/GB. Total: $0.20/GB round-trip. With per-AZ NAT (NAT GW in AZ-2 used by AZ-2 instances), the cross-AZ steps disappear: just the NAT processing twice ($0.09/GB) plus internet egress ($0.09/GB) = $0.18/GB round-trip. Saving: $0.02/GB. At a TB/month, that is $20 saved per TB — at 100 TB/month, $2,000 saved. The savings compound dramatically as workloads grow, which is why ANS-C01 emphasises per-AZ NAT deployment as the canonical answer.

Q9: What are the right operational metrics for a multi-AZ NAT Gateway architecture?

Per-NAT-GW alarms: ErrorPortAllocation > 0 (SNAT exhaustion immediate fix), IdleTimeoutCount baseline trend (connection-pool sizing), BytesOutToDestination p95 baseline with anomaly detection (cost early-warning), PacketsDropCount > 0 sustained (capacity or backend issue). Cross-NAT comparison: ratio of bytes between AZs should track the AZ-distribution of your workload — if AZ-2 NAT GW is processing 10x what AZ-3 processes despite equal Auto Scaling group sizes, something is wrong (typically a VPC route table misconfiguration sending all traffic through AZ-2). VPC Flow Logs at the NAT GW ENI provide per-instance attribution via pkt-srcaddr. CloudWatch dashboard combining all NAT GW metrics in one view, plus a Cost Explorer monthly export filtered by NAT GW usage type, gives the full picture. ANS-C01 expects you to treat NAT Gateway as a critical infrastructure component, not a passive utility.

Q10: How do I cost-optimise a multi-region active-active application?

Three levers, in priority order. (1) Minimise inter-region traffic — design for region-local data and replicas; avoid cross-region database queries when possible. DynamoDB Global Tables and S3 Cross-Region Replication (CRR) are the AWS-managed replication primitives that minimise inter-region byte transfer (replication is delta-only, eventually consistent). (2) Use Route 53 latency-based routing to direct users to the nearest region, reducing the egress path length and using local-region resources. CloudFront in front of regional ALBs further offloads egress at the edge tier. (3) Use Direct Connect Gateway for high-volume hybrid traffic that would otherwise traverse internet at $0.09/GB; DX is dramatically cheaper at scale once the port-hour fee is amortised. For inter-region replication itself, TGW peering between regions or inter-region VPC peering is the AWS-internal path at $0.02/GB. (4) Caching layers — ElastiCache regional, CloudFront edge — reduce repeat fetches across regions. ANS-C01 frames this as "design a multi-region architecture that minimises data transfer cost while meeting RTO/RPO" — the answer combines all four levers with the specific service the scenario emphasises.

After understanding cost optimisation, the natural next ANS-C01 operational layers are: VPC Flow Logs and Reachability Analyzer for diagnosing cost spikes via traffic attribution; network performance with ENA/EFA/jumbo frames for throughput per dollar; hybrid connectivity maintenance including BGP route limits, Direct Connect failover, and PrivateLink maintenance; and Transit Gateway routing and attachments for the central fabric that ties it all together.

Official sources

More ANS-C01 topics