examlab .net The most efficient path to the most valuable certifications.
In this note ≈ 27 min

Network Performance — ENA, EFA, Placement Groups, and Jumbo Frames

5,400 words · ≈ 27 min read ·

ANS-C01 Domain 3.3 deep dive into EC2 network performance — ENA enhanced networking, EFA OS-bypass with SRD and libfabric, cluster/partition/spread placement groups, MTU 9001 jumbo frames inside VPC vs 1500 across IGW/VPN, PMTUD black holes, single-flow vs multi-flow bandwidth, and Nitro instance bandwidth tiers.

Do 20 practice questions → Free · No signup · ANS-C01

The AWS Certified Advanced Networking — Specialty exam (ANS-C01) Domain 3 task 3.3 demands a fluency with EC2 network performance internals that no other AWS certification expects. You have to know which instance types support which network interface generations, when an HPC workload needs EFA rather than ENA, why the canonical 9001-byte jumbo frame works inside a VPC but breaks at the Internet Gateway, what a placement group actually constrains, and the many practical limits — single-flow caps, baseline-versus-burst bandwidth, PMTUD black holes, and CPU-pinning gotchas — that explain why measured throughput is often half what the marketing slides promised. This topic walks the entire stack with a focus on the trap-rich exam scenarios that the Specialty writes most frequently.

We cover the four EC2 network interface generations (vNIC, ENI, ENA, EFA), the OS-bypass model EFA uses with the SRD transport and libfabric library, the three placement-group flavours (cluster, partition, spread) and what each isolates, the precise jumbo-frame story across VPC peering, IGW, VPN, and Direct Connect, the path-MTU-discovery (PMTUD) black hole that every networking engineer eventually hits, and the single-flow-versus-multi-flow bandwidth distinction the exam loves. Throughout, the focus is on the configuration choices an exam scenario can test and the ratification of why a given choice is right while three plausible-looking alternatives are not.

Why Network Performance Is a First-Class ANS-C01 Topic

ANS-C01 task 3.3 reads "Optimize AWS networks for performance, reliability, and cost-effectiveness", and its skill bullets explicitly call out "Selecting the right network interface for the best performance (for example, elastic network interface, Elastic Network Adapter [ENA], Elastic Fabric Adapter [EFA])" and "Configuring jumbo frame support across connection types". These two skill bullets alone produce three to five exam questions — and they are not abstract questions. The exam will write you a 200-word scenario describing a numerical-simulation cluster needing tight all-reduce performance across 64 nodes, ask which network configuration to use, and offer four answers that look superficially similar: ENA on c5n.18xlarge with cluster placement group; EFA on c5n.18xlarge with cluster placement group; EFA on m5.large with cluster placement group; ENA on hpc6a.48xlarge with partition placement group. Three of those answers are wrong for specific, knowable reasons.

The other half of task 3.3's exam surface is jumbo frames. The 9001 MTU is the default inside a VPC, but the moment a packet crosses an Internet Gateway, a Site-to-Site VPN, or an unconfigured Direct Connect, the path MTU drops to 1500 (or 1438 on VPN) — and if the application or kernel does not handle PMTUD correctly, large packets are silently dropped. ANS-C01 frames this as "TCP throughput is unexpectedly low between two specific endpoints; what is the most likely cause?" with PMTUD failure as the right answer hidden among MSS clamping, SACK disabled, and jumbo frames not enabled distractors. Knowing which boundary preserves jumbo frames and which truncates is a memorisation task with high yield.

Plain-Language Explanation: EC2 Network Performance — ENA, EFA, Jumbo Frames

The performance stack is layered: NIC hardware (ENA vs EFA) → physical-placement constraints (placement groups) → frame-size choices (MTU 1500 vs 9001) → throughput shape (single-flow vs multi-flow). Three analogies anchor the geometry.

Analogy 1: The Logistics Warehouse

Think of an EC2 instance as a distribution warehouse. The ENI is the regular loading dock with standard-size truck bays (1500-byte frames over a 1 Gbps pipe). ENA is the upgraded multi-bay loading dock at the same warehouse — same trucks, but more bays in parallel, on a wider road (10/25/100 Gbps), with electronic dispatch (SR-IOV) so trucks don't queue. EFA is the dedicated freight rail terminal built next to the warehouse — bypassing the public road entirely (OS-bypass), running on its own track (the SRD protocol), and only useful when you have specific freight types (HPC MPI/NCCL) that benefit from rail. Jumbo frames are double-length 18-wheel trucks instead of standard 12-foot ones — six times the cargo per trip, but only legal on the highways inside the AWS VPC; the moment they hit the public road network (IGW, VPN, internet), they must be broken down into standard trucks at the loading dock (fragmentation) or refused (PMTUD black hole). A cluster placement group is the rule that all your warehouses must be on the same campus (same physical rack, same network spine), so freight rail and high-speed dock-to-dock transfers are possible. A spread placement group is the opposite rule that no two warehouses share the same campus — guaranteeing a fire at one campus does not destroy two warehouses simultaneously.

Analogy 2: The Operating Room

A hospital's operating room network is the EC2 network stack in miniature. ENI is the standard hospital phone line to other departments. ENA is the upgraded fibre-optic intercom with multiple parallel channels and enhanced clarity. EFA is the direct surgeon-to-surgeon hotline that bypasses the switchboard entirely, used only when two surgeons are operating in coordination on the same patient (HPC tightly-coupled jobs). SRD is the specific protocol the hotline uses — out-of-order delivery handled at the application layer for lower latency than ordered TCP. Jumbo frames are bulk equipment shipments between adjacent ORs in the same hospital wing — efficient when the corridor is wide (intra-VPC), impossible when packages must go through the front-door security checkpoint to a different hospital (IGW boundary). Placement groups are scheduling rules: cluster = same wing, same floor (low latency), spread = different buildings (fault isolation), partition = grouped by section (rack-level isolation).

Analogy 3: The Race Track

An EC2 instance is a race car. ENI is the stock production engine. ENA is the factory-tuned high-performance engine with multiple intake valves (SR-IOV queues) and direct injection (Nitro card offload). EFA is the Formula 1 racing engine — purpose-built for HPC race conditions, requiring specialised fuel (libfabric runtime), only legal on dedicated track (cluster placement group), and useless for daily driving. Cluster placement group is the rule that all cars start in the same pit lane so they have the lowest possible inter-car latency. Partition placement group is the rule that no two cars share the same fuel reservoir — a fuel-line accident affects only one car. Spread placement group is the rule that cars are placed in different garages on different days, the strongest fault-isolation guarantee but with no latency benefit. Jumbo frames are wider tyres giving more grip per lap inside the track (intra-VPC), but the cars cannot leave the track wearing them (cross-IGW, MTU truncation).

For HPC and tightly-coupled distributed training questions, the race track + Formula 1 engine mental model captures the EFA-specific picture cleanly. For multi-region replication and large object transfers, the logistics warehouse + jumbo trucks model maps to MTU 9001 vs 1500 trade-offs. For HA and fault tolerance scenarios, the race track placement group model (same pit lane vs different garages) captures cluster vs spread. Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking-ena.html

EC2 Network Interface Generations: ENI, ENA, EFA

Four generations of network interfaces have shipped on EC2; the exam expects you to recognise the differences and pick correctly.

vNIC (legacy paravirtualised)

The original Xen-based virtual NIC. Maximum throughput around 1 Gbps per instance. Found on m1, m2, c1, t1, m3 (some), c3 (some), r3 (some). No exam question writes vNIC as a correct answer in 2026 — these instance families are deprecated.

ENI (Elastic Network Interface)

The standard virtual network interface in the VPC abstraction. An ENI has a primary IP, optional secondary IPs, a MAC address, security groups, source/dest check setting, and can be detached and reattached to another instance. ENI is the abstraction — every running EC2 instance has at least one ENI. The performance characteristics depend on the instance's underlying hardware.

ENA (Elastic Network Adapter)

ENA is the enhanced networking driver that runs on Nitro and pre-Nitro instances supporting SR-IOV (single-root I/O virtualization). ENA delivers higher PPS (packets per second), lower jitter, and lower latency than vNIC. It is the default on all modern instance types: c5/c5n/c6i/c6in/c7i/c7g, m5/m5n/m6i/m7i/m7g, r5/r5n/r6i/r7i, p3/p4/p5, hpc6a/hpc7a, and many more. ENA bandwidth scales with instance size: a c5.large gets up to 10 Gbps burst (with much lower baseline), a c5n.18xlarge gets 100 Gbps, a c6in.metal gets 200 Gbps, and a p5.48xlarge gets 3200 Gbps to GPU NICs.

The Linux ENA driver is bundled in modern kernels (Amazon Linux, Ubuntu 18.04+, RHEL 7.5+); for older OSes you may need to install the driver from the kernel.org ena module. Verify via ethtool -i eth0 looking for driver: ena.

EFA (Elastic Fabric Adapter)

EFA is a superset of ENA that adds an OS-bypass transport mode for HPC. EFA appears on the instance as both an ena interface (for normal TCP/IP) and an efa interface (for bypass-mode transport using AWS's Scalable Reliable Datagram (SRD) protocol). HPC libraries — libfabric, MPI implementations (Intel MPI, Open MPI, MPICH, MVAPICH), NCCL (NVIDIA Collective Communications Library) — link against libfabric and use the EFA bypass path for inter-node MPI/collective communications, achieving lower latency and higher all-reduce throughput than TCP can deliver.

EFA is not a drop-in replacement for ENA — applications must explicitly use libfabric or a compatible runtime. Standard TCP/IP applications get no benefit from EFA being attached; they continue to use the ENA interface. EFA is enabled by attaching an EFA-capable ENI at instance launch, supported only on specific instance types: c5n.18xlarge/9xlarge, c6in.32xlarge/16xlarge, m5n/m5dn (some sizes), r5n (some), p3dn.24xlarge, p4d.24xlarge, p5.48xlarge, hpc6a.48xlarge, hpc7a, hpc7g, and a small number of others. Check the official EFA-supported instance list before recommending EFA on the exam.

SRD (Scalable Reliable Datagram)

SRD is AWS's bespoke transport protocol underneath EFA's bypass path. Compared to TCP, SRD: (a) supports out-of-order delivery with reordering at the application/library layer, eliminating head-of-line blocking; (b) uses multi-path routing across many AWS-internal paths simultaneously, hashing flows differently to spread load; (c) has lower latency than TCP for tightly-coupled HPC workloads. SRD is invisible to applications — they see the libfabric API; SRD is the wire format underneath. Use cases include HPC MPI all-reduce, distributed deep learning gradient synchronisation, and certain real-time simulation protocols.

A frequent ANS-C01 distractor: "we attached EFA to all 64 nodes but throughput did not improve". The answer is that the application is using standard TCP — it sees the ENA interface, not the EFA bypass path. EFA gives a benefit only when the application is built against libfabric (or a runtime that uses libfabric, like Intel MPI, Open MPI, NCCL). For pure TCP workloads — web servers, databases, S3 transfers — EFA provides no benefit; the answer is to optimise on ENA bandwidth, multi-flow parallelism, and instance sizing. Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html

Placement Groups: Cluster, Partition, Spread

A placement group is a logical constraint on the physical placement of EC2 instances within an AZ. There are three types, each optimising a different goal.

Cluster placement group

All instances in a cluster placement group are placed on the same physical rack and the same network spine within a single AZ. Result: lowest possible inter-instance latency (sub-microsecond round-trip), highest possible inter-instance bandwidth (full instance NIC bandwidth, up to 100+ Gbps on c6in/c7i, 3200 Gbps on p5). Required for HPC and EFA workloads — EFA's tight-coupling benefits only manifest when nodes are on the same spine.

Limitations: capacity is rack-dependent, so launching a large cluster placement group can hit "InsufficientCapacity" errors; the recommendation is to launch all instances in a single API call with the same AMI and instance type, and avoid stop/start which can re-place instances onto a different rack.

Partition placement group

Instances are divided into logical partitions (up to 7 per AZ, configurable), each on separate underlying hardware (different racks, different network spines). All instances within a partition share rack-level fate (a rack failure takes them all down); instances in different partitions are isolated. Used for distributed databases (Cassandra, HDFS, Kafka) where data is replicated across partitions and the application can tolerate single-partition failure. The application controls partition assignment via the partition_number launch parameter.

Spread placement group

Instances are placed on distinct underlying hardware with no two sharing the same rack. Maximum 7 instances per AZ. Used for HA-critical workloads where every instance must be on its own rack (small clusters, control plane components). The strongest fault-isolation guarantee, but the smallest scale and no latency benefit.

Placement group + AZ behaviour

Placement groups are AZ-scoped — a cluster placement group is single-AZ by definition. For multi-AZ HPC, you launch one cluster placement group per AZ; cross-AZ traffic uses normal VPC routing (with the cross-AZ latency penalty of ~1ms instead of sub-microsecond). For multi-AZ HA without latency requirement, spread placement groups in each AZ.

  • ENI: VPC abstraction for a virtual network interface (every instance has at least one).
  • ENA: enhanced-networking driver using SR-IOV; modern default.
  • EFA: superset of ENA with OS-bypass via libfabric and SRD; HPC-specific.
  • SR-IOV: single-root I/O virtualization, hardware-level NIC sharing.
  • SRD: AWS Scalable Reliable Datagram, EFA's wire transport.
  • libfabric: open-source HPC fabric API; required to use EFA bypass path.
  • MPI: Message Passing Interface; HPC parallel-programming runtime.
  • NCCL: NVIDIA Collective Communications Library; GPU all-reduce optimised.
  • Cluster PG: same-rack/spine, lowest latency, EFA-required.
  • Partition PG: separate racks per partition, up to 7, distributed-DB pattern.
  • Spread PG: every instance on its own rack, up to 7 per AZ, HA-critical.
  • Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html

Jumbo Frames: MTU 9001 and Where It Survives

A jumbo frame is an Ethernet frame larger than the standard 1500-byte MTU. AWS supports jumbo frames at 9001 bytes (MTU 9001) inside a VPC, but only inside specific path types. Crossing the wrong boundary causes packets to be either fragmented (slow, sometimes blocked by intermediate firewalls) or silently dropped (PMTUD black hole).

Where MTU 9001 works

  • Within the same VPC — instances on the same subnet, or routed within the VPC via VPC route tables, support 9001-byte frames natively.
  • VPC peering connections — both same-region and inter-region — support 9001-byte frames as of the 2018 update. Verify via the VPC peering connection's enable jumbo frames attribute (which is now default).
  • Transit Gateway VPC attachments — support 9001 between attached VPCs.
  • Direct Connect — supports 9001 if the Direct Connect connection is jumbo-frame-enabled (configurable on the connection; both ends must agree).
  • Cluster placement group flows — 9001 is the canonical configuration.

Where MTU 9001 does NOT work

  • Internet Gateway (IGW) — frames egressing or ingressing through an IGW are limited to 1500 bytes. Larger frames are dropped (no fragmentation by the IGW).
  • Site-to-Site VPN — IPsec tunnel encapsulation overhead means the effective inner MTU is 1438 bytes (or so, varying by cipher); larger frames must be fragmented or dropped.
  • NAT Gateway — frames through a NAT GW are limited to 1500 bytes for egress to the internet.
  • VPN over Direct Connect (VPN over Public/Transit VIF) — IPsec overhead drops effective MTU to 1438 even on a jumbo-frame-enabled DX connection.
  • Inter-region VPC peering — supports up to 1500 bytes only for traffic crossing the inter-region boundary (intra-region traffic is 9001).

PMTUD and the black hole problem

Path MTU Discovery (PMTUD) is the mechanism by which TCP endpoints discover the smallest MTU on the path between them. When a router cannot forward a packet because the next-hop MTU is smaller and the IPv4 "Don't Fragment" (DF) bit is set, the router sends back an ICMP Type 3 Code 4 ("Fragmentation needed and DF set") response telling the sender to reduce the packet size. The sender's TCP stack reduces its TCP MSS accordingly and retransmits.

The PMTUD black hole occurs when ICMP Type 3 Code 4 messages are silently dropped — by an overly-restrictive firewall, a security group that does not permit ICMP, or a customer-network firewall blocking ICMP. The sender continues to send 9001-byte packets, the path drops them, no ICMP feedback returns, and TCP retransmits the same too-large packet repeatedly until the connection times out. From the application's perspective, small packets (control messages) flow fine; large packets (data, file transfers) hang.

The fix: TCP MSS clamping at the egress device, which intercepts TCP SYN packets and rewrites the MSS option to a known-safe value (e.g., 1380 bytes for VPN, accounting for IPsec overhead). Most modern customer-gateway devices support MSS clamping. AWS Site-to-Site VPN performs MSS clamping automatically for the IPsec tunnel; on Direct Connect with jumbo frames, the customer router must clamp MSS for any flow heading to non-jumbo paths.

The most-tested ANS-C01 jumbo-frame trap: a scenario describes a high-throughput TCP flow between an EC2 instance with jumbo frames enabled and an external SaaS endpoint over the internet, with throughput far below expected. The cause is the IGW drops 9001-byte frames; PMTUD ICMP is being filtered somewhere along the path; the application hangs in a black hole. The fix is to either disable jumbo frames on the egress interface, configure MSS clamping, or accept 1500 MTU for any flow that traverses an IGW. Inside the VPC, jumbo flows fine; the moment a packet crosses an IGW, jumbo dies. Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/network_mtu.html

Single-Flow vs Multi-Flow Bandwidth

A subtle exam-trap area: instance bandwidth is advertised as an aggregate (e.g., "100 Gbps on c5n.18xlarge"), but a single TCP flow is capped at a much smaller value due to per-flow flow-table entries in the underlying Nitro hypervisor and the inherent serialisation of a TCP connection's send/ack window.

Single-flow caps

  • Within the same AZ, same placement group: typically 10 Gbps per single TCP flow on standard ENA. With cluster placement group and full instance bandwidth, single-flow can approach 25 Gbps for very large instance types but rarely exceeds it.
  • Across AZs (within region): typically 5 Gbps per single TCP flow on standard ENA.
  • EFA with SRD can multipath a single application flow across many internal paths, effectively bypassing the single-flow TCP cap — this is part of why EFA matters for HPC.

Multi-flow scaling

To use full instance bandwidth, the application must open multiple parallel flows — for example, 10 parallel TCP connections, each carrying a portion of the data. Tools like iperf3 with -P 10 validate this. S3 multipart upload uses parallel HTTP connections by default. Database replication, on the other hand, often runs on a single connection and hits the single-flow cap.

Implications for design

  • Cluster placement group + EFA + libfabric for HPC all-reduce — bypasses TCP entirely.
  • Multiple parallel TCP connections for bulk data transfer (S3, replication, EBS snapshot copy).
  • Choose larger instance types for higher per-flow caps — c6in.32xlarge has higher single-flow than c6i.large.
  • Avoid cross-AZ for bandwidth-critical paths — single-flow drops to 5 Gbps cross-AZ.

Direct Connect and Frame Size

A Direct Connect dedicated connection supports both 1500 and 9001 MTU per VIF. The setting is per-VIF and must match on both AWS and on-premises router ends. Jumbo frames over DX dramatically improve throughput for bulk transfers — for example, EBS snapshot copy from on-prem to S3, large-object uploads, and database replication over private networks.

To enable: at VIF creation, set the MTU to 9001 (default is 1500); on the on-premises router, configure the L2 link MTU to 9001 (or higher, accounting for VLAN tag overhead). All intermediate switches in the on-premises path must also support jumbo frames — a single 1500-MTU switch in the path causes the same PMTUD black-hole as the IGW boundary.

A Direct Connect Gateway (DXGW) path to multiple Transit Gateways supports 8500-byte MTU on the AWS-internal segment; the customer-facing VIF can be 9001 but flows to TGW-attached VPCs are capped at 8500. This subtle limit is one of the deepest jumbo-frame distinctions on ANS-C01.

  • Jumbo MTU: 9001 bytes inside VPC, on VPC peering, on TGW VPC attachments, on Direct Connect (if enabled).
  • DXGW-to-TGW path: 8500 MTU max (lower than 9001).
  • IGW MTU: 1500 bytes; jumbo dropped.
  • Site-to-Site VPN effective MTU: 1438 (after IPsec overhead).
  • NAT Gateway MTU: 1500 for outbound to internet.
  • Single TCP flow cap intra-AZ same PG: ~10 Gbps (standard ENA).
  • Single TCP flow cap cross-AZ: ~5 Gbps.
  • EFA + SRD: multi-pathed single application flow; HPC-only benefit.
  • Cluster PG: single-AZ, same rack/spine, lowest latency.
  • Partition PG: up to 7 partitions per AZ, separate racks per partition.
  • Spread PG: max 7 instances per AZ, every instance on its own rack.
  • EFA-supported instances: c5n.18xl, c6in.32xl, m5n/m5dn (some), p3dn.24xl, p4d.24xl, p5.48xl, hpc6a.48xl, hpc7a, hpc7g — verify list before exam.
  • PMTUD ICMP: Type 3 Code 4 ("Fragmentation needed and DF set"); blocked = black hole.
  • MSS clamping: rewrite TCP MSS option in SYN to known-safe value; mitigation for PMTUD black hole.
  • Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-network-bandwidth.html

Performance Patterns: HPC Cluster, Bulk Transfer, Distributed Database

HPC tightly-coupled cluster

Requirements: lowest possible inter-node latency, highest possible all-reduce bandwidth, often hundreds of GB/s aggregate. Solution: instance type from the EFA-supported list (typically hpc6a.48xlarge, hpc7a, c6in.32xlarge, or p5.48xlarge for GPU); attach EFA-capable ENI; place in a cluster placement group in a single AZ; use MPI (Open MPI, Intel MPI) or NCCL (for GPU) compiled against libfabric; ensure 9001 MTU; turn off CPU C-states and configure NUMA-aware placement for predictable latency. The exam-canonical HPC architecture.

Bulk data transfer to S3

Requirements: maximum throughput from EC2 to S3 within the region. Solution: use S3 multipart upload with concurrent parts (default 8 parallel uploads in AWS CLI, configurable via aws configure set default.s3.max_concurrent_requests 100); use Transfer Acceleration for cross-region (CloudFront-based); use S3 gateway endpoint to avoid NAT GW data-processing fees and stay on the AWS backbone (free); choose an instance type with sufficient bandwidth for the parallel flows (c6in.4xlarge or larger is typical).

Distributed database (Cassandra, HDFS, Kafka)

Requirements: replication across racks for fault isolation, predictable network for replication lag. Solution: partition placement group with one partition per Cassandra node ring or HDFS rack; sufficient instance size for the per-flow cap (single-flow replication can be the bottleneck); jumbo frames within the cluster for replication efficiency; CloudWatch monitoring on the per-instance NetworkOut and replication-lag metrics.

Cross-region replication (DR)

Requirements: efficient bulk data transfer across AWS regions with reasonable cost. Solution: inter-region VPC peering (1500 MTU at the boundary, but minimal overhead); use the AWS backbone path (data transfer pricing favourable vs internet); for very large daily transfers, consider AWS DataSync or AWS Snowcone/Snowball for one-time migrations.

A frequent ANS-C01 distractor pattern: separating cluster PG from EFA. The exam will offer "EFA on m5.large with cluster PG" or "ENA on hpc6a with cluster PG" as distractors. The right answer almost always combines: (a) an EFA-supported instance type, (b) EFA-capable ENI attached, (c) cluster placement group, (d) libfabric-using application (MPI/NCCL). Missing any element produces sub-optimal performance and is the wrong answer. Reference: https://aws.amazon.com/hpc/efa/

Common Traps Recap — Domain 3.3

The traps the exam writes most frequently in network performance.

Trap 1: EFA improves all networking performance

Wrong. EFA gives a benefit only when the application uses libfabric (MPI, NCCL). Standard TCP applications continue to use the ENA interface and see no benefit.

Trap 2: Jumbo frames work everywhere in AWS

Wrong. Jumbo (9001) works inside VPC, across VPC peering, on TGW attachments, on Direct Connect (if enabled). Jumbo does NOT work across IGW (1500), VPN (1438), NAT GW (1500), or DXGW-to-TGW path (8500 cap).

Trap 3: A single TCP flow can use full instance bandwidth

Wrong. Single TCP flow caps at ~10 Gbps intra-AZ, ~5 Gbps cross-AZ on standard ENA. Multi-flow parallelism is required for full bandwidth.

Trap 4: Cluster placement group spans multiple AZs

Wrong. Cluster PG is single-AZ by definition. For multi-AZ HPC, use one cluster PG per AZ and accept cross-AZ latency for inter-AZ flows.

Trap 5: Spread placement group has unlimited instances

Wrong. Spread PG is capped at 7 instances per AZ. For larger spreads, use multiple spread PGs or partition PG.

Trap 6: ENA needs special setup on modern AMIs

Wrong. ENA driver is bundled in modern Linux kernels (Amazon Linux, Ubuntu 18.04+, RHEL 7.5+). Only legacy AMIs need explicit driver install.

Trap 7: Enabling jumbo frames automatically improves throughput

Wrong. Without PMTUD working end-to-end and without the application generating large enough writes, jumbo frames may not help and can cause black-hole drops. Test before deploying.

Trap 8: EFA is free / always-on

Wrong. EFA is free as a feature (no extra hourly fee), but it is opt-in at instance launch and only available on specific instance types.

Trap 9: PMTUD always works as long as ICMP is allowed

Wrong. PMTUD requires the specific ICMP Type 3 Code 4 message to traverse the entire return path. Many customer firewalls block ICMP entirely or block specific subtypes, breaking PMTUD silently.

Trap 10: Cluster PG guarantees same physical machine

Wrong. Cluster PG places on the same rack and spine, but instances are on separate hosts. The latency benefit is sub-microsecond, but instances do not share kernel or hardware.

Trap 11: SRD is exposed to the application as a socket API

Wrong. SRD is the wire transport beneath libfabric. Applications use the libfabric API; SRD is invisible.

Trap 12: Partition placement group provides instance-level isolation

Wrong. Partition PG provides partition-level isolation — all instances in a single partition share rack-level fate. Instance-level isolation requires spread PG.

Decision Matrix — Network Performance Construct for Each ANS-C01 Goal

Performance goal Primary construct Notes
Tightly-coupled HPC, sub-microsecond latency EFA + cluster PG + libfabric MPI/NCCL All four required.
Maximum aggregate throughput same AZ Multi-flow parallelism + cluster PG + jumbo Multi-flow critical; single-flow caps at ~10 Gbps.
Distributed DB rack isolation Partition placement group Up to 7 partitions per AZ.
Small HA cluster, max isolation Spread placement group Max 7 instances per AZ.
Bulk transfer EC2 to S3 in-region S3 gateway endpoint + multipart upload Free, no NAT, parallel flows.
Bulk transfer EC2 to S3 cross-region S3 Transfer Acceleration CloudFront PoP edge upload.
9001 MTU between two VPCs VPC peering or TGW attachment + same region Cross-region peering caps at 1500 at boundary.
9001 MTU on hybrid link Direct Connect with jumbo enabled Both VIF and on-prem router must agree.
Single TCP flow > 10 Gbps EFA bypassing TCP entirely TCP single-flow caps; SRD multi-paths.
Diagnose throughput lower than expected Test PMTUD + check single-flow vs multi-flow + Flow Logs Likely PMTUD or single-flow cap.
Avoid cross-AZ traffic cost in HPC Cluster PG single-AZ Cluster is single-AZ; cross-AZ is full price.
GPU all-reduce EFA + NCCL on p4d/p5 instances + cluster PG NCCL uses libfabric.

FAQ — Network Performance, ENA, EFA, Jumbo Frames

Q1: When does EFA actually help, and when is it a distraction?

EFA delivers a measurable performance benefit only for HPC and tightly-coupled distributed-compute workloads — MPI all-reduce, NCCL collective operations on GPU clusters, certain real-time simulation runtimes — where the application explicitly uses libfabric to drive the OS-bypass datapath. For pure TCP applications (web servers, REST APIs, S3 transfers, database replication using TCP, file copies), EFA provides zero benefit because the application never touches libfabric; it sees only the ENA interface and uses standard TCP/IP. ANS-C01 frequently writes scenarios where a candidate is tempted to recommend EFA for "maximum performance" — wrong unless the workload is HPC. The right framing: "do my engineers compile against libfabric, MPI, or NCCL?" — if no, EFA is irrelevant and the answer is ENA on a larger instance with multi-flow parallelism.

Q2: I enabled jumbo frames inside my VPC and now some traffic to the internet hangs — why?

Classic PMTUD black hole. Inside the VPC, your instance sends 9001-byte frames. When a flow traverses the Internet Gateway, the IGW limits MTU to 1500 — frames larger than 1500 with the DF bit set should trigger an ICMP Type 3 Code 4 ("Fragmentation needed and DF set") back to the sender, which then reduces TCP MSS. The black hole forms when that ICMP response is dropped — by your security group not allowing ICMP, by an upstream ISP firewall blocking ICMP, or by intermediate routers along the path. The sender keeps retransmitting 9001-byte packets, none get through, the connection hangs. Fixes: (a) disable jumbo on the egress interface used for internet traffic, (b) configure TCP MSS clamping at the egress (typically 1380 bytes, accounting for IP+TCP overhead), (c) ensure ICMP Type 3 Code 4 is allowed end-to-end. AWS Site-to-Site VPN performs MSS clamping automatically; on direct internet egress, the OS-level fix is iptables -t mangle -A OUTPUT -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu.

Q3: How do I size an HPC cluster for maximum all-reduce performance?

The four-step recipe. (1) Choose the right instance type from the EFA-supported list — hpc7a.96xlarge for general HPC (200 Gbps), c6in.32xlarge for compute with EFA (200 Gbps), p5.48xlarge for GPU all-reduce (3200 Gbps to GPU NICs). (2) Attach EFA-capable ENIs at launch — pass InterfaceType=efa in the launch spec; one EFA ENI per instance is standard. (3) Place in a cluster placement group in a single AZ — launch all instances in a single API call to maximise capacity placement; avoid stop/start which can re-place onto a different rack. (4) Build the application against libfabric — most distributions of Open MPI 4.x+, Intel MPI 2019+, and NCCL 2.10+ support libfabric out of the box; verify with ompi_info | grep libfabric and ensure FI_PROVIDER=efa is set in the environment. Validate with mpirun -np N ./osu_allreduce (Ohio State Microbenchmark) before running production workloads. The exam-canonical answer to "how do I optimise HPC inter-node communication" is exactly this stack.

Q4: My TCP throughput between two instances in the same VPC is much lower than the advertised instance bandwidth — why?

Three likely causes, in order of frequency. (1) Single-flow cap — a single TCP connection caps at ~10 Gbps intra-AZ (or ~5 Gbps cross-AZ) on standard ENA, regardless of the advertised aggregate bandwidth. Validate with iperf3 -P 10 (10 parallel flows) — if the parallel test shows full bandwidth, single-flow cap is your bottleneck and the fix is application-level parallelism. (2) Cross-AZ traffic — if the two instances are in different AZs, single-flow caps are halved, and there is unavoidable additional latency (~1ms instead of sub-microsecond). The fix is to colocate latency-critical workloads in the same AZ with cluster placement group. (3) CPU bottleneck — at very high packet rates (100s of millions of pps), the kernel network stack can saturate a CPU core; the fix is RPS (receive packet steering) configuration, larger instance types with more cores, or moving to EFA where the CPU is bypassed. ANS-C01 expects you to test with multiple flows and across-AZ measurements before concluding "the network is broken".

Q5: Should I use a partition placement group or a spread placement group for my MongoDB cluster?

Depends on cluster size and HA model. Partition PG when you have more than 7 nodes and the application has rack-aware replication — typically Cassandra, HDFS, Kafka, or ScyllaDB which understand partition topology and replicate across partitions. MongoDB replica sets with up to 7 secondaries can use either; for replica sets larger than 7 nodes (uncommon), partition is required. Spread PG for small (≤7 instances per AZ) clusters where every instance must be on its own physical rack — typical for MongoDB primaries plus small numbers of secondaries, etcd quorum members, ZooKeeper ensembles, or other small-N control plane components. The trade-off is scale (spread is capped) versus granularity (spread isolates per-instance, partition isolates per-partition). For multi-AZ MongoDB, you typically combine: spread PG within each AZ for the local replica members, MongoDB replica-set replication across AZs.

Q6: How do I configure jumbo frames on Direct Connect, and what can break?

Three coordination points. (1) AWS side: when creating the Virtual Interface (private VIF, public VIF, or transit VIF), set the MTU to 9001 (the default is 1500). For an existing VIF, you can update the MTU but must coordinate with the on-premises end. (2) On-premises router: configure the L2 link MTU to 9001 (or higher to accommodate VLAN tag overhead — typically 9100 on the physical interface). All intermediate switches in the on-premises path between the customer router and the workload VLAN must also support 9001+ MTU. (3) Path consistency: a single switch in the path with a 1500 MTU re-creates a PMTUD black hole. Test with ping -M do -s 8972 from the source (8972 = 9000 - 28 byte IP/ICMP overhead); if the ping succeeds at 8972 but fails at 9000, MTU is constrained somewhere. Important DXGW-to-TGW caveat: traffic via Direct Connect Gateway to Transit Gateway tops out at 8500 MTU, not 9001 — a subtle limit ANS-C01 occasionally tests.

Q7: What is the difference between SR-IOV, DPDK, and EFA?

SR-IOV (Single-Root I/O Virtualization) is a hardware-level NIC sharing technology — one physical NIC presents multiple virtual functions (VFs), each appearing as a dedicated NIC to a guest OS. ENA on Nitro uses SR-IOV. DPDK (Data Plane Development Kit) is a userspace networking library that bypasses the kernel network stack for high-PPS workloads — used commonly with NFV (network function virtualization) appliances; supported on EC2 with the ENA driver in DPDK mode. EFA (Elastic Fabric Adapter) is AWS's HPC-specific OS-bypass adapter using SRD as its wire transport, exposing libfabric as the application API. The three are different layers: SR-IOV is hypervisor-level, DPDK is userspace networking-stack-bypass for general NFV, EFA is HPC-specific bypass for MPI/NCCL. ANS-C01 expects you to recognise EFA as the HPC option, not DPDK; DPDK questions are typically phrased around "high-PPS network appliance" rather than HPC.

Q8: My security team wants to disable ICMP entirely — what are the consequences for performance?

Disabling ICMP at any boundary in the path breaks PMTUD, which means jumbo frames inside the VPC cannot safely transit any MTU-narrowing boundary. Specific failure modes: (a) flows from VPC to internet hang silently when packets exceed 1500 bytes; (b) flows over Site-to-Site VPN hang when packets exceed 1438 bytes; (c) flows over Direct Connect with mixed jumbo/non-jumbo paths hang at the narrowing point. Mitigations: (1) MSS clamping at the egress device, which proactively constrains TCP MSS at SYN time so packets never exceed the path MTU; AWS Site-to-Site VPN does this automatically. (2) Disable jumbo frames on interfaces facing non-jumbo paths, accepting 1500 MTU for those flows. (3) Allow the specific ICMP subtypes needed for PMTUD — Type 3 Code 4 ("Fragmentation needed and DF set") and Type 3 Code 3 ("Port unreachable") — without allowing all ICMP. The "default deny ICMP" policy is a security best practice that has unintended performance consequences; design accordingly.

Q9: Why does cluster placement group fail with InsufficientCapacity, and how do I avoid it?

A cluster placement group requires all instances on the same physical rack and spine. As the cluster grows, AWS may not have enough free capacity on a single rack for a large-instance-type cluster — particularly for niche instance types (hpc6a.48xlarge is a giant 192-vCPU machine; only a few fit per rack). Mitigations: (1) Launch all instances in a single API call with RunInstances and a count parameter — AWS optimises rack allocation when it sees the full request upfront; launching one at a time gives much worse capacity. (2) Use Capacity Reservations ahead of time — reserve N hpc6a.48xlarge in a specific AZ with cluster PG, and the capacity is guaranteed for your launches. (3) Avoid stop/start cycles — when an instance is stopped and restarted, AWS may re-place it on a different rack, breaking the cluster PG guarantee. For HPC workloads, treat instances as immutable; replace rather than restart. (4) For very large clusters (>100 large instances), accept that cluster PG may not fit and either split into multiple cluster PGs (with some inter-PG latency) or use partition PG.

Q10: What is the right placement group strategy for a multi-AZ Kubernetes cluster running mixed workloads?

A typical pattern. (1) Control plane — etcd and the Kubernetes API server replicas use a spread placement group in each AZ to ensure rack-level fault isolation; etcd is sensitive to a single rack failure taking out two replicas. (2) Data plane workers — most pods run on workers without a placement group (default placement); the cluster scheduler handles spread across nodes via topology spread constraints. (3) HPC workloads (HPC operators, ML training) — dedicated worker node groups in cluster placement groups with EFA-capable instances (hpc6a.48xlarge, p5.48xlarge); the cluster autoscaler labels nodes for scheduling HPC pods only here. (4) Stateful workloads (StatefulSet pods using EBS) — typically default placement; for distributed databases (Cassandra, Kafka), partition placement groups by zone with stable node labels. The key insight: placement groups are infrastructure-level constraints, and Kubernetes scheduler placement is workload-level; they compose by mapping node groups to PG types. ANS-C01 may not deep-dive Kubernetes specifics, but the underlying PG patterns are exactly the same.

After understanding network performance internals, the natural next ANS-C01 operational layers are: VPC Flow Logs and Reachability Analyzer for diagnosing throughput problems and verifying intent; network monitoring and cost optimisation for tracking NAT GW expense and TGW data-processing fees; hybrid connectivity maintenance including BGP route limits and Direct Connect failover testing; and Direct Connect VIF types and MACsec for the line-rate encryption story on dedicated cross-connects.

Official sources

More ANS-C01 topics