Question 1

Q1: Our BGP session on Direct Connect just dropped — how do I diagnose whether we hit the 100-prefix limit?

Accepted Answer

The diagnostic ladder. (1) Check CloudWatch on the Direct Connect connection : ConnectionState should show 0; the alarm should have fired. (2) SSH to the customer-edge router and run show ip bgp summary (Cisco) or equivalent. Look at the BGP session state — if it shows "Idle (PfxCt)" or "Idle (max-prefix)" the customer router is reporting that it received too many prefixes from AWS (rare, since AWS rarely advertises >100). If it shows session timeout or hold-time expiry without prefix-count, the issue is elsewhere. (3) Check the AWS-side BGP : in the Direct Connect console, the VIF will show "Down" with a reason such as "BGP_PEER_NEIGHBOUR_RECEIVED_TOO_MANY_PREFIXES". (4) Count prefixes being announced from on-prem : on the customer router, show ip bgp neighbors &#x3C;aws-bgp-peer> advertised-routes | count . If this count is ≥100, you have hit the limit. (5) Summarise : identify contiguous CIDR ranges and consolidate announcements into supernets. Re-establish BGP session. (6) Document and alert : add a CloudWatch alarm monitoring "advertised prefix count" — a custom metric you publish from the customer-router via SNMP-to-CloudWatch — to alert when count nears 90 (early warning before 100).

Question 2

Q2: I have BGP configured with default timers and BFD disabled — how long will my failover take during a Direct Connect outage?

Accepted Answer

With default timers and no BFD, failure detection is 60–90 seconds . The BGP keepalive timer is 30 seconds, and the hold time is 90 seconds — meaning BGP must miss 3 keepalives before declaring the session dead. After session declaration, BGP must converge to the alternative path, which takes additional seconds (typically 3–10s for AS-PATH-prepended VPN backup). Total time-to-recovery: 65–100 seconds . This is unacceptable for most production workloads. With BFD enabled at default 300ms × 3 = 900ms detection , total recovery drops to about 1–5 seconds . With aggressive BFD timing of 50ms × 3 = 150ms, total recovery approaches sub-second — the gold standard for mission-critical hybrid links. Enabling BFD requires both AWS and customer-router support; most modern enterprise routers (Cisco, Juniper, Arista, Palo Alto) support BFD. Configure both ends, test, and validate via failover drills.

Question 3

Q3: We need to advertise 150 on-premises prefixes to AWS — how do we work around the 100-prefix limit?

Accepted Answer

Three viable approaches in priority order. (1) Customer-side summarisation — almost always the right answer. Identify contiguous CIDR ranges that can be consolidated into supernets. For example, 150 individual /24s in 10.0.0.0–10.149.255.255 can be summarised into a single /16 (10.0.0.0/16) covering all of them. Trade-offs: summarisation works only for contiguous ranges, and the supernet may include unallocated space that should not route traffic (which is acceptable if you use VPC route tables to filter). (2) Multiple VIFs on the same Direct Connect connection — each VIF has its own 100-prefix budget. If your on-prem ranges naturally split into two groups (e.g., production 10.0.0.0/12, development 172.16.0.0/12), provision one Private VIF per group. Trade-off: more BGP sessions to maintain; not all customer routers handle multi-VIF gracefully. (3) Multiple Direct Connect connections — each with its own VIFs. The most expensive but most flexible. Often the right answer when the underlying need is also for redundancy. (4) Restructure on-premises addressing — if the prefix sprawl is the result of historical accumulation, a long-term re-IP project consolidates many small ranges into a few summarisable ranges. ANS-C01 expects answer (1) as the first response; (2) and (3) are escalations.

Question 4

Q4: How do I architect for graceful Direct Connect maintenance windows announced by AWS?

Accepted Answer

The architecture. (1) Two Direct Connect connections at two different Direct Connect locations — when AWS performs maintenance on one location's router, the other location continues to serve traffic. (2) Active/passive BGP routing with AS-PATH prepending on the secondary, so primary handles all traffic in normal state and secondary takes over only during primary's maintenance. (3) BFD enabled for sub-second failover detection. (4) CloudWatch alarms on ConnectionState for both connections, with SNS escalation to on-call. (5) EventBridge rule subscribed to AWS Health events for the Direct Connect resource type, sending notifications when AWS announces maintenance for your connections — typically 7–14 days in advance. (6) Documented runbook for each maintenance scenario: which connection is being maintained, expected failover behavior, validation steps, escalation path. (7) Quarterly failover testing to validate that the maintenance-day failover will actually work. With this stack, AWS Direct Connect maintenance is a non-event — traffic shifts automatically, you receive a CloudWatch alarm during the window, and traffic shifts back when the maintenance completes. ANS-C01 expects this as the canonical answer to "how do you handle AWS-side Direct Connect maintenance gracefully".

Question 5

Q5: What is the role of BFD specifically, and how is it different from BGP keepalive?

Accepted Answer

BFD (Bidirectional Forwarding Detection) is a dataplane-level fast-failure-detection protocol designed to be lightweight enough to send packets every 50–300 milliseconds without significant CPU cost. BGP keepalive is a control-plane-level alive-check that runs at much slower intervals (30 seconds default) and uses the BGP TCP session itself. The asymmetry: BFD detects link failure in milliseconds (e.g., 900ms with default settings); BGP keepalive detects in 60–90 seconds (3 missed keepalives × 30s = 90s hold time). BFD packets are tiny UDP datagrams sent independently of BGP, so a BGP control plane that is responsive but unable to forward traffic (e.g., CPU pegged) will be detected by BFD but not by BGP keepalive. AWS Direct Connect supports BFD with both async mode (each side sends independently) and demand mode (poll-on-demand), with async being the default. BFD failure tears down the BGP session immediately, which then converges to alternative paths via standard BGP processes. The two protocols are complementary, not redundant — you should run both, and BFD is the difference between sub-second and minute-scale failover.

Question 6

Q6: How do I monitor PrivateLink endpoints for production health?

Accepted Answer

Multiple complementary signals. (1) DescribeVpcEndpoints API on a schedule (Lambda + EventBridge every 5 minutes) — query each endpoint's state and ENI per-AZ availability. Alarm if state ≠ available or any AZ ENI is missing. (2) VPC Flow Logs on the endpoint ENIs — periodic queries via Athena to confirm traffic is flowing to/from the endpoint. A sudden drop in flow volume can indicate downstream issues even before the AWS-managed health detects them. (3) DNS resolution validation — for endpoints with private DNS enabled, periodic Lambda checks resolving the AWS service hostname from inside the VPC and confirming the resolved IPs match the endpoint ENI IPs. Drift here can signal DHCP option set or Resolver issues. (4) Application-level health metrics — the consuming application should publish custom metrics for its endpoint calls (latency, error rate, throughput). End-to-end success is the ultimate validation. (5) Endpoint policy validation — periodic IAM Access Analyzer review of endpoint policies to detect drift. (6) For producer-side custom endpoint services — monitor the underlying NLB health, target health, and acceptance state of consumer connections. The collection of these monitors gives full coverage of PrivateLink operational health.

Question 7

Q7: AS-PATH prepending vs LOCAL_PREF vs MED — what does each control on AWS Direct Connect?

Accepted Answer

Each BGP attribute influences a different routing decision. AS-PATH prepending influences inbound traffic from AWS to on-premises — by making the path appear longer in AS-hops, AWS prefers shorter paths, so traffic flows over the non-prepended (primary) path inbound to your on-prem. Set this on the customer router for outbound BGP advertisements to AWS. LOCAL_PREF influences outbound traffic from on-premises to AWS — but LOCAL_PREF is iBGP-only (within an autonomous system), so AWS never sees your LOCAL_PREF settings; it influences only your internal routing decision among multiple AWS BGP peers (which is rare unless you have multiple Direct Connects to the same AWS region). MED (Multi-Exit Discriminator) is more nuanced — it influences how AWS chooses among multiple BGP paths with the same AS-PATH length, but only when those paths come from the same neighboring AS (yours). For a typical Direct Connect deployment with two connections from the same on-prem AS to AWS, MED can tip the preference between them. The exam-canonical pattern: AS-PATH prepending on the secondary VPN for active/passive Direct Connect + VPN failover. MED for tweaking between two Direct Connects. LOCAL_PREF rarely visible to AWS. Memorise the asymmetry: AS-PATH affects inbound; LOCAL_PREF affects outbound; MED is the tiebreaker.

Question 8

Q8: When is Site-to-Site VPN the right backup for Direct Connect, and when is a second Direct Connect the right answer?

Accepted Answer

Depends on the criticality and traffic volume. Site-to-Site VPN backup is appropriate when: (a) the workload can tolerate VPN's lower bandwidth (typically 1.25 Gbps per VPN connection, ECMP-aggregable to higher), (b) the cost of a second Direct Connect is not justified, (c) the recovery-time objective allows for the VPN's slightly higher latency during failover. Common pattern: a 1 Gbps Direct Connect with VPN backup for non-critical hybrid workloads. Second Direct Connect is appropriate when: (a) bandwidth needs exceed VPN aggregate capacity, (b) the workload has stringent latency SLAs that VPN cannot meet, (c) regulatory requirements mandate non-internet transit even during failover, (d) the cost of a second port is justified by the workload value. Common pattern: tier-1 mission-critical hybrid with two 10 Gbps Direct Connects at different DX locations, plus VPN as third-tier failover. For ANS-C01, scenarios that mention "regulated industry", "mission-critical", or "consistent low latency" typically favor dual Direct Connect; scenarios mentioning "cost-effective" or "small bandwidth" favor DX + VPN.

Question 9

Q9: How do I plan VGW route propagation when it's nearing the 100-route limit?

Accepted Answer

Multiple options. (1) Migrate from VGW to Transit Gateway — TGW supports 10,000 routes per route table, dramatically larger limits. The migration involves attaching the TGW to all relevant VPCs, attaching the Direct Connect via Transit VIF + Direct Connect Gateway, updating VPC route tables to point at TGW instead of VGW. This is the right long-term answer for any organization growing past simple single-VPC hybrid. (2) Customer-side summarisation — reduce the BGP advertisement count from on-prem so fewer routes propagate to the VGW. Same techniques as Direct Connect prefix-limit summarisation. (3) Static routes to override propagation — if some routes are stable and don't need BGP propagation, configure them as static in the VPC route table, and rely on propagation only for the dynamic ones. (4) Multiple VPCs each with their own VGW — split the workload into smaller VPCs, each with its own 100-route VGW; total capacity multiplies. Trade-off: operational complexity. ANS-C01 expects answer (1) as the canonical — VGW is being deprecated in favor of TGW for hybrid architectures. New designs should always use TGW.

Question 10

Q10: What are the operational signals for "BGP session is healthy but traffic is asymmetric or missing"?

Accepted Answer

Symptom triage. (1) CloudWatch shows ConnectionState=1 for both DX connections, BGP session is up on customer router — the link is fine. (2) Application reports intermittent timeouts to certain on-prem destinations — investigation begins. (3) Run TGW Route Analyzer or Reachability Analyzer for the source-to-destination pair; verify the path is correct and the route is propagated. (4) Check VPC Flow Logs at the source ENI for REJECT records — if SG/NACL is dropping, you'll see them. (5) Check Network Firewall logs separately if NFW is in the path — drops there don't appear in Flow Logs. (6) On the customer router, run show ip route &#x3C;destination> and show ip bgp &#x3C;destination> — confirm AWS is advertising the prefix and the route table contains it. (7) Check for asymmetric routing — a packet flowing AWS-to-on-prem via DX-1 with the response coming AWS via DX-2 can break stateful firewalls; look at TGW appliance mode if inspection VPCs are involved. (8) Validate MTU end-to-end — a PMTUD black hole produces "small packets work, large packets hang" symptoms even with healthy BGP. The diagnostic order: link state → BGP route → VPC routing → SG/NACL/NFW → asymmetry → MTU. ANS-C01 frames this as "intermittent connection failure with no obvious BGP error" — the answer requires walking the entire stack.

Maintenance goal	Primary construct	Notes
BGP session dropping with "max-prefixes exceeded"	Customer-side route summarisation	AWS does not summarise.
Sub-second failure detection on Direct Connect	Enable BFD on both ends	Default 900ms; tunable lower.
Active/passive Direct Connect + VPN	AS-PATH prepending on VPN BGP	VPN prepended; primary preferred.
True redundancy for Direct Connect	Two connections at two locations	NOT two VIFs on one connection.
Surviving AWS Direct Connect maintenance	Active/passive with redundant connection	Posted on Personal Health Dashboard.
Detect BGP flapping	CloudWatch ConnectionState alarm + BGP show on customer router	Both endpoints.
Surviving customer-router upgrade	BGP graceful shutdown + drain via AS-PATH	Pre-drain traffic; restart; reattach.
VGW propagation failures	Check 100-route limit	Silent failure beyond.
TGW route table near 10,000	Summarise + use multiple route tables	Per-segment isolation.
Quarterly failover validation	Documented test procedure + CloudWatch dashboard	Rollback plan mandatory.
Detect SNAT exhaustion on hybrid path	Combine NAT GW ErrorPortAllocation + customer-side stats	Look at both ends.
PrivateLink endpoint health	DescribeVpcEndpoints + per-AZ ENI status	Alarm on non-available.
Cross-account TGW share drift	AWS Config rule on TGW attachment configuration	Detect unauthorised changes.
MAC address changes after VGW maintenance	Use IP, not MAC, in customer firewall rules	MACs may change.
On-prem-to-on-prem traffic over AWS backbone	Direct Connect SiteLink	Per-VIF feature.

Why Maintenance Owns 3.1 of the ANS-C01 Domain

Plain-Language Explanation: Hybrid Connectivity Maintenance — BGP Limits and Route Management

Analogy 1: The City Bridge Maintenance Schedule

Analogy 2: The Power Grid Substation

Analogy 3: The Regional Airline Hub Operations

AWS Networking Route Limits — The Numbers That Shape Architecture

Direct Connect BGP prefix limits

VGW route limits

Transit Gateway quotas

VPC route table limits

Why these numbers matter for maintenance

Route Summarisation Strategies

Customer-side summarisation

AWS-side advertisement

Allowed Prefixes filter

When summarisation isn't enough

BGP Timer Tuning and BFD

BGP keepalive and hold timers

BFD — sub-second failover

BGP MD5 authentication

BGP graceful restart

AWS-Managed Maintenance Windows

AWS Personal Health Dashboard

Direct Connect endpoint maintenance

VGW maintenance

TGW maintenance

Customer maintenance procedures

Direct Connect Failover Testing and Validation

Planned failover test procedure

Failover test frequency

Validation metrics

AS-PATH prepending for active/passive

Site-to-Site VPN Tunnel Maintenance

DPD (Dead Peer Detection)

NAT-T (NAT Traversal)

Tunnel rekey

Tunnel endpoint replacement

CloudWatch metrics for VPN

PrivateLink Endpoint Maintenance

Interface endpoint health

Endpoint security group drift

Endpoint policies

Cross-account endpoint sharing maintenance

Direct Connect SiteLink

Common Traps Recap — Domain 3.1

Trap 1: AWS summarises VPC CIDRs in BGP advertisements

Trap 2: 100-prefix limit on Direct Connect is a soft cap

Trap 3: Two VIFs on one Direct Connect provide redundancy

Trap 4: BFD eliminates BGP timer tuning

Trap 5: VGW supports more than 100 propagated routes

Trap 6: NAT-T must be manually configured

Trap 7: VPN tunnel rekey causes traffic loss

Trap 8: AWS-managed maintenance never affects customer traffic

Trap 9: Failover testing is optional for redundant designs

Trap 10: AS-PATH prepending makes the path unusable

Trap 11: PrivateLink interface endpoints are immune to maintenance

Trap 12: SiteLink replaces VPC peering or TGW

Decision Matrix — Maintenance Construct for Each ANS-C01 Goal

FAQ — Hybrid Connectivity Maintenance