examlab .net The most efficient path to the most valuable certifications.
In this note ≈ 27 min

Hybrid Connectivity Maintenance — BGP Limits, Route Management, and PrivateLink Access

5,400 words · ≈ 27 min read ·

ANS-C01 Domain 3.1 deep dive into hybrid connectivity maintenance — Direct Connect BGP prefix limits (100 received per VIF), VGW vs TGW route table quotas, route summarisation, AWS maintenance windows, BFD failover testing, PrivateLink endpoint health, and the active/passive Direct Connect plus VPN backup pattern.

Do 20 practice questions → Free · No signup · ANS-C01

The AWS Certified Advanced Networking — Specialty exam (ANS-C01) Domain 3 task 3.1, "Maintain routing and connectivity on AWS and hybrid networks", is the operational counterpart to the design-heavy task 1.5. Where 1.5 asks you to architect dual Direct Connect plus VPN backup with BGP attribute manipulation, 3.1 asks you to keep that architecture healthy — diagnosing BGP session flaps, summarising routes when prefix limits are hit, scheduling failover tests, surviving AWS-side maintenance windows, validating PrivateLink endpoint health, and reconciling cross-account RAM-shared resources after they drift. Specialty-level fluency with route limits, BGP timers, BFD, and the canonical maintenance procedures separates ANS-C01 from every other AWS certification.

This topic covers task 3.1 in the depth the exam demands. We walk through the route limits that dictate AWS networking architecture (Direct Connect's 100-prefix-per-VIF cap, VGW's 100-route limit, TGW's 10,000-route limit), the route summarisation strategies for staying within those limits, the BFD and BGP timer tuning for sub-second failover, the AWS-managed maintenance windows for Direct Connect and VGW that operators must understand to avoid surprise outages, the Direct Connect failover testing procedure (intentional BGP shutdown, traffic shift, restoration), the VPN tunnel maintenance including DPD and NAT-T, the PrivateLink endpoint health patterns, and the cross-account resource sharing maintenance procedures. The narrative is grounded in the exam's frequent scenario form: "the BGP session is flapping every 90 seconds — what is the most likely cause?" with three plausible distractors and one specific correct diagnosis.

Why Maintenance Owns 3.1 of the ANS-C01 Domain

Task 3.1's knowledge bullets explicitly call out "industry-standard routing protocols (BGP over Direct Connect)", "limits and quotas", "private and public access methods", and "inter-Regional and intra-Regional communication patterns". The skill bullets push you to manage routing protocols over hybrid links, maintain private access via PrivateLink and peering, use route tables to direct traffic with automatic propagation, and optimize routing over dynamic and static routing protocols (route summarization, CIDR overlap). That last skill is the highest-yield exam territory in 3.1 — almost every Specialty cohort sees a question about "Direct Connect prefix limit exceeded" or "BGP route summarisation".

The exam framing is operational, not architectural: the network was built correctly six months ago, but now something is failing, and you must diagnose and fix without re-architecting. This requires knowing the specific limits (Direct Connect accepts max 100 prefixes from on-premises per VIF), the specific failure modes (BGP session drops when prefix limit is exceeded), the specific tools (CloudWatch on ConnectionState, BGP show commands on the customer router, TGW Network Manager Route Analyzer), and the specific remediation procedures (summarise on-premises announcements, raise via support if hard quota, schedule maintenance window for the change).

ANS-C01 also tests the active/passive Direct Connect with VPN backup pattern in detail — not as a design question (that is 1.5) but as a maintenance question: "during a planned Direct Connect maintenance, how do you ensure traffic flows over the VPN without flapping?". The right answer involves AS-PATH prepending on the VPN BGP session, BFD timing, and validating BGP session state via CloudWatch metrics. Three plausible distractors lurk in that question.

Plain-Language Explanation: Hybrid Connectivity Maintenance — BGP Limits and Route Management

Maintaining a hybrid network is like maintaining a critical bridge between two cities. The bridge has weight limits, lane limits, scheduled inspections, and emergency procedures — the network has prefix limits, route limits, AWS maintenance windows, and BGP failover procedures. Three analogies anchor the concepts.

Analogy 1: The City Bridge Maintenance Schedule

A Direct Connect link is a dedicated bridge between two cities (your data centre and AWS). The bridge has a weight limit on how many trucks per direction (BGP prefix limit — 100 from on-premises to AWS per VIF). If you try to send the 101st truck, traffic stops entirely (BGP session drops) until you reduce. Route summarisation is consolidating cargo into fewer larger trucks — instead of advertising fifteen /24s, advertise one /20 covering the same ground. BGP is the dispatch protocol the bridge uses to announce which destinations are reachable; BFD is the bridge structural health monitor that pings sub-second and reports failure faster than BGP's slow keepalive timers.

AWS maintenance windows are the city's scheduled bridge inspection times — AWS posts upcoming maintenance for Direct Connect endpoints, VGWs, and Transit Gateways via the Personal Health Dashboard; during those windows the bridge may have one lane closed (one tunnel down). Without redundancy (a second bridge, the Site-to-Site VPN backup), you risk traffic stopping during inspection. The active/passive failover is the rule that during normal operations all trucks use the primary bridge (Direct Connect), but if the primary closes, trucks automatically reroute over the secondary bridge (VPN); the routing announcements use AS-PATH prepending on the secondary to make it less attractive — the dispatcher announces "this route is longer, use it only if no other route exists".

Analogy 2: The Power Grid Substation

A Direct Connect link is a high-voltage transmission line between your factory and the regional power utility (AWS). The line has a rated capacity (BGP prefix limit) and the substation has route table slots (VGW or TGW route table size limits). The BGP session is the power-line synchronisation signal — both ends must stay in lock-step or the line trips. BFD is the fault interrupter that responds in milliseconds vs the seconds BGP would take. Route summarisation is transmitting at higher voltage to cover the same distance with fewer conductors — combining many small announcements into a few large ones. AWS maintenance windows are the utility's scheduled substation maintenance, announced in advance via the Personal Health Dashboard; redundant power feeds (dual Direct Connect or DX + VPN) keep the factory operational during maintenance. Active/passive routing is the utility's preferred-feed-vs-backup-feed configuration — primary feed handles normal load, backup feed kicks in within seconds when primary drops.

Analogy 3: The Regional Airline Hub Operations

A Direct Connect is a dedicated commercial flight slot between your home airport and the AWS hub airport. The slot has a maximum number of cargo destinations announced per flight (BGP prefix limit). The BGP session is the flight schedule synchronisation with airport operations. BFD is the rapid weather-warning radio — sub-second failure detection. Route summarisation is consolidating multiple destination tags into one container — instead of announcing each individual neighbourhood your trucks deliver to, announce the metro region. AWS maintenance windows are the runway maintenance schedule, posted on the AWS Personal Health Dashboard; flights during maintenance windows may be diverted to backup airports. The VPN backup is the chartered backup flight on a different carrier — used only when the primary is unavailable; AS-PATH prepending is the flight plan that adds extra waypoints to make this route less preferred than the primary commercial flight.

For 3.1 exam questions about BGP limits, prefer the bridge with weight limit mental model — it captures both the hard cap (100 prefixes) and the consequence of exceeding (session drops). For failover and maintenance windows, the bridge inspection schedule with backup route is the highest-yield analogy. For route summarisation, the bigger trucks carrying more cargo per trip image makes the supernetting concept intuitive. Reference: https://docs.aws.amazon.com/directconnect/latest/UserGuide/limits.html

AWS Networking Route Limits — The Numbers That Shape Architecture

ANS-C01 tests the specific quotas and limits that drive hybrid network design. Memorise these.

Direct Connect BGP prefix limits

  • 100 prefixes maximum received from on-premises per Private VIF or Transit VIF. Hard quota — exceeding causes the BGP session to drop, not just suppression.
  • 1000 prefixes maximum received per Public VIF — higher because Public VIFs can advertise on-prem networks to AWS public services.
  • AWS advertises all VPC CIDRs of attached VPCs to your on-premises router — so if you have 50 VPCs attached to a Direct Connect Gateway, AWS will advertise 50+ prefixes to you. This is normally fine because customer routers typically handle hundreds of thousands of BGP prefixes; the constraint is in the AWS direction, not the customer direction.

VGW route limits

  • 100 routes propagated from a single VGW to a VPC route table. Exceeding causes propagation to fail silently for excess routes — not a session drop.
  • 20 connections (VPN + Direct Connect VIFs) that can attach to a single VGW.

Transit Gateway quotas

  • 10,000 routes per TGW route table — generous, but exceedable in very large multi-region deployments.
  • 5 TGW route tables per TGW by default (raisable).
  • 5,000 attachments per TGW (raisable to 50,000 via AWS Support).
  • 20 VPCs per Direct Connect Gateway when used with Transit Gateway via Transit VIF.

VPC route table limits

  • 50 routes per VPC route table by default; raisable to 1,000 — but performance starts to degrade above ~500 routes.
  • 5 secondary CIDR blocks per VPC by default; raisable.

Why these numbers matter for maintenance

  • Adding a new VPC to a TGW with on-premises attachment can push past Direct Connect's 100-prefix cap if on-prem already advertises 99 prefixes — the new VPC's CIDR pushes the count over the limit and the BGP session drops.
  • A new on-prem subnet propagated via BGP may be the 101st prefix that drops the session.
  • VGW route table propagation hitting 100 routes silently drops new routes — connectivity to new on-prem networks fails without an obvious "session down" signal.

The single most-tested ANS-C01 BGP-limit fact: Direct Connect's 100-prefix-from-on-premises limit per Private VIF or Transit VIF is enforced by dropping the BGP session entirely when exceeded. This is not a soft quota where excess prefixes are silently discarded — the session goes down, all on-prem connectivity through that VIF stops, and CloudWatch ConnectionState alarm fires. The fix is route summarisation on the customer-side BGP advertisements before re-establishing the session. For example, if on-prem advertises 120 individual /24s, summarise them into a /20 + a /22 + a /24 (or whatever covers the ranges) reducing announced count. AWS does not summarise the routes for you. Reference: https://docs.aws.amazon.com/directconnect/latest/UserGuide/limits.html

Route Summarisation Strategies

Summarisation (also called supernetting or aggregation) is the operational lever for staying within prefix limits.

Customer-side summarisation

The customer router announces a larger CIDR that contains all of the smaller CIDRs you want to make reachable. For example, instead of announcing 10.0.0.0/24, 10.0.1.0/24, ..., 10.0.15.0/24 (16 prefixes), announce 10.0.0.0/20 (one prefix covering the same range). Within the AWS network, return traffic to any 10.0.x.x address that falls within the /20 routes correctly across the link.

Trade-offs: summarisation works only when the underlying CIDRs are contiguous. A non-contiguous range like 10.0.0.0/24 and 10.5.0.0/24 cannot be summarised into a single supernet without including unallocated space (which causes black-holing of traffic to that unallocated space). For non-contiguous, summarise into the smallest supernet covering each contiguous chunk.

AWS-side advertisement

AWS does not automatically summarise the VPC CIDRs it advertises to your on-premises router. Each attached VPC's CIDR is announced as-is. If you have 30 VPCs each with /16 CIDRs, you receive 30 BGP prefixes from AWS. To reduce, you would need to design VPCs with contiguous CIDRs that could be summarised — but AWS still advertises each individually. The practical implication: AWS-side prefix count grows with attached VPCs.

Allowed Prefixes filter

For a Public VIF advertising your on-premises networks to AWS public services, you specify an "allowed prefixes" filter — AWS will accept only those prefixes from your on-premises BGP announcements. This prevents accidental hijacking and serves as a sanity check on what you announce. For Private VIFs and Transit VIFs, the filter is implicit (AWS accepts up to 100 prefixes regardless of content, modulo allowed RFC 1918 ranges).

When summarisation isn't enough

If you genuinely need more than 100 prefixes advertised from on-premises, the options are:

  • Multiple VIFs on the same connection — each VIF has its own 100-prefix budget.
  • Multiple Direct Connect connections — each with its own VIFs.
  • Restructure on-premises addressing to enable more aggressive summarisation.
  • Engage AWS Support for a quota raise (rarely granted; typically AWS recommends summarisation first).

A persistent ANS-C01 distractor: candidates assume AWS will helpfully summarise the routes it advertises. It does not. Each VPC CIDR attached to a Direct Connect Gateway or a Transit Gateway is advertised as a separate BGP prefix to your on-premises router. If you have 60 VPCs and your on-premises router has a small BGP table size limit (some legacy customer-edge devices), the AWS announcements alone exceed the table. The fix is on the customer-router side — most modern routers handle hundreds of thousands of prefixes — or to consolidate VPCs. Memorise: the 100-prefix limit on Direct Connect is inbound to AWS (from on-prem), not outbound from AWS. AWS-side advertisement is unbounded by that specific limit. Reference: https://docs.aws.amazon.com/directconnect/latest/UserGuide/routing-and-bgp.html

BGP Timer Tuning and BFD

BGP keepalive and hold timers

Default BGP timers on AWS Direct Connect: keepalive 30 seconds, hold time 90 seconds. This means BGP detects a session failure 60–90 seconds after the actual loss — too slow for mission-critical workloads. Reducing the timers to keepalive 10 / hold 30 detects failures faster but adds CPU load on both routers and is not always supported uniformly.

BFD — sub-second failover

Bidirectional Forwarding Detection (BFD) is a lightweight protocol designed for sub-second link failure detection. Both routers send BFD packets every 100–300 milliseconds; missing 3–5 consecutive packets declares the link down. AWS Direct Connect supports asynchronous BFD with a default detection time of ~900 ms (300ms × 3), tunable down to ~150 ms (50ms × 3) for ultra-fast failover.

BFD is enabled per-VIF by setting bfd_enabled = true on both AWS and customer-router sides. AWS uses BFD with the BGP session — when BFD detects link failure, BGP is immediately torn down and traffic shifts to the backup path (VPN, second DX) without waiting for BGP hold time. Without BFD, a clean BGP graceful shutdown takes 5–10 seconds at best; an ungraceful failure takes 90+ seconds.

BGP MD5 authentication

AWS supports MD5 password authentication on BGP sessions, recommended for security. Both ends must configure the same password; mismatched passwords prevent the BGP session from establishing. SHA-1 and stronger authentication is not currently supported on AWS BGP — MD5 is the only option.

BGP graceful restart

AWS Direct Connect supports BGP graceful restart, allowing the BGP session to recover without traffic loss if the BGP control plane on either router restarts (e.g., during a software upgrade). Graceful restart is enabled by default; it sets a "stale routes timer" so existing forwarding state is preserved while BGP re-establishes.

  • BGP keepalive: how often BGP sends "I'm alive" messages (default 30s).
  • BGP hold time: how long BGP waits for keepalives before declaring failure (default 90s).
  • BFD: Bidirectional Forwarding Detection; sub-second failure detection (default 900ms on DX).
  • BGP MD5: password-based BGP session authentication (only option on AWS).
  • Graceful restart: BGP control-plane restart without forwarding-plane loss.
  • AS-PATH prepending: artificially adding AS hops to make a route less preferred.
  • LOCAL_PREF: BGP attribute influencing outbound traffic (iBGP only, not visible to AWS direct).
  • MED: Multi-Exit Discriminator; influences inbound traffic from AWS to on-prem.
  • NO_EXPORT / NO_ADVERTISE: BGP communities scoping route propagation.
  • Allowed prefixes filter: per-VIF list of acceptable prefix announcements (Public VIF).
  • Reference: https://docs.aws.amazon.com/directconnect/latest/UserGuide/bfd-enable.html

AWS-Managed Maintenance Windows

AWS performs scheduled maintenance on Direct Connect endpoints, Virtual Private Gateways, and Transit Gateways. Operators must know the announcement channel, the impact, and the redundancy requirements to ride through.

AWS Personal Health Dashboard

Maintenance announcements appear on the AWS Personal Health Dashboard (now part of AWS Health) for the affected account. EventBridge can subscribe to AWS Health events to automate notifications via SNS, Slack, or PagerDuty integration. Maintenance is typically announced 7–14 days in advance.

Direct Connect endpoint maintenance

AWS may need to upgrade the AWS-side router that terminates a Direct Connect VIF. During the maintenance window, the BGP session for that VIF goes down and traffic must reroute to a redundant path. AWS recommends:

  • Two Direct Connect connections at two different AWS Direct Connect locations for full redundancy.
  • Two VIFs on the same connection is NOT redundancy — both terminate on the same AWS router and both go down together during that router's maintenance.
  • Active/passive failover with BGP — primary VIF gets all traffic; passive VIF takes over during primary's maintenance via BGP automatic re-routing.

VGW maintenance

VGWs are managed services and AWS handles the maintenance. The VGW supports tunnel resilience natively; during maintenance, one tunnel may go down briefly. As long as the customer gateway uses both tunnels, traffic continues on the other.

TGW maintenance

TGW is highly resilient by design — the multi-AZ deployment means individual control-plane updates do not affect data-plane traffic. Customers rarely see TGW maintenance impact unless specific cross-attachment routing is being modified.

Customer maintenance procedures

For customer-side router or firewall maintenance affecting BGP sessions:

  • Schedule during low-traffic windows if no redundancy.
  • Announce via internal change management so on-call understands the expected BGP flap.
  • Use AS-PATH prepending or BGP graceful shutdown to drain traffic from the path before taking it down.
  • Validate the backup path in advance — demonstrate via a controlled test that traffic flows over the secondary correctly.

A subtle ANS-C01 trap. Operators sometimes provision two VIFs (e.g., Private VIF and Transit VIF) on a single Direct Connect connection thinking that gives them redundancy. It does not — both VIFs terminate on the same AWS router; if AWS performs maintenance on that router, both VIFs flap simultaneously. True redundancy requires two separate Direct Connect connections at two different Direct Connect locations, ideally in different physical paths. The "highest resilience" tier is two connections at two locations with dual customer-edge routers. SCS-C02 and SAA-C03 occasionally test this; ANS-C01 always tests it. Reference: https://docs.aws.amazon.com/directconnect/latest/UserGuide/Welcome.html

Direct Connect Failover Testing and Validation

Operationally, regular failover tests validate that your redundancy actually works before you need it.

Planned failover test procedure

The canonical procedure: (1) Notify stakeholders of the test window. (2) Prepare the rollback plan — exact commands to restore primary, expected timing. (3) Initiate the failover — administratively shut down BGP on the primary VIF (shutdown on the BGP neighbor in customer router config), or temporarily remove the primary's static route. (4) Validate — confirm CloudWatch metrics show traffic flowing on the secondary path; check application metrics for any errors during the transition; check the duration of the gap (should be sub-second with BFD, 5–30s without). (5) Restore — bring the primary BGP session back up; verify traffic returns to the primary; confirm no asymmetric routing remains. (6) Document — failover time, any unexpected behavior, follow-up actions.

Failover test frequency

For mission-critical hybrid connectivity, quarterly failover testing is the AWS recommendation. For lower-tier workloads, annual testing combined with cloud-side configuration audits suffices. Failover testing is also useful immediately after any significant network change — a new VPC attached to TGW, a new on-prem subnet announced, a customer-router upgrade.

Validation metrics

CloudWatch metrics that prove successful failover:

  • ConnectionState — primary 0, secondary 1 during the test; both 1 after restore.
  • ConnectionBpsEgress / ConnectionBpsIngress — should drop on primary during failover, rise on secondary.
  • VPC Flow Logs at the workload ENIs — should show continuous traffic with brief gap during transition.
  • Application-level metrics — request counts, error rates, latency p99 during the test window.

AS-PATH prepending for active/passive

The mechanism that makes traffic prefer the primary in normal operation but use the secondary during failover. On the secondary VPN, configure BGP to prepend its own AS multiple times when announcing on-prem prefixes — route-map AS_PREPEND_OUT permit 10 with set as-path prepend 65000 65000 65000. AWS sees the secondary path as longer and prefers the primary. When primary fails, the secondary becomes the only path and traffic shifts. Restoring primary reverses the preference automatically.

A surprisingly common operational failure: an engineer initiates a failover test without a clear rollback procedure, the secondary path has a misconfiguration the engineer didn't anticipate, traffic is partially down, and the engineer takes 30 minutes to figure out the rollback. ANS-C01 doesn't test this directly, but it tests the prerequisites for safe failover testing: documented rollback, validation metrics, AS-PATH or weight tuning so traffic shifts cleanly, both paths pre-validated independently. The exam-canonical answer to "how do you safely test Direct Connect failover" includes BGP graceful shutdown, AS-PATH prepending, BFD-enabled fast failure detection, and CloudWatch dashboard for real-time validation. Reference: https://docs.aws.amazon.com/directconnect/latest/UserGuide/routing-and-bgp.html

Site-to-Site VPN Tunnel Maintenance

Site-to-Site VPN tunnels need ongoing maintenance: monitoring tunnel state, handling DPD timeouts, NAT-T traversal for clients behind NAT, and rekeying.

DPD (Dead Peer Detection)

Dead Peer Detection (DPD) is the IPsec mechanism for detecting an unresponsive peer and tearing down the IKE association. AWS sends DPD probes every 10 seconds by default; after 30 seconds of no response (3 missed probes), the tunnel is declared down and re-negotiated. On the customer-gateway side, DPD timing should match — most enterprise firewalls support configurable DPD intervals.

NAT-T (NAT Traversal)

If the customer gateway sits behind NAT, IPsec ESP traffic (protocol 50) cannot traverse NAT. NAT Traversal (NAT-T) encapsulates ESP in UDP/4500, allowing NAT translation. AWS Site-to-Site VPN supports NAT-T automatically; the customer router must be configured to negotiate NAT-T at IKE time. Detection: if the customer router is behind NAT, NAT-T is auto-detected and used.

Tunnel rekey

By default, AWS Site-to-Site VPN tunnels rekey IKE Phase 1 every 8 hours and IPsec Phase 2 every 1 hour. Rekey is transparent to applications — traffic continues during the negotiation. Customer gateway must support modern cipher suites (AES-256, SHA-2, DH group 14+) for compatibility with current AWS defaults.

Tunnel endpoint replacement

AWS may replace tunnel endpoint IPs during planned maintenance. The customer gateway must be configured with both tunnel endpoint IPs (each tunnel has its own pair) and use both for HA. If only one is configured, AWS-side endpoint replacement breaks the connection until the customer router is updated. The Personal Health Dashboard announces these.

CloudWatch metrics for VPN

  • TunnelState: 1 if up, 0 if down. Alarm on this for paging.
  • TunnelDataIn, TunnelDataOut: validate active/active sharing or single-tunnel domination.
  • TunnelIpAddress: the AWS-side endpoint IP of the tunnel.

PrivateLink (interface and gateway endpoints) has its own operational considerations.

Interface endpoint health

Each interface endpoint provisions ENIs in your subnets — one ENI per AZ where the endpoint is enabled. The ENIs are managed by AWS but are visible in your account. Health-check considerations:

  • Endpoint status: pending, available, deleting, deleted, rejected, failed. Monitor via DescribeVpcEndpoints API and alarm on non-available state.
  • Per-AZ availability: each AZ's endpoint ENI is independent; an AZ-level issue affects only that AZ's endpoint.
  • DNS resolution: with private DNS enabled, the endpoint overrides public DNS. Verify resolution from within the VPC using dig <service-name> and ensure responses point to ENI private IPs, not public IPs.

Endpoint security group drift

Interface endpoints have security groups; rule changes drift over time. Periodically audit endpoint SGs to ensure they allow access only from intended client subnets. AWS Config rules can flag endpoints with overly permissive SGs.

Endpoint policies

VPC endpoint policies are IAM-style policies attached to the endpoint that restrict which actions are allowed through it. Updates to endpoint policies require careful change management — an overly restrictive policy denies all traffic through the endpoint, breaking client applications. Test policy changes in a non-prod endpoint first.

Cross-account endpoint sharing maintenance

When a service provider exposes a PrivateLink endpoint service to consumer accounts, the acceptance workflow can drift. New consumer accounts request connection; the provider accepts manually (unless auto-accept is enabled). When provider scaling adds or removes NLB targets behind the endpoint service, consumers see no impact (transparent to the endpoint). When provider moves to a new NLB or endpoint service ID, consumers must update their endpoint configuration — this is a breaking change that requires coordination.

Direct Connect SiteLink is a feature that allows traffic between two Direct Connect locations to traverse the AWS backbone without first hairpinning to a VPC. Useful when an organisation has multiple branch offices each connected to AWS via Direct Connect — SiteLink enables direct branch-to-branch traffic over AWS-internal links instead of internet.

SiteLink is enabled per-VIF (Private or Transit). Pricing: SiteLink-eligible data transfer is charged at a separate rate from standard Direct Connect data transfer. SiteLink is region-aware: cross-region SiteLink uses inter-region pricing.

For maintenance, SiteLink-enabled VIFs follow the same Direct Connect maintenance rules — the AWS-side endpoint may be subject to scheduled maintenance, requiring redundant SiteLink-enabled VIFs at multiple Direct Connect locations.

  • Direct Connect Private/Transit VIF prefix limit: 100 received from on-prem (hard, drops session when exceeded).
  • Direct Connect Public VIF prefix limit: 1000 received from on-prem.
  • VGW route propagation limit: 100 routes (silent failure beyond).
  • TGW route table size limit: 10,000 routes per table.
  • TGW route tables per TGW: 5 default (raisable).
  • TGW attachments per TGW: 5,000 default (raisable to 50,000).
  • DXGW VPC limit via TGW Transit VIF: 20 VPCs.
  • Default BGP keepalive / hold time on DX: 30s / 90s.
  • BFD on Direct Connect: default 300ms × 3 = 900ms detection; tunable to 150ms.
  • AWS BGP MD5 authentication: only option (no SHA).
  • VPN DPD default: AWS sends every 10s; declares dead after 30s (3 missed).
  • VPN tunnels per connection: 2 (both should be configured at customer gateway).
  • VPN IKE Phase 1 rekey: every 8 hours by default.
  • VPN IKE Phase 2 rekey: every 1 hour by default.
  • Two Direct Connect connections at two locations: minimum for true redundancy.
  • Two VIFs on one connection: NOT redundancy (same AWS router).
  • Reference: https://docs.aws.amazon.com/vpc/latest/tgw/transit-gateway-quotas.html

Common Traps Recap — Domain 3.1

The traps the exam writes most frequently in hybrid connectivity maintenance.

Trap 1: AWS summarises VPC CIDRs in BGP advertisements

Wrong. Each VPC CIDR is announced as a separate BGP prefix. Customer-side summarisation is the only summarisation in the system.

Trap 2: 100-prefix limit on Direct Connect is a soft cap

Wrong. It is a hard limit; exceeding drops the BGP session entirely.

Trap 3: Two VIFs on one Direct Connect provide redundancy

Wrong. Same AWS router; same maintenance window. Two connections at two locations is the minimum.

Trap 4: BFD eliminates BGP timer tuning

Wrong. BFD provides fast failure detection; BGP timers still govern session keepalive and graceful restart. Both should be tuned.

Trap 5: VGW supports more than 100 propagated routes

Wrong. 100 routes is the limit; new propagated routes silently drop without alarm.

Trap 6: NAT-T must be manually configured

Wrong. AWS Site-to-Site VPN auto-detects NAT and uses NAT-T transparently when the customer gateway negotiates it.

Trap 7: VPN tunnel rekey causes traffic loss

Wrong. Rekey is transparent — traffic continues during IKE Phase 1/2 negotiation.

Trap 8: AWS-managed maintenance never affects customer traffic

Wrong. Direct Connect endpoint maintenance does affect a single VIF on a single connection. Redundancy is required to ride through.

Trap 9: Failover testing is optional for redundant designs

Wrong. Untested failover paths frequently fail when actually needed. Quarterly testing is the AWS-recommended cadence.

Trap 10: AS-PATH prepending makes the path unusable

Wrong. AS-PATH prepending makes a path less preferred but still usable. Traffic flows over the prepended path when no shorter path exists.

Wrong. Per-AZ ENIs are subject to AWS maintenance individually; endpoint provider services may also have maintenance windows.

Wrong. SiteLink only enables direct on-prem-to-on-prem traffic over AWS backbone for traffic between Direct Connect locations. It does not replace VPC connectivity.

Decision Matrix — Maintenance Construct for Each ANS-C01 Goal

Maintenance goal Primary construct Notes
BGP session dropping with "max-prefixes exceeded" Customer-side route summarisation AWS does not summarise.
Sub-second failure detection on Direct Connect Enable BFD on both ends Default 900ms; tunable lower.
Active/passive Direct Connect + VPN AS-PATH prepending on VPN BGP VPN prepended; primary preferred.
True redundancy for Direct Connect Two connections at two locations NOT two VIFs on one connection.
Surviving AWS Direct Connect maintenance Active/passive with redundant connection Posted on Personal Health Dashboard.
Detect BGP flapping CloudWatch ConnectionState alarm + BGP show on customer router Both endpoints.
Surviving customer-router upgrade BGP graceful shutdown + drain via AS-PATH Pre-drain traffic; restart; reattach.
VGW propagation failures Check 100-route limit Silent failure beyond.
TGW route table near 10,000 Summarise + use multiple route tables Per-segment isolation.
Quarterly failover validation Documented test procedure + CloudWatch dashboard Rollback plan mandatory.
Detect SNAT exhaustion on hybrid path Combine NAT GW ErrorPortAllocation + customer-side stats Look at both ends.
PrivateLink endpoint health DescribeVpcEndpoints + per-AZ ENI status Alarm on non-available.
Cross-account TGW share drift AWS Config rule on TGW attachment configuration Detect unauthorised changes.
MAC address changes after VGW maintenance Use IP, not MAC, in customer firewall rules MACs may change.
On-prem-to-on-prem traffic over AWS backbone Direct Connect SiteLink Per-VIF feature.

FAQ — Hybrid Connectivity Maintenance

Q1: Our BGP session on Direct Connect just dropped — how do I diagnose whether we hit the 100-prefix limit?

The diagnostic ladder. (1) Check CloudWatch on the Direct Connect connection: ConnectionState should show 0; the alarm should have fired. (2) SSH to the customer-edge router and run show ip bgp summary (Cisco) or equivalent. Look at the BGP session state — if it shows "Idle (PfxCt)" or "Idle (max-prefix)" the customer router is reporting that it received too many prefixes from AWS (rare, since AWS rarely advertises >100). If it shows session timeout or hold-time expiry without prefix-count, the issue is elsewhere. (3) Check the AWS-side BGP: in the Direct Connect console, the VIF will show "Down" with a reason such as "BGP_PEER_NEIGHBOUR_RECEIVED_TOO_MANY_PREFIXES". (4) Count prefixes being announced from on-prem: on the customer router, show ip bgp neighbors <aws-bgp-peer> advertised-routes | count. If this count is ≥100, you have hit the limit. (5) Summarise: identify contiguous CIDR ranges and consolidate announcements into supernets. Re-establish BGP session. (6) Document and alert: add a CloudWatch alarm monitoring "advertised prefix count" — a custom metric you publish from the customer-router via SNMP-to-CloudWatch — to alert when count nears 90 (early warning before 100).

Q2: I have BGP configured with default timers and BFD disabled — how long will my failover take during a Direct Connect outage?

With default timers and no BFD, failure detection is 60–90 seconds. The BGP keepalive timer is 30 seconds, and the hold time is 90 seconds — meaning BGP must miss 3 keepalives before declaring the session dead. After session declaration, BGP must converge to the alternative path, which takes additional seconds (typically 3–10s for AS-PATH-prepended VPN backup). Total time-to-recovery: 65–100 seconds. This is unacceptable for most production workloads. With BFD enabled at default 300ms × 3 = 900ms detection, total recovery drops to about 1–5 seconds. With aggressive BFD timing of 50ms × 3 = 150ms, total recovery approaches sub-second — the gold standard for mission-critical hybrid links. Enabling BFD requires both AWS and customer-router support; most modern enterprise routers (Cisco, Juniper, Arista, Palo Alto) support BFD. Configure both ends, test, and validate via failover drills.

Q3: We need to advertise 150 on-premises prefixes to AWS — how do we work around the 100-prefix limit?

Three viable approaches in priority order. (1) Customer-side summarisation — almost always the right answer. Identify contiguous CIDR ranges that can be consolidated into supernets. For example, 150 individual /24s in 10.0.0.0–10.149.255.255 can be summarised into a single /16 (10.0.0.0/16) covering all of them. Trade-offs: summarisation works only for contiguous ranges, and the supernet may include unallocated space that should not route traffic (which is acceptable if you use VPC route tables to filter). (2) Multiple VIFs on the same Direct Connect connection — each VIF has its own 100-prefix budget. If your on-prem ranges naturally split into two groups (e.g., production 10.0.0.0/12, development 172.16.0.0/12), provision one Private VIF per group. Trade-off: more BGP sessions to maintain; not all customer routers handle multi-VIF gracefully. (3) Multiple Direct Connect connections — each with its own VIFs. The most expensive but most flexible. Often the right answer when the underlying need is also for redundancy. (4) Restructure on-premises addressing — if the prefix sprawl is the result of historical accumulation, a long-term re-IP project consolidates many small ranges into a few summarisable ranges. ANS-C01 expects answer (1) as the first response; (2) and (3) are escalations.

Q4: How do I architect for graceful Direct Connect maintenance windows announced by AWS?

The architecture. (1) Two Direct Connect connections at two different Direct Connect locations — when AWS performs maintenance on one location's router, the other location continues to serve traffic. (2) Active/passive BGP routing with AS-PATH prepending on the secondary, so primary handles all traffic in normal state and secondary takes over only during primary's maintenance. (3) BFD enabled for sub-second failover detection. (4) CloudWatch alarms on ConnectionState for both connections, with SNS escalation to on-call. (5) EventBridge rule subscribed to AWS Health events for the Direct Connect resource type, sending notifications when AWS announces maintenance for your connections — typically 7–14 days in advance. (6) Documented runbook for each maintenance scenario: which connection is being maintained, expected failover behavior, validation steps, escalation path. (7) Quarterly failover testing to validate that the maintenance-day failover will actually work. With this stack, AWS Direct Connect maintenance is a non-event — traffic shifts automatically, you receive a CloudWatch alarm during the window, and traffic shifts back when the maintenance completes. ANS-C01 expects this as the canonical answer to "how do you handle AWS-side Direct Connect maintenance gracefully".

Q5: What is the role of BFD specifically, and how is it different from BGP keepalive?

BFD (Bidirectional Forwarding Detection) is a dataplane-level fast-failure-detection protocol designed to be lightweight enough to send packets every 50–300 milliseconds without significant CPU cost. BGP keepalive is a control-plane-level alive-check that runs at much slower intervals (30 seconds default) and uses the BGP TCP session itself. The asymmetry: BFD detects link failure in milliseconds (e.g., 900ms with default settings); BGP keepalive detects in 60–90 seconds (3 missed keepalives × 30s = 90s hold time). BFD packets are tiny UDP datagrams sent independently of BGP, so a BGP control plane that is responsive but unable to forward traffic (e.g., CPU pegged) will be detected by BFD but not by BGP keepalive. AWS Direct Connect supports BFD with both async mode (each side sends independently) and demand mode (poll-on-demand), with async being the default. BFD failure tears down the BGP session immediately, which then converges to alternative paths via standard BGP processes. The two protocols are complementary, not redundant — you should run both, and BFD is the difference between sub-second and minute-scale failover.

Multiple complementary signals. (1) DescribeVpcEndpoints API on a schedule (Lambda + EventBridge every 5 minutes) — query each endpoint's state and ENI per-AZ availability. Alarm if state ≠ available or any AZ ENI is missing. (2) VPC Flow Logs on the endpoint ENIs — periodic queries via Athena to confirm traffic is flowing to/from the endpoint. A sudden drop in flow volume can indicate downstream issues even before the AWS-managed health detects them. (3) DNS resolution validation — for endpoints with private DNS enabled, periodic Lambda checks resolving the AWS service hostname from inside the VPC and confirming the resolved IPs match the endpoint ENI IPs. Drift here can signal DHCP option set or Resolver issues. (4) Application-level health metrics — the consuming application should publish custom metrics for its endpoint calls (latency, error rate, throughput). End-to-end success is the ultimate validation. (5) Endpoint policy validation — periodic IAM Access Analyzer review of endpoint policies to detect drift. (6) For producer-side custom endpoint services — monitor the underlying NLB health, target health, and acceptance state of consumer connections. The collection of these monitors gives full coverage of PrivateLink operational health.

Q7: AS-PATH prepending vs LOCAL_PREF vs MED — what does each control on AWS Direct Connect?

Each BGP attribute influences a different routing decision. AS-PATH prepending influences inbound traffic from AWS to on-premises — by making the path appear longer in AS-hops, AWS prefers shorter paths, so traffic flows over the non-prepended (primary) path inbound to your on-prem. Set this on the customer router for outbound BGP advertisements to AWS. LOCAL_PREF influences outbound traffic from on-premises to AWS — but LOCAL_PREF is iBGP-only (within an autonomous system), so AWS never sees your LOCAL_PREF settings; it influences only your internal routing decision among multiple AWS BGP peers (which is rare unless you have multiple Direct Connects to the same AWS region). MED (Multi-Exit Discriminator) is more nuanced — it influences how AWS chooses among multiple BGP paths with the same AS-PATH length, but only when those paths come from the same neighboring AS (yours). For a typical Direct Connect deployment with two connections from the same on-prem AS to AWS, MED can tip the preference between them. The exam-canonical pattern: AS-PATH prepending on the secondary VPN for active/passive Direct Connect + VPN failover. MED for tweaking between two Direct Connects. LOCAL_PREF rarely visible to AWS. Memorise the asymmetry: AS-PATH affects inbound; LOCAL_PREF affects outbound; MED is the tiebreaker.

Q8: When is Site-to-Site VPN the right backup for Direct Connect, and when is a second Direct Connect the right answer?

Depends on the criticality and traffic volume. Site-to-Site VPN backup is appropriate when: (a) the workload can tolerate VPN's lower bandwidth (typically 1.25 Gbps per VPN connection, ECMP-aggregable to higher), (b) the cost of a second Direct Connect is not justified, (c) the recovery-time objective allows for the VPN's slightly higher latency during failover. Common pattern: a 1 Gbps Direct Connect with VPN backup for non-critical hybrid workloads. Second Direct Connect is appropriate when: (a) bandwidth needs exceed VPN aggregate capacity, (b) the workload has stringent latency SLAs that VPN cannot meet, (c) regulatory requirements mandate non-internet transit even during failover, (d) the cost of a second port is justified by the workload value. Common pattern: tier-1 mission-critical hybrid with two 10 Gbps Direct Connects at different DX locations, plus VPN as third-tier failover. For ANS-C01, scenarios that mention "regulated industry", "mission-critical", or "consistent low latency" typically favor dual Direct Connect; scenarios mentioning "cost-effective" or "small bandwidth" favor DX + VPN.

Q9: How do I plan VGW route propagation when it's nearing the 100-route limit?

Multiple options. (1) Migrate from VGW to Transit Gateway — TGW supports 10,000 routes per route table, dramatically larger limits. The migration involves attaching the TGW to all relevant VPCs, attaching the Direct Connect via Transit VIF + Direct Connect Gateway, updating VPC route tables to point at TGW instead of VGW. This is the right long-term answer for any organization growing past simple single-VPC hybrid. (2) Customer-side summarisation — reduce the BGP advertisement count from on-prem so fewer routes propagate to the VGW. Same techniques as Direct Connect prefix-limit summarisation. (3) Static routes to override propagation — if some routes are stable and don't need BGP propagation, configure them as static in the VPC route table, and rely on propagation only for the dynamic ones. (4) Multiple VPCs each with their own VGW — split the workload into smaller VPCs, each with its own 100-route VGW; total capacity multiplies. Trade-off: operational complexity. ANS-C01 expects answer (1) as the canonical — VGW is being deprecated in favor of TGW for hybrid architectures. New designs should always use TGW.

Q10: What are the operational signals for "BGP session is healthy but traffic is asymmetric or missing"?

Symptom triage. (1) CloudWatch shows ConnectionState=1 for both DX connections, BGP session is up on customer router — the link is fine. (2) Application reports intermittent timeouts to certain on-prem destinations — investigation begins. (3) Run TGW Route Analyzer or Reachability Analyzer for the source-to-destination pair; verify the path is correct and the route is propagated. (4) Check VPC Flow Logs at the source ENI for REJECT records — if SG/NACL is dropping, you'll see them. (5) Check Network Firewall logs separately if NFW is in the path — drops there don't appear in Flow Logs. (6) On the customer router, run show ip route <destination> and show ip bgp <destination> — confirm AWS is advertising the prefix and the route table contains it. (7) Check for asymmetric routing — a packet flowing AWS-to-on-prem via DX-1 with the response coming AWS via DX-2 can break stateful firewalls; look at TGW appliance mode if inspection VPCs are involved. (8) Validate MTU end-to-end — a PMTUD black hole produces "small packets work, large packets hang" symptoms even with healthy BGP. The diagnostic order: link state → BGP route → VPC routing → SG/NACL/NFW → asymmetry → MTU. ANS-C01 frames this as "intermittent connection failure with no obvious BGP error" — the answer requires walking the entire stack.

After understanding hybrid connectivity maintenance, the natural next ANS-C01 layers are: Direct Connect VIF types and MACsec for the design-level decisions on cross-connect encryption; BGP attributes and route manipulation for the design-level traffic-engineering tools; Transit Gateway routing and attachments for the hub-and-spoke fabric; and VPC routing with longest-prefix match for the rules that determine packet egress decisions in any VPC.

Official sources

More ANS-C01 topics