Cloud NAT

Introduction

Cloud NAT (Network Address Translation) is a regional, distributed, software-defined managed service that allows Google Cloud resources without external IP addresses, such as Compute Engine VMs, private GKE nodes, Cloud Run for Anthos workloads, and Serverless VPC Access connectors, to initiate outbound connections to the internet and to specific Google APIs. It performs only Source NAT (SNAT) and the corresponding response translation; it never accepts unsolicited inbound connections, which is the property that makes it both a connectivity tool and a defence-in-depth boundary. Cloud NAT is implemented inside the Andromeda network virtualization stack on the same host as your VM, so the data plane is fully distributed and there is no NAT VM, appliance, or single-tenant gateway to size, patch, or scale.

A Cloud NAT gateway always belongs to exactly one region and exactly one VPC network, and you attach it to a Cloud Router in that same region. The Cloud Router does not run BGP for Cloud NAT; instead it stores configuration, programs the Andromeda host agents that perform translation, and reports per-VM port reservations. Each gateway can serve some or all subnets in its region, can be configured for primary IP ranges only or both primary and alias IP ranges, and can carry IPv4, IPv6, or IPv4 plus NAT64 traffic depending on subnet stack type. This study note walks through the mechanics, IP allocation modes, port behaviour, logging, drop reasons, troubleshooting flow, and exam pitfalls in enough depth to handle the PCNE Cloud NAT blueprint end-to-end.

How Cloud NAT Mechanics Work (Regional Managed NAT)

Distributed Data Plane on Andromeda

Cloud NAT is not a gateway VM or a proxy hop. When a private VM sends a packet to a public destination, the Andromeda virtual switch on the VM's host rewrites the source IPv4 from the VM's internal address to one of the NAT IPs the gateway has been assigned, picks a source port from the VM's reserved range, and records the 5-tuple in a per-host connection table. Return traffic is reversed locally on the same host. Because each host translates independently, the data plane scales horizontally with VM count and has no central choke point or single point of failure.

Cloud Router Control Plane

You create a Cloud Router in the target region (for example gcloud compute routers create nat-router --region=us-central1 --network=prod-vpc), then attach the NAT gateway with gcloud compute routers nats create. The Cloud Router is the configuration anchor; it tells the fleet which subnets are eligible, which IP allocation mode applies, the port-allocation parameters, and which timeouts to use. A single Cloud Router can host multiple NAT gateways only if their subnet selections do not overlap.

Regional, Per-VPC Scope

A Cloud NAT gateway in us-central1 serves only resources whose primary internal IP belongs to a subnet in us-central1 in the same VPC. It cannot translate traffic for VMs in another region, another VPC, peered VPCs, or on-premises networks reached through Cloud VPN or Cloud Interconnect. Multi-region private workloads therefore require one gateway per region.

Cloud NAT is a Google-managed regional Source NAT service that translates the internal IPv4/IPv6 address of a private Google Cloud resource into a public NAT IP attached to a Cloud Router, providing outbound internet egress without exposing the resource to inbound internet connections.

Auto vs Manual IP Allocation

Auto-Allocate NAT IPs

With --nat-external-ip-pool=auto, Google automatically allocates and releases external IPv4 addresses as gateway port demand grows or shrinks. Auto-allocate is the simplest mode and scales effectively without operator intervention, but the NAT IPs themselves are not stable: Google may rotate them, and you cannot pre-register them with external services such as third-party API allowlists, partner SaaS, or banking endpoints that require a fixed source IP whitelist.

Manual IP Allocation

With --nat-external-ip-pool=ADDRESS1,ADDRESS2,... and reserved static regional external IPs, you tell Cloud NAT exactly which public IPs to source from. Manual mode is mandatory whenever an external party requires an allowlist or audit trail of source IPs. The trade-off is capacity planning: each manual IP supports up to 64,512 source ports (65,536 minus the 1024 well-known reserved ports), so you must reserve enough addresses to cover peak concurrent flows divided by the per-VM port allocation.

Choosing the Right Mode

Use auto-allocate for back-office workloads pulling packages, container images, or public datasets where source IP is irrelevant. Use manual allocation for production paths to payment processors, SWIFT or FIX endpoints, vendor APIs with IP allowlists, or anywhere compliance demands a known egress IP. Manual mode is also the right choice when you want to retain a NAT IP after deleting and recreating a gateway, since unreserved auto IPs are not guaranteed to come back.

Manual NAT IP allocation pins specific reserved static external addresses to a gateway via --nat-external-ip-pool. This is the only way to guarantee that partner systems whitelisting your egress IP continue to accept traffic; auto-allocate IPs may be rotated by Google without notice and must never be used for compliance-critical egress.

Dynamic Port Allocation (DPA)

Static Port Allocation Defaults

By default each VM is reserved a fixed number of source ports per NAT IP. The default minimum is 64 ports per VM, and historically this was static: every VM got exactly the configured floor, even if the gateway had spare ports. Static allocation is predictable but wastes ports on idle VMs and starves chatty VMs that briefly need thousands of simultaneous flows.

How DPA Works

Dynamic Port Allocation, enabled with --enable-dynamic-port-allocation along with --min-ports-per-vm and --max-ports-per-vm, lets Cloud NAT scale each VM's port reservation between the minimum and maximum as demand changes. A VM idle at 64 ports can burst up to, say, 32,768 ports during a traffic spike and then release them when activity drops. DPA significantly reduces OUT_OF_RESOURCES drops for spiky workloads such as crawlers, ETL jobs, and outbound webhooks.

Tuning Min and Max

The minimum must be a power of two between 32 and 32,768; the maximum must be a power of two between 64 and 65,536, and strictly greater than the minimum. A typical production tuning is --min-ports-per-vm=64 --max-ports-per-vm=32768 for general-purpose nodes, with the maximum raised to 65,536 for very fan-out heavy hosts such as private GKE nodes running connection-pool-hungry services.

For private GKE clusters, enable Dynamic Port Allocation and set --max-ports-per-vm=65536. GKE nodes run many Pods that share the node's NAT port reservation, so a chatty Pod can otherwise exhaust the node's 64 default ports and trigger OUT_OF_RESOURCES drops that look like flaky network behaviour to application teams.

Endpoint-Independent Mapping (EIM)

Default: Endpoint-Dependent Mapping

By default Cloud NAT uses endpoint-dependent mapping: the (NAT IP, NAT port) tuple assigned to a flow depends on both the source and destination, so the same VM connecting to two different destinations may appear from two different external ports. This maximises port reuse but breaks protocols that rely on UDP hole punching or STUN-style NAT traversal, including many WebRTC and VoIP setups.

Turning on EIM

Enabling --enable-endpoint-independent-mapping makes Cloud NAT use the same (NAT IP, NAT port) for all outbound flows from a given internal (VM IP, VM port), regardless of destination. Combined with endpoint-independent filtering, this lets peers learn the external mapping via STUN and then send packets to it from a different destination, which is the foundation of WebRTC NAT traversal.

Trade-Offs

EIM consumes more ports because each internal (VM IP, VM port) takes a permanent external mapping until it ages out, so combining EIM with Dynamic Port Allocation and a generous --max-ports-per-vm is usually necessary. EIM is only supported for UDP; TCP always uses endpoint-dependent mapping regardless of this flag.

Endpoint-Independent Mapping applies to UDP only. Enabling --enable-endpoint-independent-mapping does nothing for TCP flows, so claims that EIM "fixes" a TCP-based protocol issue indicate a different root cause. Verify the protocol in NAT logs before blaming or tuning EIM.

NAT Logging and Observability

Two Independent Log Streams

Cloud NAT publishes two log categories to Cloud Logging under the nat.googleapis.com/nat_gateway resource type. Translation logs (TRANSLATIONS_ONLY) emit one record per successful new connection with source VM, destination, NAT IP, and NAT port. Error logs (ERRORS_ONLY) emit a record when Cloud NAT drops a connection, with a vm_ip, endpoint, and a structured drop reason such as OUT_OF_RESOURCES. ALL enables both; nat_logging is disabled by default for cost reasons.

Enabling Logging

Use gcloud compute routers nats update NAT_NAME --router=ROUTER --region=REGION --enable-logging --log-filter=ALL or specify ERRORS_ONLY to capture only failures, which keeps log volume low while still flagging exhaustion incidents. Logs are billable Cloud Logging entries, so high-throughput gateways should usually start with ERRORS_ONLY.

Metrics in Cloud Monitoring

Cloud NAT exposes metrics including router.googleapis.com/nat/port_usage, nat/allocated_ports, nat/sent_bytes_count, nat/dropped_sent_packets_count, and nat/new_connections_count. Alert on port_usage approaching the gateway maximum and on any non-zero dropped_sent_packets_count with reason OUT_OF_RESOURCES. These signals catch port exhaustion before users notice.

Drop Reasons and OUT_OF_RESOURCES

OUT_OF_RESOURCES

The most common Cloud NAT drop reason is OUT_OF_RESOURCES, which means the VM hit its currently allocated port ceiling for the protocol in question and Cloud NAT had no spare ports to grow into. The fix is one or more of: increase --min-ports-per-vm, enable Dynamic Port Allocation with a higher --max-ports-per-vm, add more NAT IPs to expand the gateway-wide port pool, or shorten timeouts so closed flows free their ports sooner.

ENDPOINT_INDEPENDENT_CONFLICT

With EIM enabled, two flows from different VMs may try to claim the same external (IP, port) for the same internal port; Cloud NAT will drop one with ENDPOINT_INDEPENDENT_CONFLICT. Adding more NAT IPs reduces collision probability.

Other Reasons

Other drop reasons logged include NAT_ALLOCATION_FAILED when the gateway cannot allocate any port at all, and protocol-specific drops when the flow exceeds idle or established timeouts. Each drop record carries the offending VM and destination, so you can correlate to application logs directly.

OUT_OF_RESOURCES is the canonical Cloud NAT port-exhaustion drop reason. Remediation order: enable Dynamic Port Allocation, raise --max-ports-per-vm (max 65,536), add more NAT IPs (each adds up to 64,512 source ports), then tighten --tcp-established-idle-timeout and --udp-idle-timeout.

Timeout Tuning and Port Reservation

Default Timeouts

Cloud NAT applies per-protocol idle timeouts: TCP established 1200 seconds, TCP transitory 30 seconds, TCP time-wait 120 seconds, UDP 30 seconds, and ICMP 30 seconds. A port remains reserved for a flow until its idle timeout expires; long timeouts therefore lock ports against short-lived bursty workloads.

When to Shorten Timeouts

Lowering --tcp-established-idle-timeout to 600 seconds, for example, frees ports about twice as fast on long-lived but lightly used TCP connections, which can avert OUT_OF_RESOURCES on port-pressured gateways. Do not shorten timeouts below the keepalive intervals of legitimate long-running connections (database pools, persistent message-broker links) or you will sever working sessions.

Port Reservation Math

Total ports available on a gateway equal (number of NAT IPs) times 64,512. Static minimum ports per VM equal --min-ports-per-vm. Maximum concurrent VMs a gateway can serve at the floor equal total ports divided by min ports per VM. For example, 2 NAT IPs and a min of 1024 ports per VM supports 2 times 64,512 divided by 1024, which is 126 VMs at the floor; beyond that the gateway either auto-allocates more IPs (in auto mode) or starts dropping.

IPv6 and NAT64

Cloud NAT for IPv6

On dual-stack and IPv6-only subnets, you can attach a Cloud NAT gateway in IPv6 mode using --nat64=ENABLED or by configuring the gateway to source-translate IPv6 prefixes. Native IPv6 endpoints can therefore egress through a managed NAT just like IPv4 endpoints.

NAT64 Translation

NAT64 lets IPv6-only VMs reach IPv4-only destinations on the public internet. Cloud NAT translates the IPv6 source to an IPv4 NAT IP and embeds the IPv4 destination inside an IPv6 prefix that DNS64 hands the VM. This is critical for IPv6-only workloads that still need to consume legacy IPv4 SaaS APIs.

Configuration Notes

NAT64 requires the subnet stack type to include IPv6 and the gateway to be IPv6-aware. Combined with Private Google Access for IPv6, you can build IPv6-only private subnets whose VMs reach both Google APIs (via PGA) and IPv4 internet endpoints (via NAT64) without external IPv6 addresses.

NAT64 is the only Cloud NAT translation type that changes address family. It maps an IPv6 source to an IPv4 NAT IP so IPv6-only VMs can reach IPv4-only public services. Pair it with DNS64 (provided by Google's metadata resolver) so that legacy A-record-only services are reachable as synthesized AAAA records.

Cloud Router Dependency and Configuration

Why Cloud Router

Every Cloud NAT gateway is attached to a Cloud Router because Google reuses the Router as a programmable control plane: it stores the gateway config, propagates allocations to Andromeda agents, and reports per-VM port assignments back through monitoring. Even though no BGP session is required for Cloud NAT itself, the Cloud Router object must exist in the same region and VPC.

A single Cloud Router can simultaneously back Cloud NAT, dynamic routing for Cloud VPN, and BGP sessions for Cloud Interconnect VLAN attachments. Sharing is cost-effective but you must size BGP keepalive and ASN settings for the most demanding tenant.

Common gcloud Pattern

gcloud compute routers create prod-router \
  --region=us-central1 --network=prod-vpc
gcloud compute addresses create prod-nat-ip-1 prod-nat-ip-2 \
  --region=us-central1
gcloud compute routers nats create prod-nat \
  --router=prod-router --region=us-central1 \
  --nat-external-ip-pool=prod-nat-ip-1,prod-nat-ip-2 \
  --nat-all-subnet-ip-ranges \
  --enable-dynamic-port-allocation \
  --min-ports-per-vm=64 --max-ports-per-vm=32768 \
  --enable-logging --log-filter=ERRORS_ONLY

Private Google Access Integration

What PGA Solves

Private Google Access (PGA) lets private VMs reach Google APIs and services (*.googleapis.com, gcr.io, storage.googleapis.com) without an external IP and without Cloud NAT. PGA is enabled per-subnet (--enable-private-ip-google-access) and routes API traffic over Google's internal network using the default route 0.0.0.0/0 to the default-internet-gateway alongside special private route handling for Google CIDRs.

How PGA and Cloud NAT Coexist

With both PGA and Cloud NAT configured on the same subnet, traffic destined for Google APIs takes the PGA path (no NAT translation, no NAT port consumed) while everything else takes Cloud NAT. This split keeps the Cloud NAT gateway's port budget reserved for genuine third-party internet traffic and reduces translation overhead for the much larger volume of Google API calls.

Private Service Connect Alternative

Private Service Connect endpoints offer an even tighter alternative: a project-local internal IP address that resolves to Google APIs or partner services without traversing the internet at all. Use PSC where you need a stable internal IP and per-VPC DNS for Google APIs; use PGA when you simply want Google API egress without spending NAT ports.

On any subnet that hosts both private VMs and a Cloud NAT egress path, enable Private Google Access. Google API calls (Cloud Storage, Pub/Sub, BigQuery client libraries, container registry pulls) will then bypass the NAT gateway entirely, freeing ports for genuine third-party internet egress and shrinking your NAT logging bill.

Cloud NAT for Serverless and GKE

Private GKE Clusters

Private GKE node pools have no external IP and rely on Cloud NAT for any reachable internet egress, such as pulling images from public registries (when not mirrored to Artifact Registry) or invoking third-party APIs. Each node is one Compute Engine VM from NAT's perspective, so DPA tuning is essential because a single node may host dozens of Pods sharing its port budget.

Serverless VPC Access

Cloud Run, Cloud Functions, and App Engine connect to a VPC via a Serverless VPC Access connector. Egress to the internet from those workloads can be forced through the VPC by setting egress to all-traffic, in which case Cloud NAT translates the connector's traffic. This is the standard pattern for serverless calls to vendor APIs that require a fixed source IP.

Cloud Run Direct VPC Egress

Cloud Run direct VPC egress (without a connector) attaches Pods directly to the VPC and again uses Cloud NAT for outbound internet. Manual IP allocation here pins a known source IP for vendor whitelists.

Exam Tips and Common Traps

Trap: Inbound NAT

Cloud NAT performs only Source NAT; it never accepts inbound connections. For inbound public access to private backends, use an external HTTP(S) Load Balancer, Network Load Balancer, or Identity-Aware Proxy TCP tunneling.

Trap: Hybrid Egress

Traffic to on-premises destinations via Cloud VPN or Cloud Interconnect does not flow through Cloud NAT. Cloud NAT only intercepts traffic whose next hop is the default-internet-gateway. For hybrid egress translation, configure NAT on the on-premises side or use a vendor virtual appliance.

Trap: Subnet Selection

--nat-all-subnet-ip-ranges covers every subnet currently in the VPC region, but newly added subnets are auto-included only if you keep the flag. Switching to --nat-custom-subnet-ip-ranges requires you to enumerate each subnet manually thereafter, which is a common cause of "new subnet has no internet" tickets.

Trap: Per-Region Scope

A NAT gateway never spans regions. Multi-region private workloads require one Cloud Router and one Cloud NAT gateway per region, each with its own NAT IPs and port budget.

白話文解釋（Plain English Explanation）

Analogy 1: The Apartment Building's Front Desk

Imagine a high-security apartment building. Residents (private VMs) can walk out the front door to run errands (outbound internet), and the doorman (Cloud NAT) signs them out using the building's address (the NAT IP). When something arrives in the mail, the doorman can only deliver it back to the resident who originally sent the outgoing request. Strangers off the street cannot ask the doorman to send them up to apartment 7B; they have no way to initiate contact. That is exactly Source NAT: the building's single street address shields every resident's apartment number from the outside world, while still letting them order takeout.

Analogy 2: Hotel Switchboard and Extension Ports

A 500-room hotel has a single outside phone number (the NAT IP) but only 64,512 outbound trunk lines. When room 312 dials out, the switchboard reserves a trunk line (a NAT port) for the duration of that call. With static port allocation, each room is permanently assigned 64 trunk lines whether they use them or not, which means a busy convention room runs out fast while empty rooms hoard unused lines. Dynamic Port Allocation is the manager telling the switchboard "give extra trunks to whoever is busy, take them back when they hang up". When demand exceeds the building's trunk lines you hear "all circuits are busy" — that is OUT_OF_RESOURCES. The fix is to either lease more outside phone numbers (more NAT IPs) or let the busy room borrow more trunks (raise --max-ports-per-vm).

Analogy 3: Letter Forwarding Service vs Post Office Boxes

Private Google Access is like having a private courier route inside Google's campus: VMs that need to mail something to Cloud Storage or BigQuery hand the letter to the internal courier, which never touches the public postal system. Cloud NAT is the regular post office on the corner that handles letters going outside the campus, stamping them with the NAT IP as the return address. NAT64 is the bilingual translator behind the counter who rewrites IPv6 letters into IPv4 envelopes so they can be delivered to old-school recipients. Sending Cloud Storage requests through the post office (Cloud NAT) instead of the courier (PGA) works but wastes stamps (NAT ports); turning on PGA tells your VMs to use the courier for anything addressed to Google.

FAQs

Q: Does Cloud NAT consume VM CPU or bandwidth? A: No. Translation happens in the Andromeda host networking stack outside the VM. There is no per-VM CPU or memory cost, no traffic hairpinning through a NAT VM, and bandwidth is limited only by the VM's own egress cap and the gateway's total port budget.

Q: Can I use Cloud NAT for private GKE clusters? A: Yes, and it is the recommended pattern. Private GKE nodes have no external IPs; Cloud NAT lets them pull container images from public registries and call third-party APIs. Always enable Dynamic Port Allocation and raise --max-ports-per-vm because each node shares its port budget across many Pods.

Q: How do I get a fixed egress IP for a vendor allowlist? A: Use manual IP allocation. Reserve one or more regional static external IPv4 addresses with gcloud compute addresses create ... --region=REGION, then attach them via --nat-external-ip-pool=ADDRESS_NAMES. Do not use auto-allocate for allowlisted egress because Google may rotate auto-allocated IPs.

Q: What is the difference between Cloud NAT and Private Google Access? A: Cloud NAT provides general internet egress through Source NAT for any external destination. Private Google Access provides direct access to Google APIs (*.googleapis.com, Cloud Storage, BigQuery, etc.) over Google's internal network, without translation and without consuming NAT ports. They coexist on the same subnet and serve complementary destinations.

Q: Why do I see OUT_OF_RESOURCES drops only during traffic spikes? A: A spike pushes one or more VMs past their currently allocated port ceiling. If you are on static allocation, increase --min-ports-per-vm. If you are on Dynamic Port Allocation, raise --max-ports-per-vm so VMs can burst higher, or add more NAT IPs to grow the gateway-wide pool. Also check whether long TCP idle timeouts are holding ports hostage.

Q: Can Cloud NAT translate traffic going to my on-premises data center via Cloud VPN? A: No. Cloud NAT only translates traffic whose route's next hop is the default-internet-gateway. Traffic to on-premises CIDRs via Cloud VPN or Cloud Interconnect bypasses Cloud NAT entirely. If you need source-IP translation for hybrid traffic, do it on the on-premises side.

Q: Does enabling Endpoint-Independent Mapping affect TCP performance? A: No. EIM applies to UDP only. TCP always uses endpoint-dependent mapping in Cloud NAT regardless of the EIM flag. Enable EIM specifically for WebRTC, STUN, and other UDP-based NAT-traversal scenarios.

Introduction

How Cloud NAT Mechanics Work (Regional Managed NAT)

Distributed Data Plane on Andromeda

Cloud Router Control Plane

Regional, Per-VPC Scope

Auto vs Manual IP Allocation

Auto-Allocate NAT IPs

Manual IP Allocation

Choosing the Right Mode

Dynamic Port Allocation (DPA)

Static Port Allocation Defaults

How DPA Works

Tuning Min and Max

Endpoint-Independent Mapping (EIM)

Default: Endpoint-Dependent Mapping

Turning on EIM

Trade-Offs

NAT Logging and Observability

Two Independent Log Streams

Enabling Logging

Metrics in Cloud Monitoring

Drop Reasons and OUT_OF_RESOURCES

OUT_OF_RESOURCES

ENDPOINT_INDEPENDENT_CONFLICT

Other Reasons

Timeout Tuning and Port Reservation

Default Timeouts

When to Shorten Timeouts

Port Reservation Math

IPv6 and NAT64

Cloud NAT for IPv6

NAT64 Translation

Configuration Notes

Cloud Router Dependency and Configuration

Why Cloud Router

Sharing a Router with Other Services

Common gcloud Pattern

Private Google Access Integration

What PGA Solves

How PGA and Cloud NAT Coexist

Private Service Connect Alternative

Cloud NAT for Serverless and GKE

Private GKE Clusters

Serverless VPC Access

Cloud Run Direct VPC Egress

Exam Tips and Common Traps

Trap: Inbound NAT

Trap: Hybrid Egress

Trap: Subnet Selection

Trap: Per-Region Scope

白話文解釋（Plain English Explanation）

Analogy 1: The Apartment Building's Front Desk

Analogy 2: Hotel Switchboard and Extension Ports

Analogy 3: Letter Forwarding Service vs Post Office Boxes

FAQs

Official sources

More PCNE topics