Post

Who's Calling the Internet? Hunting Unknown Egress with VPC Flow Logs and R53 Resolver Logs

Learn how to enable VPC flow logs on your NAT gateway ENI, store them in S3, and query with Athena to identify which services in your private cluster are making unexpected external DNS calls.

Who's Calling the Internet? Hunting Unknown Egress with VPC Flow Logs and R53 Resolver Logs

Your private EKS cluster is making calls to the public internet and you have no idea which service is doing it — or why.

  • Your services run in private subnets with no direct internet access, so you assume they’re safe
  • Egress goes through a NAT gateway, masking individual pod IPs behind a single IP
  • No alerts fire; traffic looks normal, but the calls are happening silently
  • The fix: enable VPC flow logs on the NAT gateway’s ENI, ship logs to S3, and query with Athena to see every external connection — including pods bypassing your VPC resolver with direct DNS calls

TL;DR

  1. Prerequisite: Use AWS VPC CNI with AWS_VPC_K8S_CNI_EXTERNALSNAT=true so pods get real VPC IPs and SNAT happens at the NAT gateway, not the node.
  2. Enable ENI-level VPC flow logs on the NAT gateway’s network interface → send to S3 (not CloudWatch — too expensive).
  3. Query with Athena to see every external IP your pods are hitting (flow_direction = 'ingress'). Filter on dstport = 53 to catch pods bypassing the VPC resolver with direct public DNS calls.
  4. Enable Route 53 Resolver query logging → same S3 bucket. Join pkt_dstaddr from flow logs against answers.Rdata in resolver logs to resolve external IPs to domain names.
  5. Caveat: R53 Resolver logs show 172.20.0.10 (CoreDNS ClusterIP) as srcaddr, not the pod IP — use flow logs for pod attribution and resolver logs only for IP → domain mapping.

The Problem

I first noticed this during a security review. We were auditing outbound traffic from our private EKS clusters and found connections to external IPs we couldn’t attribute to any known service. No alerts had fired, no deployments had gone out recently, and the APM dashboards looked clean. But something was clearly talking to the internet.

The tricky part: in a private subnet setup, all outbound traffic exits through the NAT gateway. Every pod and VM funnels through the same gateway IP. Once traffic leaves through the NAT, external systems only see the NAT’s public IP — the pod IP is gone. Standard flow logs at the VPC or subnet level capture this post-NAT traffic, so you can’t tell which pod originated the connection. That’s where the NAT gateway ENI’s pkt-srcaddr field comes in — covered in Step 5.

This is a common blind spot for platform and security teams. Developers add SDK calls, telemetry agents, or third-party libraries that phone home during initialization. In a monolith, you’d catch it in a network scan. In EKS with dozens of microservices, it goes unnoticed for months.

Symptoms to Watch For

If you’re seeing any of these, you likely have the same problem:

  • NAT gateway data processing charges are higher than expected for the volume of traffic your services nominally generate
  • Unknown destination IPs in VPC flow logs at the VPC or subnet level, but you can’t trace them back to a specific service
  • Port 53 traffic flowing outbound through the NAT — DNS queries to public resolvers (like 8.8.8.8) rather than your Route 53 private resolver
  • Security scanner alerts showing your private workloads reaching external domains you didn’t whitelist

The port 53 traffic is the one most people miss. If your pods are using a custom DNS config or if a sidecar is bypassing the cluster DNS, those queries go out to the public internet through the NAT gateway. You’d never see it in CloudWatch metrics or standard monitoring.

The Solution: Flow Logs on the NAT Gateway ENI

The NAT gateway has its own elastic network interface (ENI). Flow logs at this specific ENI capture every connection that passes through the NAT. Crucially, they include the pkt-srcaddr field — the original pod IP before SNAT — alongside the external destination IP. That’s exactly the visibility we need.

Sending to S3 instead of CloudWatch Logs is deliberate. A busy NAT gateway can generate millions of flow records per day. CloudWatch Logs ingestion at $0.50/GB adds up fast. S3 at a fraction of that cost, combined with Athena’s pay-per-query model, makes this practical to run continuously.

Prerequisite — AWS VPC CNI with external SNAT enabled

This approach only works if your pods have real VPC CIDR IPs and SNAT happens at the NAT gateway, not at the node. Two things need to be true:

  1. Use AWS VPC CNI, not an overlay CNI like Calico or Flannel. Overlay CNIs assign pod IPs from a separate private range (e.g. 192.168.0.0/16) and perform SNAT at the node before traffic ever reaches the NAT gateway. The NAT ENI flow logs then only see the node IP — pod-level attribution is lost.

  2. Set AWS_VPC_K8S_CNI_EXTERNALSNAT=true on the VPC CNI DaemonSet. By default this is false, which means the VPC CNI itself does SNAT at the node (replacing the pod IP with the node IP). Setting it to true disables node-level SNAT and delegates it to the NAT gateway — so the pod’s real VPC IP is preserved all the way to the NAT ENI.

1
kubectl describe daemonset aws-node -n kube-system | grep AWS_VPC_K8S_CNI_EXTERNALSNAT

With this in place, each pod gets an IP directly from the VPC CIDR, and pkt-srcaddr in the NAT gateway ENI flow logs will be the actual pod IP.

Other pitfalls to watch for:

  • Flow logs have up to a 10-minute aggregation window — this isn’t real-time monitoring
  • Parquet format for flow logs is faster to query in Athena and uses 20% less storage — use it from the start, retrofitting is painful

Step 1: Find Your NAT Gateway ENI

1
2
3
4
5
# List NAT gateways in your VPC
aws ec2 describe-nat-gateways \
  --filter "Name=vpc-id,Values=vpc-XXXXXXXXXXXXXXXXX" \
  --query "NatGateways[*].{NatGatewayId:NatGatewayId, State:State, NetworkInterfaceId:NatGatewayAddresses[0].NetworkInterfaceId}" \
  --output table

Note the NetworkInterfaceId — this is the ENI ID you’ll enable flow logs on. It looks like eni-XXXXXXXXXXXXXXXXX.

Step 2: Create an S3 Bucket for Flow Logs

1
2
3
4
5
6
7
8
9
10
# Create the bucket (use a unique name)
aws s3api create-bucket \
  --bucket my-vpc-flow-logs-bucket \
  --region <region> \
  --create-bucket-configuration LocationConstraint=<region>

# Block all public access
aws s3api put-public-access-block \
  --bucket my-vpc-flow-logs-bucket \
  --public-access-block-configuration "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"

Add a lifecycle rule to auto-expire old logs (optional but recommended):

1
2
3
4
5
6
7
8
9
10
aws s3api put-bucket-lifecycle-configuration \
  --bucket my-vpc-flow-logs-bucket \
  --lifecycle-configuration '{
    "Rules": [{
      "ID": "expire-flow-logs-90-days",
      "Status": "Enabled",
      "Filter": {"Prefix": "nat-gateway-flow-logs/"},
      "Expiration": {"Days": 90}
    }]
  }'

Step 3: Enable Flow Logs on the NAT Gateway ENI

AWS automatically attaches the required S3 bucket policy when you create the flow log — no manual bucket policy setup needed if you own the bucket.

1
2
3
4
5
6
7
8
aws ec2 create-flow-logs \
  --resource-type NetworkInterface \
  --resource-ids eni-XXXXXXXXXXXXXXXXX \
  --traffic-type ALL \
  --log-destination-type s3 \
  --log-destination "arn:aws:s3:::my-vpc-flow-logs-bucket/nat-gateway-flow-logs/" \
  --log-format '${version} ${account-id} ${interface-id} ${srcaddr} ${dstaddr} ${srcport} ${dstport} ${protocol} ${packets} ${bytes} ${start} ${end} ${action} ${log-status} ${vpc-id} ${subnet-id} ${flow-direction} ${traffic-path} ${pkt-srcaddr} ${pkt-dstaddr}' \
  --max-aggregation-interval 60

Key choices here:

  • --max-aggregation-interval 60 gives 1-minute granularity instead of the default 10 minutes
  • The custom --log-format adds pkt-srcaddr and pkt-dstaddr (v3 fields) which capture the original pod IP before NAT translation — srcaddr alone would only show the NAT gateway’s own private IP
  • flow-direction and traffic-path fields help distinguish inbound vs. outbound
  • We’re logging ALL traffic, not just ACCEPT, so we catch rejected egress attempts too

Verify the flow log was created:

1
2
3
4
aws ec2 describe-flow-logs \
  --filter "Name=resource-id,Values=eni-XXXXXXXXXXXXXXXXX" \
  --query "FlowLogs[*].{FlowLogId:FlowLogId, Status:FlowLogStatus, Destination:LogDestination}" \
  --output table

It takes a few minutes to start collecting data. Check your S3 bucket after ~5 minutes:

1
aws s3 ls s3://my-vpc-flow-logs-bucket/nat-gateway-flow-logs/ --recursive | head -20

Step 4: Set Up Athena to Query the Logs

First, create an Athena results bucket if you don’t have one:

1
2
3
4
aws s3api create-bucket \
  --bucket my-athena-query-results \
  --region <region> \ 
  --create-bucket-configuration LocationConstraint=<region>

Set the query result location in Athena (Console: Athena → Settings → Manage → Query result location):

1
s3://my-athena-query-results/

Now create the Athena table. Run this in the Athena query editor:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
CREATE EXTERNAL TABLE IF NOT EXISTS vpc_flow_logs_nat (
  version INT,
  account_id STRING,
  interface_id STRING,
  srcaddr STRING,
  dstaddr STRING,
  srcport INT,
  dstport INT,
  protocol BIGINT,
  packets BIGINT,
  bytes BIGINT,
  `start` BIGINT,
  `end` BIGINT,
  action STRING,
  log_status STRING,
  vpc_id STRING,
  subnet_id STRING,
  flow_direction STRING,
  traffic_path STRING,
  pkt_srcaddr STRING,
  pkt_dstaddr STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
LOCATION 's3://my-vpc-flow-logs-bucket/nat-gateway-flow-logs/AWSLogs/<account-id>/vpcflowlogs/<region>/'
TBLPROPERTIES (
  'skip.header.line.count'='1'
);

Replace <account-id> and <region> with your actual values. AWS writes flow logs to this path structure automatically.

Step 5: Query the Flow Logs

srcaddr vs pkt-srcaddr on a NAT gateway ENI: srcaddr records the IP of the intermediate layer — on a NAT gateway ENI that is the NAT’s own private IP. pkt-srcaddr records the original packet-level source IP before SNAT, which is the real pod IP. The AWS docs explicitly call out NAT gateways and EKS pods as the primary use case for this field. Always use pkt_srcaddr / pkt_dstaddr when querying NAT gateway flow logs.

Traffic arriving at the NAT ENI from private subnet pods is flow_direction = 'ingress' — at that point the pod IP has not yet been translated. Use ingress records to get the real pod IP in pkt_srcaddr.

All external calls from pods

Start here. This gives you every external IP your pods are reaching, ranked by connection count. No port filter — you want the full picture first.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
SELECT
  pkt_srcaddr AS pod_ip,
  pkt_dstaddr AS external_ip,
  dstport,
  protocol,
  COUNT(*)    AS connection_count,
  SUM(bytes)  AS total_bytes,
  MIN(from_unixtime("start")) AS first_seen,
  MAX(from_unixtime("end")) AS last_seen
FROM vpc_flow_logs_nat
WHERE
  flow_direction = 'ingress'
  AND action = 'ACCEPT'
  AND pkt_dstaddr NOT LIKE '10.%'
  AND pkt_dstaddr NOT LIKE '172.16.%'
  AND pkt_dstaddr NOT LIKE '192.168.%'
GROUP BY pkt_srcaddr, pkt_dstaddr, dstport, protocol
ORDER BY connection_count DESC
LIMIT 100;

Take the pkt_dstaddr values from this result into Step 6 — matching them against the Rdata field in Route 53 Resolver query logs will resolve these IPs to domain names.

To map a pod IP back to a Kubernetes workload:

1
kubectl get pods --all-namespaces -o wide | grep "<pkt_srcaddr-from-query>"

Check for pods bypassing the VPC resolver (direct external DNS)

As a separate check, filter specifically on port 53. If any pods show up here, they are sending DNS queries directly to a public resolver (e.g. 8.8.8.8, 1.1.1.1) instead of going through CoreDNS and the VPC resolver. This is a misconfiguration worth fixing — it bypasses your private hosted zone resolution and any DNS Firewall rules you have in place.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
SELECT
  pkt_srcaddr AS pod_ip,
  pkt_dstaddr AS external_dns_resolver,
  COUNT(*) AS query_count,
  SUM(bytes) AS total_bytes,
  MIN(from_unixtime("start")) AS first_seen,
  MAX(from_unixtime("end")) AS last_seen
FROM vpc_flow_logs_nat
WHERE
  dstport = 53
  AND action = 'ACCEPT'
  AND flow_direction = 'ingress'
  AND pkt_dstaddr NOT LIKE '10.%'
  AND pkt_dstaddr NOT LIKE '172.16.%'
  AND pkt_dstaddr NOT LIKE '192.168.%'
GROUP BY pkt_srcaddr, pkt_dstaddr
ORDER BY query_count DESC
LIMIT 50;

Step 6: Resolve IPs to Domain Names with Route 53 Resolver Query Logs

VPC flow logs tell you that pod 10.0.1.45 connected to 142.250.80.46 on port 443 — but they don’t tell you the domain name. To get the actual hostname, you need Route 53 Resolver query logging. This is different from Route 53 public hosted zone query logging — it captures DNS queries made by resources inside your VPC, including every pod in your EKS cluster.

Each resolver query log entry is a JSON record that includes the domain name queried, the DNS response code, and the resolved IP in the answers field. This lets you map an external IP back to its domain name — which is the gap VPC flow logs leave.

EKS caveat: The srcaddr in Resolver logs is the CoreDNS ClusterIP (172.20.0.10), not the originating pod IP — see Lessons Learned for the full explanation.

Caching note: Resolver query logging only records unique queries. Subsequent lookups served from the VPC resolver’s cache (within TTL) are not logged, so you may not see every call.

Enable Resolver Query Logging

Unlike VPC flow logs (which auto-attach the required S3 policy), Route 53 Resolver requires you to add the bucket policy manually first — otherwise you get RSLVR-01605 Missing permission to log destination.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Save this as resolver-bucket-policy.json
cat > resolver-bucket-policy.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AWSLogDeliveryWrite",
      "Effect": "Allow",
      "Principal": { "Service": "delivery.logs.amazonaws.com" },
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::my-vpc-flow-logs-bucket/resolver-query-logs/AWSLogs/<account-id>/*",
      "Condition": {
        "StringEquals": {
          "s3:x-amz-acl": "bucket-owner-full-control",
          "aws:SourceAccount": "<account-id>"
        }
      }
    },
    {
      "Sid": "AWSLogDeliveryAclCheck",
      "Effect": "Allow",
      "Principal": { "Service": "delivery.logs.amazonaws.com" },
      "Action": "s3:GetBucketAcl",
      "Resource": "arn:aws:s3:::my-vpc-flow-logs-bucket",
      "Condition": {
        "StringEquals": { "aws:SourceAccount": "<account-id>" }
      }
    }
  ]
}
EOF

# Apply the policy (replace <account-id> with your actual account ID)
aws s3api put-bucket-policy \
  --bucket my-vpc-flow-logs-bucket \
  --policy file://resolver-bucket-policy.json

Now create the logging config:

1
2
3
4
aws route53resolver create-resolver-query-log-config \
  --name eks-vpc-resolver-logs \
  --destination-arn "arn:aws:s3:::my-vpc-flow-logs-bucket/resolver-query-logs/" \
  --creator-request-id "eks-resolver-logs-$(date +%s)"

Note the Id from the response — you’ll need it to associate with your VPC:

1
2
3
4
# Associate the config with your VPC
aws route53resolver associate-resolver-query-log-config \
  --resolver-query-log-config-id rqlc-XXXXXXXXXXXXXXXXX \
  --resource-id vpc-XXXXXXXXXXXXXXXXX

Verify the association is active:

1
2
3
4
aws route53resolver list-resolver-query-log-config-associations \
  --filters Name=Status,Values=ACTIVE \
  --query "ResolverQueryLogConfigAssociations[*].{ConfigId:ResolverQueryLogConfigId,VPC:ResourceId,Status:Status}" \
  --output table

Create the Athena Table

Resolver logs land in S3 as newline-delimited JSON (one record per line), so the table uses the OpenX JSON SerDe:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
CREATE EXTERNAL TABLE IF NOT EXISTS resolver_query_logs (
  version        STRING,
  account_id     STRING,
  region         STRING,
  vpc_id         STRING,
  query_timestamp STRING,
  query_name     STRING,
  query_type     STRING,
  query_class    STRING,
  rcode          STRING,
  answers        ARRAY<STRUCT<
    Rdata: STRING,
    Type:  STRING,
    Class: STRING
  >>,
  srcaddr        STRING,
  srcport        STRING,
  transport      STRING,
  srcids         STRUCT<
    instance:          STRING,
    resolver_endpoint: STRING
  >
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://my-vpc-flow-logs-bucket/resolver-query-logs/AWSLogs/<account-id>/vpcdnsquerylogs/<vpc-id>/';

Resolve an IP to Its Domain Name

Given an external IP you spotted via pkt_dstaddr in VPC flow logs (e.g. 142.250.80.46), find which domain it resolved from:

1
2
3
4
5
6
7
8
9
10
11
SELECT
  query_name        AS domain,
  query_type,
  rcode,
  answer.Rdata      AS resolved_ip,
  query_timestamp
FROM resolver_query_logs
CROSS JOIN UNNEST(answers) AS t(answer)
WHERE answer.Rdata = '142.250.80.46'
ORDER BY query_timestamp DESC
LIMIT 20;

Join Flow Logs and Resolver Logs for the Full Picture

Join on the resolved IP (answer.Rdata = fl.pkt_dstaddr) — not on srcaddr, because the source in Resolver logs is CoreDNS, not the individual pod:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
SELECT
  fl.pkt_srcaddr    AS pod_ip,
  rl.query_name     AS domain,
  fl.pkt_dstaddr    AS external_ip,
  fl.dstport,
  SUM(fl.bytes)     AS total_bytes,
  COUNT(*)          AS connection_count
FROM vpc_flow_logs_nat fl
JOIN (
  SELECT query_name, answer.Rdata AS resolved_ip
  FROM resolver_query_logs
  CROSS JOIN UNNEST(answers) AS t(answer)
) rl ON fl.pkt_dstaddr = rl.resolved_ip
WHERE
  fl.flow_direction = 'ingress'
  AND fl.action     = 'ACCEPT'
  AND fl.pkt_dstaddr NOT LIKE '10.%'
GROUP BY fl.pkt_srcaddr, rl.query_name, fl.pkt_dstaddr, fl.dstport
ORDER BY total_bytes DESC
LIMIT 50;

This gives you the pod IP from VPC flow logs, the domain name behind the destination IP from Resolver logs, and how much data was transferred. You still get the full picture — you just can’t trace the DNS query itself back to the individual pod without CoreDNS logs.

Lessons Learned

A few things I got wrong or had to figure out the hard way while setting this up.

Route 53 Resolver logs give you CoreDNS’s IP, not the pod IP

When you look at the srcaddr field in Route 53 Resolver query logs, you’ll see something like 172.20.0.10 as the source for almost every DNS query. That’s not a pod — it’s the ClusterIP of the CoreDNS (kube-dns) service in EKS, which is the default internal address (172.20.0.10) assigned when the cluster is created.

The actual DNS resolution chain in EKS, and how the pod then uses the resolved IP to reach the external service:

flowchart LR
    subgraph dns[" DNS Resolution "]
        direction TB
        pod1([" EKS Pod 10.0.1.45 "]):::podStyle
        coredns([" CoreDNS 172.20.0.10 "]):::corednsStyle
        vpc([" VPC Resolver 169.254.169.253 "]):::resolverStyle
        r53([" Route 53 Resolver "]):::r53Style

        pod1 -- "① query: api.example.com" --> coredns
        coredns -- "② forward external" --> vpc
        vpc -- "③ resolve" --> r53
        r53 -. "④ 142.250.80.46" .-> vpc
        vpc -. "⑤" .-> coredns
        coredns -. "⑥ return IP to pod" .-> pod1
    end

    subgraph conn[" TCP Connection "]
        direction TB
        pod2([" EKS Pod 10.0.1.45 "]):::podStyle
        nat([" NAT Gateway ENI pkt-srcaddr = pod IP "]):::natStyle
        ext([" External Service 142.250.80.46 "]):::extStyle

        pod2 -- "⑦ connect :443" --> nat
        nat -- "⑧ SNAT → EIP" --> ext
        ext -. "⑨ response" .-> nat
        nat -. "⑩ DNAT → pod" .-> pod2
    end

    dns -. " resolved IP triggers connection " .-> conn

    classDef podStyle     fill:#eef2ff,stroke:#6366f1,color:#3730a3
    classDef corednsStyle fill:#f5f3ff,stroke:#8b5cf6,color:#5b21b6
    classDef resolverStyle fill:#ecfeff,stroke:#06b6d4,color:#155e75
    classDef r53Style     fill:#fffbeb,stroke:#f59e0b,color:#92400e
    classDef natStyle     fill:#ecfdf5,stroke:#10b981,color:#064e3b
    classDef extStyle     fill:#f8fafc,stroke:#64748b,color:#334155

By the time the query reaches the Route 53 Resolver, the original pod IP is gone — CoreDNS is the caller. So Resolver logs are useful for mapping IPs to domain names (via the answers.Rdata field), but they cannot tell you which pod initiated the DNS lookup. That’s why the join query in Step 6 joins on pkt_dstaddr = resolved_ip, not on source IP.

CoreDNS / node-local-dns logs can give you pod IP + domain, but with trade-offs

If you need to trace a specific pod to a specific DNS query, CoreDNS query logs are the right tool — they log the client pod IP and the queried domain directly. You can enable them by editing the CoreDNS ConfigMap:

1
kubectl edit configmap coredns -n kube-system

Add log to the Corefile block:

1
2
3
4
5
6
.:53 {
    log        # <-- adds per-query logging
    errors
    health
    ...
}

But be aware of the trade-offs before you flip this on in production:

  • Scope: CoreDNS logs only cover workloads inside the EKS cluster. VMs, Lambda functions, ECS tasks, or any other compute that routes through the same NAT gateway won’t appear here.
  • Resource cost: Enabling query-level logging (log plugin) significantly increases CoreDNS CPU and memory usage under load. In a busy cluster, DNS query volume is high — every pod startup, every service call, every health check generates DNS queries. This has caused CoreDNS OOM kills in clusters that weren’t sized for it.
  • Log volume: CoreDNS debug logs are verbose. Without a log aggregation pipeline that can handle the volume, you’ll either drop logs or run up a large CloudWatch bill.

If you just need occasional ad-hoc investigation (not continuous monitoring), a better approach is to temporarily enable CoreDNS logging, capture what you need, then disable it — rather than leaving it on permanently.

Conclusion

VPC flow logs on the NAT gateway ENI — queried via pkt-srcaddr — tell you which pod is talking to which external IP and how much. Route 53 Resolver query logs tell you the domain name behind that IP. R53 Resolver logs won’t expose individual pod IPs (CoreDNS is the caller), but joining on the destination IP closes the loop: you go from “something is hitting 142.250.80.46” to “pods are connecting to telemetry.vendor.com” in a single Athena query.


Subscribe to Level Up

This post is licensed under CC BY 4.0 by the author.