Kubernetes Networking Explained: CNI, kube-proxy, DNS, and NetworkPolicy

Why Kubernetes Networking Confuses People

Kubernetes networking has four distinct layers that all need to work together. Most engineers learn each in isolation and then get confused when they interact. This article explains all four layers and how they connect.

The four problems Kubernetes networking solves:

Pod-to-pod communication (on the same node and across nodes)
Service discovery and load balancing (stable IP for a group of pods)
DNS resolution (names, not IPs)
Network policy (access control between pods)

Layer 1: Pod Networking (CNI)

Every pod gets its own IP address. Pods on the same node and pods on different nodes must be able to reach each other without NAT. This is the Kubernetes networking model.

The Container Network Interface (CNI) plugin is responsible for making this work. When a pod is scheduled, the kubelet calls the CNI plugin to:

Create a network namespace for the pod
Create a virtual ethernet pair (veth) — one end inside the pod namespace, one end on the host
Assign an IP address to the pod from the node's CIDR range
Set up routing so packets destined for this pod's IP arrive at this node

How cross-node communication works (with flannel/VXLAN):

When pod A on node-1 (10.244.1.5) sends a packet to pod B on node-2 (10.244.2.7):

The packet leaves pod A via its veth into node-1's network namespace
node-1's routing table says: 10.244.2.0/24 → flannel.1 (VXLAN interface)
flannel.1 encapsulates the packet in a UDP VXLAN packet and sends it to node-2's host IP
node-2's flannel.1 decapsulates it and delivers it to pod B via its veth

With AWS VPC CNI (used by EKS), pods get real VPC IPs from the node's ENI — no overlay encapsulation. This is faster and simpler, but requires sufficient ENI capacity per node.

Debugging CNI issues:

# Pod stuck in ContainerCreating — often a CNI failure
kubectl describe pod <pod> | grep -A5 Events

# Check CNI plugin logs
kubectl logs -n kube-system -l k8s-app=aws-node  # AWS VPC CNI

# Verify pod has an IP
kubectl get pod <pod> -o wide

# Test pod-to-pod connectivity
kubectl exec -it pod-a -- curl http://10.244.2.7:8080

Layer 2: Services and kube-proxy

A Service gives a stable virtual IP (ClusterIP) to a group of pods. The actual pods behind it can come and go — the Service IP stays constant.

How kube-proxy implements Services:

kube-proxy watches the API server for Service and Endpoints objects. When it sees a new Service, it programs iptables rules (or IPVS rules in ipvs mode) on every node:

Packet to 10.96.45.100:80 (ClusterIP)
  → iptables DNAT → randomly select one of:
      10.244.1.5:8080 (pod-1)
      10.244.2.7:8080 (pod-2)
      10.244.3.2:8080 (pod-3)

The DNAT (destination NAT) rewrites the destination IP to a real pod IP before the packet leaves the node. The response path is handled by conntrack — the kernel remembers the translation and reverses it on the way back.

ClusterIP vs NodePort vs LoadBalancer:

ClusterIP: only reachable within the cluster. Default.
NodePort: exposes the service on a high port (30000–32767) on every node's external IP.
LoadBalancer: provisions a cloud load balancer (ALB/NLB on AWS) pointing to NodePorts.

The iptables scale problem:

At 10,000 services, iptables rules are evaluated linearly — O(N) per packet. IPVS mode uses a hash table — O(1) lookups. For large clusters, switch to IPVS mode. AWS also offers the AWS Load Balancer Controller which bypasses kube-proxy for external traffic entirely.

Layer 3: DNS

Every Kubernetes cluster runs CoreDNS as the cluster DNS server. Every pod has /etc/resolv.conf pointing to the CoreDNS ClusterIP.

The search domain problem (ndots:5):

/etc/resolv.conf in pods typically contains:

nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

ndots:5 means: if the name has fewer than 5 dots, try the search domains first before treating it as an absolute name. So a query for api.example.com (2 dots) generates:

api.example.com.default.svc.cluster.local — fails
api.example.com.svc.cluster.local — fails
api.example.com.cluster.local — fails
api.example.com. — succeeds

Three unnecessary DNS queries per external lookup. This adds latency and hammers CoreDNS. Fix: append a trailing dot to external names (api.example.com.) or set ndots:1 for workloads that only call external services.

DNS for Services:

my-service.my-namespace.svc.cluster.local — full DNS name for a service. Within the same namespace, just my-service works. Headless services (ClusterIP: None) return individual pod IPs instead of a single ClusterIP — useful for stateful sets where clients need to connect to specific pods.

Debugging DNS:

# Test DNS resolution from inside a pod
kubectl exec -it debug-pod -- nslookup my-service.my-namespace

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns

# Check CoreDNS config
kubectl get configmap coredns -n kube-system -o yaml

Layer 4: NetworkPolicy

By default, all pods in a Kubernetes cluster can communicate with all other pods. NetworkPolicy resources restrict this — they are the firewall rules of Kubernetes networking.

NetworkPolicy is implemented by the CNI plugin (not kube-proxy). Not all CNI plugins support NetworkPolicy — Calico, Cilium, and WeaveNet do; flannel alone does not.

Default deny all ingress:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: production
spec:
  podSelector: {}  # applies to all pods
  policyTypes:
  - Ingress

This blocks all inbound traffic to all pods in the production namespace. Then selectively allow:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-api-from-frontend
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 8080

Common NetworkPolicy mistakes:

Creating an Ingress policy without a matching Egress policy — the api pod can receive from frontend but cannot reply if there is also a default-deny-egress
Forgetting that NetworkPolicy is additive — multiple policies that select the same pod are ORed together
Not allowing DNS egress (port 53 UDP/TCP to CoreDNS) — pods cannot resolve names

With Cilium (eBPF-based):

Cilium replaces iptables-based enforcement with eBPF programs attached to network interfaces. It supports L7 NetworkPolicy (HTTP path, gRPC method), which iptables cannot do. It also provides a network observability layer (Hubble) showing which pods are communicating with which.

The Full Picture

When pod-a calls http://my-service/endpoint:

DNS: pod-a queries CoreDNS for my-service.default.svc.cluster.local → gets ClusterIP 10.96.45.100
NetworkPolicy: egress from pod-a to port 80 is allowed (if policies are configured)
kube-proxy/iptables: packet to 10.96.45.100:80 is DNAT'd to a real pod IP, say 10.244.2.7:8080
CNI: packet is routed to node-2 (via VXLAN or VPC routing) and delivered to pod-b's network namespace
Response: conntrack reverses the NAT, pod-a receives the response from 10.96.45.100 (the Service IP)

Understanding this chain is what lets you debug networking issues systematically instead of randomly — which is exactly what Staff SRE interviewers are looking for.