Kubernetes Networking Explained: CNI, kube-proxy, DNS, and NetworkPolicy
Why Kubernetes Networking Confuses People
Kubernetes networking has four distinct layers that all need to work together. Most engineers learn each in isolation and then get confused when they interact. This article explains all four layers and how they connect.
The four problems Kubernetes networking solves:
- Pod-to-pod communication (on the same node and across nodes)
- Service discovery and load balancing (stable IP for a group of pods)
- DNS resolution (names, not IPs)
- Network policy (access control between pods)
Layer 1: Pod Networking (CNI)
Every pod gets its own IP address. Pods on the same node and pods on different nodes must be able to reach each other without NAT. This is the Kubernetes networking model.
The Container Network Interface (CNI) plugin is responsible for making this work. When a pod is scheduled, the kubelet calls the CNI plugin to:
- Create a network namespace for the pod
- Create a virtual ethernet pair (veth) — one end inside the pod namespace, one end on the host
- Assign an IP address to the pod from the node's CIDR range
- Set up routing so packets destined for this pod's IP arrive at this node
How cross-node communication works (with flannel/VXLAN):
When pod A on node-1 (10.244.1.5) sends a packet to pod B on node-2 (10.244.2.7):
- The packet leaves pod A via its veth into node-1's network namespace
- node-1's routing table says: 10.244.2.0/24 → flannel.1 (VXLAN interface)
- flannel.1 encapsulates the packet in a UDP VXLAN packet and sends it to node-2's host IP
- node-2's flannel.1 decapsulates it and delivers it to pod B via its veth
With AWS VPC CNI (used by EKS), pods get real VPC IPs from the node's ENI — no overlay encapsulation. This is faster and simpler, but requires sufficient ENI capacity per node.
Debugging CNI issues:
# Pod stuck in ContainerCreating — often a CNI failure
kubectl describe pod <pod> | grep -A5 Events
# Check CNI plugin logs
kubectl logs -n kube-system -l k8s-app=aws-node # AWS VPC CNI
# Verify pod has an IP
kubectl get pod <pod> -o wide
# Test pod-to-pod connectivity
kubectl exec -it pod-a -- curl http://10.244.2.7:8080
Layer 2: Services and kube-proxy
A Service gives a stable virtual IP (ClusterIP) to a group of pods. The actual pods behind it can come and go — the Service IP stays constant.
How kube-proxy implements Services:
kube-proxy watches the API server for Service and Endpoints objects. When it sees a new Service, it programs iptables rules (or IPVS rules in ipvs mode) on every node:
Packet to 10.96.45.100:80 (ClusterIP)
→ iptables DNAT → randomly select one of:
10.244.1.5:8080 (pod-1)
10.244.2.7:8080 (pod-2)
10.244.3.2:8080 (pod-3)
The DNAT (destination NAT) rewrites the destination IP to a real pod IP before the packet leaves the node. The response path is handled by conntrack — the kernel remembers the translation and reverses it on the way back.
ClusterIP vs NodePort vs LoadBalancer:
- ClusterIP: only reachable within the cluster. Default.
- NodePort: exposes the service on a high port (30000–32767) on every node's external IP.
- LoadBalancer: provisions a cloud load balancer (ALB/NLB on AWS) pointing to NodePorts.
The iptables scale problem:
At 10,000 services, iptables rules are evaluated linearly — O(N) per packet. IPVS mode uses a hash table — O(1) lookups. For large clusters, switch to IPVS mode. AWS also offers the AWS Load Balancer Controller which bypasses kube-proxy for external traffic entirely.
Layer 3: DNS
Every Kubernetes cluster runs CoreDNS as the cluster DNS server. Every pod has /etc/resolv.conf pointing to the CoreDNS ClusterIP.
The search domain problem (ndots:5):
/etc/resolv.conf in pods typically contains:
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
ndots:5 means: if the name has fewer than 5 dots, try the search domains first before treating it as an absolute name. So a query for api.example.com (2 dots) generates:
api.example.com.default.svc.cluster.local— failsapi.example.com.svc.cluster.local— failsapi.example.com.cluster.local— failsapi.example.com.— succeeds
Three unnecessary DNS queries per external lookup. This adds latency and hammers CoreDNS. Fix: append a trailing dot to external names (api.example.com.) or set ndots:1 for workloads that only call external services.
DNS for Services:
my-service.my-namespace.svc.cluster.local — full DNS name for a service. Within the same namespace, just my-service works. Headless services (ClusterIP: None) return individual pod IPs instead of a single ClusterIP — useful for stateful sets where clients need to connect to specific pods.
Debugging DNS:
# Test DNS resolution from inside a pod
kubectl exec -it debug-pod -- nslookup my-service.my-namespace
# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns
# Check CoreDNS config
kubectl get configmap coredns -n kube-system -o yaml
Layer 4: NetworkPolicy
By default, all pods in a Kubernetes cluster can communicate with all other pods. NetworkPolicy resources restrict this — they are the firewall rules of Kubernetes networking.
NetworkPolicy is implemented by the CNI plugin (not kube-proxy). Not all CNI plugins support NetworkPolicy — Calico, Cilium, and WeaveNet do; flannel alone does not.
Default deny all ingress:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: production
spec:
podSelector: {} # applies to all pods
policyTypes:
- Ingress
This blocks all inbound traffic to all pods in the production namespace. Then selectively allow:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-api-from-frontend
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
Common NetworkPolicy mistakes:
- Creating an Ingress policy without a matching Egress policy — the api pod can receive from frontend but cannot reply if there is also a default-deny-egress
- Forgetting that NetworkPolicy is additive — multiple policies that select the same pod are ORed together
- Not allowing DNS egress (port 53 UDP/TCP to CoreDNS) — pods cannot resolve names
With Cilium (eBPF-based):
Cilium replaces iptables-based enforcement with eBPF programs attached to network interfaces. It supports L7 NetworkPolicy (HTTP path, gRPC method), which iptables cannot do. It also provides a network observability layer (Hubble) showing which pods are communicating with which.
The Full Picture
When pod-a calls http://my-service/endpoint:
- DNS: pod-a queries CoreDNS for
my-service.default.svc.cluster.local→ gets ClusterIP10.96.45.100 - NetworkPolicy: egress from pod-a to port 80 is allowed (if policies are configured)
- kube-proxy/iptables: packet to
10.96.45.100:80is DNAT'd to a real pod IP, say10.244.2.7:8080 - CNI: packet is routed to node-2 (via VXLAN or VPC routing) and delivered to pod-b's network namespace
- Response: conntrack reverses the NAT, pod-a receives the response from
10.96.45.100(the Service IP)
Understanding this chain is what lets you debug networking issues systematically instead of randomly — which is exactly what Staff SRE interviewers are looking for.
Want to go deeper?
15 weeks of structured SRE curriculum.
Hone covers every topic in this article — and 100 more — in a structured 15-week path built for engineers aiming at Staff and Principal SRE. Production scenarios, hands-on labs, and Staff-level interview Q&As in every lesson.