50 Staff SRE Interview Questions (With Full Answers)

Why Staff SRE Interviews Are Different

A Senior SRE interview tests whether you know how systems work. A Staff SRE interview tests whether you can reason about systems you have never seen before — at scale, under failure, with incomplete information.

The difference in expected answers is stark:

Senior answer to "How does Kubernetes schedule a pod?"

"The scheduler watches for unscheduled pods and assigns them to nodes based on resource availability."

Staff answer:

"The scheduler pipeline has several stages: filtering (which nodes are feasible given resource requests, node selectors, taints/tolerations, and affinity rules), then scoring (which feasible node is best given spread preferences, bin-packing goals, and custom priorities). The binding decision is written to etcd via the API server. The kubelet on the winning node watches for pods assigned to it and starts the container runtime. A pod can get stuck Pending if the filter phase eliminates all nodes — common causes are resource pressure, taint mismatches, or PVCs that can't be satisfied in the available AZs."

That depth is what Hone trains.

Linux and Kernel Questions

1. A process is hung and unresponsive to SIGTERM. Walk through your diagnosis.

Check /proc/<pid>/status for the process state — if it shows D (uninterruptible sleep), the process is blocked on a kernel-level I/O operation and cannot be killed with SIGTERM or SIGKILL. Common causes: NFS mount not responding, a hung storage driver, or a broken block device. cat /proc/<pid>/wchan shows which kernel function it is sleeping in. dmesg | tail often shows storage or filesystem errors. Resolution: fix the underlying I/O issue (unmount the stuck NFS, resolve the storage fault). A SIGKILL on a D-state process will queue but not execute until the process wakes from kernel sleep.

2. Explain the difference between soft and hard interrupts.

Hard interrupts (IRQs) are generated by hardware and handled immediately by the CPU, preempting whatever was running. They are kept short — just enough to acknowledge the hardware and queue work. Soft interrupts (softirqs) are deferred work triggered by hard interrupts, running in kernel context without preemption. Network packet processing (NET_RX_SOFTIRQ) and block device completions run as softirqs. /proc/softirqs shows counts per CPU. High softirq time in top (%si) often indicates network saturation or a NIC not distributing interrupts across CPUs (check /proc/interrupts for IRQ affinity).

3. What is the OOM killer and how does it decide what to kill?

When the kernel cannot allocate memory and cannot reclaim enough from page cache, it invokes the OOM killer. Each process has an oom_score (0–1000) based on memory usage relative to total RAM, runtime, and whether it is a privileged process. The process with the highest score is killed. You can set oom_score_adj (-1000 to 1000) to protect critical processes (-1000 = never kill) or make them preferentially targeted (1000). In Kubernetes, containers exceeding their memory limit are OOM-killed by the cgroup limit, not the system OOM killer — a subtle but important distinction for debugging.

4. Explain cgroups v2 and how Kubernetes uses them.

cgroups v2 is a unified hierarchy — all controllers (cpu, memory, io, pids) are under a single tree, unlike v1 where each controller had its own hierarchy. This makes it possible to atomically apply resource limits to a group. Kubernetes maps each pod to a cgroup directory under /sys/fs/cgroup/kubepods/. Container resource requests set cpu.weight (proportional CPU scheduling) and memory.max (hard limit). Exceeding memory.max causes an OOM kill of the container. cpu.max sets CPU quota/period for CPU limits. The burstable and besteffort QoS classes map to different cgroup priorities.

Kubernetes Questions

5. A deployment rollout is stuck. How do you diagnose it?

kubectl rollout status deployment/my-app shows the stuck state. Then: kubectl describe deployment my-app — look at Conditions (Progressing, Available) and Events. Most common causes: (1) new pods are CrashLoopBackOff — check kubectl logs on the new pod, check previous container logs with -p flag; (2) new pods are pending — check node resource pressure or PVC binding issues; (3) readiness probe failing — the deployment waits for minReadySeconds after the probe passes; (4) PodDisruptionBudget blocking old pod termination — kubectl get pdb and check ALLOWED DISRUPTIONS. Use kubectl rollout undo deployment/my-app to revert if the new version is broken.

6. What is the difference between liveness and readiness probes?

Readiness: controls whether the pod receives traffic from Services. A failing readiness probe removes the pod from Service endpoints but does not restart it. Use for: startup warmup, dependency unavailability (if your DB is down, mark yourself not ready rather than crashing). Liveness: controls whether the pod is restarted. A failing liveness probe triggers a container restart. Use for: detecting deadlocks or hung states that the app cannot self-recover from. Startup probe: a third probe that disables liveness checking until the app has started — prevents liveness from killing a slow-starting container before it is ready. Rule: set readiness probe to your health check endpoint. Only set liveness if you have a specific hung-state problem it solves.

7. How does etcd achieve consistency and what happens if you lose quorum?

etcd uses the Raft consensus algorithm. Writes require acknowledgment from a majority of members (quorum = floor(N/2) + 1). For a 3-node cluster, quorum = 2; for a 5-node cluster, quorum = 3. If fewer than quorum nodes are available, etcd stops accepting writes to prevent split-brain. The cluster becomes read-only. Kubernetes API server cannot write new state — existing pods keep running but no new scheduling, scaling, or config changes work. Recovery: restore a member from backup or bring enough nodes back online. Always run etcd with 3 or 5 nodes, never 2 or 4 (even numbers do not improve fault tolerance).

8. Explain RBAC in Kubernetes: Role vs ClusterRole, RoleBinding vs ClusterRoleBinding.

Role and RoleBinding are namespaced — they grant permissions within a specific namespace. ClusterRole and ClusterRoleBinding are cluster-scoped — they grant permissions across all namespaces or on cluster-scoped resources (nodes, PersistentVolumes, namespaces themselves). You can bind a ClusterRole with a RoleBinding to restrict it to one namespace — useful for reusable role definitions. Minimal privilege: each service account should have only the verbs it needs on only the resources it uses. A pod that only reads ConfigMaps does not need get on Secrets. Audit with kubectl auth can-i --list --as system:serviceaccount:ns:sa-name.

AWS Questions

9. Design a multi-AZ VPC for a production EKS cluster.

Three AZs. Public subnets (one per AZ) for the ALB only — no EC2 or EKS nodes in public subnets. Private subnets (one per AZ) for EKS nodes and RDS. One NAT Gateway per AZ (not shared — AZ failure would kill outbound traffic from other AZs). VPC endpoints for ECR API, ECR DKR, S3, STS, CloudWatch Logs — eliminates NAT Gateway costs for AWS API calls and improves reliability. Route tables: public subnets route 0.0.0.0/0 to IGW; private subnets route 0.0.0.0/0 to their AZ's NAT Gateway. EKS nodes need tags kubernetes.io/cluster/<name>=owned and subnet tags for the load balancer controller to discover subnets.

10. Explain IAM policy evaluation order.

AWS evaluates policies in this order: (1) Explicit Deny — always wins, from any policy source; (2) Service Control Policies (SCPs) — org-level guardrails, must Allow or the action is denied; (3) Resource-based policies — S3 bucket policies, KMS key policies; (4) Identity-based policies — IAM user/role policies; (5) Permission boundaries — if set, the effective permission is the intersection; (6) Session policies — for assumed roles. The default is implicit deny — if no policy explicitly allows an action, it is denied. Common mistake: adding an IAM policy that allows S3:GetObject but forgetting the S3 bucket policy denies cross-account access.

Observability Questions

11. What is an SLO burn rate alert and why is it better than a simple threshold alert?

A threshold alert fires when a metric crosses a value (error rate > 1%). It has no context — 1% errors for 1 minute is very different from 1% errors for 6 hours. A burn rate alert measures how fast you are consuming your error budget relative to the SLO window. A 1-hour burn rate of 14x means you are burning 14 hours of budget per hour — at that rate you will exhaust your monthly budget in ~51 hours. Multi-window alerts (fast burn: 1h + 5m windows; slow burn: 6h + 30m windows) catch both sudden spikes and slow degradations. The math: burn rate = (current error rate) / (1 - SLO). For a 99.9% SLO, budget = 0.1%. A 10% error rate = 100x burn rate.

Incident Management Questions

12. Walk through the first 5 minutes of a SEV1 incident.

(1) Acknowledge the alert — stops the escalation timer. (2) Declare the incident in your incident channel — #incidents — with a brief description. (3) Assign roles: IC (Incident Commander), Comms (customer/stakeholder updates), Scribe (timeline). (4) Assess blast radius: how many users are affected, is data at risk, is revenue impacted? (5) Start mitigation first, root cause investigation second — restore service before explaining why it broke. The IC should be directing, not hands-on-keyboard. Updates every 10-15 minutes to stakeholders even if there is nothing new to report.

These are excerpts from Hone's 500+ question bank. The full set covers Linux, Kubernetes, AWS, Terraform, CI/CD, GitOps, Prometheus, and incident management — with model answers written at the Staff/Principal level.