Expert-Level Kubernetes Interview Questions

Q1:

How does the Kubernetes API Server internally process an incoming request from authentication to admission?

Expert

Answer

Request pipeline: authentication, authorization, mutating admission, validating admission, schema validation, etcd write, and watch notifications.

Quick Summary: The request flows through: Authentication (verify identity â€” bearer token, certificate, OIDC), Authorization (RBAC â€” can this identity do this action on this resource?), Admission Controllers (mutating â€” modify the object; validating â€” reject invalid objects), then object validation against the API schema, and finally persistence to etcd.

Permalink

Q2:

Why does Kubernetes use a watch mechanism instead of continuous polling for state changes?

Expert

Answer

Watch streams push revision-based updates efficiently; polling overloads API server and causes stale reads.

Quick Summary: Polling would require each component to ask the API server "any changes?" every N seconds â€” generating massive load and adding latency proportional to the poll interval. Watches are long-lived HTTP connections where the API server pushes events to clients instantly when objects change. Much lower API server load and near-real-time response to state changes.

Permalink

Q3:

How does the API Server deduplicate events during heavy watch traffic?

Expert

Answer

Updates are coalesced so only the latest revision is delivered, reducing event storms.

Quick Summary: During heavy watch traffic, many rapid changes to the same object generate many events. The API server uses event aggregation and watch bookmarks to reduce noise. Bookmarks are synthetic events that advance the client's resourceVersion without carrying object data â€” helping clients stay current without receiving every intermediate state change.

Permalink

Q4:

Why does etcd use MVCC instead of overwriting values directly?

Expert

Answer

MVCC stores revisions enabling consistent reads, time-travel, and non-blocking watches.

Quick Summary: MVCC (Multi-Version Concurrency Control) keeps old versions of values tagged with their revision number. This enables Watches to replay history from any past revision without locking current values. It also makes atomic compare-and-swap (CAS) operations safe â€” etcd compares the current revision before writing. Overwriting directly would lose the history needed for watches.

Permalink

Q5:

How does the scheduler handle scheduling cycles to prevent race conditions in multi-scheduler setups?

Expert

Answer

Each cycle locks a Pod; schedulers rely on leader election or partitioning to avoid double-binding.

Quick Summary: Each scheduler instance runs a scheduling cycle: claim a Pod to schedule, find a node, then bind it. Without locking, two scheduler instances could both select the same node for the same Pod. Kubernetes uses optimistic binding â€” the scheduler writes the binding to the API server which validates against resourceVersion. Only one binding succeeds; the other is rejected and retried.

Permalink

Q6:

What is preemption toxicity and how does Kubernetes mitigate it?

Expert

Answer

Cascade evictions are avoided by feasibility checks, retry limits, and controlled preemption.

Quick Summary: Preemption toxicity is when preempting lower-priority Pods to make room for a high-priority Pod causes so many Pod disruptions that the cluster destabilizes. Kubernetes mitigates this with PDB (prevents preempting too many replicas), a minimum preemption threshold, and grace periods â€” preemption is tried only after no available node is found.

Permalink

Q7:

How does CRI allow Kubernetes to remain runtime-agnostic?

Expert

Answer

CRI defines gRPC APIs for sandboxing, lifecycle, and images; runtimes implement CRI decoupling kubelet.

Quick Summary: CRI (Container Runtime Interface) is a standard gRPC API between kubelet and the container runtime. Any runtime implementing CRI (containerd, CRI-O) can be used. Kubernetes communicates via CRI calls (RunPodSandbox, CreateContainer, StartContainer) â€” the runtime handles the implementation details. This lets Kubernetes run on any OCI-compatible runtime without code changes.

Permalink

Q8:

Why does kubelet maintain a pod sandbox even when containers crash?

Expert

Answer

Sandbox holds Pod-level namespaces ensuring stable networking and IPs across restarts.

Quick Summary: The pod sandbox is the network namespace and cgroup hierarchy for the Pod â€” created once when the Pod is scheduled. It persists across container restarts. Even if all containers in the Pod crash and restart, the sandbox (Pod IP, network, volumes) stays stable. This is why Pod IP doesn't change when a container restarts inside it.

Permalink

Q9:

How does Kubernetes prevent deadlocks in the node drain process?

Expert

Answer

Drain respects PDBs, uses backoff, ignores DaemonSets/Static Pods, and processes evictions asynchronously.

Quick Summary: kubectl drain cordon the node (stops new scheduling) and evicts Pods. If a Pod has a finalizer that never gets removed (deadlocked controller), it blocks the eviction â€” the drain hangs. Kubernetes uses a configurable eviction timeout; if exceeded, drain fails. The operator must manually intervene â€” remove the finalizer or force-delete the stuck Pod.

Permalink

Q10:

Why are long-lived connections tricky behind Kubernetes Services?

Expert

Answer

Established connections bypass load balancing, causing backend imbalance.

Quick Summary: HTTP/1.1 long-lived connections (WebSocket, gRPC streams, database persistent connections) don't benefit from Service load balancing because one connection goes to one Pod and stays there. All requests on that connection hit the same Pod â€” no load balancing. Use a service mesh (Istio, Linkerd) for request-level load balancing of persistent connections.

Permalink

Q11:

How does Cilium replace kube-proxy using eBPF?

Expert

Answer

eBPF enables direct routing, load balancing, and policy enforcement without iptables.

Quick Summary: Cilium attaches eBPF programs to network interfaces and socket layers. When a Pod connects to a Service ClusterIP, Cilium's eBPF socket-level program intercepts the connect() syscall and redirects it directly to a backend Pod's IP â€” without going through the kernel network stack, without iptables, and without needing kube-proxy at all. L7 policies work similarly.

Permalink

Q12:

What are API Priority and Fairness queues and why do they matter?

Expert

Answer

They allocate fair request shares, ensuring critical traffic is never starved by noisy clients.

Quick Summary: API Priority and Fairness (APF) classifies incoming API requests into priority levels (leader-election, cluster-critical, system, workload-low). Each level has a queue with concurrency limits. This prevents a burst of low-priority requests (like a misconfigured controller) from blocking critical operations (kubelet heartbeats, leader election). Every request gets fair access within its priority class.

Permalink

Q13:

How does Kubernetes ensure strict ordering of writes to a single object under high concurrency?

Expert

Answer

Compare-and-swap with resourceVersion ensures linearizable writes.

Quick Summary: All writes to a single etcd key are serialized by etcd's Raft protocol. The API server includes the resourceVersion in writes â€” etcd rejects writes where the version doesn't match current (optimistic locking). The Raft leader serializes competing writes to the same key â€” only one succeeds per revision. This provides single-object write linearizability.

Permalink

Q14:

Why do multi-cluster architectures require federation or service meshes?

Expert

Answer

Cross-cluster discovery, routing, failover, and policy require abstractions beyond core Kubernetes.

Quick Summary: Each cluster is an isolated control plane â€” Pods in cluster A can't directly discover or communicate with Services in cluster B. Multi-cluster networking (Submariner, Cilium Cluster Mesh) creates cross-cluster service discovery and connectivity. Service meshes (Istio multi-cluster) extend traffic management across clusters. Federation (KubeFed) synchronizes resources across clusters.

Permalink

Q15:

How does kube-controller-manager prevent infinite reconciliation loops?

Expert

Answer

Rate-limited work queues with exponential backoff break infinite retry loops.

Quick Summary: Controllers use the "level-triggered" model â€” they don't remember the sequence of events, just the current desired and actual state. After any reconciliation attempt, they re-read the current state from the API server and compare to desired. If already at desired state, they do nothing. This idempotent design prevents infinite loops â€” reconciliation converges.

Permalink

Q16:

Why do CSI drivers require both node and controller plugins?

Expert

Answer

Controller provisions/attaches; node plugin mounts/unmounts and performs filesystem ops.

Quick Summary: The controller plugin handles cluster-level operations (creating, deleting, snapshotting cloud volumes) â€” it runs as a Pod, often on control plane nodes. The node plugin handles node-level operations (attaching the volume to the node, mounting it into the container) â€” it runs as a DaemonSet on all nodes. Both are required for full persistent volume lifecycle management.

Permalink

Q17:

How does Kubernetes maintain consistency for ConfigMaps and Secrets projected into Pods?

Expert

Answer

Kubelet updates atomic symlink-based volumes in tmpfs when API server changes.

Quick Summary: kubelet watches for ConfigMap and Secret changes via the API server using watches. When a change is detected, kubelet updates the projected files in the Pod's volume (default sync period 60s). Environment variable injections from ConfigMaps/Secrets don't update â€” the Pod must restart. Volume mounts update dynamically; env var injections do not.

Permalink

Q18:

Why is vertical scaling of etcd limited even with powerful hardware?

Expert

Answer

Consensus latency, fsync overhead, and majority replication limit scaling.

Quick Summary: etcd uses Raft â€” all writes go through one leader. Even with more powerful hardware, a single leader serializes all writes â€” adding more CPU or RAM doesn't add more write throughput. etcd's bottleneck is disk fsync latency (every commit requires fsync). Use fast NVMe SSDs, dedicated etcd nodes, and limit etcd object count rather than scaling hardware vertically.

Permalink

Q19:

How does Kubernetes handle stale endpoints when Pods die abruptly?

Expert

Answer

EndpointSlice controller removes endpoints after Pod deletion or NotReady signals.

Quick Summary: When a Pod dies abruptly (OOM kill, node failure), its endpoint isn't removed from the Service endpoint list until the endpoint controller learns of the Pod's deletion via the API server. This propagation takes seconds. During that window, traffic still routes to the dead Pod's IP. kube-proxy retries help; readiness probes and preStop hooks minimize the window.

Permalink

Q20:

How does the scheduler prevent double-binding a Pod to multiple nodes?

Expert

Answer

Posting a Binding object updates Pod status; other schedulers skip already-bound Pods.

Quick Summary: The scheduler uses optimistic binding â€” it chooses a node and then writes a Binding object to the API server, which validates it against current cluster state. Two schedulers could both choose the same node for the same Pod, but only the first binding write succeeds (the second fails due to the Pod's resourceVersion having changed). The losing scheduler must reschedule that Pod.

Permalink

Q21:

Why does IPVS mode scale better under tens of thousands of Services?

Expert

Answer

IPVS performs constant-time routing; iptables rule chains grow linearly.

Quick Summary: iptables stores rules in kernel memory as linked lists â€” lookup is O(n) where n is the number of rules. With 10,000 Services, each packet traverses thousands of rules. IPVS uses kernel hash tables â€” lookup is O(1) regardless of Service count. IPVS also supports more load balancing algorithms (round-robin, least-connection, shortest-expected-delay) natively.

Permalink

Q22:

Why must RBAC roles be tightly scoped in multi-team clusters?

Expert

Answer

Poor scoping allows privilege escalation through rolebinding or secret edits.

Quick Summary: Broad RBAC roles (like cluster-admin) given to developer ServiceAccounts allow any Pod they deploy to read all secrets, modify any Deployment, or delete any resource cluster-wide. A compromised Pod becomes a cluster takeover. Tight scoping (namespace-scoped roles, specific verbs on specific resources) limits the blast radius of any compromised workload.

Permalink

Q23:

How do admission webhooks affect cluster latency?

Expert

Answer

All writes pass through webhooks; slow webhooks stall API server requests.

Quick Summary: Admission webhooks are synchronous HTTP calls â€” the API server must wait for the webhook response before proceeding. A slow or unresponsive webhook blocks all object creation/modification of that type across the cluster. Misconfigured webhook failurePolicy can either block all operations (Fail) or silently skip validation (Ignore). Keep webhooks fast, stateless, and HA-deployed.

Permalink

Q24:

Why is kubelet’s node lease critical for large clusters?

Expert

Answer

NodeLease reduces API load by sending lightweight heartbeats instead of full Node updates.

Quick Summary: In large clusters (1000+ nodes), updating full Node status objects (large JSON) every 10 seconds from every node would flood the API server. Node leases use tiny Lease objects (just a timestamp) for heartbeats â€” 90% less API server load per node. Full Node status updates still happen, but only when the node's status actually changes â€” far less frequently.

Permalink

Q25:

How does Kubernetes avoid excessive DNS traffic due to frequent Pod restarts?

Expert

Answer

CoreDNS caches, EndpointSlices reduce entries, and stable ClusterIP reduces re-resolution.

Quick Summary: Every DNS query from every Pod hits CoreDNS. In clusters with frequent Pod restarts, new Pods look up Service names and ExternalName records. Under heavy DNS load, CoreDNS becomes a bottleneck. NodeLocal DNSCache runs a DNS cache DaemonSet on each node â€” Pods hit the local cache first, dramatically reducing CoreDNS load and DNS query latency.

Permalink

Q26:

Why does Pod Disruption Budget not protect against node failures?

Expert

Answer

PDB applies only to voluntary disruptions; node crashes bypass it.

Quick Summary: PDB is enforced only during voluntary disruptions processed by the Eviction API â€” drain operations, cluster upgrades. Node failures are involuntary â€” the node goes down hard, and Kubernetes terminates all its Pods immediately without consulting PDB. Use Pod Anti-Affinity across nodes/zones to ensure replicas are physically separated â€” that's your protection against node failures.

Permalink

Q27:

Why can aggressive API Server audit logging degrade performance?

Expert

Answer

Audit logs are synchronous; heavy logging slows API request handling.

Quick Summary: Every API request is logged with its user, action, resource, timestamp, and response â€” including the full request/response body if configured. At high request rates (busy clusters), audit logging writes megabytes per second to disk, synchronously, blocking API server request processing. Use structured logging, filter out high-volume read requests, and write to a fast storage backend.

Permalink

Q28:

How does pod-level Seccomp differ from AppArmor or SELinux?

Expert

Answer

Seccomp filters syscalls; AppArmor/SELinux enforce filesystem and process restrictions.

Quick Summary: Seccomp filters system calls at the kernel level â€” it blocks specific syscalls entirely (like ptrace, keyctl) before they reach the kernel. AppArmor restricts what filesystem paths and network operations a process can perform. SELinux uses mandatory type enforcement on all resources. They're complementary â€” seccomp limits syscalls, AppArmor/SELinux limit resource access.

Permalink

Q29:

Why do Node Local DNS caches drastically improve performance in large clusters?

Expert

Answer

Local caches reduce CoreDNS load and latency for frequent lookups.

Quick Summary: NodeLocal DNSCache deploys a CoreDNS instance as a DaemonSet on each node, listening on a link-local IP. Pod DNS resolvers are configured to query this local cache first. Cache hits are answered in microseconds without leaving the node. Only cache misses (first lookup of a new name) go to the cluster CoreDNS. Reduces DNS latency from milliseconds to microseconds for cached lookups.

Permalink

Q30:

What architectural principles make Kubernetes eventually consistent?

Expert

Answer

Components use cached informers and async reconciliation; etcd is strongly consistent but cluster converges eventually.

Quick Summary: Kubernetes is eventually consistent because: controllers reconcile asynchronously (there's always a lag between desired state change and actual state), etcd watch propagation has latency, kube-proxy endpoint updates have a delay, and DNS TTL means service discovery lags. The system converges to the desired state, but there's always a window where actual and desired state differ.

Permalink

Get Pro for Free

Expert Kubernetes Interview Questions

Kubernetes Interview Questions & Answers

What you will learn from these Kubernetes interview questions:

Questions

How does the Kubernetes API Server internally process an incoming request from authentication to admission?

Answer

Why does Kubernetes use a watch mechanism instead of continuous polling for state changes?

Answer

How does the API Server deduplicate events during heavy watch traffic?

Answer

Why does etcd use MVCC instead of overwriting values directly?

Answer

How does the scheduler handle scheduling cycles to prevent race conditions in multi-scheduler setups?

Answer

What is preemption toxicity and how does Kubernetes mitigate it?

Answer

How does CRI allow Kubernetes to remain runtime-agnostic?

Answer

Why does kubelet maintain a pod sandbox even when containers crash?

Answer

How does Kubernetes prevent deadlocks in the node drain process?

Answer

Why are long-lived connections tricky behind Kubernetes Services?

Answer

How does Cilium replace kube-proxy using eBPF?

Answer

What are API Priority and Fairness queues and why do they matter?

Answer

How does Kubernetes ensure strict ordering of writes to a single object under high concurrency?

Answer

Why do multi-cluster architectures require federation or service meshes?

Answer

How does kube-controller-manager prevent infinite reconciliation loops?

Answer

Why do CSI drivers require both node and controller plugins?

Answer

How does Kubernetes maintain consistency for ConfigMaps and Secrets projected into Pods?

Answer

Why is vertical scaling of etcd limited even with powerful hardware?

Answer

How does Kubernetes handle stale endpoints when Pods die abruptly?

Answer

How does the scheduler prevent double-binding a Pod to multiple nodes?

Answer

Why does IPVS mode scale better under tens of thousands of Services?

Answer

Why must RBAC roles be tightly scoped in multi-team clusters?

Answer

How do admission webhooks affect cluster latency?

Answer

Why is kubelet’s node lease critical for large clusters?

Answer

How does Kubernetes avoid excessive DNS traffic due to frequent Pod restarts?

Answer

Why does Pod Disruption Budget not protect against node failures?

Answer

Why can aggressive API Server audit logging degrade performance?

Answer

How does pod-level Seccomp differ from AppArmor or SELinux?

Answer

Why do Node Local DNS caches drastically improve performance in large clusters?

Answer

What architectural principles make Kubernetes eventually consistent?

Answer

Curated Sets for Kubernetes

People Also Ask - Related Kubernetes Questions