Operating Systems
This assumes you know nothing beyond "a computer runs programs," and takes you to the point where you can answer any OS interview question out loud. We build each idea from the one before it — no leaps, no hand-waving. Grounded in OSTEP, CS:APP, and Kerrisk. Read it top to bottom once; then the labs and the Interview Gauntlet at the end will feel like review.
1 · What a computer is actually doing
Before any operating system exists, understand the bare machine, because everything the OS does is a trick played on top of this one simple loop.
A computer has two core parts: a CPU (the part that does things) and memory / RAM (the part that remembers things). Think of RAM as an enormous row of numbered mailboxes, each holding one byte. The number of a mailbox is its address. That's all memory is: a giant array of bytes, each with an address like 0, 1, 2, … up into the billions.
The CPU does exactly one thing, forever, billions of times a second — the fetch–decode–execute cycle:
To do this the CPU has a handful of tiny, ultra-fast storage slots called registers — think of them as the CPU's hands, holding the few values it's working with right now. Two registers matter enormously: the program counter (PC, also called the instruction pointer) holds the address of the next instruction to fetch, and the stack pointer holds the top of the current function's workspace (more on the stack later). "Running a program" literally means: point the program counter at the program's first instruction and let the loop run.
An instruction is astonishingly primitive: "add these two registers," "copy this byte from memory into a register," "if this register is zero, jump to that address." A program — a web browser, a database — is millions of these tiny steps. The magic of computing is entirely in the arrangement, not the individual step.
Now the central problem the OS exists to solve: a CPU can only run one instruction stream at a time, and there is only one pool of RAM — yet you want to run hundreds of programs, and none of them should be able to read or corrupt another's memory, crash the machine, or hog the CPU forever. The operating system is the program that makes one CPU and one RAM feel like a private computer to each of the hundreds of programs running on it. Every section below is one piece of that illusion. Keep the fetch–decode–execute loop in mind: it never stops, and the OS's job is to keep swapping whose instructions it runs and whose memory it sees.
2 · From a file on disk to a living process
Start with the difference between a program and a process, because interviewers love this and beginners blur it. A program is a dead, passive thing: a file on disk (like /bin/ls) containing machine instructions and initial data. A process is that program brought to life — an instance running in memory, with a slice of CPU time, its own private memory, and a pile of kernel bookkeeping. One program can become many processes: run ls in three terminals and you have one program, three processes. Recipe vs. the meal being cooked.
When a process is created, the kernel builds a record for it called the process control block (in Linux, a task_struct). It holds everything the kernel needs to know: the process's ID (its PID), its parent's PID, the saved values of all its registers (for when it's not running), a pointer to its memory map, its list of open files, its scheduling priority and state, and much more. When people say "the OS manages a process," they mean it maintains and updates this structure.
The shape of a process's memory
Every process gets its own private memory laid out in a standard pattern. You must know this cold — it explains stack overflows, heap growth, and segfaults:
malloc() / new. Managed by the program, not the kernel directly. A leak = the heap growing forever.How a process is born: fork + exec
Here's a surprise that trips up everyone: a new process is never created from scratch — it's cloned from an existing one. The system call fork() makes a child that is a near-perfect duplicate of the parent. The child then usually calls exec() to replace its own contents with a different program. Your shell does exactly this for every command. Click through the whole birth-to-death lifecycle — this is the answer to "what happens when you run ./a.out?":
The shell calls fork(). The kernel makes a child that duplicates the parent — same code, memory, and open files. But it doesn't really copy the memory: parent and child share the same physical pages, marked read-only (copy-on-write), and a page is only truly copied the instant one of them writes to it. That's why fork is cheap. fork returns twice: 0 to the child, the child's PID to the parent — that single value is how each side knows who it is.
The child calls execve(). The kernel throws away the child's entire memory image and replaces it with the new program's. The PID stays the same; only the contents change. This is why you fork then exec — if you just execed without forking, you'd replace the shell itself.
The kernel maps the executable's sections (code, data) into the new address space, and for a dynamically-linked program invokes the dynamic linker (ld.so) to map shared libraries like libc. Crucially almost nothing is copied into RAM yet — the pages are mapped (the address space knows about them) but not present (not in physical memory).
Control jumps to _start, which calls main(). The very first instruction is on a page that's mapped but not in RAM, so the CPU raises a page fault; the kernel loads that page on demand and resumes. Pages arrive only as the program touches them — demand paging — which is why even a huge binary starts instantly.
On finishing, the program calls exit(). It doesn't vanish — it becomes a zombie, keeping just its exit code, until the parent calls wait() to collect it (reaping). Unreaped children pile up as zombies; if the parent dies first, the orphan is adopted by PID 1, which reaps it. This is precisely why a container's PID 1 must reap children.
3 · Virtual memory: the great illusion
Now the deepest trick in the whole OS. Run the same program twice and inspect both: both claim to be using memory address 0x400000. They can't share the same physical mailbox — so who's lying? Both are. Neither process ever touches real physical memory directly. Each lives inside its own private, fictional memory called a virtual address space, and the hardware silently translates its fake addresses into real ones on every single memory access.
Why go to all this trouble? Three enormous wins: (1) Isolation — process A literally cannot name, let alone read, process B's memory, because their address spaces are separate maps. (2) Simplicity — every program gets the same clean layout starting at the same addresses, regardless of what else is running. (3) Efficiency — the OS can hand out more memory than physically exists, and load pages only when touched.
Pages, page tables, and the MMU
Translation happens in fixed-size chunks called pages (almost always 4 KB). Physical RAM is divided into equal chunks called frames. Each process has a page table — a kernel-maintained lookup from its virtual page numbers to physical frame numbers. A dedicated piece of hardware, the MMU (memory management unit), performs the lookup on every access. Because walking the page table on every access would be far too slow, the CPU caches recent translations in a small, very fast cache called the TLB (translation lookaside buffer).
Here is one address being translated, step by step:
The program executes an instruction touching virtual address 0x004021a8. As far as the program knows, that is the address in real memory. It has no idea translation is about to happen — that obliviousness is the entire point.
The MMU splits the address. With 4 KB pages, the bottom 12 bits are the offset (position within the page, 0x1a8); the rest is the virtual page number (0x00402). Only the page number gets translated; the offset rides along unchanged — a byte's position within a page is the same in virtual and physical.
The MMU first checks the TLB. On a hit, the physical frame comes back in ~1 ns and we jump straight to a physical address. This is why tight loops and hot data are fast — their translations stay cached. A well-behaved program has a high TLB hit rate.
On a miss, the MMU walks the page table in memory (on x86-64, up to four levels of tables — a tree, so the table itself doesn't need to be one giant contiguous block). That's several extra memory reads. The result is cached back in the TLB. Frame + offset = the real physical address, and the access finally happens.
If the page-table entry says the page isn't in RAM, the CPU raises a page fault and jumps into the kernel. A minor fault: the page is valid but not yet resident (first touch, or shared) — the kernel allocates/zeroes a frame and retries. A major fault: the page must be read from disk (from the executable, or from swap) — slow. An invalid access (a bad pointer): the kernel refuses and sends SIGSEGV. A segfault is just a page fault the kernel wouldn't satisfy.
The two numbers interviewers ask about
VSZ (virtual size) = how much address space a process has mapped, including memory reserved but never touched. RSS (resident set size) = how much is actually in physical RAM right now. VSZ is the promise; RSS is the bill. Because pages become resident only on demand, a process routinely has a huge VSZ and a modest RSS. This is also what lets Linux overcommit: hand out more virtual memory than it physically has, betting not everyone cashes their promise at once. When that bet fails, the OOM killer collects (final section). And when RAM is tight, the kernel can push cold pages out to swap (disk) to free frames — cheap if the pages are rarely used, catastrophic ("swap thrashing") if the working set genuinely exceeds RAM.
4 · How a program actually gets memory
You write malloc(64) or new or [] and get memory. Where does it come from? Beginners assume "the OS hands it over each time." Wrong — and the truth is a favourite interview probe.
Recall the address-space layout: the stack and the heap. They serve different needs:
malloc/new) that outlives the function that created it. You must free it (or a garbage collector must) — forget, and it's a memory leak. Managed by an allocator inside your process.malloc doesn't usually call the kernel
Here's the key insight. Your program links a memory allocator (part of libc). When you malloc, the allocator hands you a slice from a pool of memory it already obtained from the kernel — no syscall needed most of the time. Only when its pool runs low does it ask the kernel for more, via brk/sbrk (grow the heap's top) or mmap (map a fresh region, used for large allocations). Freeing usually returns memory to the allocator's pool, not to the OS — which is why a process's RSS often doesn't shrink after you free: the memory is free within the process but still held from the OS's view. This surprises people debugging "why didn't my RSS drop after I freed everything?"
Two consequences you'll be asked about. Fragmentation: after many allocations and frees of different sizes, the pool can become swiss-cheese — plenty of total free space, but no single contiguous chunk big enough, so the allocator must ask the OS for more anyway. And the layer beneath: because of overcommit and demand paging (Section 3), even a successful malloc of a gigabyte doesn't consume a gigabyte of RAM until you write to those pages — the pages fault in one at a time as you touch them.
5 · The syscall boundary: asking the kernel for anything
A user program cannot open a file, send a packet, or read the clock by itself. By deliberate design, it has no direct access to hardware. Here's the enforcement mechanism. The CPU runs in one of two privilege modes: user mode (called ring 3), where your programs live and where privileged instructions simply don't work, and kernel mode (ring 0), where the OS runs with full access to everything. This wall is enforced by the hardware itself — a program in user mode that tries to touch a device or another process's memory doesn't succeed-but-get-caught; the CPU physically refuses and faults. This, plus virtual-memory isolation, is the entire basis of OS security.
So how does a program get anything done? It asks the kernel through a system call — a single, controlled doorway from user mode into kernel mode. Every interaction with the outside world — read, write, open, connect, mmap, fork — is a syscall. Walk the crossing:
The program wants to read(). It puts the syscall number in a register (rax) and the arguments (which file descriptor, where to put the bytes, how many) in others, then runs the syscall instruction. In C you just call read(); libc is a thin wrapper doing exactly these register moves.
The syscall instruction is a controlled trap: the CPU switches to kernel mode and jumps to one fixed entry point the kernel registered at boot. The program can't choose where in the kernel it lands — that's the safety property. There is exactly one supervised doorway in.
The kernel reads the number from rax and indexes the syscall table to find the handler — 0 is read. It then validates the arguments: is that buffer pointer actually inside the caller's own address space? (A malicious program might pass a kernel address hoping to trick the kernel into writing there — the check stops it.)
Now with full privilege, the kernel does what the program couldn't: talk to the disk or socket, copy data into the user's buffer. If the data isn't ready (an empty socket), the kernel may put the process to sleep and run someone else — the syscall blocks. This is the exact moment a process enters an S or D state (next sections).
The kernel puts the result (bytes read, or a negative error code) in rax, switches back to user mode, and the program continues right after its syscall instruction. libc turns a negative return into the familiar errno. The program never knew it left ring 3.
Two things to carry forward. Syscalls are not free — each is a mode switch with real overhead, so fast code batches them (one big write beats a thousand tiny ones; io_uring exists to amortise them). And strace works by intercepting exactly this boundary, so its output is the truest possible description of what a program actually does.
6 · Interrupts: how the CPU gets pulled away
One puzzle the fetch–decode–execute loop can't explain on its own: if the CPU is busy running program A's instructions forever, how does it ever notice the keyboard was pressed, a network packet arrived, the disk finished, or that A has run long enough and B deserves a turn? The answer is interrupts, and they're the heartbeat of the whole system.
An interrupt is a hardware signal that says "stop what you're doing and handle me." When one fires, the CPU immediately pauses the current program (saving just enough state), switches to kernel mode, and runs a small kernel routine called an interrupt handler (or ISR). When the handler finishes, the CPU restores the interrupted program and resumes as if nothing happened. Devices use interrupts so the CPU never has to sit and poll them: the disk says "your data's ready" by raising an interrupt, freeing the CPU to do other work in the meantime.
syscall instruction. Predictable, requested.The single most important interrupt is the timer interrupt. The kernel programs a hardware timer to fire hundreds of times per second. Each tick yanks the CPU away from whatever program is running and hands control to the kernel — which is the only reason a runaway program (an infinite loop) can't freeze your machine. On each tick the kernel checks: has this program used its time slice? Should someone else run now? That decision is scheduling, next.
7 · The scheduler: sharing one CPU among many
A CPU core runs exactly one instruction stream at a time, yet hundreds of processes appear to run "simultaneously." The trick is time-slicing: give each ready process a tiny slice of CPU (a few milliseconds), then, on the next timer interrupt, forcibly take the CPU back and give it to another — fast enough that everything looks concurrent. Forcibly taking the CPU is called preemption, and an OS that does it (all modern ones) is preemptively multitasking. The kernel component that decides who runs next is the scheduler; Linux's is the Completely Fair Scheduler (CFS), which hands out CPU time roughly in proportion to priority, always favouring whoever has run the least so far. Priority is nudged by the nice value (−20 greediest to +19 most generous).
The context switch — and why it's expensive
Switching from process A to process B is a context switch: the kernel saves A's registers into A's task_struct, loads B's, and (since B has a different address space) points the MMU at B's page table and flushes much of the TLB. Interviewers love "why is a context switch expensive?" The register save/restore is trivial. The real cost is indirect: after the switch, B runs with cold CPU caches and a flushed TLB, so it suffers a burst of cache and TLB misses until its working set warms up again. This is why too many threads, tiny time slices, or heavy lock contention (which forces constant switching) burn CPU in overhead rather than useful work — visible as high %sys and high involuntary-context-switch counts.
Process states — know these cold
At any instant, every process is in one of a handful of states. This table answers half of all "diagnose this process" questions. Hover each:
R tasks.S.kill -9, because it's mid-syscall and can't receive the signal. A pile of D processes = an I/O problem (slow disk, dead NFS mount).wait()s. Uses no CPU/RAM, but a flood means a parent that isn't reaping its children.Ctrl-Z, SIGSTOP) or paused under a debugger. Resumes on SIGCONT.Finally, a classic trap: load average is not CPU usage. On Linux it counts processes that are R and D. So a box can show load average 200 while the CPU is idle — that's 200 processes stuck in D waiting on a wedged disk, not 200 doing work. Reading load average as "CPU busyness" is one of the most common mistakes.
8 · Concurrency: threads, races, locks, deadlock
So far a process has one stream of execution. But you often want several — to use multiple CPU cores, or to keep working while one part waits on I/O. Enter threads: multiple independent streams of execution inside one process, sharing its memory and open files. This sharing is exactly what makes threads powerful and exactly what makes them dangerous.
clone() — a thread is just a task that shares more with its creator. Pick processes for isolation, threads for cheap shared-memory concurrency.The race condition — the root of concurrency bugs
Two threads share a counter x = 0 and each does x = x + 1 a million times. You'd expect 2,000,000. You often get less. Why? Because x = x + 1 isn't one step — it's three: read x, add 1, write x back. If the scheduler switches threads mid-sequence, an update is lost. Step through it:
x is 41. Thread A reads it into its register (41). Before A can finish, the scheduler switches to B, which also reads x — still 41. Both threads now hold "41" privately. The disaster is already baked in.
Each thread adds 1 to its own copy. Both compute 42. Independently. Neither knows the other exists or that they read the same starting value.
A writes 42 back to x. Then B writes 42 back to x. Two increments happened, but x only went up by one. One update was silently lost. This is a race condition: the result depends on the exact interleaving of threads, which is unpredictable — so the bug appears randomly, often only under load, and vanishes when you add a print statement. The worst kind of bug.
The fix: make the three steps atomic — indivisible — with a mutex (mutual-exclusion lock). A thread acquires the lock before touching x and releases it after; while held, no other thread can enter. The read-add-write becomes a critical section only one thread runs at a time. Now no update is lost.
Deadlock — when locks freeze each other
Locks solve races but introduce a new hazard. Thread A holds lock 1 and waits for lock 2; thread B holds lock 2 and waits for lock 1. Neither can proceed; both wait forever. That's deadlock. It requires four conditions simultaneously (the Coffman conditions) — break any one and deadlock is impossible:
One aside that comes up constantly: Python's GIL (global interpreter lock) lets only one thread run Python bytecode at a time, so CPU-bound Python threads don't use multiple cores — which is why Python programs use multiprocessing (separate processes) for CPU parallelism. It's a concrete example of why you'd choose processes over threads.
9 · Signals: the kernel tapping a process on the shoulder
How does Ctrl-C stop a program? How does the kernel tell a process "you segfaulted" or "your child died"? Through signals — tiny asynchronous notifications delivered to a process. A signal interrupts the process's normal flow, and the process either runs a registered handler function, or takes the default action (often: die). Know these by name:
kill. Catchable — a program can run cleanup (flush data, close connections) then exit gracefully. Always try this first.kill -9. Cannot be caught, blocked, or ignored — the kernel just destroys the process. No cleanup runs (risking corrupt state). Use only when SIGTERM fails.Ctrl-C sends. Usually terminates, but catchable (that's how a program cleans up on Ctrl-C).wait() and reap. Ignoring it is how zombies accumulate.SIGSTOP (like Ctrl-Z) freezes a process into the T state; SIGCONT unfreezes it. SIGSTOP, like SIGKILL, can't be caught.This ties two earlier threads together. First, "graceful shutdown": a well-built server catches SIGTERM, stops accepting new work, finishes in-flight requests, then exits — which is exactly what Kubernetes relies on when it stops a pod (it sends SIGTERM, waits, then SIGKILL). Second, it explains the D-state mystery from Section 7: a process in uninterruptible sleep is stuck inside a kernel syscall and cannot receive any signal, not even SIGKILL — so kill -9 genuinely does nothing until the I/O it's blocked on completes.
10 · File descriptors and the I/O models
Now the abstraction that makes Unix Unix. When a process opens a file, the kernel doesn't hand back the file — it hands back a small integer, a file descriptor (fd). Every later operation names the resource by that integer. The profound part: files are not special. Pipes, network sockets, terminals, even the kernel's own event queues are all reached through the identical fd interface (read/write/close). "Everything is a file" really means "everything is a file descriptor" — and it's why one piece of code, one event loop, can juggle files and sockets and timers together.
Every process starts with three fds open: 0 = stdin, 1 = stdout, 2 = stderr — which is why 2>&1 redirects errors to output (you point fd 2 at whatever fd 1 points to). New fds take the lowest free integer, so the next file you open is usually fd 3. Underneath, an fd is one end of a three-link chain:
fork copies them). Browse them live at /proc/<pid>/fd.df and du disagree (you'll meet it in Storage).Blocking, non-blocking, and how one program serves 100,000 connections
When you read() a socket with no data, the default is to block — the process sleeps (state S) until data arrives. Fine for one connection; disastrous for thousands, because you'd need a thread per connection (crushing context-switch and memory cost — this is the famous C10K problem). The solution combines three things you now understand: mark the fds non-blocking (a read with no data returns immediately instead of sleeping), and use an event-notification syscall — epoll on Linux — to ask the kernel "tell me which of these 100,000 fds are ready." A tiny pool of threads (about one per core) each runs an event loop that epoll_waits and services only the ready fds. That single design — non-blocking fds + epoll — is the architecture behind nginx, Redis, and Node.js. It's why "everything is an fd" is more than a slogan: one loop watches every kind of I/O at once.
And the failure mode: every fd costs a table slot, the kernel caps how many you may hold (ulimit -n), and a program that leaks fds (opening without closing) eventually hits EMFILE — too many open files and can't accept new connections. You'll reproduce exactly that in the labs.
11 · The memory hierarchy: why programs wait
Every performance intuition rests on one fact: memory is a hierarchy, and each level down is dramatically slower than the one above. The CPU can do arithmetic in a fraction of a nanosecond, but reaching main memory takes ~100 ns and reaching a spinning disk takes tens of thousands of times longer. All the machinery you've met — CPU caches, the TLB, the page cache — exists to keep the data the CPU needs as high up this pyramid as possible. Hover each tier and burn the orders of magnitude into memory (the exact numbers vary; the ratios are the point):
Two ideas follow. Locality: memory moves between levels in cache lines (~64 bytes), so touching data that sits close together (a contiguous array) is far faster than scattered access (a linked list of pointers all over the heap) — same number of accesses, wildly different speed, because one enjoys cache hits and the other suffers misses. The page cache: Linux uses spare RAM to cache file data, which is why free shows most memory "used" as buff/cache — that memory is reclaimable, not lost, and is why re-reading a file is instant the second time.
12 · When memory runs out: the OOM killer
Recall two facts: Linux overcommits (Section 3) — it promises more virtual memory than it physically has — and it uses spare RAM as reclaimable page cache (Section 11). Usually this is fine. But when processes actually try to use more physical memory than exists, and the kernel has already shrunk the page cache as far as it can, it's cornered: there are no free frames and something must give.
Enter the OOM killer (out-of-memory killer). Rather than let the whole system freeze, the kernel picks a process and kills it (SIGKILL) to reclaim its memory. It doesn't choose randomly: each process has an OOM score — roughly, how much memory it's using, tunable per process via oom_score_adj — and the highest scorer is sacrificed. In the kernel log you'll find the tell-tale Out of memory: Killed process 1234 (mysqld).
The crucial connection for everything that follows: this is the same mechanism a cgroup memory limit uses locally. Give a container a 512 MB limit and the kernel tracks that cgroup's memory; exceed it and the OOM killer fires inside just that cgroup, killing a process there — while the host may have gigabytes free. That's why a container gets "OOMKilled" on a machine with plenty of RAM: it hit its limit, not the host's. One sentence links this entire module to containers and Kubernetes: a container OOM is virtual memory hitting a wall you configured.
You now have the whole machine in your head: a CPU running one fetch–decode–execute loop, virtual memory giving every process a private world, syscalls and interrupts as the only ways into the kernel, the scheduler time-slicing the CPU, threads and locks for concurrency, signals for control, file descriptors for all I/O, and the memory hierarchy explaining every performance question. Everything above this module — Linux, Docker, Kubernetes — is built from exactly these parts.
Every time a program touches the outside world it crosses into the kernel via a system call. strace intercepts that boundary and shows the truest description of what a program actually does.
strace uses the ptrace syscall to stop the target on every syscall entry and exit. -c tallies them and shows in-kernel time per call — the fast way to spot a program hammering the kernel (thousands of tiny reads) vs doing real work. On a stuck process, strace -p <pid> often shows it spinning on one syscall.
Run strace on a simple command and read the syscall tally.
$ strace -c ls /You get a table of syscalls (execve, openat, read, mmap…). Every row is one favour asked of the kernel; execve at the top is the program being loaded (Section 2).
Reveal solution
$ sudo apt install -y strace # if needed $ strace -c ls / $ strace -e trace=openat ls / # just the file opens
Processes are born by fork (clone) then exec (replace image). Every process has a parent, up to PID 1.
sleep 100 & makes your shell fork a child that execves sleep; the ppid you see is the fork relationship. Kill the shell and the child re-parents to PID 1 — the orphan-reaping rule from Section 2, live.
Start a background sleep, then show it, its PID, and its parent.
$ ps -o pid,ppid,cmd --ppid $$The child sleep appears with ppid = your shell's PID. pstree -p $$ draws the tree.
Reveal solution
$ sleep 100 & $ ps -o pid,ppid,cmd --ppid $$ $ pstree -p $$
Every process has the standard layout from Section 2 — text, data, heap, and stack — each in its own private virtual address space.
/proc/self/maps lists every mapped region with permissions and backing file: the executable's .text (r-x), [heap], [stack], and each shared library. Summed, these regions are roughly your VSZ; the resident subset is your RSS.
Print the memory map of a process and find its heap, stack, and code.
$ cat /proc/self/maps | grep -E "heap|stack|r-xp" You'll see [heap], [stack], and the executable/libraries mapped r-xp (read-execute) — the address-space diagram made real.
Reveal solution
$ cat /proc/self/maps | head -20 $ cat /proc/self/maps | grep -E "heap|stack"
Reach engineers who read the man page
Native, contextual, no tracking — this is how the curriculum stays free.
Files, pipes, and sockets are all reached through small integers — file descriptors. 0/1/2 are stdin/stdout/stderr; the kernel assigns the lowest free integer next.
/proc/<pid>/fd is the per-process fd table made browsable — each entry symlinks to what it points at (a path, socket:[…], pipe:[…]). First place to look for "too many open files" or "which socket is this process holding."
Open a file on fd 3, then inspect the shell's open descriptors.
$ ls -l /proc/$$/fdfd 3 symlinks to the file you opened, next to 0/1/2. A socket would appear as socket:[…] in the same list — same mechanism, different backing object.
Reveal solution
$ exec 3< /etc/hostname $ ls -l /proc/$$/fd $ exec 3<&- # close it
A process can reserve a huge virtual address space (VSZ) while keeping little actually in RAM (RSS), because pages become resident only on demand.
This is overcommit + demand paging in one measurement. RSS climbs as a program touches more of its allocation while VSZ stays put — the pages were always there virtually; touching them made them real. As RSS nears a cgroup limit, the OOM killer looms (Section 12).
Compare virtual size vs resident size across your processes.
$ ps -eo pid,vsz,rss,comm --sort=-rss | headVSZ (address space mapped) is almost always far larger than RSS (actually in RAM). The gap is memory promised but not cashed in.
Reveal solution
$ ps -eo pid,vsz,rss,comm --sort=-rss | headEvery fd costs a slot; the kernel caps how many a process may hold (ulimit -n). Hit the cap and syscalls fail with EMFILE — a real outage where a leaking server slowly stops accepting connections.
It surfaces as Too many open files and accept()/open() returning EMFILE. Diagnose with ls /proc/<pid>/fd | wc -l vs the limit in /proc/<pid>/limits: a steadily climbing count is a leak (fix the code); a one-time spike is an under-provisioned limit (raise it). Raising the limit to hide a leak just delays the crash.
Lower the fd limit hard in a subshell and watch opens fail; then reason about the real fix.
$ bash -c 'ulimit -n 8; for i in $(seq 20); do exec {fd}<>/tmp/f$i || { echo "failed at $i"; break; }; done'It fails partway once the 8-fd budget is spent. The right fix depends on why: raise the soft limit for a legitimate need, or close leaked fds if the count grows unbounded.
Reveal solution
Always diagnose before raising: a monotonically climbing fd count is a leak, not a capacity problem.
$ cat /proc/<pid>/limits | grep "open files" $ ls /proc/<pid>/fd | wc -l $ ulimit -n 4096 # only if it is a real need
You now hold the whole machine: a CPU running one fetch–decode–execute loop; virtual memory giving every process a private world; the heap and stack; syscalls and interrupts as the only doors into the kernel; the scheduler time-slicing the CPU; threads, races, locks and deadlock; signals for control; file descriptors and epoll for all I/O; and the memory hierarchy explaining every performance question. Every module above — Linux, Docker, Kubernetes — is assembled from exactly these parts. A container is a process with namespaces and cgroups; a socket is an fd; an OOM kill is virtual memory hitting a wall you set.
The questions actually asked for SRE, systems, and platform-engineering roles — conceptual, debugging, and "explain it to me" prompts, plus a design question. Each expands to show what the interviewer is really probing for, a model answer, and the follow-up traps. Answer all of these out loud and you have mastered this module.
Q1What happens, step by step, when you run ./a.out?
The whole stack: fork/exec, the loader and dynamic linking, demand paging, process lifecycle — not just "it runs."
The shell forks a child (copy-on-write, so nothing is truly copied yet). The child execves the binary: the kernel discards the copied image and maps the executable's sections, invoking ld.so to map libc and resolve symbols for a dynamic binary. Almost nothing is in RAM — pages are mapped but not present. Control reaches _start → main; the first instruction triggers a page fault and the kernel loads that page on demand. The process runs, entering the kernel via syscalls for any I/O, until it exits and becomes a zombie until its parent waits to reap it.
- "Why does fork return twice?" — 0 in the child, child PID in the parent.
- "Why not copy all memory?" — copy-on-write; copied only when written.
- "Parent exits first?" — orphan re-parents to PID 1, which reaps it.
Q2Explain virtual memory as if I know nothing about it.
Clarity of the mental model plus the machinery (pages, page table, TLB, faults) behind it.
Every process gets its own private, fictional memory and believes it owns all of it. The hardware translates these fake (virtual) addresses to real (physical) ones on every access, in page-sized chunks, via a per-process page table; a small cache (the TLB) makes repeated translations fast. Pages load into RAM on demand through page faults. Payoffs: isolation (processes can't see each other's memory), a clean uniform layout, and overcommit. A segfault is a page fault the kernel refused to satisfy.
- "What's in the page table?" — virtual-page → physical-frame plus permission bits.
- "Why is the TLB critical?" — without it every access needs a multi-level page-table walk.
- "Minor vs major fault?" — allocate/zero a frame vs read from disk/swap.
Q3A process is stuck and won't die even with kill -9. Why, and what do you do?
The D (uninterruptible sleep) state, its link to I/O, and why signals can't reach it.
It's almost certainly in uninterruptible sleep (D): blocked inside a kernel syscall on I/O that can't be interrupted — a slow/dead disk or a hung NFS mount. SIGKILL can't be delivered because the process isn't in a state to receive any signal; it's mid-syscall. You can't force it — you fix the underlying I/O (recover the mount/disk). Confirm with ps -o pid,stat,wchan: D plus a wchan in an I/O wait function.
- "So how do you kill it?" — you can't directly; resolve the I/O it awaits.
- "Effect on load average?" —
Dcounts toward load, so load can be huge while CPU is idle.
Q4Your service's memory grows until it's OOM-killed. How do you diagnose it?
Systematic memory debugging: leak vs cache vs limit, RSS trends, and OOM-killer logic.
Confirm it's a leak, not page cache — track the process's RSS over time (/proc/<pid>/status), not system-wide "used." Monotonic climb that never plateaus = leak in the app; a plateau near the limit = working set larger than the limit. Check the cgroup limit and memory.events for prior OOM kills, and dmesg for Out of memory: Killed process naming the victim. Fixes: raise the limit if the working set is legitimately that big, or fix the leak (heap profiler, unbounded caches/queues).
- "Why that process?" — highest OOM score (~memory used, tunable via
oom_score_adj). - "Container killed but host had free RAM?" — it hit the cgroup limit, not the host.
- "free shows no free memory — bad?" — usually not; most is reclaimable page cache.
Q5Walk me through a race condition and how you'd fix it.
Whether you understand shared mutable state, non-atomic operations, and mutual exclusion.
Two threads share a variable and each does x = x + 1, which is really read-add-write. If the scheduler interleaves them — both read the same old value, both add one, both write back — one increment is lost. The result depends on timing, so the bug is intermittent and load-dependent (a race condition). The fix is mutual exclusion: wrap the read-add-write in a mutex so only one thread is in that critical section at a time, making it atomic. Alternatives: atomic instructions, or avoiding shared mutable state entirely.
- "Why does adding a print make it disappear?" — it changes timing, hiding (not fixing) the race.
- "Downside of locks?" — contention (serialises threads) and deadlock.
Q6What is deadlock and how do you prevent it?
The four Coffman conditions and the practical prevention used in real code.
Deadlock is when threads wait on each other in a cycle and none can proceed — A holds lock 1 and wants 2; B holds 2 and wants 1. It needs four conditions at once: mutual exclusion, hold-and-wait, no preemption, and circular wait. Break any one and it can't happen. The standard real-world fix is to eliminate circular wait by imposing a global lock ordering — every thread always acquires locks in the same order. Other options: acquire all locks at once (kills hold-and-wait), or use try-lock with timeouts and back off (kills no-preemption).
- "Difference from livelock?" — livelock: threads keep reacting to each other and make no progress (not blocked, just busy).
- "How do you detect it in production?" — threads stuck, no CPU use; thread dump shows the cycle.
Q7Process vs thread — what's the difference and when do you choose each?
Shared vs isolated address space and the practical trade-offs.
A process has its own private address space and fds; it's isolated (a crash can't corrupt another) but heavier and needs IPC to share data. Threads live in one process and share its address space and fds, so they communicate through shared memory for free and are cheap to create/switch — but a bug in one can corrupt the whole process, and shared data needs locks. On Linux both come from clone(). Choose threads for cheap shared-memory concurrency; choose processes for fault isolation or to escape a global lock (e.g. Python's GIL).
- "How does the scheduler see them?" — it schedules threads (tasks) individually.
- "Why multiprocessing in Python?" — the GIL prevents threads from using multiple cores for CPU-bound work.
Q8What is a context switch and why is it expensive?
The subtle answer — caches and TLB, not the register copy.
It's the kernel saving one task's CPU state and restoring another's so one core can multiplex many tasks. The register save/restore is cheap; the real cost is indirect. Switching to a different process swaps page tables and flushes much of the TLB, and the incoming task runs with cold CPU caches, suffering cache/TLB misses until its working set warms up. So heavy switching (too many threads, tiny slices, lock contention) burns CPU in overhead — high %sys, high involuntary context-switch counts — rather than useful work.
- "Voluntary vs involuntary?" — blocked on I/O vs preempted by the scheduler.
- "Thread switch within a process cheaper?" — yes, same address space, less TLB damage.
Q9Load average is 200 but CPU utilisation is near zero. Explain.
That Linux load average includes D-state processes, not just CPU demand.
Linux load average counts processes that are runnable (R) and in uninterruptible sleep (D). A load of 200 with idle CPU means ~200 processes stuck in D waiting on I/O — not consuming CPU. The classic cause is a slow or hung storage backend (dying disk, unresponsive NFS): requests pile up in I/O wait. The fix is on the I/O side, not the CPU. Treating load average as "CPU busyness" is the mistake — it's "demand for the system, including I/O waits."
- "How confirm?" —
ps -eo stat | grep -c D,iostat/iotopfor disk saturation, check stuck mounts. - "Why does D count?" — to reflect true system demand including I/O.
Q10Explain a file descriptor, and why sockets, files, and pipes all use them.
The "everything is a file" abstraction and why it's powerful.
An fd is a small integer the kernel gives as a handle to an open resource; all operations name the resource by that integer. The power is uniformity: files, pipes, sockets, terminals, and event queues share one interface (read/write/close), so the same code and the same event loops (epoll) work across all of them. Underneath, the fd indexes a per-process table → a system-wide open-file entry (holding the offset) → the underlying object. fds 0/1/2 are stdin/stdout/stderr, which makes shell redirection work.
- "Why does
2>&1work?" — points fd 2 at whatever fd 1 points to. - "fd exhaustion?" — hitting
ulimit -n→EMFILE, often an fd leak.
Q11SIGTERM vs SIGKILL — what's the difference and why does it matter?
Graceful shutdown, catchable vs uncatchable signals, and the tie to orchestration.
SIGTERM (the default kill) politely asks a process to exit and is catchable, so a well-built program runs cleanup — flush buffers, finish in-flight requests, close connections — then exits. SIGKILL (kill -9) is uncatchable: the kernel destroys the process immediately with no cleanup, risking corrupt/partial state. So you always try SIGTERM first. This is exactly Kubernetes' pod-termination flow: it sends SIGTERM, waits a grace period, then SIGKILL — which is why services implement a SIGTERM handler for graceful shutdown.
- "Which signals can't be caught?" — SIGKILL and SIGSTOP.
- "Why did SIGKILL not work on my process?" — it's in
Dstate, can't receive any signal until its I/O completes.
Q12Design: one box, 100,000 concurrent network connections. How?
Combining fds, non-blocking I/O, and event notification into a systems design (C10K).
Not a thread per connection — 100k threads means crushing context-switch and memory overhead. Instead use non-blocking sockets and an event-driven model: register all connection fds with epoll, and run a small pool of threads (~one per core), each an event loop that epoll_waits and services only the fds that are actually ready. This is why "everything is an fd" matters — one epoll loop watches tens of thousands of sockets. It's the architecture of nginx, Redis, and Node. Keep per-connection state small, scale threads to cores, and never make a blocking syscall inside the loop (it would stall every connection on that thread).
- "Why not
select?" — O(n) per call, capped at FD_SETSIZE;epollis O(ready). - "Where does blocking hurt?" — inside the loop it stalls all that thread's connections; push blocking work to a separate pool.
- "Limits to tune?" —
ulimit -nand ephemeral port range.