sn sysnerdunderstand it from the kernel up
← curriculum map
Operating Systems
Core · ~180 min · 6 labs
map / Concepts / Operating Systems
Concepts Core ~180 min

Operating Systems

This assumes you know nothing beyond "a computer runs programs," and takes you to the point where you can answer any OS interview question out loud. We build each idea from the one before it — no leaps, no hand-waving. Grounded in OSTEP, CS:APP, and Kerrisk. Read it top to bottom once; then the labs and the Interview Gauntlet at the end will feel like review.

1 · What a computer is actually doing

Before any operating system exists, understand the bare machine, because everything the OS does is a trick played on top of this one simple loop.

A computer has two core parts: a CPU (the part that does things) and memory / RAM (the part that remembers things). Think of RAM as an enormous row of numbered mailboxes, each holding one byte. The number of a mailbox is its address. That's all memory is: a giant array of bytes, each with an address like 0, 1, 2, … up into the billions.

The CPU does exactly one thing, forever, billions of times a second — the fetch–decode–execute cycle:

The only loop that matters
FETCHread next instruction from RAM—>DECODEwhat does it mean?—>EXECUTEdo it
A CPU fetches an instruction from memory, figures out what it is, does it, and repeats. Programs are just long lists of these tiny instructions.

To do this the CPU has a handful of tiny, ultra-fast storage slots called registers — think of them as the CPU's hands, holding the few values it's working with right now. Two registers matter enormously: the program counter (PC, also called the instruction pointer) holds the address of the next instruction to fetch, and the stack pointer holds the top of the current function's workspace (more on the stack later). "Running a program" literally means: point the program counter at the program's first instruction and let the loop run.

An instruction is astonishingly primitive: "add these two registers," "copy this byte from memory into a register," "if this register is zero, jump to that address." A program — a web browser, a database — is millions of these tiny steps. The magic of computing is entirely in the arrangement, not the individual step.

Now the central problem the OS exists to solve: a CPU can only run one instruction stream at a time, and there is only one pool of RAM — yet you want to run hundreds of programs, and none of them should be able to read or corrupt another's memory, crash the machine, or hog the CPU forever. The operating system is the program that makes one CPU and one RAM feel like a private computer to each of the hundreds of programs running on it. Every section below is one piece of that illusion. Keep the fetch–decode–execute loop in mind: it never stops, and the OS's job is to keep swapping whose instructions it runs and whose memory it sees.

2 · From a file on disk to a living process

Start with the difference between a program and a process, because interviewers love this and beginners blur it. A program is a dead, passive thing: a file on disk (like /bin/ls) containing machine instructions and initial data. A process is that program brought to life — an instance running in memory, with a slice of CPU time, its own private memory, and a pile of kernel bookkeeping. One program can become many processes: run ls in three terminals and you have one program, three processes. Recipe vs. the meal being cooked.

When a process is created, the kernel builds a record for it called the process control block (in Linux, a task_struct). It holds everything the kernel needs to know: the process's ID (its PID), its parent's PID, the saved values of all its registers (for when it's not running), a pointer to its memory map, its list of open files, its scheduling priority and state, and much more. When people say "the OS manages a process," they mean it maintains and updates this structure.

The shape of a process's memory

Every process gets its own private memory laid out in a standard pattern. You must know this cold — it explains stack overflows, heap growth, and segfaults:

A process's address space (high address at top)
stack ↓Grows downward. Holds local variables and the "who called whom" chain of function calls (return addresses). Each function call pushes a frame; each return pops it. Recurse too deep and it collides with the heap: stack overflow.
↕ (the gap)Unmapped space between stack and heap. The stack grows down into it, the heap grows up into it. Touching an address in the unmapped middle is an invalid access → SIGSEGV.
heap ↑Grows upward. Dynamic memory — everything you malloc() / new. Managed by the program, not the kernel directly. A leak = the heap growing forever.
BSS + dataGlobal variables. data holds initialised globals; BSS holds zero-initialised ones (stored as just a size, zeroed on load).
textThe code itself — the machine instructions, marked read-only and executable so the program can't accidentally overwrite its own code.
Every process sees this same layout in its own private address space. Stack and heap grow toward each other; the code is read-only. This picture answers "what is a stack overflow / segfault?" instantly.

How a process is born: fork + exec

Here's a surprise that trips up everyone: a new process is never created from scratch — it's cloned from an existing one. The system call fork() makes a child that is a near-perfect duplicate of the parent. The child then usually calls exec() to replace its own contents with a different program. Your shell does exactly this for every command. Click through the whole birth-to-death lifecycle — this is the answer to "what happens when you run ./a.out?":

Interactive · from ./a.out to a running process
shellparent— fork() —>shell copychild

The shell calls fork(). The kernel makes a child that duplicates the parent — same code, memory, and open files. But it doesn't really copy the memory: parent and child share the same physical pages, marked read-only (copy-on-write), and a page is only truly copied the instant one of them writes to it. That's why fork is cheap. fork returns twice: 0 to the child, the child's PID to the parent — that single value is how each side knows who it is.

shell copy— execve("a.out") —>a.out image

The child calls execve(). The kernel throws away the child's entire memory image and replaces it with the new program's. The PID stays the same; only the contents change. This is why you fork then exec — if you just execed without forking, you'd replace the shell itself.

ELF binary+ld.so—>mapped, not yet loaded

The kernel maps the executable's sections (code, data) into the new address space, and for a dynamically-linked program invokes the dynamic linker (ld.so) to map shared libraries like libc. Crucially almost nothing is copied into RAM yet — the pages are mapped (the address space knows about them) but not present (not in physical memory).

_start → main1st instruction —>page fault → RAM

Control jumps to _start, which calls main(). The very first instruction is on a page that's mapped but not in RAM, so the CPU raises a page fault; the kernel loads that page on demand and resumes. Pages arrive only as the program touches them — demand paging — which is why even a huge binary starts instantly.

running— exit() —>zombieuntil parent wait()s

On finishing, the program calls exit(). It doesn't vanish — it becomes a zombie, keeping just its exit code, until the parent calls wait() to collect it (reaping). Unreaped children pile up as zombies; if the parent dies first, the orphan is adopted by PID 1, which reaps it. This is precisely why a container's PID 1 must reap children.

fork = cheap copy-on-write clone · exec = replace the image · PID survives exec · memory loads on demand · exit → zombie → reaped by wait().

3 · Virtual memory: the great illusion

Now the deepest trick in the whole OS. Run the same program twice and inspect both: both claim to be using memory address 0x400000. They can't share the same physical mailbox — so who's lying? Both are. Neither process ever touches real physical memory directly. Each lives inside its own private, fictional memory called a virtual address space, and the hardware silently translates its fake addresses into real ones on every single memory access.

Why go to all this trouble? Three enormous wins: (1) Isolation — process A literally cannot name, let alone read, process B's memory, because their address spaces are separate maps. (2) Simplicity — every program gets the same clean layout starting at the same addresses, regardless of what else is running. (3) Efficiency — the OS can hand out more memory than physically exists, and load pages only when touched.

Pages, page tables, and the MMU

Translation happens in fixed-size chunks called pages (almost always 4 KB). Physical RAM is divided into equal chunks called frames. Each process has a page table — a kernel-maintained lookup from its virtual page numbers to physical frame numbers. A dedicated piece of hardware, the MMU (memory management unit), performs the lookup on every access. Because walking the page table on every access would be far too slow, the CPU caches recent translations in a small, very fast cache called the TLB (translation lookaside buffer).

Here is one address being translated, step by step:

Interactive · translating virtual address 0x004021a8
virtual 0x004021a8

The program executes an instruction touching virtual address 0x004021a8. As far as the program knows, that is the address in real memory. It has no idea translation is about to happen — that obliviousness is the entire point.

page# 0x00402+offset 0x1a8

The MMU splits the address. With 4 KB pages, the bottom 12 bits are the offset (position within the page, 0x1a8); the rest is the virtual page number (0x00402). Only the page number gets translated; the offset rides along unchanged — a byte's position within a page is the same in virtual and physical.

TLBhit →frame 0x1f3 · ~1 ns

The MMU first checks the TLB. On a hit, the physical frame comes back in ~1 ns and we jump straight to a physical address. This is why tight loops and hot data are fast — their translations stay cached. A well-behaved program has a high TLB hit rate.

TLB miss—>walk page table—>frame#

On a miss, the MMU walks the page table in memory (on x86-64, up to four levels of tables — a tree, so the table itself doesn't need to be one giant contiguous block). That's several extra memory reads. The result is cached back in the TLB. Frame + offset = the real physical address, and the access finally happens.

not present—>page fault → kernel

If the page-table entry says the page isn't in RAM, the CPU raises a page fault and jumps into the kernel. A minor fault: the page is valid but not yet resident (first touch, or shared) — the kernel allocates/zeroes a frame and retries. A major fault: the page must be read from disk (from the executable, or from swap) — slow. An invalid access (a bad pointer): the kernel refuses and sends SIGSEGV. A segfault is just a page fault the kernel wouldn't satisfy.

Translate on every access · TLB caches it · demand paging + faults load memory lazily · minor vs major faults · segfault = unsatisfiable fault.

The two numbers interviewers ask about

VSZ (virtual size) = how much address space a process has mapped, including memory reserved but never touched. RSS (resident set size) = how much is actually in physical RAM right now. VSZ is the promise; RSS is the bill. Because pages become resident only on demand, a process routinely has a huge VSZ and a modest RSS. This is also what lets Linux overcommit: hand out more virtual memory than it physically has, betting not everyone cashes their promise at once. When that bet fails, the OOM killer collects (final section). And when RAM is tight, the kernel can push cold pages out to swap (disk) to free frames — cheap if the pages are rarely used, catastrophic ("swap thrashing") if the working set genuinely exceeds RAM.

4 · How a program actually gets memory

You write malloc(64) or new or [] and get memory. Where does it come from? Beginners assume "the OS hands it over each time." Wrong — and the truth is a favourite interview probe.

Recall the address-space layout: the stack and the heap. They serve different needs:

Stack vs heap · hover each
the stackAutomatic, fast, scoped. Local variables and function call frames. Allocation is literally "move the stack pointer" — one instruction. Freed automatically when the function returns. Limited in size (often 8 MB); blow past it (deep recursion, huge local array) and you get a stack overflow.
the heapManual, flexible, longer-lived. Memory you request at runtime (malloc/new) that outlives the function that created it. You must free it (or a garbage collector must) — forget, and it's a memory leak. Managed by an allocator inside your process.
Stack = automatic + fast + small; heap = manual + flexible + can grow. "Where does a leak live?" → the heap.

malloc doesn't usually call the kernel

Here's the key insight. Your program links a memory allocator (part of libc). When you malloc, the allocator hands you a slice from a pool of memory it already obtained from the kernel — no syscall needed most of the time. Only when its pool runs low does it ask the kernel for more, via brk/sbrk (grow the heap's top) or mmap (map a fresh region, used for large allocations). Freeing usually returns memory to the allocator's pool, not to the OS — which is why a process's RSS often doesn't shrink after you free: the memory is free within the process but still held from the OS's view. This surprises people debugging "why didn't my RSS drop after I freed everything?"

Two consequences you'll be asked about. Fragmentation: after many allocations and frees of different sizes, the pool can become swiss-cheese — plenty of total free space, but no single contiguous chunk big enough, so the allocator must ask the OS for more anyway. And the layer beneath: because of overcommit and demand paging (Section 3), even a successful malloc of a gigabyte doesn't consume a gigabyte of RAM until you write to those pages — the pages fault in one at a time as you touch them.

5 · The syscall boundary: asking the kernel for anything

A user program cannot open a file, send a packet, or read the clock by itself. By deliberate design, it has no direct access to hardware. Here's the enforcement mechanism. The CPU runs in one of two privilege modes: user mode (called ring 3), where your programs live and where privileged instructions simply don't work, and kernel mode (ring 0), where the OS runs with full access to everything. This wall is enforced by the hardware itself — a program in user mode that tries to touch a device or another process's memory doesn't succeed-but-get-caught; the CPU physically refuses and faults. This, plus virtual-memory isolation, is the entire basis of OS security.

So how does a program get anything done? It asks the kernel through a system call — a single, controlled doorway from user mode into kernel mode. Every interaction with the outside world — read, write, open, connect, mmap, fork — is a syscall. Walk the crossing:

Interactive · crossing into the kernel
user mode · ring 3
rax=0 (read)fd, buf, count in registers

The program wants to read(). It puts the syscall number in a register (rax) and the arguments (which file descriptor, where to put the bytes, how many) in others, then runs the syscall instruction. In C you just call read(); libc is a thin wrapper doing exactly these register moves.

user → kernel
syscall instr—>ring 0 · fixed entry

The syscall instruction is a controlled trap: the CPU switches to kernel mode and jumps to one fixed entry point the kernel registered at boot. The program can't choose where in the kernel it lands — that's the safety property. There is exactly one supervised doorway in.

kernel mode · ring 0
syscall_table[0]—>sys_read()

The kernel reads the number from rax and indexes the syscall table to find the handler — 0 is read. It then validates the arguments: is that buffer pointer actually inside the caller's own address space? (A malicious program might pass a kernel address hoping to trick the kernel into writing there — the check stops it.)

kernel mode · ring 0
privileged worktouch device · copy bytes

Now with full privilege, the kernel does what the program couldn't: talk to the disk or socket, copy data into the user's buffer. If the data isn't ready (an empty socket), the kernel may put the process to sleep and run someone else — the syscall blocks. This is the exact moment a process enters an S or D state (next sections).

kernel → user
rax = bytes read— sysret —>program resumes

The kernel puts the result (bytes read, or a negative error code) in rax, switches back to user mode, and the program continues right after its syscall instruction. libc turns a negative return into the familiar errno. The program never knew it left ring 3.

Two rings, one doorway, hardware-enforced · every I/O is a syscall · argument validation is a security boundary · a blocking syscall is where a process sleeps.

Two things to carry forward. Syscalls are not free — each is a mode switch with real overhead, so fast code batches them (one big write beats a thousand tiny ones; io_uring exists to amortise them). And strace works by intercepting exactly this boundary, so its output is the truest possible description of what a program actually does.

6 · Interrupts: how the CPU gets pulled away

One puzzle the fetch–decode–execute loop can't explain on its own: if the CPU is busy running program A's instructions forever, how does it ever notice the keyboard was pressed, a network packet arrived, the disk finished, or that A has run long enough and B deserves a turn? The answer is interrupts, and they're the heartbeat of the whole system.

An interrupt is a hardware signal that says "stop what you're doing and handle me." When one fires, the CPU immediately pauses the current program (saving just enough state), switches to kernel mode, and runs a small kernel routine called an interrupt handler (or ISR). When the handler finishes, the CPU restores the interrupted program and resumes as if nothing happened. Devices use interrupts so the CPU never has to sit and poll them: the disk says "your data's ready" by raising an interrupt, freeing the CPU to do other work in the meantime.

Three ways into the kernel · hover each
interruptAsynchronous, from hardware. Timer, keyboard, network card, disk. Arrives at an unpredictable moment, unrelated to what the CPU was doing. Drives preemption and I/O completion.
trap / syscallSynchronous, on purpose. The running program chose to enter the kernel via the syscall instruction. Predictable, requested.
fault / exceptionSynchronous, by accident. An instruction did something needing the kernel — a page fault, a divide-by-zero, an illegal access. Caused by the current instruction.
All three suspend the program and enter the kernel; they differ in what triggers them. Knowing the distinction is a common interview point.

The single most important interrupt is the timer interrupt. The kernel programs a hardware timer to fire hundreds of times per second. Each tick yanks the CPU away from whatever program is running and hands control to the kernel — which is the only reason a runaway program (an infinite loop) can't freeze your machine. On each tick the kernel checks: has this program used its time slice? Should someone else run now? That decision is scheduling, next.

7 · The scheduler: sharing one CPU among many

A CPU core runs exactly one instruction stream at a time, yet hundreds of processes appear to run "simultaneously." The trick is time-slicing: give each ready process a tiny slice of CPU (a few milliseconds), then, on the next timer interrupt, forcibly take the CPU back and give it to another — fast enough that everything looks concurrent. Forcibly taking the CPU is called preemption, and an OS that does it (all modern ones) is preemptively multitasking. The kernel component that decides who runs next is the scheduler; Linux's is the Completely Fair Scheduler (CFS), which hands out CPU time roughly in proportion to priority, always favouring whoever has run the least so far. Priority is nudged by the nice value (−20 greediest to +19 most generous).

The context switch — and why it's expensive

Switching from process A to process B is a context switch: the kernel saves A's registers into A's task_struct, loads B's, and (since B has a different address space) points the MMU at B's page table and flushes much of the TLB. Interviewers love "why is a context switch expensive?" The register save/restore is trivial. The real cost is indirect: after the switch, B runs with cold CPU caches and a flushed TLB, so it suffers a burst of cache and TLB misses until its working set warms up again. This is why too many threads, tiny time slices, or heavy lock contention (which forces constant switching) burn CPU in overhead rather than useful work — visible as high %sys and high involuntary-context-switch counts.

Process states — know these cold

At any instant, every process is in one of a handful of states. This table answers half of all "diagnose this process" questions. Hover each:

Process states · hover each row
R — runningRunning or runnable. Either on a CPU right now, or on the run queue waiting its turn. A box pegged at 100% CPU is full of R tasks.
S — sleepInterruptible sleep. Waiting for an event (socket data, a timer, a lock) and can be woken by a signal. The normal resting state — most processes on an idle box are S.
D — uninterruptibleUninterruptible sleep. Blocked in the kernel on something that can't be interrupted — nearly always disk or network I/O. Cannot be killed, even with kill -9, because it's mid-syscall and can't receive the signal. A pile of D processes = an I/O problem (slow disk, dead NFS mount).
Z — zombieDead but unreaped. Finished; only its exit code lingers until the parent wait()s. Uses no CPU/RAM, but a flood means a parent that isn't reaping its children.
T — stoppedStopped/traced. Suspended by a signal (Ctrl-Z, SIGSTOP) or paused under a debugger. Resumes on SIGCONT.
R and S are normal; D means I/O and resists kill -9; Z means a parent isn't reaping. Memorise this.

Finally, a classic trap: load average is not CPU usage. On Linux it counts processes that are R and D. So a box can show load average 200 while the CPU is idle — that's 200 processes stuck in D waiting on a wedged disk, not 200 doing work. Reading load average as "CPU busyness" is one of the most common mistakes.

8 · Concurrency: threads, races, locks, deadlock

So far a process has one stream of execution. But you often want several — to use multiple CPU cores, or to keep working while one part waits on I/O. Enter threads: multiple independent streams of execution inside one process, sharing its memory and open files. This sharing is exactly what makes threads powerful and exactly what makes them dangerous.

Process vs thread · hover each
processOwn private address space and fd table. Isolated — a crash in one can't corrupt another. Heavier to create and switch. Must use IPC (pipes, sockets, shared memory) to share data.
threadShares the process's address space and fds; has only its own stack and registers. Cheap to create/switch, communicates through shared memory for free — but a bug in one thread can corrupt the whole process, and shared data needs careful protection.
On Linux both are made by clone() — a thread is just a task that shares more with its creator. Pick processes for isolation, threads for cheap shared-memory concurrency.

The race condition — the root of concurrency bugs

Two threads share a counter x = 0 and each does x = x + 1 a million times. You'd expect 2,000,000. You often get less. Why? Because x = x + 1 isn't one step — it's three: read x, add 1, write x back. If the scheduler switches threads mid-sequence, an update is lost. Step through it:

Interactive · a lost update
A reads x=41B reads x=41

x is 41. Thread A reads it into its register (41). Before A can finish, the scheduler switches to B, which also reads x — still 41. Both threads now hold "41" privately. The disaster is already baked in.

A: 41+1=42B: 41+1=42

Each thread adds 1 to its own copy. Both compute 42. Independently. Neither knows the other exists or that they read the same starting value.

A writes 42thenB writes 42

A writes 42 back to x. Then B writes 42 back to x. Two increments happened, but x only went up by one. One update was silently lost. This is a race condition: the result depends on the exact interleaving of threads, which is unpredictable — so the bug appears randomly, often only under load, and vanishes when you add a print statement. The worst kind of bug.

lockread-add-writeunlock

The fix: make the three steps atomic — indivisible — with a mutex (mutual-exclusion lock). A thread acquires the lock before touching x and releases it after; while held, no other thread can enter. The read-add-write becomes a critical section only one thread runs at a time. Now no update is lost.

Shared mutable state + concurrent access + non-atomic operations = race. The cure is mutual exclusion around the critical section.

Deadlock — when locks freeze each other

Locks solve races but introduce a new hazard. Thread A holds lock 1 and waits for lock 2; thread B holds lock 2 and waits for lock 1. Neither can proceed; both wait forever. That's deadlock. It requires four conditions simultaneously (the Coffman conditions) — break any one and deadlock is impossible:

The four conditions for deadlock · hover each
mutual exclusionA resource can be held by only one thread at a time. (Inherent to locks.)
hold and waitA thread holds one resource while waiting for another. Break it: grab all locks at once, or none.
no preemptionA resource can't be forcibly taken from a thread. Break it: use lock timeouts / try-lock and back off.
circular waitA cycle of threads each waiting on the next. Break it (the usual fix): impose a global lock ordering — always acquire locks in the same order everywhere.
All four must hold at once. The standard prevention in real code is a consistent global lock-ordering, which kills "circular wait."

One aside that comes up constantly: Python's GIL (global interpreter lock) lets only one thread run Python bytecode at a time, so CPU-bound Python threads don't use multiple cores — which is why Python programs use multiprocessing (separate processes) for CPU parallelism. It's a concrete example of why you'd choose processes over threads.

9 · Signals: the kernel tapping a process on the shoulder

How does Ctrl-C stop a program? How does the kernel tell a process "you segfaulted" or "your child died"? Through signals — tiny asynchronous notifications delivered to a process. A signal interrupts the process's normal flow, and the process either runs a registered handler function, or takes the default action (often: die). Know these by name:

The signals you must know · hover each
SIGTERM (15)Polite "please exit." The default of kill. Catchable — a program can run cleanup (flush data, close connections) then exit gracefully. Always try this first.
SIGKILL (9)The unconditional kill. kill -9. Cannot be caught, blocked, or ignored — the kernel just destroys the process. No cleanup runs (risking corrupt state). Use only when SIGTERM fails.
SIGINT (2)Interrupt from the keyboard — what Ctrl-C sends. Usually terminates, but catchable (that's how a program cleans up on Ctrl-C).
SIGSEGV (11)Segmentation fault. Sent by the kernel when a process makes an illegal memory access (the unsatisfiable page fault from Section 3). Default: die with a core dump.
SIGCHLD"A child of yours just died." Sent to the parent so it can wait() and reap. Ignoring it is how zombies accumulate.
SIGSTOP / SIGCONTPause / resume. SIGSTOP (like Ctrl-Z) freezes a process into the T state; SIGCONT unfreezes it. SIGSTOP, like SIGKILL, can't be caught.
SIGTERM = ask nicely (catchable); SIGKILL = force (uncatchable, no cleanup). This is why graceful shutdown handles SIGTERM.

This ties two earlier threads together. First, "graceful shutdown": a well-built server catches SIGTERM, stops accepting new work, finishes in-flight requests, then exits — which is exactly what Kubernetes relies on when it stops a pod (it sends SIGTERM, waits, then SIGKILL). Second, it explains the D-state mystery from Section 7: a process in uninterruptible sleep is stuck inside a kernel syscall and cannot receive any signal, not even SIGKILL — so kill -9 genuinely does nothing until the I/O it's blocked on completes.

10 · File descriptors and the I/O models

Now the abstraction that makes Unix Unix. When a process opens a file, the kernel doesn't hand back the file — it hands back a small integer, a file descriptor (fd). Every later operation names the resource by that integer. The profound part: files are not special. Pipes, network sockets, terminals, even the kernel's own event queues are all reached through the identical fd interface (read/write/close). "Everything is a file" really means "everything is a file descriptor" — and it's why one piece of code, one event loop, can juggle files and sockets and timers together.

Every process starts with three fds open: 0 = stdin, 1 = stdout, 2 = stderr — which is why 2>&1 redirects errors to output (you point fd 2 at whatever fd 1 points to). New fds take the lowest free integer, so the next file you open is usually fd 3. Underneath, an fd is one end of a three-link chain:

The fd chain · hover each stage
fd 0/1/2/3…Per-process fd table. Your integers, private to each process (fork copies them). Browse them live at /proc/<pid>/fd.
open file entrySystem-wide open file table. Holds the offset (where the next read/write lands) and the mode. Two fds can share one entry (shared offset) or point at separate entries on the same file.
inodeThe file object itself — metadata plus pointers to data blocks. A file with zero names but an open fd still exists — the "deleted-but-open" trap that makes df and du disagree (you'll meet it in Storage).
Integer → open-file entry (offset) → inode (data). Sockets and pipes ride the same chain — which is why "a socket is just an fd."

Blocking, non-blocking, and how one program serves 100,000 connections

When you read() a socket with no data, the default is to block — the process sleeps (state S) until data arrives. Fine for one connection; disastrous for thousands, because you'd need a thread per connection (crushing context-switch and memory cost — this is the famous C10K problem). The solution combines three things you now understand: mark the fds non-blocking (a read with no data returns immediately instead of sleeping), and use an event-notification syscall — epoll on Linux — to ask the kernel "tell me which of these 100,000 fds are ready." A tiny pool of threads (about one per core) each runs an event loop that epoll_waits and services only the ready fds. That single design — non-blocking fds + epoll — is the architecture behind nginx, Redis, and Node.js. It's why "everything is an fd" is more than a slogan: one loop watches every kind of I/O at once.

And the failure mode: every fd costs a table slot, the kernel caps how many you may hold (ulimit -n), and a program that leaks fds (opening without closing) eventually hits EMFILE — too many open files and can't accept new connections. You'll reproduce exactly that in the labs.

11 · The memory hierarchy: why programs wait

Every performance intuition rests on one fact: memory is a hierarchy, and each level down is dramatically slower than the one above. The CPU can do arithmetic in a fraction of a nanosecond, but reaching main memory takes ~100 ns and reaching a spinning disk takes tens of thousands of times longer. All the machinery you've met — CPU caches, the TLB, the page cache — exists to keep the data the CPU needs as high up this pyramid as possible. Hover each tier and burn the orders of magnitude into memory (the exact numbers vary; the ratios are the point):

Latency hierarchy · hover each tier
CPU register< 1 ns
L1 cache~1 ns
L2 cache~4 ns
L3 cache~15 ns
Main memory (RAM)~100 ns
NVMe SSD read~100 µs
Network RTT (same datacenter)~0.5 ms
Disk seek (spinning HDD)~10 ms
RAM is ~100× slower than L1; SSD ~1000× slower than RAM; an HDD seek ~100× slower again. "Keep the working set in cache/RAM" is the root of almost every performance win.

Two ideas follow. Locality: memory moves between levels in cache lines (~64 bytes), so touching data that sits close together (a contiguous array) is far faster than scattered access (a linked list of pointers all over the heap) — same number of accesses, wildly different speed, because one enjoys cache hits and the other suffers misses. The page cache: Linux uses spare RAM to cache file data, which is why free shows most memory "used" as buff/cache — that memory is reclaimable, not lost, and is why re-reading a file is instant the second time.

12 · When memory runs out: the OOM killer

Recall two facts: Linux overcommits (Section 3) — it promises more virtual memory than it physically has — and it uses spare RAM as reclaimable page cache (Section 11). Usually this is fine. But when processes actually try to use more physical memory than exists, and the kernel has already shrunk the page cache as far as it can, it's cornered: there are no free frames and something must give.

Enter the OOM killer (out-of-memory killer). Rather than let the whole system freeze, the kernel picks a process and kills it (SIGKILL) to reclaim its memory. It doesn't choose randomly: each process has an OOM score — roughly, how much memory it's using, tunable per process via oom_score_adj — and the highest scorer is sacrificed. In the kernel log you'll find the tell-tale Out of memory: Killed process 1234 (mysqld).

The crucial connection for everything that follows: this is the same mechanism a cgroup memory limit uses locally. Give a container a 512 MB limit and the kernel tracks that cgroup's memory; exceed it and the OOM killer fires inside just that cgroup, killing a process there — while the host may have gigabytes free. That's why a container gets "OOMKilled" on a machine with plenty of RAM: it hit its limit, not the host's. One sentence links this entire module to containers and Kubernetes: a container OOM is virtual memory hitting a wall you configured.

You now have the whole machine in your head: a CPU running one fetch–decode–execute loop, virtual memory giving every process a private world, syscalls and interrupts as the only ways into the kernel, the scheduler time-slicing the CPU, threads and locks for concurrency, signals for control, file descriptors for all I/O, and the memory hierarchy explaining every performance question. Everything above this module — Linux, Docker, Kubernetes — is built from exactly these parts.

Now build it by hand
Q1Watch the syscall boundary with stracewarm-up

Every time a program touches the outside world it crosses into the kernel via a system call. strace intercepts that boundary and shows the truest description of what a program actually does.

Under the hood

strace uses the ptrace syscall to stop the target on every syscall entry and exit. -c tallies them and shows in-kernel time per call — the fast way to spot a program hammering the kernel (thousands of tiny reads) vs doing real work. On a stuck process, strace -p <pid> often shows it spinning on one syscall.

Task

Run strace on a simple command and read the syscall tally.

Verify it yourself
verify
$ strace -c ls /

You get a table of syscalls (execve, openat, read, mmap…). Every row is one favour asked of the kernel; execve at the top is the program being loaded (Section 2).

Reveal solution
solution
$ sudo apt install -y strace   # if needed
$ strace -c ls /
$ strace -e trace=openat ls /   # just the file opens
Q2Watch fork + exec build the process treewarm-up

Processes are born by fork (clone) then exec (replace image). Every process has a parent, up to PID 1.

Under the hood

sleep 100 & makes your shell fork a child that execves sleep; the ppid you see is the fork relationship. Kill the shell and the child re-parents to PID 1 — the orphan-reaping rule from Section 2, live.

Task

Start a background sleep, then show it, its PID, and its parent.

Verify it yourself
verify
$ ps -o pid,ppid,cmd --ppid $$

The child sleep appears with ppid = your shell's PID. pstree -p $$ draws the tree.

Reveal solution
solution
$ sleep 100 &
$ ps -o pid,ppid,cmd --ppid $$
$ pstree -p $$
Q3Read the address-space layoutcore

Every process has the standard layout from Section 2 — text, data, heap, and stack — each in its own private virtual address space.

Under the hood

/proc/self/maps lists every mapped region with permissions and backing file: the executable's .text (r-x), [heap], [stack], and each shared library. Summed, these regions are roughly your VSZ; the resident subset is your RSS.

Task

Print the memory map of a process and find its heap, stack, and code.

Verify it yourself
verify
$ cat /proc/self/maps | grep -E "heap|stack|r-xp" 

You'll see [heap], [stack], and the executable/libraries mapped r-xp (read-execute) — the address-space diagram made real.

Reveal solution
solution
$ cat /proc/self/maps | head -20
$ cat /proc/self/maps | grep -E "heap|stack"
Sponsored

Reach engineers who read the man page

Native, contextual, no tracking — this is how the curriculum stays free.

Q4A socket is just a file descriptorcore

Files, pipes, and sockets are all reached through small integers — file descriptors. 0/1/2 are stdin/stdout/stderr; the kernel assigns the lowest free integer next.

Under the hood

/proc/<pid>/fd is the per-process fd table made browsable — each entry symlinks to what it points at (a path, socket:[…], pipe:[…]). First place to look for "too many open files" or "which socket is this process holding."

Task

Open a file on fd 3, then inspect the shell's open descriptors.

Verify it yourself
verify
$ ls -l /proc/$$/fd

fd 3 symlinks to the file you opened, next to 0/1/2. A socket would appear as socket:[…] in the same list — same mechanism, different backing object.

Reveal solution
solution
$ exec 3< /etc/hostname
$ ls -l /proc/$$/fd
$ exec 3<&-   # close it
Q5VSZ vs RSS — the promise vs the billcore

A process can reserve a huge virtual address space (VSZ) while keeping little actually in RAM (RSS), because pages become resident only on demand.

Under the hood

This is overcommit + demand paging in one measurement. RSS climbs as a program touches more of its allocation while VSZ stays put — the pages were always there virtually; touching them made them real. As RSS nears a cgroup limit, the OOM killer looms (Section 12).

Task

Compare virtual size vs resident size across your processes.

Verify it yourself
verify
$ ps -eo pid,vsz,rss,comm --sort=-rss | head

VSZ (address space mapped) is almost always far larger than RSS (actually in RAM). The gap is memory promised but not cashed in.

Reveal solution
solution
$ ps -eo pid,vsz,rss,comm --sort=-rss | head
Q6Break-and-fix — too many open filesdebug

Every fd costs a slot; the kernel caps how many a process may hold (ulimit -n). Hit the cap and syscalls fail with EMFILE — a real outage where a leaking server slowly stops accepting connections.

Under the hood

It surfaces as Too many open files and accept()/open() returning EMFILE. Diagnose with ls /proc/<pid>/fd | wc -l vs the limit in /proc/<pid>/limits: a steadily climbing count is a leak (fix the code); a one-time spike is an under-provisioned limit (raise it). Raising the limit to hide a leak just delays the crash.

Task

Lower the fd limit hard in a subshell and watch opens fail; then reason about the real fix.

Verify it yourself
verify
$ bash -c 'ulimit -n 8; for i in $(seq 20); do exec {fd}<>/tmp/f$i || { echo "failed at $i"; break; }; done'

It fails partway once the 8-fd budget is spent. The right fix depends on why: raise the soft limit for a legitimate need, or close leaked fds if the count grows unbounded.

Reveal solution

Always diagnose before raising: a monotonically climbing fd count is a leak, not a capacity problem.

solution
$ cat /proc/<pid>/limits | grep "open files"
$ ls /proc/<pid>/fd | wc -l
$ ulimit -n 4096   # only if it is a real need
What you just built

You now hold the whole machine: a CPU running one fetch–decode–execute loop; virtual memory giving every process a private world; the heap and stack; syscalls and interrupts as the only doors into the kernel; the scheduler time-slicing the CPU; threads, races, locks and deadlock; signals for control; file descriptors and epoll for all I/O; and the memory hierarchy explaining every performance question. Every module above — Linux, Docker, Kubernetes — is assembled from exactly these parts. A container is a process with namespaces and cgroups; a socket is an fd; an OOM kill is virtual memory hitting a wall you set.

The interview gauntlet

The questions actually asked for SRE, systems, and platform-engineering roles — conceptual, debugging, and "explain it to me" prompts, plus a design question. Each expands to show what the interviewer is really probing for, a model answer, and the follow-up traps. Answer all of these out loud and you have mastered this module.

Q1What happens, step by step, when you run ./a.out?
What they're really probing

The whole stack: fork/exec, the loader and dynamic linking, demand paging, process lifecycle — not just "it runs."

Model answer

The shell forks a child (copy-on-write, so nothing is truly copied yet). The child execves the binary: the kernel discards the copied image and maps the executable's sections, invoking ld.so to map libc and resolve symbols for a dynamic binary. Almost nothing is in RAM — pages are mapped but not present. Control reaches _start → main; the first instruction triggers a page fault and the kernel loads that page on demand. The process runs, entering the kernel via syscalls for any I/O, until it exits and becomes a zombie until its parent waits to reap it.

Follow-up traps
  • "Why does fork return twice?" — 0 in the child, child PID in the parent.
  • "Why not copy all memory?" — copy-on-write; copied only when written.
  • "Parent exits first?" — orphan re-parents to PID 1, which reaps it.
Q2Explain virtual memory as if I know nothing about it.
What they're really probing

Clarity of the mental model plus the machinery (pages, page table, TLB, faults) behind it.

Model answer

Every process gets its own private, fictional memory and believes it owns all of it. The hardware translates these fake (virtual) addresses to real (physical) ones on every access, in page-sized chunks, via a per-process page table; a small cache (the TLB) makes repeated translations fast. Pages load into RAM on demand through page faults. Payoffs: isolation (processes can't see each other's memory), a clean uniform layout, and overcommit. A segfault is a page fault the kernel refused to satisfy.

Follow-up traps
  • "What's in the page table?" — virtual-page → physical-frame plus permission bits.
  • "Why is the TLB critical?" — without it every access needs a multi-level page-table walk.
  • "Minor vs major fault?" — allocate/zero a frame vs read from disk/swap.
Q3A process is stuck and won't die even with kill -9. Why, and what do you do?
What they're really probing

The D (uninterruptible sleep) state, its link to I/O, and why signals can't reach it.

Model answer

It's almost certainly in uninterruptible sleep (D): blocked inside a kernel syscall on I/O that can't be interrupted — a slow/dead disk or a hung NFS mount. SIGKILL can't be delivered because the process isn't in a state to receive any signal; it's mid-syscall. You can't force it — you fix the underlying I/O (recover the mount/disk). Confirm with ps -o pid,stat,wchan: D plus a wchan in an I/O wait function.

Follow-up traps
  • "So how do you kill it?" — you can't directly; resolve the I/O it awaits.
  • "Effect on load average?" — D counts toward load, so load can be huge while CPU is idle.
Q4Your service's memory grows until it's OOM-killed. How do you diagnose it?
What they're really probing

Systematic memory debugging: leak vs cache vs limit, RSS trends, and OOM-killer logic.

Model answer

Confirm it's a leak, not page cache — track the process's RSS over time (/proc/<pid>/status), not system-wide "used." Monotonic climb that never plateaus = leak in the app; a plateau near the limit = working set larger than the limit. Check the cgroup limit and memory.events for prior OOM kills, and dmesg for Out of memory: Killed process naming the victim. Fixes: raise the limit if the working set is legitimately that big, or fix the leak (heap profiler, unbounded caches/queues).

Follow-up traps
  • "Why that process?" — highest OOM score (~memory used, tunable via oom_score_adj).
  • "Container killed but host had free RAM?" — it hit the cgroup limit, not the host.
  • "free shows no free memory — bad?" — usually not; most is reclaimable page cache.
Q5Walk me through a race condition and how you'd fix it.
What they're really probing

Whether you understand shared mutable state, non-atomic operations, and mutual exclusion.

Model answer

Two threads share a variable and each does x = x + 1, which is really read-add-write. If the scheduler interleaves them — both read the same old value, both add one, both write back — one increment is lost. The result depends on timing, so the bug is intermittent and load-dependent (a race condition). The fix is mutual exclusion: wrap the read-add-write in a mutex so only one thread is in that critical section at a time, making it atomic. Alternatives: atomic instructions, or avoiding shared mutable state entirely.

Follow-up traps
  • "Why does adding a print make it disappear?" — it changes timing, hiding (not fixing) the race.
  • "Downside of locks?" — contention (serialises threads) and deadlock.
Q6What is deadlock and how do you prevent it?
What they're really probing

The four Coffman conditions and the practical prevention used in real code.

Model answer

Deadlock is when threads wait on each other in a cycle and none can proceed — A holds lock 1 and wants 2; B holds 2 and wants 1. It needs four conditions at once: mutual exclusion, hold-and-wait, no preemption, and circular wait. Break any one and it can't happen. The standard real-world fix is to eliminate circular wait by imposing a global lock ordering — every thread always acquires locks in the same order. Other options: acquire all locks at once (kills hold-and-wait), or use try-lock with timeouts and back off (kills no-preemption).

Follow-up traps
  • "Difference from livelock?" — livelock: threads keep reacting to each other and make no progress (not blocked, just busy).
  • "How do you detect it in production?" — threads stuck, no CPU use; thread dump shows the cycle.
Q7Process vs thread — what's the difference and when do you choose each?
What they're really probing

Shared vs isolated address space and the practical trade-offs.

Model answer

A process has its own private address space and fds; it's isolated (a crash can't corrupt another) but heavier and needs IPC to share data. Threads live in one process and share its address space and fds, so they communicate through shared memory for free and are cheap to create/switch — but a bug in one can corrupt the whole process, and shared data needs locks. On Linux both come from clone(). Choose threads for cheap shared-memory concurrency; choose processes for fault isolation or to escape a global lock (e.g. Python's GIL).

Follow-up traps
  • "How does the scheduler see them?" — it schedules threads (tasks) individually.
  • "Why multiprocessing in Python?" — the GIL prevents threads from using multiple cores for CPU-bound work.
Q8What is a context switch and why is it expensive?
What they're really probing

The subtle answer — caches and TLB, not the register copy.

Model answer

It's the kernel saving one task's CPU state and restoring another's so one core can multiplex many tasks. The register save/restore is cheap; the real cost is indirect. Switching to a different process swaps page tables and flushes much of the TLB, and the incoming task runs with cold CPU caches, suffering cache/TLB misses until its working set warms up. So heavy switching (too many threads, tiny slices, lock contention) burns CPU in overhead — high %sys, high involuntary context-switch counts — rather than useful work.

Follow-up traps
  • "Voluntary vs involuntary?" — blocked on I/O vs preempted by the scheduler.
  • "Thread switch within a process cheaper?" — yes, same address space, less TLB damage.
Q9Load average is 200 but CPU utilisation is near zero. Explain.
What they're really probing

That Linux load average includes D-state processes, not just CPU demand.

Model answer

Linux load average counts processes that are runnable (R) and in uninterruptible sleep (D). A load of 200 with idle CPU means ~200 processes stuck in D waiting on I/O — not consuming CPU. The classic cause is a slow or hung storage backend (dying disk, unresponsive NFS): requests pile up in I/O wait. The fix is on the I/O side, not the CPU. Treating load average as "CPU busyness" is the mistake — it's "demand for the system, including I/O waits."

Follow-up traps
  • "How confirm?" — ps -eo stat | grep -c D, iostat/iotop for disk saturation, check stuck mounts.
  • "Why does D count?" — to reflect true system demand including I/O.
Q10Explain a file descriptor, and why sockets, files, and pipes all use them.
What they're really probing

The "everything is a file" abstraction and why it's powerful.

Model answer

An fd is a small integer the kernel gives as a handle to an open resource; all operations name the resource by that integer. The power is uniformity: files, pipes, sockets, terminals, and event queues share one interface (read/write/close), so the same code and the same event loops (epoll) work across all of them. Underneath, the fd indexes a per-process table → a system-wide open-file entry (holding the offset) → the underlying object. fds 0/1/2 are stdin/stdout/stderr, which makes shell redirection work.

Follow-up traps
  • "Why does 2>&1 work?" — points fd 2 at whatever fd 1 points to.
  • "fd exhaustion?" — hitting ulimit -nEMFILE, often an fd leak.
Q11SIGTERM vs SIGKILL — what's the difference and why does it matter?
What they're really probing

Graceful shutdown, catchable vs uncatchable signals, and the tie to orchestration.

Model answer

SIGTERM (the default kill) politely asks a process to exit and is catchable, so a well-built program runs cleanup — flush buffers, finish in-flight requests, close connections — then exits. SIGKILL (kill -9) is uncatchable: the kernel destroys the process immediately with no cleanup, risking corrupt/partial state. So you always try SIGTERM first. This is exactly Kubernetes' pod-termination flow: it sends SIGTERM, waits a grace period, then SIGKILL — which is why services implement a SIGTERM handler for graceful shutdown.

Follow-up traps
  • "Which signals can't be caught?" — SIGKILL and SIGSTOP.
  • "Why did SIGKILL not work on my process?" — it's in D state, can't receive any signal until its I/O completes.
Q12Design: one box, 100,000 concurrent network connections. How?
What they're really probing

Combining fds, non-blocking I/O, and event notification into a systems design (C10K).

Model answer

Not a thread per connection — 100k threads means crushing context-switch and memory overhead. Instead use non-blocking sockets and an event-driven model: register all connection fds with epoll, and run a small pool of threads (~one per core), each an event loop that epoll_waits and services only the fds that are actually ready. This is why "everything is an fd" matters — one epoll loop watches tens of thousands of sockets. It's the architecture of nginx, Redis, and Node. Keep per-connection state small, scale threads to cores, and never make a blocking syscall inside the loop (it would stall every connection on that thread).

Follow-up traps
  • "Why not select?" — O(n) per call, capped at FD_SETSIZE; epoll is O(ready).
  • "Where does blocking hurt?" — inside the loop it stalls all that thread's connections; push blocking work to a separate pool.
  • "Limits to tune?" — ulimit -n and ephemeral port range.