map / Concepts / Operating Systems

Concepts Core ~180 min

Operating Systems

This assumes you know nothing beyond "a computer runs programs," and takes you to the point where you can answer any OS interview question out loud. We build each idea from the one before it — no leaps, no hand-waving. Grounded in OSTEP, CS:APP, and Kerrisk. Read it top to bottom once; then the labs and the Interview Gauntlet at the end will feel like review.

Now build it by hand

Q1Watch the syscall boundary with stracewarm-up

Every time a program touches the outside world it crosses into the kernel via a system call. strace intercepts that boundary and shows the truest description of what a program actually does.

Under the hood

strace uses the ptrace syscall to stop the target on every syscall entry and exit. -c tallies them and shows in-kernel time per call — the fast way to spot a program hammering the kernel (thousands of tiny reads) vs doing real work. On a stuck process, strace -p <pid> often shows it spinning on one syscall.

Task

Run strace on a simple command and read the syscall tally.

Verify it yourself

verify

$ strace -c ls /

You get a table of syscalls (execve, openat, read, mmap…). Every row is one favour asked of the kernel; execve at the top is the program being loaded (Section 2).

Reveal solution

solution

$ sudo apt install -y strace   # if needed
$ strace -c ls /
$ strace -e trace=openat ls /   # just the file opens

Q2Watch fork + exec build the process treewarm-up

Processes are born by fork (clone) then exec (replace image). Every process has a parent, up to PID 1.

Under the hood

sleep 100 & makes your shell fork a child that execves sleep; the ppid you see is the fork relationship. Kill the shell and the child re-parents to PID 1 — the orphan-reaping rule from Section 2, live.

Task

Start a background sleep, then show it, its PID, and its parent.

Verify it yourself

verify

$ ps -o pid,ppid,cmd --ppid $$

The child sleep appears with ppid = your shell's PID. pstree -p $$ draws the tree.

Reveal solution

solution

$ sleep 100 &
$ ps -o pid,ppid,cmd --ppid $$
$ pstree -p $$

Q3Read the address-space layoutcore

Every process has the standard layout from Section 2 — text, data, heap, and stack — each in its own private virtual address space.

Under the hood

/proc/self/maps lists every mapped region with permissions and backing file: the executable's .text (r-x), [heap], [stack], and each shared library. Summed, these regions are roughly your VSZ; the resident subset is your RSS.

Task

Print the memory map of a process and find its heap, stack, and code.

Verify it yourself

verify

$ cat /proc/self/maps | grep -E "heap|stack|r-xp"

You'll see [heap], [stack], and the executable/libraries mapped r-xp (read-execute) — the address-space diagram made real.

Reveal solution

solution

$ cat /proc/self/maps | head -20
$ cat /proc/self/maps | grep -E "heap|stack"

Reach engineers who read the man page

Native, contextual, no tracking — this is how the curriculum stays free.

Q4A socket is just a file descriptorcore

Files, pipes, and sockets are all reached through small integers — file descriptors. 0/1/2 are stdin/stdout/stderr; the kernel assigns the lowest free integer next.

Under the hood

/proc/<pid>/fd is the per-process fd table made browsable — each entry symlinks to what it points at (a path, socket:[…], pipe:[…]). First place to look for "too many open files" or "which socket is this process holding."

Task

Open a file on fd 3, then inspect the shell's open descriptors.

Verify it yourself

verify

$ ls -l /proc/$$/fd

fd 3 symlinks to the file you opened, next to 0/1/2. A socket would appear as socket:[…] in the same list — same mechanism, different backing object.

Reveal solution

solution

$ exec 3< /etc/hostname
$ ls -l /proc/$$/fd
$ exec 3<&-   # close it

Q5VSZ vs RSS — the promise vs the billcore

A process can reserve a huge virtual address space (VSZ) while keeping little actually in RAM (RSS), because pages become resident only on demand.

Under the hood

This is overcommit + demand paging in one measurement. RSS climbs as a program touches more of its allocation while VSZ stays put — the pages were always there virtually; touching them made them real. As RSS nears a cgroup limit, the OOM killer looms (Section 12).

Task

Compare virtual size vs resident size across your processes.

Verify it yourself

verify

$ ps -eo pid,vsz,rss,comm --sort=-rss | head

VSZ (address space mapped) is almost always far larger than RSS (actually in RAM). The gap is memory promised but not cashed in.

Reveal solution

solution

$ ps -eo pid,vsz,rss,comm --sort=-rss | head

Q6Break-and-fix — too many open filesdebug

Every fd costs a slot; the kernel caps how many a process may hold (ulimit -n). Hit the cap and syscalls fail with EMFILE — a real outage where a leaking server slowly stops accepting connections.

Under the hood

It surfaces as Too many open files and accept()/open() returning EMFILE. Diagnose with ls /proc/<pid>/fd | wc -l vs the limit in /proc/<pid>/limits: a steadily climbing count is a leak (fix the code); a one-time spike is an under-provisioned limit (raise it). Raising the limit to hide a leak just delays the crash.

Task

Lower the fd limit hard in a subshell and watch opens fail; then reason about the real fix.

Verify it yourself

verify

$ bash -c 'ulimit -n 8; for i in $(seq 20); do exec {fd}<>/tmp/f$i || { echo "failed at $i"; break; }; done'

It fails partway once the 8-fd budget is spent. The right fix depends on why: raise the soft limit for a legitimate need, or close leaked fds if the count grows unbounded.

Reveal solution

Always diagnose before raising: a monotonically climbing fd count is a leak, not a capacity problem.

solution

$ cat /proc/<pid>/limits | grep "open files"
$ ls /proc/<pid>/fd | wc -l
$ ulimit -n 4096   # only if it is a real need

What you just built

You now hold the whole machine: a CPU running one fetch–decode–execute loop; virtual memory giving every process a private world; the heap and stack; syscalls and interrupts as the only doors into the kernel; the scheduler time-slicing the CPU; threads, races, locks and deadlock; signals for control; file descriptors and epoll for all I/O; and the memory hierarchy explaining every performance question. Every module above — Linux, Docker, Kubernetes — is assembled from exactly these parts. A container is a process with namespaces and cgroups; a socket is an fd; an OOM kill is virtual memory hitting a wall you set.

The interview gauntlet

The questions actually asked for SRE, systems, and platform-engineering roles — conceptual, debugging, and "explain it to me" prompts, plus a design question. Each expands to show what the interviewer is really probing for, a model answer, and the follow-up traps. Answer all of these out loud and you have mastered this module.

Q1What happens, step by step, when you run ./a.out?

What they're really probing

The whole stack: fork/exec, the loader and dynamic linking, demand paging, process lifecycle — not just "it runs."

Model answer

The shell forks a child (copy-on-write, so nothing is truly copied yet). The child execves the binary: the kernel discards the copied image and maps the executable's sections, invoking ld.so to map libc and resolve symbols for a dynamic binary. Almost nothing is in RAM — pages are mapped but not present. Control reaches _start → main; the first instruction triggers a page fault and the kernel loads that page on demand. The process runs, entering the kernel via syscalls for any I/O, until it exits and becomes a zombie until its parent waits to reap it.

Follow-up traps

"Why does fork return twice?" — 0 in the child, child PID in the parent.
"Why not copy all memory?" — copy-on-write; copied only when written.
"Parent exits first?" — orphan re-parents to PID 1, which reaps it.

Q2Explain virtual memory as if I know nothing about it.

What they're really probing

Clarity of the mental model plus the machinery (pages, page table, TLB, faults) behind it.

Model answer

Every process gets its own private, fictional memory and believes it owns all of it. The hardware translates these fake (virtual) addresses to real (physical) ones on every access, in page-sized chunks, via a per-process page table; a small cache (the TLB) makes repeated translations fast. Pages load into RAM on demand through page faults. Payoffs: isolation (processes can't see each other's memory), a clean uniform layout, and overcommit. A segfault is a page fault the kernel refused to satisfy.

Follow-up traps

"What's in the page table?" — virtual-page → physical-frame plus permission bits.
"Why is the TLB critical?" — without it every access needs a multi-level page-table walk.
"Minor vs major fault?" — allocate/zero a frame vs read from disk/swap.

Q3A process is stuck and won't die even with kill -9. Why, and what do you do?

What they're really probing

The D (uninterruptible sleep) state, its link to I/O, and why signals can't reach it.

Model answer

It's almost certainly in uninterruptible sleep (D): blocked inside a kernel syscall on I/O that can't be interrupted — a slow/dead disk or a hung NFS mount. SIGKILL can't be delivered because the process isn't in a state to receive any signal; it's mid-syscall. You can't force it — you fix the underlying I/O (recover the mount/disk). Confirm with ps -o pid,stat,wchan: D plus a wchan in an I/O wait function.

Follow-up traps

"So how do you kill it?" — you can't directly; resolve the I/O it awaits.
"Effect on load average?" — D counts toward load, so load can be huge while CPU is idle.

Q4Your service's memory grows until it's OOM-killed. How do you diagnose it?

What they're really probing

Systematic memory debugging: leak vs cache vs limit, RSS trends, and OOM-killer logic.

Model answer

Confirm it's a leak, not page cache — track the process's RSS over time (/proc/<pid>/status), not system-wide "used." Monotonic climb that never plateaus = leak in the app; a plateau near the limit = working set larger than the limit. Check the cgroup limit and memory.events for prior OOM kills, and dmesg for Out of memory: Killed process naming the victim. Fixes: raise the limit if the working set is legitimately that big, or fix the leak (heap profiler, unbounded caches/queues).

Follow-up traps

"Why that process?" — highest OOM score (~memory used, tunable via oom_score_adj).
"Container killed but host had free RAM?" — it hit the cgroup limit, not the host.
"free shows no free memory — bad?" — usually not; most is reclaimable page cache.

Q5Walk me through a race condition and how you'd fix it.

What they're really probing

Whether you understand shared mutable state, non-atomic operations, and mutual exclusion.

Model answer

Two threads share a variable and each does x = x + 1, which is really read-add-write. If the scheduler interleaves them — both read the same old value, both add one, both write back — one increment is lost. The result depends on timing, so the bug is intermittent and load-dependent (a race condition). The fix is mutual exclusion: wrap the read-add-write in a mutex so only one thread is in that critical section at a time, making it atomic. Alternatives: atomic instructions, or avoiding shared mutable state entirely.

Follow-up traps

"Why does adding a print make it disappear?" — it changes timing, hiding (not fixing) the race.
"Downside of locks?" — contention (serialises threads) and deadlock.

Q6What is deadlock and how do you prevent it?

What they're really probing

The four Coffman conditions and the practical prevention used in real code.

Model answer

Deadlock is when threads wait on each other in a cycle and none can proceed — A holds lock 1 and wants 2; B holds 2 and wants 1. It needs four conditions at once: mutual exclusion, hold-and-wait, no preemption, and circular wait. Break any one and it can't happen. The standard real-world fix is to eliminate circular wait by imposing a global lock ordering — every thread always acquires locks in the same order. Other options: acquire all locks at once (kills hold-and-wait), or use try-lock with timeouts and back off (kills no-preemption).

Follow-up traps

"Difference from livelock?" — livelock: threads keep reacting to each other and make no progress (not blocked, just busy).
"How do you detect it in production?" — threads stuck, no CPU use; thread dump shows the cycle.

Q7Process vs thread — what's the difference and when do you choose each?

What they're really probing

Shared vs isolated address space and the practical trade-offs.

Model answer

A process has its own private address space and fds; it's isolated (a crash can't corrupt another) but heavier and needs IPC to share data. Threads live in one process and share its address space and fds, so they communicate through shared memory for free and are cheap to create/switch — but a bug in one can corrupt the whole process, and shared data needs locks. On Linux both come from clone(). Choose threads for cheap shared-memory concurrency; choose processes for fault isolation or to escape a global lock (e.g. Python's GIL).

Follow-up traps

"How does the scheduler see them?" — it schedules threads (tasks) individually.
"Why multiprocessing in Python?" — the GIL prevents threads from using multiple cores for CPU-bound work.

Q8What is a context switch and why is it expensive?

What they're really probing

The subtle answer — caches and TLB, not the register copy.

Model answer

It's the kernel saving one task's CPU state and restoring another's so one core can multiplex many tasks. The register save/restore is cheap; the real cost is indirect. Switching to a different process swaps page tables and flushes much of the TLB, and the incoming task runs with cold CPU caches, suffering cache/TLB misses until its working set warms up. So heavy switching (too many threads, tiny slices, lock contention) burns CPU in overhead — high %sys, high involuntary context-switch counts — rather than useful work.

Follow-up traps

"Voluntary vs involuntary?" — blocked on I/O vs preempted by the scheduler.
"Thread switch within a process cheaper?" — yes, same address space, less TLB damage.

Q9Load average is 200 but CPU utilisation is near zero. Explain.

What they're really probing

That Linux load average includes D-state processes, not just CPU demand.

Model answer

Linux load average counts processes that are runnable (R) and in uninterruptible sleep (D). A load of 200 with idle CPU means ~200 processes stuck in D waiting on I/O — not consuming CPU. The classic cause is a slow or hung storage backend (dying disk, unresponsive NFS): requests pile up in I/O wait. The fix is on the I/O side, not the CPU. Treating load average as "CPU busyness" is the mistake — it's "demand for the system, including I/O waits."

Follow-up traps

"How confirm?" — ps -eo stat | grep -c D, iostat/iotop for disk saturation, check stuck mounts.
"Why does D count?" — to reflect true system demand including I/O.

Q10Explain a file descriptor, and why sockets, files, and pipes all use them.

What they're really probing

The "everything is a file" abstraction and why it's powerful.

Model answer

An fd is a small integer the kernel gives as a handle to an open resource; all operations name the resource by that integer. The power is uniformity: files, pipes, sockets, terminals, and event queues share one interface (read/write/close), so the same code and the same event loops (epoll) work across all of them. Underneath, the fd indexes a per-process table → a system-wide open-file entry (holding the offset) → the underlying object. fds 0/1/2 are stdin/stdout/stderr, which makes shell redirection work.

Follow-up traps

"Why does 2>&1 work?" — points fd 2 at whatever fd 1 points to.
"fd exhaustion?" — hitting ulimit -n → EMFILE, often an fd leak.

Q11SIGTERM vs SIGKILL — what's the difference and why does it matter?

What they're really probing

Graceful shutdown, catchable vs uncatchable signals, and the tie to orchestration.

Model answer

SIGTERM (the default kill) politely asks a process to exit and is catchable, so a well-built program runs cleanup — flush buffers, finish in-flight requests, close connections — then exits. SIGKILL (kill -9) is uncatchable: the kernel destroys the process immediately with no cleanup, risking corrupt/partial state. So you always try SIGTERM first. This is exactly Kubernetes' pod-termination flow: it sends SIGTERM, waits a grace period, then SIGKILL — which is why services implement a SIGTERM handler for graceful shutdown.

Follow-up traps

"Which signals can't be caught?" — SIGKILL and SIGSTOP.
"Why did SIGKILL not work on my process?" — it's in D state, can't receive any signal until its I/O completes.

Q12Design: one box, 100,000 concurrent network connections. How?

What they're really probing

Combining fds, non-blocking I/O, and event notification into a systems design (C10K).

Model answer

Not a thread per connection — 100k threads means crushing context-switch and memory overhead. Instead use non-blocking sockets and an event-driven model: register all connection fds with epoll, and run a small pool of threads (~one per core), each an event loop that epoll_waits and services only the fds that are actually ready. This is why "everything is an fd" matters — one epoll loop watches tens of thousands of sockets. It's the architecture of nginx, Redis, and Node. Keep per-connection state small, scale threads to cores, and never make a blocking syscall inside the loop (it would stall every connection on that thread).

Follow-up traps

"Why not select?" — O(n) per call, capped at FD_SETSIZE; epoll is O(ready).
"Where does blocking hurt?" — inside the loop it stalls all that thread's connections; push blocking work to a separate pool.
"Limits to tune?" — ulimit -n and ephemeral port range.

Operating Systems

1 · What a computer is actually doing

2 · From a file on disk to a living process

The shape of a process's memory

How a process is born: fork + exec

3 · Virtual memory: the great illusion

Pages, page tables, and the MMU

The two numbers interviewers ask about

4 · How a program actually gets memory

malloc doesn't usually call the kernel

5 · The syscall boundary: asking the kernel for anything

6 · Interrupts: how the CPU gets pulled away

7 · The scheduler: sharing one CPU among many

The context switch — and why it's expensive

Process states — know these cold

8 · Concurrency: threads, races, locks, deadlock

The race condition — the root of concurrency bugs

Deadlock — when locks freeze each other

9 · Signals: the kernel tapping a process on the shoulder

10 · File descriptors and the I/O models

Blocking, non-blocking, and how one program serves 100,000 connections

11 · The memory hierarchy: why programs wait

12 · When memory runs out: the OOM killer

Reach engineers who read the man page