sn sysnerdunderstand it from the kernel up
← curriculum map
Storage
Core · ~180 min · 6 labs
map / Concepts / Storage
Concepts Core ~180 min requires: Operating Systems

Storage

This assumes you know only that "computers save files," and takes you to where you can reason about durability, design a storage layer, and debug any disk problem out loud. Every idea builds on the one before it. Grounded in Kleppmann's Designing Data-Intensive Applications and the Google SRE book. Read it top to bottom; the labs and Interview Gauntlet will then feel like review.

1 · What storage actually is

Start with the one fact that makes storage a distinct, hard problem: RAM forgets. The memory your programs use (from the OS module) is volatile — cut the power and every byte is gone. Storage is about non-volatile media: disks that keep your data across reboots and power loss. That single requirement — surviving a crash — is the source of nearly every complication in this module.

Volatile vs non-volatile · hover each
RAM (volatile)Fast, forgets on power loss. Nanosecond access, but its contents vanish when power is cut. Where running programs live.
disk (non-volatile)Slower, remembers forever. Micro- to millisecond access, but survives reboots, crashes, and power loss. Where data must ultimately land to be "saved."
The whole game: RAM is fast but forgetful; disk is durable but slow. Every storage design negotiates between the two.

This immediately creates the central tension. Programs want RAM speed but disk durability, and you can't fully have both. Writing straight to disk on every change is safe but slow; buffering writes in RAM and flushing later is fast but risks losing data if the machine dies before the flush. The entire stack you'll learn — the page cache, fsync, journaling, write-ahead logs, replication — exists to manage exactly this trade-off between speed and durability. Keep it in mind: "is this actually on disk yet, or still in volatile RAM?" is the question behind most data-loss bugs.

One more foundation: storage is organised in fixed-size blocks (historically 512-byte sectors, now commonly 4 KB), not individual bytes. The disk can only read or write a whole block at a time. This granularity — plus the physics of the media below — is why how you access data matters as much as how much.

2 · The physical layer: why access patterns matter

To reason about storage performance you must know what the hardware is actually doing. Two technologies dominate, and their physics are opposite:

HDD vs SSD · hover each
HDD (spinning disk)Spinning magnetic platters + a moving read head. To read a block, the head must physically move to the right track (seek) and wait for the platter to rotate under it. Sequential reads (adjacent blocks) are fast; random reads (scattered blocks) are agonisingly slow because each one pays a ~10 ms seek. Cheap per GB.
SSD (flash)No moving parts — electronic flash cells. No seek penalty, so random access is dramatically faster than an HDD. But cells wear out after limited writes, and flash can only erase in large blocks, so SSDs do clever internal remapping (wear-levelling, garbage collection). Fast, more expensive per GB.
HDD: sequential fast, random catastrophic (seeks). SSD: random fast, but wears with writes. This shapes every storage-engine design.

The single most important consequence, true on both media (dramatically so on HDDs, still meaningfully on SSDs): sequential access is far faster than random access. Reading a megabyte of adjacent blocks in one sweep beats reading a thousand scattered blocks by orders of magnitude. This one fact explains an enormous amount of database design — it's why databases go to great lengths to turn random writes into sequential ones (the write-ahead log, LSM-trees), and why "just add an index" isn't free.

Recall the latency pyramid from OS: an SSD read is ~100 µs and an HDD seek ~10 ms — roughly 1,000× and 100,000× slower than RAM's ~100 ns. So every trip to disk is precious, which is why the OS aggressively caches disk data in spare RAM (the page cache, Section 4) and why the two metrics that describe a disk are IOPS (I/O operations per second — how many separate reads/writes) and throughput (MB/s — bulk transfer). A workload of many tiny random reads is IOPS-bound; a bulk copy is throughput-bound. Knowing which you're up against is half of storage performance debugging.

3 · Block, file, and object: three ways to see storage

Raw blocks are hard to use directly, so storage is exposed through three abstractions. Interviewers ask you to compare them because each maps to different real systems (local disks, NAS, S3):

The three storage abstractions · hover each
block storageA raw array of fixed-size blocks, addressed by number — like a bare disk. No notion of files. What a filesystem or database is built on. Examples: a physical disk, a cloud volume (AWS EBS). Lowest level, highest control.
file storageA hierarchy of named files in directories, with a filesystem managing the blocks underneath. What you use daily; can be shared over a network (NFS/SMB → a NAS). Familiar, POSIX semantics, but harder to scale to enormous size.
object storageFlat namespace of "objects" (blob + metadata) accessed by key over HTTP — no hierarchy, no partial in-place edits (you replace whole objects). Examples: AWS S3. Scales essentially infinitely and is cheap, at the cost of POSIX semantics and low-latency random writes.
Block = raw disk (build on it) · file = named hierarchy (use it) · object = infinite HTTP key-value blobs (scale with it). Choosing correctly is a design question.

The mental model: block is the foundation, file and object are conveniences built for different goals. A database often wants block storage (it manages its own layout for performance). Applications want file storage for familiarity. Anything needing massive, cheap, durable capacity for whole-object reads/writes (backups, images, data lakes) wants object storage. The rest of this module drills into the file/filesystem layer and then into how databases lay bytes on block storage — because that's where the deep interview questions live.

4 · The filesystem: inodes, names, and links

A filesystem is the bookkeeping that turns a flat array of blocks into named files in directories. Its central data structure is the inode ("index node"), and understanding it dissolves a whole class of confusing behaviours.

Here is the key separation, which surprises almost everyone: a file's name is not part of the file. The inode holds everything about the file — its size, permissions, timestamps, and pointers to the data blocks — but not its name. The name lives in the directory, which is simply a table mapping names to inode numbers (a directory entry, or "dirent"). So a directory is just a list like "report.txt → inode 4021."

Interactive · finding /home/user/file
/home/user/file.txt

You ask to open a path. The kernel resolves it one component at a time, walking down directories: look up home in /, then user in home, then file.txt in user. Each step is a directory lookup.

dir tablefile.txt →inode 4021

In the final directory, the entry file.txt maps to an inode number (4021). That's all the directory holds — a name and a number. The name and the file are two separate things joined here.

inode 4021size, perms, block ptrs

The kernel reads inode 4021, which holds the file's metadata — size, owner, permissions, timestamps — and, crucially, pointers to the data blocks. The name isn't here. Neither is the data — just where to find it.

block 88block 89block 240

Following the inode's pointers, the kernel reads the actual data blocks off the disk (possibly scattered — hence sequential vs random matters). Now you have the file's contents. Path → dirent → inode → blocks: that chain is the entire lookup.

Name → directory entry → inode (metadata + block pointers) → data blocks. The name and the file are separate — which explains links and the "deleted but open" trap below.

This name/inode split explains links. A hard link is a second directory entry pointing at the same inode — two names, one file, indistinguishable; the inode keeps a link count, and the data is freed only when the count hits zero. A symbolic (soft) link is a tiny file that just contains another path — a pointer to a name, which breaks if the target is renamed. And the famous trap: because an open file descriptor (from OS) is another reference to the inode, deleting a file that's still open removes its name but not the inode — the data lives on, invisible, until the last fd closes. This is why a full disk can show free space in df: a deleted-but-open logfile is still consuming blocks. You'll reproduce exactly that in the labs.

5 · The page cache, buffering, and fsync

Now the most important — and most misunderstood — topic in storage. When your program write()s to a file, the data does not go straight to disk. It goes into the page cache in RAM (the same reclaimable cache from OS), and the kernel reports success immediately. The actual disk write happens later, in the background. This is write-back caching, and it's what makes I/O feel fast — but it means a successful write() does not guarantee your data survived.

Interactive · the true path of a write
app: write(fd, data)

Your program calls write(). It expects the data to be saved. The syscall crosses into the kernel (from OS) carrying your bytes.

page cache (RAM)← returns "ok"

The kernel copies the bytes into the page cache in RAM and immediately returns success — the disk hasn't been touched yet. The page is marked dirty (modified, not yet written back). Your program thinks the data is saved. It is not — it's in volatile memory.

power lossdata gone

Here is the danger window. If the machine crashes or loses power now — after write() returned but before the kernel flushed the dirty page — the data is lost, even though your program was told it succeeded. This is the source of countless "we lost the last few seconds of data" incidents.

app: fsync(fd)forced to disk

To guarantee durability, the program must call fsync(), which forces the dirty pages to physical disk and only returns once they're truly persisted. This is slow (a real disk write) — which is exactly why databases don't fsync on every write, and instead batch writes into a sequential log (next section). Durability is a choice you pay for.

write() → page cache (RAM, fast, "ok") → background flush. fsync() forces it to disk. A returned write() ≠ durable; only fsync guarantees it.

This trade-off has a name and a knob everywhere. Databases give you durability settings; the OS lets you fsync/fdatasync; even disks have their own volatile write caches (which is why databases sometimes disable them or require battery-backed caches). The mental model to carry into every design: "acknowledged" and "durable" are different guarantees, separated by an fsync. A system that acks before fsync is fast but can lose recently-acked data on crash; a system that fsyncs before acking is durable but slower. Knowing where a given database sits on that line is exactly what interviewers probe.

6 · Crash consistency and journaling

Buffering raises a scarier problem than losing recent data: corruption. Many operations touch multiple blocks that must change together. Renaming a file, or appending data, might update the inode, a directory entry, and a free-space map. If the machine crashes between those block writes, the filesystem is left in an inconsistent, half-updated state — a dangling pointer, a lost block, a directory referencing a freed inode. On disk, there's no "undo."

The fix is journaling (a write-ahead log applied to the filesystem). Before performing a multi-block change, the filesystem first writes a description of the entire change to a dedicated on-disk journal, then applies it to the real locations, then marks the journal entry done. The magic is what happens on crash recovery: at mount, the filesystem replays the journal — any change that was fully journalled but not fully applied is re-applied; any change not fully journalled is discarded. Either way the filesystem returns to a consistent state, never a half-broken one.

Why the journal survives a crash · hover each
1 · write intent to journalRecord the whole intended change in the journal first, and mark it committed. This write is sequential and cheap.
2 · apply to real blocksNow update the actual inode/directory/data blocks in place.
3 · crash? → replayOn recovery: a fully-journalled change is re-applied (safe to repeat); a partial one is thrown away. Never a half-done state.
Write the intent first, then act. This "write-ahead" idea — log the change before you make it — is the single most important pattern in all of durable storage, and reappears in every database.

This is the same idea that makes databases crash-safe: a write-ahead log (WAL). Note the deliberate trade-off — journalling means writing the data (or at least its metadata) twice, so most filesystems journal only metadata by default (protecting structure, not necessarily your latest data bytes) and let you choose full data journalling if you need it. Recognise the recurring shape: to make a multi-step change atomic and crash-safe, write your intent to a sequential log first, then apply it. You will see it again in the database section, in replication, and in distributed systems generally.

7 · RAID: surviving disk failure

Everything so far assumed the disk works. But disks die — mechanically, all the time. RAID (Redundant Array of Independent Disks) combines several physical disks into one logical volume to buy either speed, redundancy, or both. Know the common levels and, precisely, what each protects against:

RAID levels · hover each
RAID 0 · stripeSpeed, zero redundancy. Data is split ("striped") across disks so reads/writes run in parallel — fast and full capacity. But any one disk failing loses everything. Never for data you care about.
RAID 1 · mirrorRedundancy via full copies. Every block is written to two disks. Survives one disk dying; reads can be faster. Costs 50% of capacity. Simple and safe.
RAID 5 · parityRedundancy at lower cost. Stripes data plus a distributed parity block, so any one failed disk can be reconstructed from the others. Survives one failure; loses only one disk's worth of capacity. But rebuilds are slow and stressful (risking a second failure).
RAID 10 · mirror + stripeSpeed and redundancy. Mirror pairs, then striped across them. Fast and tolerant of disk loss — the common choice for databases — at the cost of 50% capacity.
0 = speed only (danger) · 1 = mirror · 5 = parity (cheaper redundancy) · 10 = mirror+stripe (fast + safe). Interviewers ask which survives what.

Two lessons that separate the fluent from the memorisers. First, RAID is not a backup. It protects against disk hardware failure, not against rm -rf, corruption, ransomware, or fire — those replicate instantly to every mirror. You still need backups (ideally off-site). Second, RAID is the local, single-machine version of an idea that goes much further: redundancy through copies. Scale it across machines and datacentres and it becomes replication (Section 9) — the same principle, applied to survive not just a dead disk but a dead server or a dead region.

8 · How databases store data: B-trees vs LSM-trees

Now the deepest storage-interview territory, straight from Kleppmann. A database must store key-value data on disk so it can be looked up quickly and written to quickly. There are two great families of storage engine, and understanding their opposite trade-offs is what senior interviews probe.

The core tension, again from the physics of Section 2: fast reads want data sorted and in place; fast writes want to avoid random disk seeks. The two engine families resolve it differently:

B-tree vs LSM-tree · hover each
B-treeUpdate-in-place, sorted on disk. Data lives in a balanced tree of blocks kept sorted by key; a lookup walks a few levels (very fast, predictable reads). A write finds the right block and modifies it in place — which can mean a random write. Great reads, decent writes. Powers Postgres, MySQL/InnoDB, most traditional relational databases.
LSM-treeAppend-only, then merge. Writes go to an in-memory sorted structure and a sequential log, then are flushed to disk as sorted files that are periodically merged/compacted in the background. Writes are sequential and fast; reads may check several files (slower, mitigated by indexes/Bloom filters). Great writes, good-enough reads. Powers Cassandra, RocksDB, LevelDB, and write-heavy systems.
B-tree: update in place, sorted → fast reads (relational DBs). LSM: append + compact → fast writes (write-heavy/NoSQL). The classic "which and why" question.

Everything ties back to earlier sections. LSM-trees are fast at writes precisely because they turn random writes into sequential ones (Section 2) — appending to a log and merging later, rather than seeking to update in place. Both families rely on a write-ahead log for crash safety (Section 6): the write is appended to the WAL and fsync'd (Section 5) before being acknowledged, so a crash can be recovered by replaying the log. And an index is just a secondary sorted structure that trades write cost and space for read speed — which is why "just add an index" always has a price. The one-sentence takeaway interviewers want: B-trees optimise reads by keeping data sorted in place; LSM-trees optimise writes by appending sequentially and merging later.

9 · Replication, consistency, and CAP

One machine, however well-tuned, can die, fill up, or fall behind. To survive that and to scale, real systems keep copies of data on multiple machinesreplication. It's RAID's idea (redundant copies) lifted to the network, and it introduces the hardest trade-offs in all of computing. This is the heart of Kleppmann and of distributed-systems interviews.

The first choice is when the write is acknowledged:

Synchronous vs asynchronous replication · hover each
synchronousAck only after replicas confirm. The write isn't acknowledged until copies are safely on other machines. Strong durability — no data loss if the primary dies — but slower (wait for the network) and stalls if a replica is down.
asynchronousAck immediately, replicate in the background. Fast, and tolerant of slow/down replicas — but if the primary dies before a write propagates, that write is lost. Most systems default here, accepting a small data-loss window for speed.
Sync = safe but slow (wait for replicas); async = fast but a crash can lose the last few writes. The same "acknowledged ≠ durable" trade-off from Section 5, now across machines.

The CAP theorem

Once data is spread across machines connected by a network, an unavoidable law applies. A network partition — machines unable to reach each other — will happen. When it does, a distributed system must choose between two things it cannot both keep: consistency (every read sees the latest write — all copies agree) and availability (every request still gets an answer). That's the CAP theorem: under a Partition, pick Consistency or Availability.

CAP: during a network partition, choose · hover each
CP · consistencyRefuse to answer rather than risk being wrong. During a partition the minority side rejects requests to guarantee no one reads stale data. Correct but unavailable to some. (e.g. etcd/ZooKeeper, and by extension Kubernetes' state store — it must stay consistent.)
AP · availabilityAlways answer, reconcile later. Every node keeps serving during a partition, accepting that copies may temporarily disagree (eventual consistency) and be merged afterward. Available but possibly stale. (e.g. Cassandra, DynamoDB, DNS.)
Partitions are inevitable, so the real choice is C vs A when one happens. There's no "CA" system in the real world — the network will partition eventually.

Two nuances that mark a strong answer. CAP is about partition time — when the network is healthy, a system can offer both consistency and availability; the trade-off only bites during a partition. And "consistency" isn't binary: there's a spectrum from strong (reads always see the latest write) through eventual (copies converge given time), and picking the right point on that spectrum for a given feature (a bank balance vs a like count) is exactly the judgement senior data-systems roles are testing.

10 · Storage in production: the failures you'll actually hit

Theory meets the pager here. A handful of storage failure modes recur constantly; recognising them instantly is what the operational interview questions reward.

The classic storage incidents · hover each
disk full (space)No space left on device. Find the culprit with df -h (per-filesystem usage) then du -sh * (what's big). Beware: a big deleted-but-open file (Section 4) consumes space that du can't see — restart the holder or truncate via /proc/<pid>/fd.
disk full (inodes)No space left but df -h shows free space? You've exhausted inodes — too many tiny files (each needs one), a fixed pool set at format time. Check with df -i. Classic on mail spools and cache dirs full of small files.
slow disk (IOPS)High latency, load average climbing, processes in D state (from OS). The disk is saturated. iostat -x shows %util near 100 and rising await. You're IOPS- or throughput-bound; the fix is faster storage, caching, or fewer/larger I/Os.
silent corruptionBad blocks / bit rot returning wrong data. Why serious systems use checksums (ZFS, database page checksums) and why RAID isn't a backup — corruption can propagate to mirrors. Detected by scrubs and checksums, not by "it seemed fine."
Space vs inodes (df -h vs df -i), IOPS saturation (iostat, D-state), deleted-but-open files, and corruption. These four cover most storage pages.

The through-line back to earlier modules: a saturated disk shows up as processes stuck in D state and a load average far above CPU usage (OS, Section 7) — because those processes are blocked in uninterruptible I/O. So "load average 200, CPU idle" and "everything is slow" often trace straight to storage. Storage problems wear an OS costume, which is exactly why these two modules sit next to each other.

Now build it by hand
Q1Inspect the block layerwarm-up

Underneath every filesystem is a block device — a raw array of fixed-size blocks. lsblk shows the disks, partitions, and mounted filesystems built on top.

Under the hood

lsblk reveals the stack: physical disk → partitions → filesystems → mount points, and any RAID/LVM layers in between. blockdev --getbsz shows the block size. Seeing this hierarchy makes "block storage is the foundation, filesystems are built on it" concrete.

Task

List block devices and their mount points, and see the filesystem types.

Verify it yourself
verify
$ lsblk -f

You see each disk/partition, its filesystem type (ext4, xfs…), and where it's mounted — the block→filesystem stack from Section 3, live.

Reveal solution
solution
$ lsblk
$ lsblk -f          # with filesystem types and mountpoints
$ df -hT            # usage per mounted filesystem, with type
Q2Watch the inode chaincore

A file's name lives in its directory; the inode holds the metadata and block pointers. Name and file are separate things.

Under the hood

stat shows a file's inode number, size, blocks, and link count. ls -i shows inode numbers next to names. Two names with the same inode number are hard links to one file. This separation is what makes links and the deleted-but-open trap possible.

Task

Create a file, view its inode and link count, then make a hard link and watch the count rise.

Verify it yourself
verify
$ stat file.txt

stat shows Inode: and Links: 1. After ln file.txt hard.txt, the link count becomes 2 — two names, one inode. ls -i confirms both names share the inode number.

Reveal solution
solution
$ echo hi > file.txt
$ stat file.txt
$ ln file.txt hard.txt
$ stat file.txt          # Links: now 2
$ ls -i file.txt hard.txt  # same inode number
Q3Break-and-fix — the deleted-but-open filedebug

Because an open fd is a reference to the inode, deleting a file that's still open removes its name but keeps the data alive — invisible to ls, still consuming disk. A full disk with "no big files" is often this.

Under the hood

The inode's link count drops to 0 when unlinked, but the data isn't freed until the open-fd count also hits 0. lsof +L1 lists deleted-but-open files. The fix: restart or signal the process holding the fd, or truncate it via /proc/<pid>/fd. This is the storage face of the fd chain from the OS module.

Task

Open a file, delete it while it's held open, and prove the space is still used until the holder lets go.

Verify it yourself
verify
$ lsof +L1 2>/dev/null | head

The deleted-but-open file appears in lsof +L1 with a link count of 0 while its process still holds it — space consumed by a file ls can't see. Killing/restarting the holder frees it.

Reveal solution

When a disk is full but du finds nothing, look for deleted-but-open files with lsof +L1 before anything else.

solution
$ exec 4>/tmp/ghost; echo data >&4
$ rm /tmp/ghost          # name gone, fd 4 still open
$ ls -l /tmp/ghost       # gone from the directory
$ lsof +L1 2>/dev/null | grep ghost   # still open, 0 links, using space
$ exec 4>&-              # close it -> space freed
Sponsored

Reach engineers who read the man page

Native, contextual, no tracking — this is how the curriculum stays free.

Q4Space vs inodes — two ways to be "full"core

A filesystem can run out of space (bytes) or inodes (file count) independently. Both report "no space left on device", but the fix is completely different.

Under the hood

df -h shows byte usage; df -i shows inode usage. A directory full of millions of tiny files can exhaust inodes (each file needs one, from a fixed pool set at format time) while bytes are nearly empty — bewildering until you check df -i.

Task

Check both the space and inode usage of your filesystems.

Verify it yourself
verify
$ df -i

df -h shows % of bytes used; df -i shows % of inodes used. If -i is at 100% while -h isn't, you're out of inodes, not space — delete small files, don't buy a bigger disk.

Reveal solution
solution
$ df -h        # space (bytes)
$ df -i        # inodes (file count)
Q5See durability cost — the page cache and fsynccore

Writes land in the page cache (RAM) and return instantly; only a flush/fsync forces them to disk. The gap between the two is the durability window.

Under the hood

dd with and without conv=fsync exposes the difference: buffered writes are fast because they hit RAM; forcing fsync makes them slow because they hit the physical disk. free -h shows the page cache under buff/cache — memory that's reclaimable, not lost.

Task

Write with and without forcing a sync and compare the reported speed; observe the page cache.

Verify it yourself
verify
$ dd if=/dev/zero of=/tmp/t bs=1M count=256 conv=fsync

The conv=fsync run reports a lower MB/s than a buffered run (drop conv=fsync to compare) — because it waits for real disk, not just RAM. That slowdown is the price of durability.

Reveal solution
solution
$ dd if=/dev/zero of=/tmp/t bs=1M count=256           # buffered: fast (hits page cache)
$ dd if=/dev/zero of=/tmp/t bs=1M count=256 conv=fsync # forced to disk: slower
$ free -h   # page cache shown under buff/cache
$ rm -f /tmp/t
Q6Watch disk saturation cause D-statedebug

A saturated disk makes processes block in uninterruptible sleep (D), driving load average up while the CPU sits idle — the OS module's lesson, caused by storage.

Under the hood

iostat -x shows per-device %util (how busy the disk is) and await (average I/O latency). When %util pins near 100 and await climbs, the disk is the bottleneck and I/O-bound processes pile into D — visible in ps, and reflected in a load average far above CPU usage.

Task

Watch disk utilisation and latency while generating I/O, and correlate with process state.

Verify it yourself
verify
$ iostat -x 1 3

Under load, the busy device shows high %util and rising await. ps -eo stat,comm | grep ^D shows processes stuck in D — the storage cause of "high load, idle CPU".

Reveal solution
solution
$ sudo apt install -y sysstat   # for iostat, if needed
$ iostat -x 1 3
# in another shell, generate I/O, then:
$ ps -eo stat,comm | grep "^D"   # processes blocked on I/O
What you just built

You now understand persistence end to end: why RAM's volatility makes durability hard, why sequential beats random on real media, the block/file/object abstractions, inodes and links and the deleted-but-open trap, the page cache and the acknowledged-vs-durable gap that fsync closes, journaling and write-ahead logs for crash safety, RAID and replication for surviving failure, B-trees vs LSM-trees for storing data, and CAP for the limits of distributed data. This is the foundation under every database, container volume, and stateful system you'll operate — and it connects straight back to the OS: a slow disk is a D-state, and a lost write is a page that never got fsync'd.

The interview gauntlet

The questions actually asked for SRE, systems, and data-engineering roles — conceptual, debugging, and design prompts, many straight out of Designing Data-Intensive Applications. Each expands to show what the interviewer is really probing for, a model answer, and the follow-up traps. Answer all of these out loud and you have mastered this module.

Q1Does a successful write() mean my data is safe on disk?
What they're really probing

The single most important storage insight — the difference between acknowledged and durable.

Model answer

No. A successful write() only means the data reached the kernel's page cache in RAM; the kernel returns success immediately and flushes to disk later (write-back caching). If the machine loses power between the write() and the flush, the data is lost despite the success return. To guarantee durability you must call fsync(), which forces the dirty pages to physical disk before returning — at a real latency cost. "Acknowledged" and "durable" are separate guarantees, and the gap between them is the danger window.

Follow-up traps
  • "Why don't databases fsync every write?" — too slow; they batch into a sequential WAL and fsync that.
  • "Disk write caches?" — disks have their own volatile caches too, so serious setups use battery-backed caches or disable them.
Q2B-tree vs LSM-tree — how do they differ and when would you pick each?
What they're really probing

Deep storage-engine knowledge from DDIA — the read/write trade-off and real systems.

Model answer

A B-tree keeps data sorted on disk and updates in place: lookups walk a few tree levels (fast, predictable reads), while writes may seek to modify a block. Great for read-heavy and range queries — Postgres, MySQL/InnoDB. An LSM-tree is append-only: writes go to memory + a sequential log, get flushed as sorted files, and are compacted in the background; writes are sequential and fast, but a read may check several files (mitigated by Bloom filters). Great for write-heavy workloads — Cassandra, RocksDB. Pick B-tree for read-heavy/relational; LSM for write-heavy/high-ingest. It all follows from "sequential writes are far cheaper than random ones."

Follow-up traps
  • "Why are LSM writes fast?" — they turn random writes into sequential appends.
  • "LSM downside?" — read amplification and background compaction cost/latency spikes.
  • "What do both need for crash safety?" — a write-ahead log.
Q3Explain the CAP theorem. Give a real CP and AP system.
What they're really probing

Distributed-systems maturity — that the choice is C vs A *during a partition*.

Model answer

In a system spread across machines, network partitions (nodes can't reach each other) are inevitable. CAP says that during a partition you must choose between Consistency (every read sees the latest write; all copies agree) and Availability (every request gets an answer) — you can't have both. A CP system refuses requests on the minority side to avoid stale reads (etcd/ZooKeeper — which is why Kubernetes' store is CP). An AP system keeps answering and reconciles later, accepting temporary disagreement / eventual consistency (Cassandra, DynamoDB, DNS). When the network is healthy you can have both; the trade-off only bites during a partition.

Follow-up traps
  • "Is there a CA system?" — not really in practice; the network will partition, so you must plan for P.
  • "Strong vs eventual consistency?" — a spectrum; choose per feature (bank balance vs like count).
Q4A disk is full but du can't find the space. What's happening?
What they're really probing

The deleted-but-open-file trap — connects filesystem internals to the OS fd model.

Model answer

Almost certainly a deleted-but-still-open file. Because an open file descriptor is a reference to the inode, deleting a file that a process still has open removes its name (so ls/du can't see it) but keeps the inode and its data blocks alive until the last fd closes — so the space stays consumed. Classic with a rotated log a daemon still holds open. Find it with lsof +L1 (lists files with 0 links still open); fix by restarting/signalling the holding process, or truncating via /proc/<pid>/fd/<n>. Also rule out the other "full" — inode exhaustion (df -i), where millions of tiny files fill the inode pool while bytes are free.

Follow-up traps
  • "df says full, du says not — space or inodes?" — check df -i.
  • "Why doesn't deleting the log free space?" — the daemon still holds the fd; you must make it release it.
Q5Why is sequential access so much faster than random, and why does it matter?
What they're really probing

The physics that underpins all storage design.

Model answer

On an HDD, a random read forces the head to physically seek to a new track and wait for rotation (~10 ms each), while sequential reads stream adjacent blocks with no seeks — orders of magnitude faster. Even on SSDs (no seek), sequential I/O is still faster due to larger transfers and less overhead. This is the reason storage engines work so hard to turn random writes into sequential ones — write-ahead logs, LSM-tree appends, log-structured filesystems — and why "add an index" (which scatters writes) has a real cost. Matching your access pattern to the media is the root of storage performance.

Follow-up traps
  • "How does an LSM exploit this?" — appends sequentially, compacts in the background.
  • "IOPS vs throughput?" — many small random ops vs bulk MB/s; know which bounds your workload.
Q6How does a filesystem stay consistent across a crash?
What they're really probing

Journaling / write-ahead logging — the atomicity pattern.

Model answer

A single logical operation (rename, append) touches multiple blocks that must change together; a crash in the middle would corrupt the filesystem. The fix is journaling: before applying a multi-block change, the filesystem writes the whole intended change to a sequential on-disk journal and marks it committed, then applies it in place. On recovery it replays the journal — fully-journalled changes are re-applied, partial ones discarded — so the filesystem is always consistent, never half-done. Most filesystems journal metadata by default (protecting structure) and can optionally journal data too. It's the same write-ahead idea databases use for their WAL: log your intent first, then act.

Follow-up traps
  • "Does journalling protect my data bytes?" — by default often only metadata; enable data journalling for full protection.
  • "Cost?" — writes happen twice; a deliberate durability/perf trade-off.
Q7Synchronous vs asynchronous replication — trade-offs?
What they're really probing

Whether the durability-vs-latency trade-off is understood across machines.

Model answer

Synchronous replication acks a write only after replicas confirm it — so if the primary dies, no acknowledged data is lost (strong durability) — but it's slower (waits on the network) and stalls if a replica is down. Asynchronous acks immediately and replicates in the background — fast and tolerant of slow replicas — but a primary crash before propagation loses the last few writes. It's the exact "acknowledged ≠ durable" trade-off from the page-cache section, lifted across machines. Many systems use a hybrid (e.g. ack after one synchronous replica, the rest async) to balance the two.

Follow-up traps
  • "Default choice?" — usually async for latency, accepting a small loss window.
  • "Failover risk with async?" — promoting a lagging replica loses un-replicated writes (and can cause split-brain).
Q8Is RAID a backup? Explain RAID 0/1/5/10.
What they're really probing

A deliberate trap plus knowledge of what each level protects against.

Model answer

RAID is not a backup. It protects against disk hardware failure, not against deletion, corruption, ransomware, or disaster — those replicate instantly to every mirror. You still need real, ideally off-site, backups. The levels: RAID 0 stripes for speed with no redundancy (any disk lost = all data lost). RAID 1 mirrors (survives one disk, 50% capacity). RAID 5 stripes with distributed parity (survives one disk, loses one disk of capacity, slow rebuilds). RAID 10 mirrors then stripes (fast + survives disk loss, 50% capacity) — the usual database choice.

Follow-up traps
  • "RAID 5 rebuild risk?" — rebuilds stress remaining disks and are slow; a second failure during rebuild loses everything.
  • "What does RAID not protect against?" — human error, corruption, fire — hence backups.
Q9Design a durable, high-throughput write path for a database. What do you use?
What they're really probing

Synthesis — WAL, sequential writes, fsync, batching, replication, all together.

Model answer

Combine the whole module. Accept writes into an in-memory structure, and for durability append each to a write-ahead log on disk — a sequential append (fast, from Section 2) that you fsync before acknowledging (durable, from Section 5). To keep throughput high, batch many writes into one fsync (group commit) rather than fsyncing each. Periodically flush the in-memory data to sorted on-disk files and compact them (LSM style) so reads stay fast. For crash recovery, replay the WAL. For surviving machine failure, replicate the WAL to other nodes (sync for zero-loss, async for speed). This is exactly how real databases (Postgres WAL, RocksDB, Kafka) are built.

Follow-up traps
  • "Why a WAL instead of writing data in place?" — sequential + atomic recovery; in-place random writes are slow and crash-unsafe.
  • "How to raise throughput without losing durability?" — group commit (batch fsyncs).
  • "Zero data loss on machine failure?" — synchronous replication (at a latency cost).
Q10Everything is slow, load average is high, but CPU is idle. Storage angle?
What they're really probing

Connecting storage saturation to the OS D-state / load-average lesson.

Model answer

This is very often a saturated disk. Processes blocked on slow I/O sit in uninterruptible sleep (D), which counts toward Linux load average — so load can be huge while the CPU does nothing (from the OS module). Confirm with iostat -x: a device pinned near 100% %util with rising await (I/O latency), and ps -eo stat | grep -c D showing many D-state processes. Root causes: an IOPS-bound workload (many small random I/Os), a failing/degraded disk, a RAID rebuild, or a noisy neighbour. Fixes: faster storage, more caching, batching I/O, or moving the hot workload.

Follow-up traps
  • "Why does the disk affect load average?" — D-state (I/O wait) is included in load, unlike pure CPU metrics.
  • "IOPS-bound vs throughput-bound?" — small random ops vs bulk transfer; iostat tells you which.
Q11Block vs file vs object storage — when do you use each?
What they're really probing

Whether the three abstractions and their trade-offs are clear.

Model answer

Block storage is a raw array of blocks (a bare disk / cloud volume) with no file concept — you build a filesystem or database on it for maximum control and low-latency random I/O (databases, boot volumes). File storage is a named hierarchy managed by a filesystem, optionally shared over the network (NAS/NFS) — familiar POSIX semantics for applications and shared access, but harder to scale massively. Object storage is a flat key→blob namespace over HTTP with rich metadata and near-infinite, cheap, durable capacity, but no in-place edits and higher latency (S3) — ideal for backups, media, logs, and data lakes. Rule of thumb: block for databases, file for apps/shared access, object for massive cheap durable blobs.

Follow-up traps
  • "Why not object storage for a database?" — no low-latency random in-place writes; wrong access pattern.
  • "Why is object storage so scalable?" — flat namespace, immutable objects, no POSIX consistency guarantees to maintain.