Storage
This assumes you know only that "computers save files," and takes you to where you can reason about durability, design a storage layer, and debug any disk problem out loud. Every idea builds on the one before it. Grounded in Kleppmann's Designing Data-Intensive Applications and the Google SRE book. Read it top to bottom; the labs and Interview Gauntlet will then feel like review.
1 · What storage actually is
Start with the one fact that makes storage a distinct, hard problem: RAM forgets. The memory your programs use (from the OS module) is volatile — cut the power and every byte is gone. Storage is about non-volatile media: disks that keep your data across reboots and power loss. That single requirement — surviving a crash — is the source of nearly every complication in this module.
This immediately creates the central tension. Programs want RAM speed but disk durability, and you can't fully have both. Writing straight to disk on every change is safe but slow; buffering writes in RAM and flushing later is fast but risks losing data if the machine dies before the flush. The entire stack you'll learn — the page cache, fsync, journaling, write-ahead logs, replication — exists to manage exactly this trade-off between speed and durability. Keep it in mind: "is this actually on disk yet, or still in volatile RAM?" is the question behind most data-loss bugs.
One more foundation: storage is organised in fixed-size blocks (historically 512-byte sectors, now commonly 4 KB), not individual bytes. The disk can only read or write a whole block at a time. This granularity — plus the physics of the media below — is why how you access data matters as much as how much.
2 · The physical layer: why access patterns matter
To reason about storage performance you must know what the hardware is actually doing. Two technologies dominate, and their physics are opposite:
The single most important consequence, true on both media (dramatically so on HDDs, still meaningfully on SSDs): sequential access is far faster than random access. Reading a megabyte of adjacent blocks in one sweep beats reading a thousand scattered blocks by orders of magnitude. This one fact explains an enormous amount of database design — it's why databases go to great lengths to turn random writes into sequential ones (the write-ahead log, LSM-trees), and why "just add an index" isn't free.
Recall the latency pyramid from OS: an SSD read is ~100 µs and an HDD seek ~10 ms — roughly 1,000× and 100,000× slower than RAM's ~100 ns. So every trip to disk is precious, which is why the OS aggressively caches disk data in spare RAM (the page cache, Section 4) and why the two metrics that describe a disk are IOPS (I/O operations per second — how many separate reads/writes) and throughput (MB/s — bulk transfer). A workload of many tiny random reads is IOPS-bound; a bulk copy is throughput-bound. Knowing which you're up against is half of storage performance debugging.
3 · Block, file, and object: three ways to see storage
Raw blocks are hard to use directly, so storage is exposed through three abstractions. Interviewers ask you to compare them because each maps to different real systems (local disks, NAS, S3):
The mental model: block is the foundation, file and object are conveniences built for different goals. A database often wants block storage (it manages its own layout for performance). Applications want file storage for familiarity. Anything needing massive, cheap, durable capacity for whole-object reads/writes (backups, images, data lakes) wants object storage. The rest of this module drills into the file/filesystem layer and then into how databases lay bytes on block storage — because that's where the deep interview questions live.
4 · The filesystem: inodes, names, and links
A filesystem is the bookkeeping that turns a flat array of blocks into named files in directories. Its central data structure is the inode ("index node"), and understanding it dissolves a whole class of confusing behaviours.
Here is the key separation, which surprises almost everyone: a file's name is not part of the file. The inode holds everything about the file — its size, permissions, timestamps, and pointers to the data blocks — but not its name. The name lives in the directory, which is simply a table mapping names to inode numbers (a directory entry, or "dirent"). So a directory is just a list like "report.txt → inode 4021."
You ask to open a path. The kernel resolves it one component at a time, walking down directories: look up home in /, then user in home, then file.txt in user. Each step is a directory lookup.
In the final directory, the entry file.txt maps to an inode number (4021). That's all the directory holds — a name and a number. The name and the file are two separate things joined here.
The kernel reads inode 4021, which holds the file's metadata — size, owner, permissions, timestamps — and, crucially, pointers to the data blocks. The name isn't here. Neither is the data — just where to find it.
Following the inode's pointers, the kernel reads the actual data blocks off the disk (possibly scattered — hence sequential vs random matters). Now you have the file's contents. Path → dirent → inode → blocks: that chain is the entire lookup.
This name/inode split explains links. A hard link is a second directory entry pointing at the same inode — two names, one file, indistinguishable; the inode keeps a link count, and the data is freed only when the count hits zero. A symbolic (soft) link is a tiny file that just contains another path — a pointer to a name, which breaks if the target is renamed. And the famous trap: because an open file descriptor (from OS) is another reference to the inode, deleting a file that's still open removes its name but not the inode — the data lives on, invisible, until the last fd closes. This is why a full disk can show free space in df: a deleted-but-open logfile is still consuming blocks. You'll reproduce exactly that in the labs.
5 · The page cache, buffering, and fsync
Now the most important — and most misunderstood — topic in storage. When your program write()s to a file, the data does not go straight to disk. It goes into the page cache in RAM (the same reclaimable cache from OS), and the kernel reports success immediately. The actual disk write happens later, in the background. This is write-back caching, and it's what makes I/O feel fast — but it means a successful write() does not guarantee your data survived.
Your program calls write(). It expects the data to be saved. The syscall crosses into the kernel (from OS) carrying your bytes.
The kernel copies the bytes into the page cache in RAM and immediately returns success — the disk hasn't been touched yet. The page is marked dirty (modified, not yet written back). Your program thinks the data is saved. It is not — it's in volatile memory.
Here is the danger window. If the machine crashes or loses power now — after write() returned but before the kernel flushed the dirty page — the data is lost, even though your program was told it succeeded. This is the source of countless "we lost the last few seconds of data" incidents.
To guarantee durability, the program must call fsync(), which forces the dirty pages to physical disk and only returns once they're truly persisted. This is slow (a real disk write) — which is exactly why databases don't fsync on every write, and instead batch writes into a sequential log (next section). Durability is a choice you pay for.
This trade-off has a name and a knob everywhere. Databases give you durability settings; the OS lets you fsync/fdatasync; even disks have their own volatile write caches (which is why databases sometimes disable them or require battery-backed caches). The mental model to carry into every design: "acknowledged" and "durable" are different guarantees, separated by an fsync. A system that acks before fsync is fast but can lose recently-acked data on crash; a system that fsyncs before acking is durable but slower. Knowing where a given database sits on that line is exactly what interviewers probe.
6 · Crash consistency and journaling
Buffering raises a scarier problem than losing recent data: corruption. Many operations touch multiple blocks that must change together. Renaming a file, or appending data, might update the inode, a directory entry, and a free-space map. If the machine crashes between those block writes, the filesystem is left in an inconsistent, half-updated state — a dangling pointer, a lost block, a directory referencing a freed inode. On disk, there's no "undo."
The fix is journaling (a write-ahead log applied to the filesystem). Before performing a multi-block change, the filesystem first writes a description of the entire change to a dedicated on-disk journal, then applies it to the real locations, then marks the journal entry done. The magic is what happens on crash recovery: at mount, the filesystem replays the journal — any change that was fully journalled but not fully applied is re-applied; any change not fully journalled is discarded. Either way the filesystem returns to a consistent state, never a half-broken one.
This is the same idea that makes databases crash-safe: a write-ahead log (WAL). Note the deliberate trade-off — journalling means writing the data (or at least its metadata) twice, so most filesystems journal only metadata by default (protecting structure, not necessarily your latest data bytes) and let you choose full data journalling if you need it. Recognise the recurring shape: to make a multi-step change atomic and crash-safe, write your intent to a sequential log first, then apply it. You will see it again in the database section, in replication, and in distributed systems generally.
7 · RAID: surviving disk failure
Everything so far assumed the disk works. But disks die — mechanically, all the time. RAID (Redundant Array of Independent Disks) combines several physical disks into one logical volume to buy either speed, redundancy, or both. Know the common levels and, precisely, what each protects against:
Two lessons that separate the fluent from the memorisers. First, RAID is not a backup. It protects against disk hardware failure, not against rm -rf, corruption, ransomware, or fire — those replicate instantly to every mirror. You still need backups (ideally off-site). Second, RAID is the local, single-machine version of an idea that goes much further: redundancy through copies. Scale it across machines and datacentres and it becomes replication (Section 9) — the same principle, applied to survive not just a dead disk but a dead server or a dead region.
8 · How databases store data: B-trees vs LSM-trees
Now the deepest storage-interview territory, straight from Kleppmann. A database must store key-value data on disk so it can be looked up quickly and written to quickly. There are two great families of storage engine, and understanding their opposite trade-offs is what senior interviews probe.
The core tension, again from the physics of Section 2: fast reads want data sorted and in place; fast writes want to avoid random disk seeks. The two engine families resolve it differently:
Everything ties back to earlier sections. LSM-trees are fast at writes precisely because they turn random writes into sequential ones (Section 2) — appending to a log and merging later, rather than seeking to update in place. Both families rely on a write-ahead log for crash safety (Section 6): the write is appended to the WAL and fsync'd (Section 5) before being acknowledged, so a crash can be recovered by replaying the log. And an index is just a secondary sorted structure that trades write cost and space for read speed — which is why "just add an index" always has a price. The one-sentence takeaway interviewers want: B-trees optimise reads by keeping data sorted in place; LSM-trees optimise writes by appending sequentially and merging later.
9 · Replication, consistency, and CAP
One machine, however well-tuned, can die, fill up, or fall behind. To survive that and to scale, real systems keep copies of data on multiple machines — replication. It's RAID's idea (redundant copies) lifted to the network, and it introduces the hardest trade-offs in all of computing. This is the heart of Kleppmann and of distributed-systems interviews.
The first choice is when the write is acknowledged:
The CAP theorem
Once data is spread across machines connected by a network, an unavoidable law applies. A network partition — machines unable to reach each other — will happen. When it does, a distributed system must choose between two things it cannot both keep: consistency (every read sees the latest write — all copies agree) and availability (every request still gets an answer). That's the CAP theorem: under a Partition, pick Consistency or Availability.
Two nuances that mark a strong answer. CAP is about partition time — when the network is healthy, a system can offer both consistency and availability; the trade-off only bites during a partition. And "consistency" isn't binary: there's a spectrum from strong (reads always see the latest write) through eventual (copies converge given time), and picking the right point on that spectrum for a given feature (a bank balance vs a like count) is exactly the judgement senior data-systems roles are testing.
10 · Storage in production: the failures you'll actually hit
Theory meets the pager here. A handful of storage failure modes recur constantly; recognising them instantly is what the operational interview questions reward.
No space left on device. Find the culprit with df -h (per-filesystem usage) then du -sh * (what's big). Beware: a big deleted-but-open file (Section 4) consumes space that du can't see — restart the holder or truncate via /proc/<pid>/fd.No space left but df -h shows free space? You've exhausted inodes — too many tiny files (each needs one), a fixed pool set at format time. Check with df -i. Classic on mail spools and cache dirs full of small files.D state (from OS). The disk is saturated. iostat -x shows %util near 100 and rising await. You're IOPS- or throughput-bound; the fix is faster storage, caching, or fewer/larger I/Os.The through-line back to earlier modules: a saturated disk shows up as processes stuck in D state and a load average far above CPU usage (OS, Section 7) — because those processes are blocked in uninterruptible I/O. So "load average 200, CPU idle" and "everything is slow" often trace straight to storage. Storage problems wear an OS costume, which is exactly why these two modules sit next to each other.
Underneath every filesystem is a block device — a raw array of fixed-size blocks. lsblk shows the disks, partitions, and mounted filesystems built on top.
lsblk reveals the stack: physical disk → partitions → filesystems → mount points, and any RAID/LVM layers in between. blockdev --getbsz shows the block size. Seeing this hierarchy makes "block storage is the foundation, filesystems are built on it" concrete.
List block devices and their mount points, and see the filesystem types.
$ lsblk -fYou see each disk/partition, its filesystem type (ext4, xfs…), and where it's mounted — the block→filesystem stack from Section 3, live.
Reveal solution
$ lsblk $ lsblk -f # with filesystem types and mountpoints $ df -hT # usage per mounted filesystem, with type
A file's name lives in its directory; the inode holds the metadata and block pointers. Name and file are separate things.
stat shows a file's inode number, size, blocks, and link count. ls -i shows inode numbers next to names. Two names with the same inode number are hard links to one file. This separation is what makes links and the deleted-but-open trap possible.
Create a file, view its inode and link count, then make a hard link and watch the count rise.
$ stat file.txtstat shows Inode: and Links: 1. After ln file.txt hard.txt, the link count becomes 2 — two names, one inode. ls -i confirms both names share the inode number.
Reveal solution
$ echo hi > file.txt $ stat file.txt $ ln file.txt hard.txt $ stat file.txt # Links: now 2 $ ls -i file.txt hard.txt # same inode number
Because an open fd is a reference to the inode, deleting a file that's still open removes its name but keeps the data alive — invisible to ls, still consuming disk. A full disk with "no big files" is often this.
The inode's link count drops to 0 when unlinked, but the data isn't freed until the open-fd count also hits 0. lsof +L1 lists deleted-but-open files. The fix: restart or signal the process holding the fd, or truncate it via /proc/<pid>/fd. This is the storage face of the fd chain from the OS module.
Open a file, delete it while it's held open, and prove the space is still used until the holder lets go.
$ lsof +L1 2>/dev/null | headThe deleted-but-open file appears in lsof +L1 with a link count of 0 while its process still holds it — space consumed by a file ls can't see. Killing/restarting the holder frees it.
Reveal solution
When a disk is full but du finds nothing, look for deleted-but-open files with lsof +L1 before anything else.
$ exec 4>/tmp/ghost; echo data >&4 $ rm /tmp/ghost # name gone, fd 4 still open $ ls -l /tmp/ghost # gone from the directory $ lsof +L1 2>/dev/null | grep ghost # still open, 0 links, using space $ exec 4>&- # close it -> space freed
Reach engineers who read the man page
Native, contextual, no tracking — this is how the curriculum stays free.
A filesystem can run out of space (bytes) or inodes (file count) independently. Both report "no space left on device", but the fix is completely different.
df -h shows byte usage; df -i shows inode usage. A directory full of millions of tiny files can exhaust inodes (each file needs one, from a fixed pool set at format time) while bytes are nearly empty — bewildering until you check df -i.
Check both the space and inode usage of your filesystems.
$ df -idf -h shows % of bytes used; df -i shows % of inodes used. If -i is at 100% while -h isn't, you're out of inodes, not space — delete small files, don't buy a bigger disk.
Reveal solution
$ df -h # space (bytes) $ df -i # inodes (file count)
Writes land in the page cache (RAM) and return instantly; only a flush/fsync forces them to disk. The gap between the two is the durability window.
dd with and without conv=fsync exposes the difference: buffered writes are fast because they hit RAM; forcing fsync makes them slow because they hit the physical disk. free -h shows the page cache under buff/cache — memory that's reclaimable, not lost.
Write with and without forcing a sync and compare the reported speed; observe the page cache.
$ dd if=/dev/zero of=/tmp/t bs=1M count=256 conv=fsyncThe conv=fsync run reports a lower MB/s than a buffered run (drop conv=fsync to compare) — because it waits for real disk, not just RAM. That slowdown is the price of durability.
Reveal solution
$ dd if=/dev/zero of=/tmp/t bs=1M count=256 # buffered: fast (hits page cache) $ dd if=/dev/zero of=/tmp/t bs=1M count=256 conv=fsync # forced to disk: slower $ free -h # page cache shown under buff/cache $ rm -f /tmp/t
A saturated disk makes processes block in uninterruptible sleep (D), driving load average up while the CPU sits idle — the OS module's lesson, caused by storage.
iostat -x shows per-device %util (how busy the disk is) and await (average I/O latency). When %util pins near 100 and await climbs, the disk is the bottleneck and I/O-bound processes pile into D — visible in ps, and reflected in a load average far above CPU usage.
Watch disk utilisation and latency while generating I/O, and correlate with process state.
$ iostat -x 1 3Under load, the busy device shows high %util and rising await. ps -eo stat,comm | grep ^D shows processes stuck in D — the storage cause of "high load, idle CPU".
Reveal solution
$ sudo apt install -y sysstat # for iostat, if needed $ iostat -x 1 3 # in another shell, generate I/O, then: $ ps -eo stat,comm | grep "^D" # processes blocked on I/O
You now understand persistence end to end: why RAM's volatility makes durability hard, why sequential beats random on real media, the block/file/object abstractions, inodes and links and the deleted-but-open trap, the page cache and the acknowledged-vs-durable gap that fsync closes, journaling and write-ahead logs for crash safety, RAID and replication for surviving failure, B-trees vs LSM-trees for storing data, and CAP for the limits of distributed data. This is the foundation under every database, container volume, and stateful system you'll operate — and it connects straight back to the OS: a slow disk is a D-state, and a lost write is a page that never got fsync'd.
The questions actually asked for SRE, systems, and data-engineering roles — conceptual, debugging, and design prompts, many straight out of Designing Data-Intensive Applications. Each expands to show what the interviewer is really probing for, a model answer, and the follow-up traps. Answer all of these out loud and you have mastered this module.
Q1Does a successful write() mean my data is safe on disk?
The single most important storage insight — the difference between acknowledged and durable.
No. A successful write() only means the data reached the kernel's page cache in RAM; the kernel returns success immediately and flushes to disk later (write-back caching). If the machine loses power between the write() and the flush, the data is lost despite the success return. To guarantee durability you must call fsync(), which forces the dirty pages to physical disk before returning — at a real latency cost. "Acknowledged" and "durable" are separate guarantees, and the gap between them is the danger window.
- "Why don't databases fsync every write?" — too slow; they batch into a sequential WAL and fsync that.
- "Disk write caches?" — disks have their own volatile caches too, so serious setups use battery-backed caches or disable them.
Q2B-tree vs LSM-tree — how do they differ and when would you pick each?
Deep storage-engine knowledge from DDIA — the read/write trade-off and real systems.
A B-tree keeps data sorted on disk and updates in place: lookups walk a few tree levels (fast, predictable reads), while writes may seek to modify a block. Great for read-heavy and range queries — Postgres, MySQL/InnoDB. An LSM-tree is append-only: writes go to memory + a sequential log, get flushed as sorted files, and are compacted in the background; writes are sequential and fast, but a read may check several files (mitigated by Bloom filters). Great for write-heavy workloads — Cassandra, RocksDB. Pick B-tree for read-heavy/relational; LSM for write-heavy/high-ingest. It all follows from "sequential writes are far cheaper than random ones."
- "Why are LSM writes fast?" — they turn random writes into sequential appends.
- "LSM downside?" — read amplification and background compaction cost/latency spikes.
- "What do both need for crash safety?" — a write-ahead log.
Q3Explain the CAP theorem. Give a real CP and AP system.
Distributed-systems maturity — that the choice is C vs A *during a partition*.
In a system spread across machines, network partitions (nodes can't reach each other) are inevitable. CAP says that during a partition you must choose between Consistency (every read sees the latest write; all copies agree) and Availability (every request gets an answer) — you can't have both. A CP system refuses requests on the minority side to avoid stale reads (etcd/ZooKeeper — which is why Kubernetes' store is CP). An AP system keeps answering and reconciles later, accepting temporary disagreement / eventual consistency (Cassandra, DynamoDB, DNS). When the network is healthy you can have both; the trade-off only bites during a partition.
- "Is there a CA system?" — not really in practice; the network will partition, so you must plan for P.
- "Strong vs eventual consistency?" — a spectrum; choose per feature (bank balance vs like count).
Q4A disk is full but du can't find the space. What's happening?
The deleted-but-open-file trap — connects filesystem internals to the OS fd model.
Almost certainly a deleted-but-still-open file. Because an open file descriptor is a reference to the inode, deleting a file that a process still has open removes its name (so ls/du can't see it) but keeps the inode and its data blocks alive until the last fd closes — so the space stays consumed. Classic with a rotated log a daemon still holds open. Find it with lsof +L1 (lists files with 0 links still open); fix by restarting/signalling the holding process, or truncating via /proc/<pid>/fd/<n>. Also rule out the other "full" — inode exhaustion (df -i), where millions of tiny files fill the inode pool while bytes are free.
- "df says full, du says not — space or inodes?" — check
df -i. - "Why doesn't deleting the log free space?" — the daemon still holds the fd; you must make it release it.
Q5Why is sequential access so much faster than random, and why does it matter?
The physics that underpins all storage design.
On an HDD, a random read forces the head to physically seek to a new track and wait for rotation (~10 ms each), while sequential reads stream adjacent blocks with no seeks — orders of magnitude faster. Even on SSDs (no seek), sequential I/O is still faster due to larger transfers and less overhead. This is the reason storage engines work so hard to turn random writes into sequential ones — write-ahead logs, LSM-tree appends, log-structured filesystems — and why "add an index" (which scatters writes) has a real cost. Matching your access pattern to the media is the root of storage performance.
- "How does an LSM exploit this?" — appends sequentially, compacts in the background.
- "IOPS vs throughput?" — many small random ops vs bulk MB/s; know which bounds your workload.
Q6How does a filesystem stay consistent across a crash?
Journaling / write-ahead logging — the atomicity pattern.
A single logical operation (rename, append) touches multiple blocks that must change together; a crash in the middle would corrupt the filesystem. The fix is journaling: before applying a multi-block change, the filesystem writes the whole intended change to a sequential on-disk journal and marks it committed, then applies it in place. On recovery it replays the journal — fully-journalled changes are re-applied, partial ones discarded — so the filesystem is always consistent, never half-done. Most filesystems journal metadata by default (protecting structure) and can optionally journal data too. It's the same write-ahead idea databases use for their WAL: log your intent first, then act.
- "Does journalling protect my data bytes?" — by default often only metadata; enable data journalling for full protection.
- "Cost?" — writes happen twice; a deliberate durability/perf trade-off.
Q7Synchronous vs asynchronous replication — trade-offs?
Whether the durability-vs-latency trade-off is understood across machines.
Synchronous replication acks a write only after replicas confirm it — so if the primary dies, no acknowledged data is lost (strong durability) — but it's slower (waits on the network) and stalls if a replica is down. Asynchronous acks immediately and replicates in the background — fast and tolerant of slow replicas — but a primary crash before propagation loses the last few writes. It's the exact "acknowledged ≠ durable" trade-off from the page-cache section, lifted across machines. Many systems use a hybrid (e.g. ack after one synchronous replica, the rest async) to balance the two.
- "Default choice?" — usually async for latency, accepting a small loss window.
- "Failover risk with async?" — promoting a lagging replica loses un-replicated writes (and can cause split-brain).
Q8Is RAID a backup? Explain RAID 0/1/5/10.
A deliberate trap plus knowledge of what each level protects against.
RAID is not a backup. It protects against disk hardware failure, not against deletion, corruption, ransomware, or disaster — those replicate instantly to every mirror. You still need real, ideally off-site, backups. The levels: RAID 0 stripes for speed with no redundancy (any disk lost = all data lost). RAID 1 mirrors (survives one disk, 50% capacity). RAID 5 stripes with distributed parity (survives one disk, loses one disk of capacity, slow rebuilds). RAID 10 mirrors then stripes (fast + survives disk loss, 50% capacity) — the usual database choice.
- "RAID 5 rebuild risk?" — rebuilds stress remaining disks and are slow; a second failure during rebuild loses everything.
- "What does RAID not protect against?" — human error, corruption, fire — hence backups.
Q9Design a durable, high-throughput write path for a database. What do you use?
Synthesis — WAL, sequential writes, fsync, batching, replication, all together.
Combine the whole module. Accept writes into an in-memory structure, and for durability append each to a write-ahead log on disk — a sequential append (fast, from Section 2) that you fsync before acknowledging (durable, from Section 5). To keep throughput high, batch many writes into one fsync (group commit) rather than fsyncing each. Periodically flush the in-memory data to sorted on-disk files and compact them (LSM style) so reads stay fast. For crash recovery, replay the WAL. For surviving machine failure, replicate the WAL to other nodes (sync for zero-loss, async for speed). This is exactly how real databases (Postgres WAL, RocksDB, Kafka) are built.
- "Why a WAL instead of writing data in place?" — sequential + atomic recovery; in-place random writes are slow and crash-unsafe.
- "How to raise throughput without losing durability?" — group commit (batch fsyncs).
- "Zero data loss on machine failure?" — synchronous replication (at a latency cost).
Q10Everything is slow, load average is high, but CPU is idle. Storage angle?
Connecting storage saturation to the OS D-state / load-average lesson.
This is very often a saturated disk. Processes blocked on slow I/O sit in uninterruptible sleep (D), which counts toward Linux load average — so load can be huge while the CPU does nothing (from the OS module). Confirm with iostat -x: a device pinned near 100% %util with rising await (I/O latency), and ps -eo stat | grep -c D showing many D-state processes. Root causes: an IOPS-bound workload (many small random I/Os), a failing/degraded disk, a RAID rebuild, or a noisy neighbour. Fixes: faster storage, more caching, batching I/O, or moving the hot workload.
- "Why does the disk affect load average?" — D-state (I/O wait) is included in load, unlike pure CPU metrics.
- "IOPS-bound vs throughput-bound?" — small random ops vs bulk transfer;
iostattells you which.
Q11Block vs file vs object storage — when do you use each?
Whether the three abstractions and their trade-offs are clear.
Block storage is a raw array of blocks (a bare disk / cloud volume) with no file concept — you build a filesystem or database on it for maximum control and low-latency random I/O (databases, boot volumes). File storage is a named hierarchy managed by a filesystem, optionally shared over the network (NAS/NFS) — familiar POSIX semantics for applications and shared access, but harder to scale massively. Object storage is a flat key→blob namespace over HTTP with rich metadata and near-infinite, cheap, durable capacity, but no in-place edits and higher latency (S3) — ideal for backups, media, logs, and data lakes. Rule of thumb: block for databases, file for apps/shared access, object for massive cheap durable blobs.
- "Why not object storage for a database?" — no low-latency random in-place writes; wrong access pattern.
- "Why is object storage so scalable?" — flat namespace, immutable objects, no POSIX consistency guarantees to maintain.