map / Implementation / Linux

Implementation Keystone ~180 min requires: Operating Systemsrequires: Networkingrequires: Storage

Linux

This is the keystone. A container is not a virtualization primitive — it is an ordinary process the kernel shows a private view of the world to. Two mechanisms do all the work: namespaces (what a process can see) and cgroups (what it can use). You will build both by hand, then assemble container networking from veth pairs, a Linux bridge and netfilter — the exact machinery Docker’s docker0 and Kubernetes’ CNI automate. By the end you can reason about a pod’s packet path from the kernel up, and debug it when it breaks.

Before you start. You need a real Linux kernel with sudo. On native Ubuntu you are set. On macOS or Windows, do not use Docker Desktop or WSL2 for these labs — the kernel networking pieces break there. Spin up a real VM: multipass launch --name lab then multipass shell lab. A few labs use conntrack (sudo apt install -y conntrack).

What a network namespace actually is

When you run ip netns add net1, the kernel allocates a brand-new network namespace — an isolated copy of the entire network stack. That's not a metaphor: the new namespace gets its own interfaces, its own routing tables, its own ARP/neighbour tables, its own netfilter (iptables) rule set, its own conntrack table, and its own /proc/net and /sys/class/net. Two namespaces on one host are as isolated, network-wise, as two physical machines.

A namespace is a kernel object that lives only as long as something references it. A running process references its namespaces through its nsproxy; when the last process exits, the namespace is destroyed. ip netns keeps one alive without a process by bind-mounting its handle at /var/run/netns/<name> — which is why you can create it, walk away, and it's still there. nsenter and ip netns exec simply run a command with its namespace pointers swapped to that handle.

This explains the two things beginners trip over. First, a fresh namespace has only a loopback interface, and it starts DOWN — the kernel gives you the bare minimum, so even 127.0.0.1 fails until you bring lo up. Second, isolation is total: the namespace cannot see the host's eth0 or its routes at all. To give it connectivity you must explicitly build a link into it — which is exactly what a veth pair is for.

veth pairs: the virtual cable

A veth pair is the kernel's virtual patch cable. You never create a single veth; they come in twos and behave like the two ends of one Ethernet cable: every frame transmitted on one end appears on the receive path of the other. There is no socket and no user-space copy — the kernel hands the packet straight from one device's TX to the other's RX in a softirq.

  host netns                  net1 netns
 ┌──────────────┐           ┌──────────────┐
 │  veth1-br    │ ◄═══════► │  veth1       │
 │  (on bridge) │   cable   │  10.10.0.2   │
 └──────────────┘           └──────────────┘
       TX on one end  ==  RX on the other end

Move one end into a namespace with ip link set veth1 netns net1 and you've joined two otherwise-isolated stacks: a packet the container sends on veth1 pops out on veth1-br in the host. Both ends must be UP or nothing flows — the single most common fault is bringing up the container end and forgetting the host end, leaving the cable dangling. Every Docker container and every Kubernetes pod gets exactly this: one veth end inside the pod namespace, the other in the host, plugged into a bridge or handed to a routing daemon.

The Linux bridge is a real software switch

Wiring two namespaces together is useful; to let many of them talk you need a switch. The Linux bridge is an L2 switch built into the kernel, and it behaves exactly like the physical switch under your desk. Enslave a port with ip link set veth1-br master br0 and the bridge takes over that interface's receive path.

The bridge keeps a forwarding database (the FDB — see it with bridge fdb show): a table of MAC address → port. On each frame it learns the source MAC's port, then decides where to send: known destination MAC → forward to just that port; unknown or broadcast → flood to every port. That learn-and-forward loop is the entire job of a switch, done in software over your veth ports.

Give the bridge an IP (ip addr add 10.10.0.1/24 dev br0) and it becomes an L3 gateway too: the host now has a foot in the container subnet, which is why each namespace's default route points at the bridge IP. The bridge is simultaneously the switch the containers share and the router that gets them off-subnet. This object, named docker0, is exactly what Docker creates on install; Kubernetes CNI plugins build the same thing per node (often cni0).

How the kernel makes the forwarding decision

Every time the kernel sends a packet it answers one question: which interface and next-hop reach this destination? It consults the routing table of the current namespace and applies longest-prefix match — the most specific route containing the destination wins; if nothing else matches, the default route (0.0.0.0/0) catches it.

You never have to guess what it will pick. ip route get 10.10.0.3 asks the kernel to run the real forwarding decision and print the result — the most useful routing command there is, because it shows reality instead of your mental model. A gateway must be on-link (directly reachable on an attached subnet); that's why default via 10.10.0.1 only works once veth1 holds an address in 10.10.0.0/24. Otherwise the kernel has no interface on which the gateway is reachable and you get the classic Network is unreachable.

netfilter: the five hooks and how NAT really works

So far packets only move at L2/L3. netfilter is where the kernel lets you inspect and rewrite them — the machinery behind port publishing, container internet access, and Kubernetes Services. It exposes five hook points along a packet's path; iptables/nftables rules attach there, grouped into tables (raw, mangle, nat, filter).

                    ┌──────────┐
 in ─► PREROUTING ─► │ routing  │ ─► INPUT ─► local process
        (DNAT)       │ decision │
                     └────┬─────┘
                          ▼
                       FORWARD              local process
                          │                     │
 out ◄─ POSTROUTING ◄─────┴─────── OUTPUT ◄─────┘
         (SNAT / MASQUERADE)

DNAT runs at PREROUTING, rewriting the destination before routing — that's what docker run -p 8080:80 installs: a rule turning host:8080 into container-ip:80. SNAT/MASQUERADE runs at POSTROUTING, rewriting the source on the way out — that's how a private 10.10.0.0/24 container reaches the internet: its source is masqueraded to the host's real address so replies have a way back.

Critically, NAT is stateful. The first packet of a flow is translated and the kernel records the mapping in the conntrack table; every later packet, replies included, is translated automatically from that entry — which is why you write a rule for only one direction. Watch it live with conntrack -L. This also explains a classic outage: at very high connection rates the conntrack table fills (nf_conntrack: table full, dropping packet) and new connections are silently dropped until entries expire. And none of this forwarding happens unless net.ipv4.ip_forward=1 — without it the kernel refuses to route between interfaces and your namespaces stay islands.

A debugging method that always works

When container networking breaks, don't guess — bisect the path and find the last hop that works. Every layer has a command answering a yes/no question; walk them in order until one says no.

1. ip addr / ip link       → is the interface UP and does it have an IP?
2. ip route get DST        → what route will the kernel actually use?
3. ping GATEWAY            → is the next hop reachable at L3?
4. bridge link / bridge fdb → is the veth really enslaved to the bridge?
5. iptables -t nat -L -nv  → are NAT rules present and counting hits?
6. conntrack -L            → is the flow being tracked and translated?
7. tcpdump -i DEV -n       → where in the path do packets actually stop?

The discipline matters more than the commands. Nine of ten container-networking incidents are one of: an interface that's DOWN, a veth never enslaved to the bridge, a missing or wrong default route, ip_forward off, or a NAT rule that isn't matching. tcpdump is the tiebreaker — run it on each device along the path and you can see exactly where a packet vanishes, turning "the network is broken" into "the packet reaches veth1-br but never leaves the bridge, so the host end is down." Producing that sentence in the first five minutes is what separates a senior engineer — and it comes straight from walking this ladder.

Now build it by hand

Q1Build a container with your bare handscore

A container is a normal process wearing two costumes: namespaces (what it can see) and cgroups (what it can use). unshare puts on the namespace costume.

Under the hood

In the kernel every process points at a struct nsproxy holding one pointer per namespace type (pid, net, mnt, uts, ipc, cgroup, and the user namespace held separately). A namespace lives as long as something references it — a running process, or an open file in nsfs under /proc/<pid>/ns/. That is why a named netns can outlive every process: ip netns add bind-mounts the nsfs file into /var/run/netns/ to pin it. Unsharing the PID namespace only affects future children, so --fork spawns a child that becomes PID 1 of the new namespace.

Task

Enter a new PID + mount namespace so the process becomes PID 1 and can only see itself.

Verify it yourself

verify

$ echo $$   # inside the new shell

Inside, echo $$ prints 1 and ps aux shows only your own processes. You are PID 1 of a private process world — the core of a container.

Reveal solution

solution

$ sudo unshare --pid --fork --mount-proc bash
# now inside:
$ echo $$
$ ps aux

Commonly taught wrong

unshare --pid alone does not move your current shell into the new PID namespace — the shell keeps its old PID and only its children are born into the new namespace. That is why --fork is required, not optional, and why --mount-proc is needed so ps reads the new namespace’s /proc instead of the host’s.

Q2Cap its memory with a cgroup (meet the OOM killer)core

Namespaces isolate; cgroups limit. cgroup v2 is a single unified tree under /sys/fs/cgroup. Set a memory ceiling and the kernel enforces it by killing the offender.

Under the hood

In cgroup v2, controllers (memory, cpu, io) are switched on for a subtree by writing to cgroup.subtree_control. memory.max is a hard cap: when usage would exceed it and reclaim fails, the kernel’s cgroup-aware OOM killer picks a victim inside that cgroup — not a random host process. memory.events counts oom and oom_kill. Note the v2 “no internal processes” rule: a cgroup that has child cgroups cannot also hold processes directly.

Task

Create a cgroup capped at 50 MB, put a shell in it, and try to allocate 200 MB.

Verify it yourself

verify

$ cat /sys/fs/cgroup/demo/memory.events

The allocation is Killed instantly and oom_kill increments — the kill was scoped to your cgroup, not the host.

Reveal solution

In production, systemd-run --scope -p MemoryMax=50M yourcmd is the correct one-liner form of all this.

solution

$ echo +memory | sudo tee /sys/fs/cgroup/cgroup.subtree_control
$ sudo mkdir -p /sys/fs/cgroup/demo
$ echo 50M | sudo tee /sys/fs/cgroup/demo/memory.max
$ echo $$ | sudo tee /sys/fs/cgroup/demo/cgroup.procs
$ python3 -c 'import time; x = bytearray(200*1024*1024); time.sleep(60)'
$ cat /sys/fs/cgroup/demo/memory.events

Commonly taught wrong

systemd owns /sys/fs/cgroup and manages the hierarchy. Hand-writing raw cgroup files works for learning but can fight systemd on a real host — which is why the production answer is a transient systemd-run scope, not mkdir under the root cgroup.

Q3A network namespace is a whole network stackcore

A container’s network is a network namespace — and it starts almost empty.

Under the hood

A network namespace is a complete, independent copy of the network stack: its own interfaces, routing tables, ARP/neighbour tables, conntrack table, /proc/net, and its own iptables/nftables rules. Kernel-side it is a struct net. A fresh one has exactly one interface — loopback — and it comes up DOWN. ip netns add creates a named namespace (pinned via an nsfs bind-mount in /var/run/netns/), which is a different lifecycle from the anonymous namespace you get from unshare --net.

Task

Create a namespace net1 and look inside.

Verify it yourself

verify

$ sudo ip netns exec net1 ip link show

Only lo, and it is DOWN. That empty stack is your blank container network.

Reveal solution

solution

$ sudo ip netns add net1

Commonly taught wrong

ip netns and Docker/unshare namespaces are the same kernel object, but ip netns list only shows the named ones it created under /var/run/netns/. A Docker container’s netns will not appear there unless you bind-mount its /proc/<pid>/ns/net into that directory first — a common “where did my namespace go” confusion.

Reach engineers who read the man page

Native, contextual, no tracking — this is how the curriculum stays free.

Q4Build the bridge (this is docker0)core

To connect multiple namespaces you need a virtual switch — a bridge. This is precisely what docker0 is.

Under the hood

A Linux bridge is a software L2 switch. It keeps a forwarding database (FDB, viewable with bridge fdb show) mapping MAC → port, learned by watching source MACs on incoming frames. Unknown-unicast and broadcast frames are flooded to every port. Giving the bridge an IP makes the host itself a station on that L2 segment — the bridge device is simultaneously the switch and the host’s port onto it. STP is normally off for container bridges (single node, no loops to break).

Task

Create bridge br0, give it 10.10.0.1/24, bring it up.

Verify it yourself

verify

$ ping -c1 10.10.0.1

0% packet loss — the bridge IP now lives on your host.

Reveal solution

solution

$ sudo ip link add br0 type bridge
$ sudo ip addr add 10.10.0.1/24 dev br0
$ sudo ip link set br0 up

Commonly taught wrong

A bridge with no active ports often shows state UNKNOWN, not UP. That is normal for virtual interfaces — operstate/carrier semantics differ from physical NICs. Do not chase UNKNOWN as a bug.

Q5Plug a namespace in with a veth paircore

A veth pair is a virtual cable: one end inside the namespace, the other on the bridge. Exactly the wiring Docker makes per container.

Under the hood

A veth pair is two linked net_devices: a frame transmitted on one is received on the other. There is no wire — it is a direct in-kernel hand-off between the two devices. Moving one end with ip link set X netns net1 reassigns that net_device to the target struct net: it vanishes from the host and appears inside the namespace. Both ends should share an MTU; a mismatch bites later (see the MTU lab).

Task

Wire net1 to br0 as 10.10.0.2/24, bring loopback up, add a default route via the gateway.

Verify it yourself

verify

$ sudo ip netns exec net1 ping -c1 10.10.0.1

0% packet loss — the namespace reaches its gateway across the bridge you built.

Reveal solution

solution

$ sudo ip link add veth1 type veth peer name veth1-br
$ sudo ip link set veth1 netns net1
$ sudo ip link set veth1-br master br0
$ sudo ip link set veth1-br up
$ sudo ip netns exec net1 ip link set lo up
$ sudo ip netns exec net1 ip addr add 10.10.0.2/24 dev veth1
$ sudo ip netns exec net1 ip link set veth1 up
$ sudo ip netns exec net1 ip route add default via 10.10.0.1

Commonly taught wrong

The single most common real bug: people bring up the namespace end and forget the host end. The peer sitting on the bridge must also be ip link set ... up, or the cable is dead even though everything inside the namespace looks perfect.

Q6Two namespaces talking — the payoffcore

Add a second namespace on the same bridge and they can talk — through a switch you built by hand. You just recreated container-to-container networking.

Under the hood

This is same-node pod-to-pod networking. With a bridge CNI, each pod gets a veth into a shared bridge; both MACs land in the FDB and frames switch directly on L2 — no routing, no NAT for same-subnet traffic. (Route-based CNIs like Calico skip the bridge and instead put a per-pod route on the host, but the veth-into-netns half is identical.)

Task

Create net2 as 10.10.0.3/24 on br0, then ping between the two.

Verify it yourself

verify

$ sudo ip netns exec net1 ping -c1 10.10.0.3

0% loss both ways. This is a Docker bridge network, assembled from primitives.

Reveal solution

solution

$ sudo ip netns add net2
$ sudo ip link add veth2 type veth peer name veth2-br
$ sudo ip link set veth2 netns net2
$ sudo ip link set veth2-br master br0
$ sudo ip link set veth2-br up
$ sudo ip netns exec net2 ip link set lo up
$ sudo ip netns exec net2 ip addr add 10.10.0.3/24 dev veth2
$ sudo ip netns exec net2 ip link set veth2 up
$ sudo ip netns exec net2 ip route add default via 10.10.0.1

Q7Break-and-fix — the dangling cabledebug

Paste the setup: it builds net3 that should work but cannot reach the gateway. Diagnose it.

Under the hood

The frame leaves net3 and reaches its host-side veth, but if that peer is not enslaved to the bridge it has nowhere to go — the bridge never sees the frame. bridge link show lists which interfaces are enslaved to which bridge, and ip -d link show veth3-br shows its master. This is exactly how you localize a real CNI wiring failure.

Setup — paste to create the broken state

setup

sudo ip netns add net3
sudo ip link add veth3 type veth peer name veth3-br
sudo ip link set veth3 netns net3
sudo ip netns exec net3 ip link set lo up
sudo ip netns exec net3 ip addr add 10.10.0.4/24 dev veth3
sudo ip netns exec net3 ip link set veth3 up
sudo ip netns exec net3 ip route add default via 10.10.0.1
# (something is missing on the host side...)

Task

The namespace side looks complete. Check the host end — bridge link and ip -d link show veth3-br.

Verify it yourself

verify

$ sudo ip netns exec net3 ping -c1 10.10.0.1

0% loss once fixed.

Reveal solution

The host end was never enslaved to the bridge or brought up — the cable dangled. In Kubernetes this is the classic “pod has an IP but no connectivity” caused by a half-finished CNI attach.

solution

$ sudo ip link set veth3-br master br0
$ sudo ip link set veth3-br up

Q8Break-and-fix — rp_filter eats the repliesdebug

You add a second path or route and a namespace’s replies silently vanish — no error, just gone. Reverse-path filtering is dropping them.

Under the hood

rp_filter (RFC 3704) makes the kernel drop a packet if the route back to its source would not leave via the interface the packet arrived on — an anti-spoofing check. In asymmetric or multi-homed container setups it drops legitimate traffic. Mode 1 is strict, 2 is loose (accept if any route reaches the source). The effective value for an interface is the maximum of conf.all.rp_filter and conf.<iface>.rp_filter. Drops surface in nstat / /proc/net/netstat as reverse-path filter counters.

Task

Inspect the current rp_filter values, then relax them to loose mode and confirm traffic flows.

Verify it yourself

verify

$ cat /proc/sys/net/ipv4/conf/all/rp_filter

After setting loose mode, previously-dropped asymmetric replies get through. Watch the drop counter with nstat -az | grep -i rpfilter.

Reveal solution

solution

$ sysctl net.ipv4.conf.all.rp_filter
$ sudo sysctl -w net.ipv4.conf.all.rp_filter=2
$ sudo sysctl -w net.ipv4.conf.br0.rp_filter=2

Commonly taught wrong

Setting an interface’s rp_filter to 0 does nothing if conf.all.rp_filter is 1 — the kernel takes the max of the two. Almost everyone forgets this and edits only the interface value.

Q9Break-and-fix — the br_netfilter surprisedebug

Your bridged (same-L2) container traffic is unexpectedly hitting the host firewall / iptables FORWARD rules meant for routed traffic.

Under the hood

Once the br_netfilter module is loaded, bridged frames are pushed up through iptables’ FORWARD chain when net.bridge.bridge-nf-call-iptables=1. Kubernetes requires this so Service and NetworkPolicy iptables rules apply to pod traffic — but it also surprises anyone whose host firewall then filters intra-bridge traffic. The knob does not even exist until the module is loaded.

Task

Load br_netfilter, inspect the sysctl, and understand when bridged traffic is subject to iptables.

Verify it yourself

verify

$ sysctl net.bridge.bridge-nf-call-iptables

With the module loaded and the sysctl at 1, your L2 bridge traffic traverses iptables FORWARD — the mechanism that lets kube-proxy and NetworkPolicy work at all.

Reveal solution

solution

$ sudo modprobe br_netfilter
$ sudo sysctl -w net.bridge.bridge-nf-call-iptables=1
$ lsmod | grep br_netfilter

Commonly taught wrong

Guides that say “just set the sysctl” fail silently if br_netfilter is not loaded first — the key does not exist yet, so the write errors or is ignored. This is the classic kubeadm preflight gotcha: modprobe br_netfilter then set the sysctl.

Q10Reach the internet — NAT, netfilter hooks & conntrackcore

Your namespaces talk to each other but not the outside world: 10.10.0.0/24 is private and the internet has no route back. NAT fixes this.

Under the hood

netfilter has fixed hook points; NAT happens at PREROUTING (DNAT) and POSTROUTING (SNAT/MASQUERADE). MASQUERADE is SNAT that picks the outgoing interface’s IP dynamically — convenient when the uplink IP can change. Crucially, NAT only works because of conntrack: the kernel tracks each flow so reply packets are automatically un-NAT’d back to the namespace. And net.ipv4.ip_forward must be 1 or the host will not route between interfaces at all.

Task

Enable forwarding, MASQUERADE the subnet, then ping a public IP from net1 and watch the tracked flow.

Verify it yourself

verify

$ sudo ip netns exec net1 ping -c1 8.8.8.8

0% loss (needs host internet). Then run sudo conntrack -L | grep 10.10.0 and you will see the exact flow being tracked and source-NAT’d.

Reveal solution

If it still fails, your host FORWARD policy may be DROP: sudo iptables -P FORWARD ACCEPT. Install conntrack with sudo apt install -y conntrack.

solution

$ sudo sysctl -w net.ipv4.ip_forward=1
$ sudo iptables -t nat -A POSTROUTING -s 10.10.0.0/24 ! -o br0 -j MASQUERADE
$ sudo ip netns exec net1 ping -c1 8.8.8.8
$ sudo conntrack -L | grep 10.10.0

Commonly taught wrong

“MASQUERADE equals SNAT” is only half true. Plain SNAT --to-source with a fixed IP is cheaper and preferable on a stable uplink; MASQUERADE recomputes the source per flow. And NAT is not a firewall — it is conntrack state, not a security boundary.

Q11Break-and-fix — the MTU black holedebug

Small pings work, the TCP handshake works, but large transfers (a big HTTP response, git clone) hang forever. This is the classic MTU/MSS black hole.

Under the hood

If a hop on the path has a smaller MTU than the endpoints assume, large packets must fragment; if the Don’t-Fragment bit is set and the ICMP “fragmentation needed” message is blocked (common on overlays and in cloud networks), the sender never learns — Path MTU Discovery black-holes and the flow stalls after the handshake. Overlays like VXLAN add ~50 bytes of encapsulation, so pod MTU is often 1450, not 1500. The standard fix CNIs apply is TCP MSS clamping: -j TCPMSS --clamp-mss-to-pmtu.

Task

Shrink a link’s MTU to reproduce the black hole, prove it with a sized ping, then clamp MSS to fix it.

Verify it yourself

verify

$ sudo ip netns exec net1 ping -M do -s 1472 -c1 10.10.0.1

A normal ping still succeeds, but ping -M do -s 1472 (a full 1500-byte frame) fails once a hop MTU is smaller — that is the black hole made visible. After MSS clamping, large TCP transfers complete.

Reveal solution

solution

# shrink one side to simulate a smaller-MTU hop:
$ sudo ip link set veth1-br mtu 1400
# prove a full-size frame is dropped (DF set):
$ sudo ip netns exec net1 ping -M do -s 1472 -c2 10.10.0.1
# the CNI-style fix: clamp TCP MSS to the path MTU
$ sudo iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu

Commonly taught wrong

“Ping works, so the network is fine” is how people miss this for hours. Ping uses tiny packets; the handshake is tiny too. MTU black holes pass both and only hang on the first large payload. When a link “works but hangs,” always test with ping -M do -s <big>.

Q12The debugging ladder — leaves but never returnsdebug

The SRE skill that ties it all together: methodically locate exactly which hop drops a packet, instead of guessing.

Under the hood

The escalation ladder, in order: (1) ip netns exec <ns> ip route get <dst> — does the kernel even choose a sane route and source address? (2) ss -tan — is the socket stuck in SYN-SENT (nothing answering) versus ESTABLISHED? (3) tcpdump -ni <iface> on both sides of each hop — on the veth inside the namespace, then on its host peer, then on the bridge — to find where the packet last appears. (4) conntrack -L — is the flow tracked and NAT’d as expected? (5) nstat -az / /proc/net/snmp for drop counters. Capturing on both ends of a hop is what localizes the drop precisely.

Task

Run the ladder against one of your namespaces: route-get, socket state, and a two-point tcpdump across a veth.

Verify it yourself

verify

$ sudo ip netns exec net1 ip route get 8.8.8.8

You should be able to point at the exact interface where a packet last appears — that interface, and the hop just after it, is where the fault lives.

Reveal solution

solution

$ sudo ip netns exec net1 ip route get 8.8.8.8
$ sudo ip netns exec net1 ss -tan
# in two terminals, capture both ends of the veth:
$ sudo ip netns exec net1 tcpdump -ni veth1
$ sudo tcpdump -ni veth1-br

Commonly taught wrong

tcpdump on the bridge shows frames after L2 switching; capture on the specific veth to see traffic before and after a single hop. Watching the wrong interface is why people conclude “the packet just disappears” — it did not, they were not looking at the hop where it died.

What you just built

Namespaces (struct net) + cgroups + veth + a bridge + netfilter hooks — that is the entire substrate, and you built it by hand. Docker’s docker0 is your br0; -p is a DNAT rule in PREROUTING; docker-proxy is a userspace fallback. Kubernetes puts one veth end in each pod’s netns, wires the other into the node (a bridge, or a per-pod host route via CNI), and kube-proxy programs these same netfilter hooks to implement Services. You can now read a pod’s packet path from the kernel up — and debug it when it breaks. Clean up: sudo ip netns del net1 net2 net3; sudo ip link del br0.