Fault-Tolerant Systems with Erlang/OTP Supervision Trees
In 2007, as web traffic explodes, everyone is talking about scalability. But in the Erlang world, we've been solving these problems since the 80s at Ericsson. The secret isn't just "concurrency"—it's fault tolerance. In Erlang, we don't try to write perfect, bug-free code. Instead, we use the "Let It Crash" philosophy.
The Supervisor Pattern
The heart of an OTP (Open Telecom Platform) application is the Supervision Tree. A supervisor is a process whose sole job is to watch its children (workers or other supervisors) and restart them if they crash.
Defining a Simple Supervisor
In Erlang R11B, a supervisor implements the supervisor behavior. Here is how we define a supervisor for a hypothetical TCP server:
-module(my_app_sup).
-behaviour(supervisor).
-export([start_link/0, init/1]).
start_link() ->
supervisor:start_link({local, ?MODULE}, ?MODULE, []).
init([]) ->
% Restart strategy: one_for_one
% If a child crashes, only that child is restarted.
RestartStrategy = {one_for_one, 5, 10}, % 5 restarts in 10 seconds
Server = {my_server, {my_server, start_link, []},
permanent, 2000, worker, [my_server]},
{ok, {RestartStrategy, [Server]}}.
Restart Strategies
The RestartStrategy defines how the supervisor reacts to a crash:
- one_for_one: Only the crashed process is restarted.
- one_for_all: If one child crashes, restart all of them. Use this if your processes are tightly coupled.
- rest_for_one: If a child crashes, restart it and any children started after it in the list.
Why This Works
When a process in Erlang crashes due to an unhandled exception or a "badmatch," it sends an exit signal to its linked supervisor. Because the supervisor is isolated and has a very simple state, it's unlikely to crash itself. It simply spawns a fresh version of the worker process in a known good state.
This hierarchical approach allows us to build systems with "Nine Nines" availability. If a low-level worker crashes, the supervisor handles it. If the supervisor crashes, its own supervisor handles that. The crash only propagates as far as necessary to clear the corrupted state.