Fault-Tolerant Systems with Erlang/OTP Supervision Trees

#Erlang#OTP#Supervision Trees#Fault Tolerance#Concurrency

📋 Table of Contents ▼

Fault-Tolerant Systems with Erlang/OTP Supervision Trees

In 2007, as web traffic explodes, everyone is talking about scalability. But in the Erlang world, we've been solving these problems since the 80s at Ericsson. The secret isn't just "concurrency"-it's fault tolerance. In Erlang, we don't try to write perfect, bug-free code. Instead, we use the "Let It Crash" philosophy.

The Supervisor Pattern

The heart of an OTP (Open Telecom Platform) application is the Supervision Tree. A supervisor is a process whose sole job is to watch its children (workers or other supervisors) and restart them if they crash.

Defining a Simple Supervisor

In Erlang R11B, a supervisor implements the supervisor behavior. Here is how we define a supervisor for a hypothetical TCP server:

-module(my_app_sup).
-behaviour(supervisor).

-export([start_link/0, init/1]).

start_link() ->
    supervisor:start_link({local, ?MODULE}, ?MODULE, []).

init([]) ->
    % Restart strategy: one_for_one
    % If a child crashes, only that child is restarted.
    RestartStrategy = {one_for_one, 5, 10}, % 5 restarts in 10 seconds

    Server = {my_server, {my_server, start_link, []},
              permanent, 2000, worker, [my_server]},

    {ok, {RestartStrategy, [Server]}}.

Restart Strategies

The RestartStrategy defines how the supervisor reacts to a crash:

one_for_one: Only the crashed process is restarted.
one_for_all: If one child crashes, restart all of them. Use this if your processes are tightly coupled.
rest_for_one: If a child crashes, restart it and any children started after it in the list.

Why This Works

When a process in Erlang crashes due to an unhandled exception or a "badmatch," it sends an exit signal to its linked supervisor. Because the supervisor is isolated and has a very simple state, it's unlikely to crash itself. It simply spawns a fresh version of the worker process in a known good state.

This hierarchical approach allows us to build systems with "Nine Nines" availability. If a low-level worker crashes, the supervisor handles it. If the supervisor crashes, its own supervisor handles that. The crash only propagates as far as necessary to clear the corrupted state.

Aunimeda provides DevOps engineering and infrastructure services - CI/CD pipelines, containerization, cloud deployments, and monitoring setups.

Fault-Tolerant Systems with Erlang/OTP Supervision Trees (2007)

Fault-Tolerant Systems with Erlang/OTP Supervision Trees

The Supervisor Pattern

Defining a Simple Supervisor

Restart Strategies

Why This Works

Aunimeda

Need IT development for your business?

Fault-Tolerant Systems with Erlang/OTP Supervision Trees (2007)

Fault-Tolerant Systems with Erlang/OTP Supervision Trees

The Supervisor Pattern

Defining a Simple Supervisor

Restart Strategies

Why This Works

Aunimeda

Read Also

RabbitMQ: Choosing the Right Exchange Type (2011)

Riak: Dynamo in Practice with Riak Core (2010)

Docker Compose vs Kubernetes: What Small Teams Actually Need in 2026

Need IT development for your business?