AboutBlogContact
DevOps & InfrastructureNovember 5, 2007 2 min read 813Updated: June 22, 2026

Fault-Tolerant Systems with Erlang/OTP Supervision Trees (2007)

AunimedaAunimeda
📋 Table of Contents

Fault-Tolerant Systems with Erlang/OTP Supervision Trees

In 2007, as web traffic explodes, everyone is talking about scalability. But in the Erlang world, we've been solving these problems since the 80s at Ericsson. The secret isn't just "concurrency"-it's fault tolerance. In Erlang, we don't try to write perfect, bug-free code. Instead, we use the "Let It Crash" philosophy.

The Supervisor Pattern

The heart of an OTP (Open Telecom Platform) application is the Supervision Tree. A supervisor is a process whose sole job is to watch its children (workers or other supervisors) and restart them if they crash.

Defining a Simple Supervisor

In Erlang R11B, a supervisor implements the supervisor behavior. Here is how we define a supervisor for a hypothetical TCP server:

-module(my_app_sup).
-behaviour(supervisor).

-export([start_link/0, init/1]).

start_link() ->
    supervisor:start_link({local, ?MODULE}, ?MODULE, []).

init([]) ->
    % Restart strategy: one_for_one
    % If a child crashes, only that child is restarted.
    RestartStrategy = {one_for_one, 5, 10}, % 5 restarts in 10 seconds

    Server = {my_server, {my_server, start_link, []},
              permanent, 2000, worker, [my_server]},

    {ok, {RestartStrategy, [Server]}}.

Restart Strategies

The RestartStrategy defines how the supervisor reacts to a crash:

  1. one_for_one: Only the crashed process is restarted.
  2. one_for_all: If one child crashes, restart all of them. Use this if your processes are tightly coupled.
  3. rest_for_one: If a child crashes, restart it and any children started after it in the list.

Why This Works

When a process in Erlang crashes due to an unhandled exception or a "badmatch," it sends an exit signal to its linked supervisor. Because the supervisor is isolated and has a very simple state, it's unlikely to crash itself. It simply spawns a fresh version of the worker process in a known good state.

This hierarchical approach allows us to build systems with "Nine Nines" availability. If a low-level worker crashes, the supervisor handles it. If the supervisor crashes, its own supervisor handles that. The crash only propagates as far as necessary to clear the corrupted state.


Aunimeda provides DevOps engineering and infrastructure services - CI/CD pipelines, containerization, cloud deployments, and monitoring setups.

Contact us to discuss your infrastructure needs. See also: DevOps Services, Custom Software Development

Read Also

RabbitMQ: Choosing the Right Exchange Type (2011)aunimeda
DevOps & Infrastructure

RabbitMQ: Choosing the Right Exchange Type (2011)

Direct, Topic, Fanout, or Headers? If you're just dumping everything into a queue, you're missing the point of AMQP.

Riak: Dynamo in Practice with Riak Core (2010)aunimeda
DevOps & Infrastructure

Riak: Dynamo in Practice with Riak Core (2010)

Basho took Amazon's Dynamo paper and made it real. Let's look at the vnode architecture and consistent hashing.

Docker Compose vs Kubernetes: What Small Teams Actually Need in 2026aunimeda
DevOps & Infrastructure

Docker Compose vs Kubernetes: What Small Teams Actually Need in 2026

Kubernetes is powerful and over-engineered for most small products. Docker Compose is simple and hits its limits faster than you'd think. Here's where the actual boundary is, with real configs for both.

Need IT development for your business?

We build websites, mobile apps and AI solutions. Free consultation.

DevOps Services

Get Consultation All articles