AboutBlogContact
Distributed SystemsNovember 5, 2007 2 min read 43

Fault-Tolerant Systems with Erlang/OTP Supervision Trees (2007)

AunimedaAunimeda
📋 Table of Contents

Fault-Tolerant Systems with Erlang/OTP Supervision Trees

In 2007, as web traffic explodes, everyone is talking about scalability. But in the Erlang world, we've been solving these problems since the 80s at Ericsson. The secret isn't just "concurrency"—it's fault tolerance. In Erlang, we don't try to write perfect, bug-free code. Instead, we use the "Let It Crash" philosophy.

The Supervisor Pattern

The heart of an OTP (Open Telecom Platform) application is the Supervision Tree. A supervisor is a process whose sole job is to watch its children (workers or other supervisors) and restart them if they crash.

Defining a Simple Supervisor

In Erlang R11B, a supervisor implements the supervisor behavior. Here is how we define a supervisor for a hypothetical TCP server:

-module(my_app_sup).
-behaviour(supervisor).

-export([start_link/0, init/1]).

start_link() ->
    supervisor:start_link({local, ?MODULE}, ?MODULE, []).

init([]) ->
    % Restart strategy: one_for_one
    % If a child crashes, only that child is restarted.
    RestartStrategy = {one_for_one, 5, 10}, % 5 restarts in 10 seconds

    Server = {my_server, {my_server, start_link, []},
              permanent, 2000, worker, [my_server]},

    {ok, {RestartStrategy, [Server]}}.

Restart Strategies

The RestartStrategy defines how the supervisor reacts to a crash:

  1. one_for_one: Only the crashed process is restarted.
  2. one_for_all: If one child crashes, restart all of them. Use this if your processes are tightly coupled.
  3. rest_for_one: If a child crashes, restart it and any children started after it in the list.

Why This Works

When a process in Erlang crashes due to an unhandled exception or a "badmatch," it sends an exit signal to its linked supervisor. Because the supervisor is isolated and has a very simple state, it's unlikely to crash itself. It simply spawns a fresh version of the worker process in a known good state.

This hierarchical approach allows us to build systems with "Nine Nines" availability. If a low-level worker crashes, the supervisor handles it. If the supervisor crashes, its own supervisor handles that. The crash only propagates as far as necessary to clear the corrupted state.

Read Also

Riak: Dynamo in Practice with Riak Core (2010)aunimeda
Distributed Systems

Riak: Dynamo in Practice with Riak Core (2010)

Basho took Amazon's Dynamo paper and made it real. Let's look at the vnode architecture and consistent hashing.

Distributed Locking: etcd vs. Consul (2016)aunimeda
Distributed Systems

Distributed Locking: etcd vs. Consul (2016)

Don't let two cron jobs run at once. In 2016, we're comparing the Raft implementations of etcd and Consul for reliable distributed locking.

Thrift vs. Protocol Buffers: Choosing Your Binary Protocol (2007)aunimeda
Distributed Systems

Thrift vs. Protocol Buffers: Choosing Your Binary Protocol (2007)

In 2007, high-throughput RPC is all about binary. Facebook just open-sourced Thrift, and Google's Protobuf is the industry's open secret. Which one should you choose?

Need IT development for your business?

We build websites, mobile apps and AI solutions. Free consultation.

Get Consultation All articles