Apache Storm: Spouts, Bolts, and Topologies
Nathan Marz and the team at BackType (now Twitter) created Storm to solve the "real-time" problem. In Storm, you don't run jobs; you deploy "Topologies" that run forever (or until you kill them).
Spouts: The Source
A Spout is a source of streams. It typically reads from Kafka or Kestrel and emits tuples into the topology.
public void nextTuple() {
String line = collector.readFromSource();
if (line != null) {
_collector.emit(new Values(line));
}
}
Bolts: The Processor
Bolts process the input streams and produce new streams. They do everything: filtering, functions, aggregations, and talking to databases.
public void execute(Tuple tuple) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null) count = 0;
count++;
counts.put(word, count);
_collector.emit(new Values(word, count));
_collector.ack(tuple);
}
Reliability and Acker
One of Storm's coolest features is its reliability mechanism. By "acking" a tuple, you tell Storm that the message was processed successfully. If a bolt fails, Storm can replay the tuple from the original spout.
This is tracked using an "Acker" bolt that performs a XOR checksum on the tuple IDs. If the result is zero, the entire tree of tuples is complete. It's an incredibly elegant way to track millions of messages with just a few bytes of overhead.
Storm is the "Hadoop of Real-time," and if you're still doing micro-batches in 2012, you're already behind.
Aunimeda builds backend systems with optimized database architectures - PostgreSQL, Redis, ClickHouse, and more.
Contact us for backend and database engineering. See also: Custom Software Development