storm
Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more
Top Related Projects
Quick Overview
Apache Storm is a distributed real-time computation system for processing large volumes of data with high fault tolerance and guaranteed data processing. It is designed to handle unbounded streams of data and can be used for real-time analytics, online machine learning, continuous computation, and more.
Pros
- Scalable and fault-tolerant architecture
- Low latency processing with high throughput
- Supports multiple programming languages
- Easy to set up and operate
Cons
- Steep learning curve for complex topologies
- Limited built-in support for stateful processing
- Can be resource-intensive for large-scale deployments
- Requires careful tuning for optimal performance
Code Examples
- Creating a basic topology:
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("word-spout", new RandomSentenceSpout());
builder.setBolt("word-splitter", new SplitSentenceBolt()).shuffleGrouping("word-spout");
builder.setBolt("word-counter", new WordCountBolt()).fieldsGrouping("word-splitter", new Fields("word"));
- Defining a custom bolt:
public class WordCountBolt extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null)
count = 0;
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
- Submitting a topology to a Storm cluster:
Config conf = new Config();
conf.setDebug(true);
conf.setNumWorkers(3);
StormSubmitter.submitTopology("word-count-topology", conf, builder.createTopology());
Getting Started
- Add Storm dependency to your project:
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-core</artifactId>
<version>2.4.0</version>
</dependency>
- Create a topology with spouts and bolts:
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout-id", new YourCustomSpout());
builder.setBolt("bolt-id", new YourCustomBolt()).shuffleGrouping("spout-id");
- Configure and submit the topology:
Config conf = new Config();
conf.setNumWorkers(2);
StormSubmitter.submitTopology("topology-name", conf, builder.createTopology());
Competitor Comparisons
Apache Flink
Pros of Flink
- Higher throughput and lower latency for large-scale data processing
- Built-in support for stateful computations and exactly-once semantics
- More flexible windowing operations and event time processing
Cons of Flink
- Steeper learning curve due to more complex API and concepts
- Less mature ecosystem compared to Storm's long-standing community
- Requires more resources and careful tuning for optimal performance
Code Comparison
Storm topology definition:
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentenceBolt(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCountBolt(), 12).fieldsGrouping("split", new Fields("word"));
Flink job definition:
DataStream<String> text = env.addSource(new FlinkKafkaConsumer<>("topic", new SimpleStringSchema(), properties));
DataStream<Tuple2<String, Integer>> counts = text
.flatMap(new Tokenizer())
.keyBy(value -> value.f0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.sum(1);
Both frameworks offer distributed stream processing capabilities, but Flink provides more advanced features for complex event processing and stateful computations. Storm's simplicity makes it easier to get started, while Flink's power comes with a steeper learning curve. The code examples show the difference in API design, with Flink offering a more declarative approach to stream processing.
Apache Spark - A unified analytics engine for large-scale data processing
Pros of Spark
- Higher-level APIs and better support for batch processing
- More efficient memory usage and faster performance for large-scale data processing
- Wider ecosystem with libraries for machine learning, graph processing, and SQL
Cons of Spark
- Steeper learning curve due to more complex architecture
- Higher resource requirements, especially for smaller datasets
- Less suitable for real-time stream processing compared to Storm
Code Comparison
Storm example:
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
Spark example:
val lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
val query = wordCounts.writeStream.outputMode("complete").format("console").start()
Both examples demonstrate basic stream processing, but Spark's code is more concise and uses higher-level abstractions. Storm's topology is explicitly defined, while Spark's processing is more declarative.
Apache Beam is a unified programming model for Batch and Streaming data processing.
Pros of Beam
- Unified programming model for batch and streaming data processing
- Supports multiple execution engines (Flink, Spark, Dataflow)
- Extensive set of built-in transforms and I/O connectors
Cons of Beam
- Steeper learning curve due to more complex API
- Less mature ecosystem compared to Storm
- Potentially higher resource requirements for small-scale applications
Code Comparison
Storm:
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
Beam:
Pipeline p = Pipeline.create();
p.apply(TextIO.read().from("input.txt"))
.apply(ParDo.of(new ExtractWordsFn()))
.apply(Count.<String>perElement())
.apply(MapElements.into(TypeDescriptors.strings()).via(kv -> kv.getKey() + ": " + kv.getValue()))
.apply(TextIO.write().to("output.txt"));
Both Storm and Beam offer powerful distributed data processing capabilities, but they cater to different use cases. Storm excels in real-time stream processing with low latency, while Beam provides a more versatile approach for both batch and streaming scenarios across multiple execution engines.
Apache NiFi
Pros of NiFi
- More user-friendly with a web-based UI for designing and managing data flows
- Supports a wider range of data formats and protocols out-of-the-box
- Better suited for ETL and data integration tasks
Cons of NiFi
- Generally slower processing speed compared to Storm
- Less suitable for real-time stream processing at massive scale
- Steeper learning curve for complex data flow configurations
Code Comparison
NiFi uses a declarative approach with XML-based flow configurations:
<processor>
<id>abc123</id>
<name>GenerateFlowFile</name>
<position x="0" y="0"/>
<config>
<properties>
<entry>
<key>File Size</key>
<value>1 MB</value>
</entry>
</properties>
</config>
</processor>
Storm uses Java-based topologies with a more programmatic approach:
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentenceBolt(), 8)
.shuffleGrouping("spout");
builder.setBolt("count", new WordCountBolt(), 12)
.fieldsGrouping("split", new Fields("word"));
Both projects have their strengths, with NiFi excelling in data integration and Storm in high-throughput stream processing.
Mirror of Apache Kafka
Pros of Kafka
- Higher throughput and better scalability for large-scale data streaming
- Built-in partitioning and replication for fault tolerance and high availability
- Longer data retention capabilities, allowing for replay and batch processing
Cons of Kafka
- More complex setup and configuration compared to Storm
- Less real-time processing capabilities, as it's primarily designed for data streaming
- Limited built-in processing functionality, often requiring additional tools for data transformation
Code Comparison
Storm (Topology definition):
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
Kafka (Producer example):
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer<>(props);
producer.send(new ProducerRecord<>("topic", "key", "value"));
Both Storm and Kafka are powerful tools for distributed data processing, but they serve different primary purposes. Storm excels in real-time stream processing, while Kafka is optimized for high-throughput data streaming and storage.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
IMPORTANT NOTE!!!
Storm has Moved to Apache. The official Storm git repository is now hosted by Apache, and is mirrored on github here:
https://github.com/apache/incubator-storm
Contributing
Source code contributions can be submitted either by sumitting a pull request or by creating an issue in JIRA and attaching patches.
Migrating Git Repos from nathanmarz/storm to apache/incubator-storm
If you have an existing fork/clone of nathanmarz/storm, you can migrate to apache/incubator-storm by doing the following:
-
Create a new fork of apache/incubator-storm
-
Point your existing clone to the new fork:
git remote remove origin git remote add origin git@github.com:username/incubator-storm.git
Issue Tracking
The official issue tracker for Storm is Apache JIRA:
https://issues.apache.org/jira/browse/STORM
User Mailing List
Storm users should send messages and subscribe to user@storm.incubator.apache.org.
You can subscribe to this list by sending an email to user-subscribe@storm.incubator.apache.org. Likewise, you can cancel a subscription by sending an email to user-unsubscribe@storm.incubator.apache.org.
You can view the archives of the mailing list here.
Developer Mailing List
Storm developers should send messages and subscribe to dev@storm.incubator.apache.org.
You can subscribe to this list by sending an email to dev-subscribe@storm.incubator.apache.org. Likewise, you can cancel a subscription by sending an email to dev-unsubscribe@storm.incubator.apache.org.
You can view the archives of the mailing list here.
Which list should I send/subscribe to?
If you are using a pre-built binary distribution of Storm, then chances are you should send questions, comments, storm-related announcements, etc. to user@storm.apache.incubator.org.
If you are building storm from source, developing new features, or otherwise hacking storm source code, then dev@storm.incubator.apache.org is more appropriate.
What will happen with storm-user@googlegroups.com?
All existing messages will remain archived there, and can be accessed/searched here.
New messages sent to storm-user@googlegroups.com will either be rejected/bounced or replied to with a message to direct the email to the appropriate Apache-hosted group.
Top Related Projects
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot