storm

Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more

8,815

1,663

8,815

View on GitHub

Top Related Projects

spark

42,015

Apache Spark - A unified analytics engine for large-scale data processing

beam

8,228

Apache Beam is a unified programming model for Batch and Streaming data processing.

Quick Overview

Apache Storm is a distributed real-time computation system for processing large volumes of data with high fault tolerance and guaranteed data processing. It is designed to handle unbounded streams of data and can be used for real-time analytics, online machine learning, continuous computation, and more.

Pros

Scalable and fault-tolerant architecture
Low latency processing with high throughput
Supports multiple programming languages
Easy to set up and operate

Cons

Steep learning curve for complex topologies
Limited built-in support for stateful processing
Can be resource-intensive for large-scale deployments
Requires careful tuning for optimal performance

Code Examples

Creating a basic topology:

TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("word-spout", new RandomSentenceSpout());
builder.setBolt("word-splitter", new SplitSentenceBolt()).shuffleGrouping("word-spout");
builder.setBolt("word-counter", new WordCountBolt()).fieldsGrouping("word-splitter", new Fields("word"));

Defining a custom bolt:

public class WordCountBolt extends BaseBasicBolt {
    Map<String, Integer> counts = new HashMap<String, Integer>();

    @Override
    public void execute(Tuple tuple, BasicOutputCollector collector) {
        String word = tuple.getString(0);
        Integer count = counts.get(word);
        if (count == null)
            count = 0;
        count++;
        counts.put(word, count);
        collector.emit(new Values(word, count));
    }

    @Override
    public void declareOutputFields(OutputFieldsDeclarer declarer) {
        declarer.declare(new Fields("word", "count"));
    }
}

Submitting a topology to a Storm cluster:

Config conf = new Config();
conf.setDebug(true);
conf.setNumWorkers(3);

StormSubmitter.submitTopology("word-count-topology", conf, builder.createTopology());

Getting Started

Add Storm dependency to your project:

<dependency>
    <groupId>org.apache.storm</groupId>
    <artifactId>storm-core</artifactId>
    <version>2.4.0</version>
</dependency>

Create a topology with spouts and bolts:

TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout-id", new YourCustomSpout());
builder.setBolt("bolt-id", new YourCustomBolt()).shuffleGrouping("spout-id");

Configure and submit the topology:

Config conf = new Config();
conf.setNumWorkers(2);
StormSubmitter.submitTopology("topology-name", conf, builder.createTopology());

Competitor Comparisons

flink

25,110

Apache Flink

Pros of Flink

Higher throughput and lower latency for large-scale data processing
Built-in support for stateful computations and exactly-once semantics
More flexible windowing operations and event time processing

Cons of Flink

Steeper learning curve due to more complex API and concepts
Less mature ecosystem compared to Storm's long-standing community
Requires more resources and careful tuning for optimal performance

Code Comparison

Storm topology definition:

TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentenceBolt(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCountBolt(), 12).fieldsGrouping("split", new Fields("word"));

Flink job definition:

DataStream<String> text = env.addSource(new FlinkKafkaConsumer<>("topic", new SimpleStringSchema(), properties));
DataStream<Tuple2<String, Integer>> counts = text
    .flatMap(new Tokenizer())
    .keyBy(value -> value.f0)
    .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
    .sum(1);

Both frameworks offer distributed stream processing capabilities, but Flink provides more advanced features for complex event processing and stateful computations. Storm's simplicity makes it easier to get started, while Flink's power comes with a steeper learning curve. The code examples show the difference in API design, with Flink offering a more declarative approach to stream processing.

spark

42,015

Apache Spark - A unified analytics engine for large-scale data processing

Pros of Spark

Higher-level APIs and better support for batch processing
More efficient memory usage and faster performance for large-scale data processing
Wider ecosystem with libraries for machine learning, graph processing, and SQL

Cons of Spark

Steeper learning curve due to more complex architecture
Higher resource requirements, especially for smaller datasets
Less suitable for real-time stream processing compared to Storm

Code Comparison

Storm example:

TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));

Spark example:

val lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
val query = wordCounts.writeStream.outputMode("complete").format("console").start()

Both examples demonstrate basic stream processing, but Spark's code is more concise and uses higher-level abstractions. Storm's topology is explicitly defined, while Spark's processing is more declarative.

beam

8,228

Apache Beam is a unified programming model for Batch and Streaming data processing.

Pros of Beam

Unified programming model for batch and streaming data processing
Supports multiple execution engines (Flink, Spark, Dataflow)
Extensive set of built-in transforms and I/O connectors

Cons of Beam

Steeper learning curve due to more complex API
Less mature ecosystem compared to Storm
Potentially higher resource requirements for small-scale applications

Code Comparison

Storm:

TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));

Beam:

Pipeline p = Pipeline.create();
p.apply(TextIO.read().from("input.txt"))
 .apply(ParDo.of(new ExtractWordsFn()))
 .apply(Count.<String>perElement())
 .apply(MapElements.into(TypeDescriptors.strings()).via(kv -> kv.getKey() + ": " + kv.getValue()))
 .apply(TextIO.write().to("output.txt"));

Both Storm and Beam offer powerful distributed data processing capabilities, but they cater to different use cases. Storm excels in real-time stream processing with low latency, while Beam provides a more versatile approach for both batch and streaming scenarios across multiple execution engines.

nifi

5,494

Apache NiFi

Pros of NiFi

More user-friendly with a web-based UI for designing and managing data flows
Supports a wider range of data formats and protocols out-of-the-box
Better suited for ETL and data integration tasks

Cons of NiFi

Generally slower processing speed compared to Storm
Less suitable for real-time stream processing at massive scale
Steeper learning curve for complex data flow configurations

Code Comparison

NiFi uses a declarative approach with XML-based flow configurations:

<processor>
  <id>abc123</id>
  <name>GenerateFlowFile</name>
  <position x="0" y="0"/>
  <config>
    <properties>
      <entry>
        <key>File Size</key>
        <value>1 MB</value>
      </entry>
    </properties>
  </config>
</processor>

Storm uses Java-based topologies with a more programmatic approach:

TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentenceBolt(), 8)
       .shuffleGrouping("spout");
builder.setBolt("count", new WordCountBolt(), 12)
       .fieldsGrouping("split", new Fields("word"));

Both projects have their strengths, with NiFi excelling in data integration and Storm in high-throughput stream processing.

kafka

30,603

Mirror of Apache Kafka

Pros of Kafka

Higher throughput and better scalability for large-scale data streaming
Built-in partitioning and replication for fault tolerance and high availability
Longer data retention capabilities, allowing for replay and batch processing

Cons of Kafka

More complex setup and configuration compared to Storm
Less real-time processing capabilities, as it's primarily designed for data streaming
Limited built-in processing functionality, often requiring additional tools for data transformation

Code Comparison

Storm (Topology definition):

TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));

Kafka (Producer example):

Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer<>(props);
producer.send(new ProducerRecord<>("topic", "key", "value"));

Both Storm and Kafka are powerful tools for distributed data processing, but they serve different primary purposes. Storm excels in real-time stream processing, while Kafka is optimized for high-throughput data streaming and storage.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

IMPORTANT NOTE!!!

Storm has Moved to Apache. The official Storm git repository is now hosted by Apache, and is mirrored on github here:

https://github.com/apache/incubator-storm

Contributing

Source code contributions can be submitted either by sumitting a pull request or by creating an issue in JIRA and attaching patches.

Migrating Git Repos from nathanmarz/storm to apache/incubator-storm

If you have an existing fork/clone of nathanmarz/storm, you can migrate to apache/incubator-storm by doing the following:

Create a new fork of apache/incubator-storm

Point your existing clone to the new fork:

 git remote remove origin
 git remote add origin git@github.com:username/incubator-storm.git

Issue Tracking

The official issue tracker for Storm is Apache JIRA:

https://issues.apache.org/jira/browse/STORM

User Mailing List

Storm users should send messages and subscribe to user@storm.incubator.apache.org.

You can subscribe to this list by sending an email to user-subscribe@storm.incubator.apache.org. Likewise, you can cancel a subscription by sending an email to user-unsubscribe@storm.incubator.apache.org.

You can view the archives of the mailing list here.

Developer Mailing List

Storm developers should send messages and subscribe to dev@storm.incubator.apache.org.

You can subscribe to this list by sending an email to dev-subscribe@storm.incubator.apache.org. Likewise, you can cancel a subscription by sending an email to dev-unsubscribe@storm.incubator.apache.org.

You can view the archives of the mailing list here.

Which list should I send/subscribe to?

If you are using a pre-built binary distribution of Storm, then chances are you should send questions, comments, storm-related announcements, etc. to user@storm.apache.incubator.org.

If you are building storm from source, developing new features, or otherwise hacking storm source code, then dev@storm.incubator.apache.org is more appropriate.

What will happen with storm-user@googlegroups.com?

All existing messages will remain archived there, and can be accessed/searched here.

New messages sent to storm-user@googlegroups.com will either be rejected/bounced or replied to with a message to direct the email to the appropriate Apache-hosted group.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot