Convert Figma logo to code with AI

manuzhang logoawesome-streaming

a curated list of awesome streaming frameworks, applications, etc

2,720
298
2,720
5

Top Related Projects

A curated list of awesome Apache Spark packages and resources.

Everything you need to know to get the job.

Learn how to design large-scale systems. Prep for the system design interview. Includes Anki flashcards.

337,474

😎 Awesome lists about all kinds of interesting topics

An opinionated list of awesome Python frameworks, libraries, software and resources.

130,604

A curated list of awesome Go frameworks, libraries and software

Quick Overview

Awesome-streaming is a curated list of awesome streaming frameworks, applications, and resources. It serves as a comprehensive collection of tools, libraries, and platforms related to stream processing and real-time data analytics. The repository aims to be a valuable resource for developers, data engineers, and anyone interested in streaming technologies.

Pros

  • Extensive collection of streaming-related resources in one place
  • Regularly updated with new entries and contributions from the community
  • Well-organized into categories, making it easy to find specific types of tools or resources
  • Includes both open-source and commercial solutions, providing a broad overview of the streaming ecosystem

Cons

  • May be overwhelming for beginners due to the large number of resources listed
  • Some listed projects or resources may become outdated or inactive over time
  • Lacks detailed comparisons or recommendations between different tools
  • Primarily focuses on listing resources rather than providing in-depth explanations or tutorials

Note: As this is not a code library, the code example and quick start sections have been omitted.

Competitor Comparisons

A curated list of awesome Apache Spark packages and resources.

Pros of awesome-spark

  • More focused and comprehensive coverage of Apache Spark ecosystem
  • Better organization with clear categories for different Spark components
  • Includes Spark-specific resources like books, courses, and conferences

Cons of awesome-spark

  • Limited scope, only covers Apache Spark-related topics
  • May not include broader streaming technologies outside of Spark Streaming
  • Less frequent updates compared to awesome-streaming

Code Comparison

While both repositories are curated lists and don't contain actual code, here's an example of how they might differ in structure:

awesome-spark:

## Official Resources
- [Apache Spark Homepage](http://spark.apache.org/)
- [Apache Spark Documentation](http://spark.apache.org/docs/latest/)

## External Resources
- [Spark Packages](https://spark-packages.org/)

awesome-streaming:

## Streaming Engine

### Apache Spark Streaming

- [Apache Spark Streaming](https://spark.apache.org/streaming/)
- [Spark Streaming Programming Guide](https://spark.apache.org/docs/latest/streaming-programming-guide.html)

Both repositories serve as valuable resources for developers interested in streaming technologies. awesome-spark provides a deep dive into the Apache Spark ecosystem, while awesome-streaming offers a broader overview of various streaming technologies, including but not limited to Spark Streaming.

Everything you need to know to get the job.

Pros of interviews

  • Focused on interview preparation and coding challenges
  • Includes solutions to common algorithmic problems
  • Provides a comprehensive list of computer science topics for study

Cons of interviews

  • Limited to interview-specific content, not covering broader streaming technologies
  • May not be as frequently updated as awesome-streaming
  • Lacks information on real-world applications and industry trends

Code comparison

interviews:

public ListNode reverseList(ListNode head) {
    ListNode prev = null;
    while (head != null) {
        ListNode next = head.next;
        head.next = prev;
        prev = head;
        head = next;
    }
    return prev;
}

awesome-streaming:

No direct code examples available. This repository primarily contains links to streaming-related projects and resources.

Summary

interviews is a repository focused on interview preparation, offering solutions to coding challenges and a comprehensive list of computer science topics. It's ideal for those preparing for technical interviews but may not be as relevant for professionals seeking information on streaming technologies.

awesome-streaming, on the other hand, is a curated list of resources related to stream processing and real-time analytics. It provides links to various streaming frameworks, tools, and projects, making it more suitable for developers working with streaming technologies in production environments.

While interviews offers concrete code examples and solutions, awesome-streaming serves as a directory of resources without providing direct code samples. The choice between these repositories depends on whether you're preparing for interviews or seeking information on streaming technologies for practical applications.

Learn how to design large-scale systems. Prep for the system design interview. Includes Anki flashcards.

Pros of system-design-primer

  • Comprehensive coverage of system design concepts and principles
  • Includes interactive coding challenges and exercises
  • Provides visual aids and diagrams to illustrate complex concepts

Cons of system-design-primer

  • Focuses on general system design rather than streaming-specific topics
  • May not cover the latest trends in streaming technologies
  • Less curated list of resources compared to awesome-streaming

Code comparison

While both repositories don't primarily focus on code examples, system-design-primer does include some code snippets for illustration:

system-design-primer:

class LRUCache:
    def __init__(self, capacity):
        self.capacity = capacity
        self.cache = OrderedDict()

    def get(self, key):
        if key not in self.cache:
            return -1
        val = self.cache[key]
        self.cache.move_to_end(key)
        return val

awesome-streaming doesn't typically include code snippets, as it's primarily a curated list of resources.

Summary

system-design-primer is a comprehensive guide to system design principles with interactive elements and visual aids. It covers a broad range of topics but may not delve deeply into streaming-specific concepts. awesome-streaming, on the other hand, is a curated list of resources focused specifically on streaming technologies, providing links to relevant tools, frameworks, and articles in the streaming domain.

337,474

😎 Awesome lists about all kinds of interesting topics

Pros of awesome

  • Much broader scope, covering a wide range of topics and technologies
  • Larger community with more contributors and frequent updates
  • Well-organized structure with clear categories and subcategories

Cons of awesome

  • Can be overwhelming due to the sheer volume of information
  • Less focused on specific domains, making it harder to find niche resources
  • May include some outdated or less relevant links due to its size

Code comparison

Not applicable for these repositories as they are curated lists without significant code content.

Key differences

awesome-streaming:

  • Focused specifically on streaming data processing and related technologies
  • Provides a more curated and targeted list for streaming enthusiasts
  • Easier to navigate for those specifically interested in streaming topics

awesome:

  • Acts as a meta-list, linking to many other "awesome" lists
  • Covers a vast array of topics beyond just technology
  • Serves as a starting point for exploring various fields and interests

Both repositories serve valuable purposes, with awesome-streaming offering a deep dive into streaming technologies, while awesome provides a broader overview of numerous topics and resources across different domains.

An opinionated list of awesome Python frameworks, libraries, software and resources.

Pros of awesome-python

  • Broader scope, covering the entire Python ecosystem
  • Larger community and more frequent updates
  • More comprehensive categorization of resources

Cons of awesome-python

  • Less focused on a specific domain (streaming)
  • May be overwhelming for users looking for streaming-specific resources
  • Requires more filtering to find relevant tools for streaming applications

Code Comparison

While both repositories are curated lists and don't contain actual code, here's an example of how they might differ in their markdown structure:

awesome-python:

## Web Frameworks
*Full stack web frameworks.*

* [Django](https://www.djangoproject.com/) - The most popular web framework in Python.
* [Flask](https://flask.palletsprojects.com/) - A lightweight WSGI web application framework.

awesome-streaming:

## Stream Processing Frameworks
* [Apache Apex](https://apex.apache.org/) - Enterprise-grade unified stream and batch processing engine.
* [Apache Flink](https://flink.apache.org/) - System for high-throughput, low-latency data stream processing.

Both repositories use similar markdown structures, but awesome-python tends to have more detailed categorizations and descriptions, while awesome-streaming focuses specifically on streaming-related technologies.

130,604

A curated list of awesome Go frameworks, libraries and software

Pros of awesome-go

  • Broader scope, covering the entire Go ecosystem
  • Larger community and more frequent updates
  • More comprehensive categorization of resources

Cons of awesome-go

  • Less focused on a specific domain (streaming)
  • May be overwhelming for users looking for streaming-specific resources
  • Requires more time to navigate and find relevant information

Code comparison

While both repositories are curated lists and don't contain actual code, here's an example of how they structure their content:

awesome-go:

## Database

*Databases implemented in Go.*

* [BadgerDB](https://github.com/dgraph-io/badger) - Fast key-value DB in Go.
* [BigCache](https://github.com/allegro/bigcache) - Efficient key/value cache for gigabytes of data.

awesome-streaming:

## Streaming Library

* [Apache Flink](https://flink.apache.org/) [Java] - stateful computations over data streams.
* [Apache Samza](http://samza.apache.org/) [Scala/Java] - distributed stream processing framework.

awesome-go provides a more extensive list of resources across various Go-related topics, while awesome-streaming focuses specifically on streaming technologies across different programming languages. The choice between the two depends on whether you're looking for Go-specific resources or streaming-specific information.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Awesome Streaming Awesome Build Status

A curated list of awesome streaming (stream processing) frameworks, applications, readings and other resources. Inspired by other awesome projects.

Website

https://manuzhang.github.io/awesome-streaming/ is a more dynamic website where you can find updates of the awesome projects here.

Table of Contents

Streaming Engine

  • Apache Apex [Java] - unified platform for big data stream and batch processing.
  • Apache Ballista [Rust] - distributed compute platform powered by Apache Arrow.
  • Apache Flink [Java] - system for high-throughput, low-latency data stream processing that supports stateful computation, data-driven windowing semantics and iterative stream processing.
  • Apache Heron (incubating) [Java] - a realtime, distributed, fault-tolerant stream processing engine from Twitter.
  • Apache Samza [Scala/Java] - distributed stream processing framework that build on Kafka(messaging, storage) and YARN(fault tolerance, processor isolation, security and resource management).
  • Apache Spark Streaming [Scala] - makes it easy to build scalable fault-tolerant streaming applications.
  • Apache Storm [Clojure/Java] - distributed real-time computation system. Storm is to stream processing what Hadoop is to batch processing.
  • AthenaX [Java] - Uber's Stream Analytics Framework used in production
  • Bytewax [Python] - data parallel, distributed, stateful stream processing framework.
  • Faust [Python] - stream processing library, porting the ideas from Kafka Streams to Python
  • Gearpump [Scala] - lightweight real-time distributed streaming engine built on Akka.
  • Hazelcast Jet [Java] - A general purpose distributed data processing engine, built on top of Hazelcast.
  • hailstorm [Haskell] - distributed stream processing with exactly-once semantics based on Storm.
  • Maki Nage [Python] - A stream processing framework for data scientists, based on Kafka and ReactiveX.
  • mantis [Java] - Netflix's platform to build an ecosystem of realtime stream processing applications
  • mupd8(muppet) [Scala/Java] - mapReduce-style framework for processing fast/streaming data.
  • Numaflow [Java/Python/Go/Rust] - Kubernetes native stream processing platform with language agnostic framework. Scalable and cost-efficient
  • Onyx [Clojure] - Distributed, masterless, high performance, fault tolerant data processing.
  • Pathway [Python] - The fastest data processing engine supporting unified workflows for batch, streaming data, and LLM applications.
  • s4 [Java] - general-purpose, distributed, scalable, fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.
  • SABER [Java/C] - Window-Based Hybrid CPU/GPU Stream Processing Engine.
  • Scramjet Cloud Platform [Python/JavaScript/Node.js] - data processing engine for running multiple data processing apps (sequences) written in Python, JavaScript or TypeScript
  • SPQR [Java] - dynamic framework for processing high volumn data streams through pipelines.
  • tigon [C++/Java] - high throughput real-time streaming processing framework built on Hadoop and HBase.
  • Teknek [Java] - Simple elegant stream processing with interactive prototying shell SOL (Stream Operator Language) Mesos, designed for high performance data processing jobs that require flexibility & control.
  • Trill [.NET/C#] - Trill is a high-performance one-pass in-memory streaming analytics engine from Microsoft Research.
  • Wallaroo [Python] - A fast, stream-processing framework. Wallaroo makes it easy to react to data in real-time. By eliminating infrastructure complexity, going from prototype to production has never been simpler.
  • LightSaber [C++] - Multi-core Window-Based Stream Processing Engine. LightSaber uses code generation for efficient window aggregation.
  • HStreamDB [Haskell] - The streaming database built for IoT data storage and real-time processing.
  • Kuiper [Golang] - An edge lightweight IoT data analytics/streaming software implemented by Golang, and it can be run at all kinds of resource-constrained edge devices.
  • WindFlow [C++] - A C++17 Data Stream Processing Parallel Library for Multicores and GPUs.
  • RisingWave [Rust] - A PostgreSQL-compatible streaming database that is designed to build event-driven applications, real-time ETL pipelines, continuous analytics services, and feature stores for AI applications. It excels in extracting fresh and consistent insights from real-time event streams, database CDC, and time series data within sub-seconds. It unifies streaming and batch processing, enabling users to ingest, join, and analyze both live and historical data at a cloud scale.

Streaming Library

  • Apache Kafka Streams [Java] - lightweight stream processing library included in Apache Kafka (since 0.10 version).
  • Streamiz [C#] - a .Net Stream Processing Library for Apache Kafka
  • Akka Streams [Scala] - stream processing library on Akka Actors.
  • Daggy [C++] - real-time streams aggregation and catching.
  • Benthos [Go] - Benthos is a high performance and resilient message streaming service, able to connect various sources and sinks and perform arbitrary actions, transformations and filters on payloads
  • FS2(prev. 'Scalaz-Stream') [Scala] - Compositional, streaming I/O library for Scala.
  • FastStream [Python] - powerful and easy-to-use Python library simplifying the process of writing producers and consumers for message queues, handling all the parsing, networking and documentation generation automatically. Supports multiple protocols such as Apache Kafka, RabbitMQ and alike.
  • monix [Scala] - high-performance Scala / Scala.js library for composing asynchronous and event-based programs.
  • Quix Streams [Python] - a streaming library originally designed for the McLaren Formula 1 racing team that can process high volumes of time-series data with up to nanosecond precision using Apache Kafka as a message broker.
  • Scramjet Node.js - [Node.js] functional reactive stream programming framework written on top of Node.js object streams + the legacy Scramjet.js version
  • Scramjet Python - [Python] functional reactive stream programming framework written from scratch operating on object, string and buffer streams.
  • Scramjet C++ - [C++] functional reactive stream programming framework written on top of Node.js object streams.
  • Streamline [Java] - Stream Analytics Framework by Hortonworks, designed as a wrapper around existing streaming solutions like Storm. Aimed to allow users to drag-and-drop streaming components to focus on business logic.
  • StreamAlert [Python] - Airbnb's Real-time Data Analysis and Alerting.
  • Swave [Scala] - A lightweight Reactive Streams Infrastructure Toolkit for Scala.
  • Streamz [Python] - A lightweight library for building pipelines to manage continuous streams of data; supports complex pipelines that involve branching, joining, flow control, feedback, back pressure, and so on.
  • Stream Ops [Java] - A fully embeddable data streaming engine and stream processing API for Java.
  • Substation [Go] - Substation is a cloud native data pipeline and transformation toolkit written in Go.
  • SwimOS [Rust] - A framework for building real-time streaming data processing applications written in Rust.
  • Tributary [Python] - A python library for constructing dataflow graphs. Supports synchronous, reactive data streams built using python generators that mimic complex event processors, as well as lazily-evaluated acyclic graphs and functional currying streams.
  • YoMo [Go] - An open source Streaming Serverless Framework for building Low-latency Geo-distributed system. YoMo Built atop QUIC Transport Protocol and Functional Reactive Programming interface.
  • Mediapipe - Cross-platform, customizable ML solutions for live and streaming media.

Streaming Application

  • javactrl-kafka [Java] - An application of a stateful stream processing for workflow as Java code (microservices orchestration, business process automation, and more).
  • straw [Python/Java] - A platform for real-time streaming search.
  • storm-crawler [Java] - Web crawler SDK based on Apache Storm.
  • Zilla [Java] - Cross-platform, API gateway built for event-driven architectures and streaming that supports standard protocols such as HTTP, SSE, gRPC, MQTT and the native Kafka protocol.

IoT

  • sensorbee [Go] - lightweight stream processing engine for IoT.
  • Apache Edgent [Java] - a programming model and runtime that enables continuous streaming analytics on gateways and edge devices which can work with centralized systems to provide efficient and timely analytics across the whole IoT ecosystem: from the center to the edge, opens sourced by IBM.
  • Apache StreamPipes [Java] - a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams.

DSL

  • Apache Beam [Java, Python, SQL, Scala, Go] - unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs), open sourced by Google.
  • coast [Scala] - a DSL that builds DAGs on top of Samza and provides exactly-once semantics.
  • Esper [Java] - component for complex event processing (CEP) and event series analysis.
  • Streamparse [Python] - lets you run Python code against real-time streams of data via Apache Storm.
  • summingbird [Scala] - library that lets you write MapReduce programs that look like native Scala or Java collection transformations and execute them on a number of well-known distributed MapReduce platforms, including Storm and Scalding.

Data Pipeline

  • Apache Kafka [Scala/Java] - distributed, partitioned, replicated commit log service, which provides the functionality of a messaging system, but with a unique design.
  • Apache Pulsar [Java] - distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API.
  • Apache RocketMQ [Java] - distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity and flexible scalability.
  • AutoMQ [Scala/Java] - cloud-first alternative to Kafka by decoupling durability to S3 and EBS. 100% Kafka compatible. 10x cost-effective. Autoscale in seconds. Single-digit ms latency.
  • brooklin [Java] - a distributed system intended for streaming data between various heterogeneous source and destination systems with high reliability and throughput at scale from Linkedin (replaced databus).
  • camus [Java] - Linkedin's Kafka -> HDFS pipeline.
  • databus [Java] - Linkedin's source-agnostic distributed change data capture system.
  • flume [Java] - distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
  • fluvio [Rust/WASM] - Real-time programmable data streaming platform with in-line computation capabilities.
  • Gazette [golang] - Distributed streaming infrastructure built on cloud storage which makes it easy to mix and match batch and streaming paradigms.
  • LogDevice [C++] - a high-performant distributed system by Facebook for streaming and storing sequential data, using a log structure.
  • metaq [Java] - Taobao's high available, high performance distributed messaging system
  • NATS streaming [Go] - fast disk-backed messaging solution
  • nsq [Go] - realtime distributed messaging platform designed to operate at scale, handling billions of messages per day.
  • Redpanda [C++] - Redpanda is Kafka compatible, ZooKeeper-free, JVM-free and source available.
  • RudderStack [Go] - an open source customer data infrastructure (segment, mparticle alternative).
  • suro [Java] - data pipeline service for collecting, aggregating, and dispatching large volume of application events including log data.
  • StreamSets Data Collector [Java] - continuous big data ingestion infrastructure that reads from and writes to a large number of end-points, including S3, JDBC, Hadoop, Kafka, Cassandra and many others.

Online Machine Learning

  • Apache Samoa [Java] - distributed streaming machine learning (ML) framework that contains a programing abstraction for distributed streaming ML algorithms.
  • DataSketches [Java] - sketches library from Yahoo!.
  • [Numalogic] (https://github.com/numaproj/numalogic) [Python] - Collection of ML models and libraries for real-time anomaly detection and forecasting on time series data. Built on Numaflow, a K8s native stream processing platform
  • River [Python] - online machine learning library.
  • streamDM [Scala] - mining Big Data streams using Spark Streaming from Huawei.
  • StreamingBandit [Python] - Provides a webserver to quickly setup and evaluate possible solutions to contextual multi-armed bandit (cMAB) problems.
  • StormCV [Java] - enables the use of Apache Storm for video processing by adding computer vision (CV) specific operations and data model.
  • trident-ml [Java] - realtime online machine learning library based on Trident.
  • yurita [Scala] - Anomaly detection framework built on Spark Structured Streaming from Paypal.

Streaming SQL

  • pipelinedb [C] - An open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables.
  • squall [Java] - Squall executes SQL queries on top of Storm for doing online processing.
  • StreamCQL [Java] - Continuous Query Language on RealTime Computation System.
  • ksqlDB [Java] - A cloud-native, source-available database purpose-built for stream processing applications
  • Materialize [Rust] - A source-available streaming SQL engine for maintaining materialized views on data from message brokers and databases.
  • Siddhi [Java] - A cloud native Streaming and Complex Event Processing engine that understands Streaming SQL queries in order to capture events from diverse data sources, process them, detect complex conditions, and publish output to various endpoints in real time.
  • Proton [C++] - A unified streaming and historical data analytics database in a single binary, powered by ClickHouse.

Benchmark

  • storm-perf-test [Java] - a simple storm performance/stress test.
  • streaming-benchmarks [Java] - Benchmarks for Low Latency (Streaming) solutions including Apache Storm, Apache Spark, Apache Flink, etc.
  • flotilla [Go] - Automated message queue orchestration for scaled-up benchmarking.

Toolkit

  • akka [Scala] - toolkit and runtime for building highly concurrent, distributed, and resilient message-driven application on the JVM.
  • Apache Pekko [Scala, Java] - Fork of Akka 2.6.x, prior to the Akka project's adoption of the Business Source License.
  • pulsar [Python] - Actor based event driven concurrent framework for Python.
  • aeron [Java/C++] - efficient reliable unicast and multicast message transport.
  • StreamFlow [Java] - stream processing tool designed to help build and monitor processing workflows.
  • samza-luwak [Java] - uses Luwak, a stored-query engine built on Lucene, to implement full-text search on streams.
  • Streamdal [Go/Node.js/Python] - A tool to embed privacy controls in your application code to detect PII as it enters and leaves your systems, preventing it from reaching unintended data streams or pipelines.
  • Turbine [Java] - tool for aggregating streams of Server-Sent Event (SSE) JSON data into a single stream.
  • Nussknacker [Scala] - A visual tool to define and run real-time decision algorithms.

Closed Source

  • Amazon Kinesis Streams [Java] - real-time, fully managed and scalable data stream engine provided by AWS.
  • Azure Stream Analytics [.NET] a massively scalable, fully managed, real-time, data stream engine provided by Microsoft Azure.
  • Cloud Dataflow[Java, Python, SQL, Scala] - Google's managed stream and batch data processing engine. Supports running Beam pipelines.
  • concord [C++] - a distributed stream processing framework built in C++ on top of Apache.
  • IBM Streams [Python/Java/Scala] - platform for distributed processing and real-time analytics. Provides toolkits for advanced analytics like geospatial, time series, etc. out of the box.
  • jubatus [C++] - distributed processing framework and streaming machine learning library.
  • millwheel - framework for building low-latency data-processing applications that is widely used at Google.
  • NVIDIA Deep Stream [Python/C/C++] - a platform for real-time image, video and audio processing, preferably using on edge devices or cloud.

Readings

  1. In-Stream Big Data Processing
  2. The world beyond batch: Streaming 101 by Tyler Akidau.
  3. Real Time Analytics: Algorithms and Systems (VLDB 2015)
  4. Grokking Streaming Systems by Josh Fischer & Ning Wang
  5. Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing by Reuven Lax, Slava Chernyak, and Tyler Akidau
  6. Data Pipelines with Apache Airflow by Bas P. Harenslak and Julian Rutger de Ruiter

License

Creative Commons License

Licensed under a Creative Commons Attribution-ShareAlike 4.0 International License