beam

Apache Beam is a unified programming model for Batch and Streaming data processing.

8,082

4,325

8,082

4,579

View on GitHub

Top Related Projects

tensorflow

190,523

An Open Source Machine Learning Framework for Everyone

pytorch

91,080

Tensors and Dynamic neural networks in Python with strong GPU acceleration

dask

13,158

Parallel computing with task scheduling

spark

41,366

Apache Spark - A unified analytics engine for large-scale data processing

airflow

39,846

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Quick Overview

Apache Beam is an open-source, unified programming model for defining and executing data processing pipelines, including ETL, batch, and stream processing jobs. It provides a portable API layer for creating data processing workflows that can run on various distributed processing backends, such as Apache Flink, Apache Spark, and Google Cloud Dataflow.

Pros

Unified programming model for batch and stream processing
Portability across multiple execution engines
Rich set of built-in transforms and I/O connectors
Strong community support and active development

Cons

Steep learning curve for beginners
Performance overhead compared to native execution engines
Limited support for some advanced streaming features
Complexity in debugging and monitoring pipelines

Code Examples

Reading from a file and writing to another:

Pipeline p = Pipeline.create();
p.apply(TextIO.read().from("input.txt"))
 .apply(TextIO.write().to("output.txt"));
p.run();

Performing a word count:

PCollection<String> words = p.apply(TextIO.read().from("input.txt"))
    .apply(FlatMapElements.into(TypeDescriptors.strings())
        .via((String line) -> Arrays.asList(line.split("\\s+"))));
PCollection<KV<String, Long>> wordCounts = words.apply(Count.perElement());

Windowing and triggering in streaming:

PCollection<KV<String, Integer>> windowed = input
    .apply(Window.<KV<String, Integer>>into(FixedWindows.of(Duration.standardMinutes(5)))
        .triggering(AfterWatermark.pastEndOfWindow()
            .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
                .plusDelayOf(Duration.standardMinutes(1)))
            .withLateFirings(AfterProcessingTime.pastFirstElementInPane()
                .plusDelayOf(Duration.standardMinutes(5))))
        .withAllowedLateness(Duration.standardMinutes(30))
        .accumulatingFiredPanes());

Getting Started

To get started with Apache Beam, follow these steps:

Add the Beam SDK to your project (Maven example):

<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-sdks-java-core</artifactId>
  <version>2.41.0</version>
</dependency>

Create a simple pipeline:

import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;

public class SimpleWordCount {
  public static void main(String[] args) {
    PipelineOptions options = PipelineOptionsFactory.create();
    Pipeline p = Pipeline.create(options);
    p.apply(TextIO.read().from("input.txt"))
     .apply(TextIO.write().to("output.txt"));
    p.run();
  }
}

Run the pipeline using a local runner or a distributed execution engine of your choice.

Competitor Comparisons

tensorflow

190,523

An Open Source Machine Learning Framework for Everyone

Pros of TensorFlow

More comprehensive machine learning ecosystem with tools for model development, training, and deployment
Stronger support for deep learning and neural networks
Larger community and more extensive documentation

Cons of TensorFlow

Steeper learning curve, especially for beginners
Can be overkill for simpler data processing tasks
Less flexible for general-purpose data processing compared to Beam

Code Comparison

TensorFlow example (basic neural network):

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(10,)),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Beam example (basic data processing pipeline):

import apache_beam as beam

with beam.Pipeline() as p:
    lines = p | beam.io.ReadFromText('input.txt')
    counts = (lines
              | beam.FlatMap(lambda x: x.split())
              | beam.Map(lambda x: (x, 1))
              | beam.CombinePerKey(sum))
    counts | beam.io.WriteToText('output.txt')

Both TensorFlow and Beam are powerful tools, but they serve different purposes. TensorFlow excels in machine learning tasks, particularly deep learning, while Beam is more suited for large-scale data processing and ETL operations. The choice between them depends on the specific requirements of your project.

pytorch

91,080

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Pros of PyTorch

More user-friendly and intuitive API for deep learning tasks
Dynamic computational graph allows for easier debugging and flexibility
Stronger community support and ecosystem for AI/ML applications

Cons of PyTorch

Less suitable for large-scale data processing and ETL workflows
Limited support for distributed computing compared to Beam
Narrower focus on deep learning, while Beam is more versatile for general data processing

Code Comparison

PyTorch example (neural network definition):

import torch.nn as nn

class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 10)

Beam example (data processing pipeline):

import apache_beam as beam

with beam.Pipeline() as p:
    lines = p | beam.io.ReadFromText('input.txt')
    counts = (lines
              | beam.FlatMap(lambda x: x.split())
              | beam.Map(lambda x: (x, 1))
              | beam.CombinePerKey(sum))

dask

13,158

Parallel computing with task scheduling

Pros of Dask

Lightweight and flexible, integrating seamlessly with existing Python ecosystems
Excellent for scaling pandas and NumPy operations to larger-than-memory datasets
Simpler learning curve for Python developers already familiar with pandas/NumPy

Cons of Dask

Less robust for complex, multi-step data processing pipelines
Limited support for streaming data compared to Beam
Smaller ecosystem and fewer connectors for external systems

Code Comparison

Dask example:

import dask.dataframe as dd

df = dd.read_csv('large_file.csv')
result = df.groupby('column').mean().compute()

Beam example:

import apache_beam as beam

with beam.Pipeline() as p:
    result = (p | beam.io.ReadFromText('large_file.csv')
                | beam.Map(lambda x: x.split(','))
                | beam.GroupByKey()
                | beam.CombineValues(beam.combiners.MeanCombineFn()))

Both Dask and Beam offer powerful data processing capabilities, but they cater to different use cases. Dask excels in scaling familiar Python data structures, while Beam provides a more comprehensive framework for complex data processing pipelines across various languages and execution engines.

spark

41,366

Apache Spark - A unified analytics engine for large-scale data processing

Pros of Spark

More mature and widely adopted in industry
Better performance for large-scale data processing
Richer ecosystem with built-in libraries for ML, graph processing, etc.

Cons of Spark

Steeper learning curve, especially for complex use cases
Less flexibility in terms of execution engines and runtime environments
Higher resource requirements for cluster setup and management

Code Comparison

Beam (Python):

with beam.Pipeline() as p:
    lines = p | 'Read' >> beam.io.ReadFromText('input.txt')
    counts = (lines
              | 'Split' >> beam.FlatMap(lambda x: x.split())
              | 'PairWithOne' >> beam.Map(lambda x: (x, 1))
              | 'GroupAndSum' >> beam.CombinePerKey(sum))
    counts | 'Write' >> beam.io.WriteToText('output.txt')

Spark (Python):

sc = SparkContext("local", "Word Count")
lines = sc.textFile("input.txt")
counts = lines.flatMap(lambda x: x.split()) \
              .map(lambda x: (x, 1)) \
              .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("output.txt")

Both examples demonstrate a simple word count operation, showcasing the similar programming model but different syntax and execution approaches between Beam and Spark.

flink

24,808

Apache Flink

Pros of Flink

Native support for stateful stream processing
Lower latency and higher throughput for streaming workloads
More mature ecosystem for stream processing use cases

Cons of Flink

Steeper learning curve, especially for complex streaming scenarios
Less flexibility for batch processing compared to Beam
Smaller community and fewer language APIs (primarily Java and Scala)

Code Comparison

Flink:

DataStream<String> stream = env.addSource(new FlinkKafkaConsumer<>("topic", new SimpleStringSchema(), properties));
stream.flatMap(new Splitter())
      .keyBy("word")
      .sum("count")
      .print();

Beam:

PCollection<String> input = pipeline.apply(KafkaIO.<String, String>read()
    .withBootstrapServers("localhost:9092")
    .withTopic("topic")
    .withKeyDeserializer(StringDeserializer.class)
    .withValueDeserializer(StringDeserializer.class)
    .withoutMetadata());

Both Flink and Beam are powerful data processing frameworks, but they have different strengths. Flink excels in stream processing with its native stateful processing capabilities, while Beam offers more flexibility across various processing paradigms. The code examples showcase how both frameworks can interact with Kafka, demonstrating their integration capabilities with popular data sources.

airflow

39,846

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Pros of Airflow

More user-friendly web UI for monitoring and managing workflows
Extensive ecosystem of plugins and integrations
Better suited for orchestrating complex, interdependent tasks

Cons of Airflow

Steeper learning curve due to its complex architecture
Less efficient for processing large-scale data compared to Beam
More resource-intensive, especially for large DAGs

Code Comparison

Airflow DAG definition:

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime

dag = DAG('example_dag', start_date=datetime(2023, 1, 1))

def task_function():
    print("Executing task")

task = PythonOperator(
    task_id='example_task',
    python_callable=task_function,
    dag=dag
)

Beam pipeline definition:

import apache_beam as beam

with beam.Pipeline() as pipeline:
    (pipeline
     | 'Create' >> beam.Create([1, 2, 3, 4, 5])
     | 'Square' >> beam.Map(lambda x: x * x)
     | 'Print' >> beam.Map(print))

Both Apache Beam and Airflow are powerful tools for data processing and workflow management, respectively. Beam excels in large-scale data processing and offers a unified programming model, while Airflow provides robust scheduling and orchestration capabilities for complex workflows. The choice between them depends on the specific requirements of your project.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Apache Beam

Apache Beam is a unified model for defining both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and Runners for executing them on distributed processing backends, including Apache Flink, Apache Spark, Google Cloud Dataflow, and Hazelcast Jet.

Status

Overview

Beam provides a general approach to expressing embarrassingly parallel data processing pipelines and supports three categories of users, each of which have relatively disparate backgrounds and needs.

End Users: Writing pipelines with an existing SDK, running it on an existing runner. These users want to focus on writing their application logic and have everything else just work.
SDK Writers: Developing a Beam SDK targeted at a specific user community (Java, Python, Scala, Go, R, graphical, etc). These users are language geeks and would prefer to be shielded from all the details of various runners and their implementations.
Runner Writers: Have an execution environment for distributed processing and would like to support programs written against the Beam Model. Would prefer to be shielded from details of multiple SDKs.

The Beam Model

The model behind Beam evolved from several internal Google data processing projects, including MapReduce, FlumeJava, and Millwheel. This model was originally known as the âDataflow Modelâ.

To learn more about the Beam Model (though still under the original name of Dataflow), see the World Beyond Batch: Streaming 101 and Streaming 102 posts on OâReillyâs Radar site, and the VLDB 2015 paper.

The key concepts in the Beam programming model are:

PCollection: represents a collection of data, which could be bounded or unbounded in size.
PTransform: represents a computation that transforms input PCollections into output PCollections.
Pipeline: manages a directed acyclic graph of PTransforms and PCollections that is ready for execution.
PipelineRunner: specifies where and how the pipeline should execute.

SDKs

Beam supports multiple language-specific SDKs for writing pipelines against the Beam Model.

Currently, this repository contains SDKs for Java, Python and Go.

Have ideas for new SDKs or DSLs? See the sdk-ideas label.

Specific SDK Readmes

Python SDK readme

Runners

Beam supports executing programs on multiple distributed processing backends through PipelineRunners. Currently, the following PipelineRunners are available:

The DirectRunner runs the pipeline on your local machine.
The PrismRunner runs the pipeline on your local machine using Beam Portability.
The DataflowRunner submits the pipeline to the Google Cloud Dataflow.
The FlinkRunner runs the pipeline on an Apache Flink cluster. The code has been donated from dataArtisans/flink-dataflow and is now part of Beam.
The SparkRunner runs the pipeline on an Apache Spark cluster.
The JetRunner runs the pipeline on a Hazelcast Jet cluster. The code has been donated from hazelcast/hazelcast-jet and is now part of Beam.
The Twister2Runner runs the pipeline on a Twister2 cluster. The code has been donated from DSC-SPIDAL/twister2 and is now part of Beam.

Have ideas for new Runners? See the runner-ideas label.

Instructions for building and testing Beam itself are in the contribution guide.

ð Learn More

Here are some resources actively maintained by the Beam community to help you get started:

Resource	Details
Apache Beam Website	Our website discussing the project, and it's specifics.
Java Quickstart	A guide to getting started with the Java SDK.
Python Quickstart	A guide to getting started with the Python SDK.
Go Quickstart	A guide to getting started with the Go SDK.
Tour of Beam	A comprehensive, interactive learning experience covering Beam concepts in depth.
Beam Quest	A certification granted by Google Cloud, certifying proficiency in Beam.
Community Metrics	Beam's Git Community Metrics.

Contact Us

To get involved with Apache Beam:

Subscribe to or e-mail the user@beam.apache.org list.
Subscribe to or e-mail the dev@beam.apache.org list.
Join ASF Slack on #beam channel
Report an issue.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of TensorFlow

Cons of TensorFlow

Code Comparison

Pros of PyTorch

Cons of PyTorch

Code Comparison

Pros of Dask

Cons of Dask

Code Comparison

Pros of Spark

Cons of Spark

Code Comparison

Pros of Flink

Cons of Flink

Code Comparison

Pros of Airflow

Cons of Airflow

Code Comparison

Convert designs to code with AI

README

Apache Beam

Status

Overview

The Beam Model

SDKs

Specific SDK Readmes

Runners

ð Learn More

Contact Us

Top Related Projects

Convert designs to code with AI

ð Learn More