Convert Figma logo to code with AI

apache logobeam

Apache Beam is a unified programming model for Batch and Streaming data processing.

7,828
4,241
7,828
4,475

Top Related Projects

186,879

An Open Source Machine Learning Framework for Everyone

85,015

Tensors and Dynamic neural networks in Python with strong GPU acceleration

12,495

Parallel computing with task scheduling

40,184

Apache Spark - A unified analytics engine for large-scale data processing

23,929

Apache Flink

36,684

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Quick Overview

Apache Beam is an open-source, unified programming model for defining and executing data processing pipelines, including ETL, batch, and stream processing jobs. It provides a portable API layer for creating data processing workflows that can run on various distributed processing backends, such as Apache Flink, Apache Spark, and Google Cloud Dataflow.

Pros

  • Unified programming model for batch and stream processing
  • Portability across multiple execution engines
  • Rich set of built-in transforms and I/O connectors
  • Strong community support and active development

Cons

  • Steep learning curve for beginners
  • Performance overhead compared to native execution engines
  • Limited support for some advanced streaming features
  • Complexity in debugging and monitoring pipelines

Code Examples

  1. Reading from a file and writing to another:
Pipeline p = Pipeline.create();
p.apply(TextIO.read().from("input.txt"))
 .apply(TextIO.write().to("output.txt"));
p.run();
  1. Performing a word count:
PCollection<String> words = p.apply(TextIO.read().from("input.txt"))
    .apply(FlatMapElements.into(TypeDescriptors.strings())
        .via((String line) -> Arrays.asList(line.split("\\s+"))));
PCollection<KV<String, Long>> wordCounts = words.apply(Count.perElement());
  1. Windowing and triggering in streaming:
PCollection<KV<String, Integer>> windowed = input
    .apply(Window.<KV<String, Integer>>into(FixedWindows.of(Duration.standardMinutes(5)))
        .triggering(AfterWatermark.pastEndOfWindow()
            .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
                .plusDelayOf(Duration.standardMinutes(1)))
            .withLateFirings(AfterProcessingTime.pastFirstElementInPane()
                .plusDelayOf(Duration.standardMinutes(5))))
        .withAllowedLateness(Duration.standardMinutes(30))
        .accumulatingFiredPanes());

Getting Started

To get started with Apache Beam, follow these steps:

  1. Add the Beam SDK to your project (Maven example):
<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-sdks-java-core</artifactId>
  <version>2.41.0</version>
</dependency>
  1. Create a simple pipeline:
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;

public class SimpleWordCount {
  public static void main(String[] args) {
    PipelineOptions options = PipelineOptionsFactory.create();
    Pipeline p = Pipeline.create(options);
    p.apply(TextIO.read().from("input.txt"))
     .apply(TextIO.write().to("output.txt"));
    p.run();
  }
}
  1. Run the pipeline using a local runner or a distributed execution engine of your choice.

Competitor Comparisons

186,879

An Open Source Machine Learning Framework for Everyone

Pros of TensorFlow

  • More comprehensive machine learning ecosystem with tools for model development, training, and deployment
  • Stronger support for deep learning and neural networks
  • Larger community and more extensive documentation

Cons of TensorFlow

  • Steeper learning curve, especially for beginners
  • Can be overkill for simpler data processing tasks
  • Less flexible for general-purpose data processing compared to Beam

Code Comparison

TensorFlow example (basic neural network):

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(10,)),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Beam example (basic data processing pipeline):

import apache_beam as beam

with beam.Pipeline() as p:
    lines = p | beam.io.ReadFromText('input.txt')
    counts = (lines
              | beam.FlatMap(lambda x: x.split())
              | beam.Map(lambda x: (x, 1))
              | beam.CombinePerKey(sum))
    counts | beam.io.WriteToText('output.txt')

Both TensorFlow and Beam are powerful tools, but they serve different purposes. TensorFlow excels in machine learning tasks, particularly deep learning, while Beam is more suited for large-scale data processing and ETL operations. The choice between them depends on the specific requirements of your project.

85,015

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Pros of PyTorch

  • More user-friendly and intuitive API for deep learning tasks
  • Dynamic computational graph allows for easier debugging and flexibility
  • Stronger community support and ecosystem for AI/ML applications

Cons of PyTorch

  • Less suitable for large-scale data processing and ETL workflows
  • Limited support for distributed computing compared to Beam
  • Narrower focus on deep learning, while Beam is more versatile for general data processing

Code Comparison

PyTorch example (neural network definition):

import torch.nn as nn

class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 10)

Beam example (data processing pipeline):

import apache_beam as beam

with beam.Pipeline() as p:
    lines = p | beam.io.ReadFromText('input.txt')
    counts = (lines
              | beam.FlatMap(lambda x: x.split())
              | beam.Map(lambda x: (x, 1))
              | beam.CombinePerKey(sum))
12,495

Parallel computing with task scheduling

Pros of Dask

  • Lightweight and flexible, integrating seamlessly with existing Python ecosystems
  • Excellent for scaling pandas and NumPy operations to larger-than-memory datasets
  • Simpler learning curve for Python developers already familiar with pandas/NumPy

Cons of Dask

  • Less robust for complex, multi-step data processing pipelines
  • Limited support for streaming data compared to Beam
  • Smaller ecosystem and fewer connectors for external systems

Code Comparison

Dask example:

import dask.dataframe as dd

df = dd.read_csv('large_file.csv')
result = df.groupby('column').mean().compute()

Beam example:

import apache_beam as beam

with beam.Pipeline() as p:
    result = (p | beam.io.ReadFromText('large_file.csv')
                | beam.Map(lambda x: x.split(','))
                | beam.GroupByKey()
                | beam.CombineValues(beam.combiners.MeanCombineFn()))

Both Dask and Beam offer powerful data processing capabilities, but they cater to different use cases. Dask excels in scaling familiar Python data structures, while Beam provides a more comprehensive framework for complex data processing pipelines across various languages and execution engines.

40,184

Apache Spark - A unified analytics engine for large-scale data processing

Pros of Spark

  • More mature and widely adopted in industry
  • Better performance for large-scale data processing
  • Richer ecosystem with built-in libraries for ML, graph processing, etc.

Cons of Spark

  • Steeper learning curve, especially for complex use cases
  • Less flexibility in terms of execution engines and runtime environments
  • Higher resource requirements for cluster setup and management

Code Comparison

Beam (Python):

with beam.Pipeline() as p:
    lines = p | 'Read' >> beam.io.ReadFromText('input.txt')
    counts = (lines
              | 'Split' >> beam.FlatMap(lambda x: x.split())
              | 'PairWithOne' >> beam.Map(lambda x: (x, 1))
              | 'GroupAndSum' >> beam.CombinePerKey(sum))
    counts | 'Write' >> beam.io.WriteToText('output.txt')

Spark (Python):

sc = SparkContext("local", "Word Count")
lines = sc.textFile("input.txt")
counts = lines.flatMap(lambda x: x.split()) \
              .map(lambda x: (x, 1)) \
              .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("output.txt")

Both examples demonstrate a simple word count operation, showcasing the similar programming model but different syntax and execution approaches between Beam and Spark.

23,929

Apache Flink

Pros of Flink

  • Native support for stateful stream processing
  • Lower latency and higher throughput for streaming workloads
  • More mature ecosystem for stream processing use cases

Cons of Flink

  • Steeper learning curve, especially for complex streaming scenarios
  • Less flexibility for batch processing compared to Beam
  • Smaller community and fewer language APIs (primarily Java and Scala)

Code Comparison

Flink:

DataStream<String> stream = env.addSource(new FlinkKafkaConsumer<>("topic", new SimpleStringSchema(), properties));
stream.flatMap(new Splitter())
      .keyBy("word")
      .sum("count")
      .print();

Beam:

PCollection<String> input = pipeline.apply(KafkaIO.<String, String>read()
    .withBootstrapServers("localhost:9092")
    .withTopic("topic")
    .withKeyDeserializer(StringDeserializer.class)
    .withValueDeserializer(StringDeserializer.class)
    .withoutMetadata());

Both Flink and Beam are powerful data processing frameworks, but they have different strengths. Flink excels in stream processing with its native stateful processing capabilities, while Beam offers more flexibility across various processing paradigms. The code examples showcase how both frameworks can interact with Kafka, demonstrating their integration capabilities with popular data sources.

36,684

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Pros of Airflow

  • More user-friendly web UI for monitoring and managing workflows
  • Extensive ecosystem of plugins and integrations
  • Better suited for orchestrating complex, interdependent tasks

Cons of Airflow

  • Steeper learning curve due to its complex architecture
  • Less efficient for processing large-scale data compared to Beam
  • More resource-intensive, especially for large DAGs

Code Comparison

Airflow DAG definition:

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime

dag = DAG('example_dag', start_date=datetime(2023, 1, 1))

def task_function():
    print("Executing task")

task = PythonOperator(
    task_id='example_task',
    python_callable=task_function,
    dag=dag
)

Beam pipeline definition:

import apache_beam as beam

with beam.Pipeline() as pipeline:
    (pipeline
     | 'Create' >> beam.Create([1, 2, 3, 4, 5])
     | 'Square' >> beam.Map(lambda x: x * x)
     | 'Print' >> beam.Map(print))

Both Apache Beam and Airflow are powerful tools for data processing and workflow management, respectively. Beam excels in large-scale data processing and offers a unified programming model, while Airflow provides robust scheduling and orchestration capabilities for complex workflows. The choice between them depends on the specific requirements of your project.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Apache Beam

Apache Beam is a unified model for defining both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and Runners for executing them on distributed processing backends, including Apache Flink, Apache Spark, Google Cloud Dataflow, and Hazelcast Jet.

Status

Maven Version PyPI version Go version Python coverage Build python source distribution and wheels Python tests Java tests

Overview

Beam provides a general approach to expressing embarrassingly parallel data processing pipelines and supports three categories of users, each of which have relatively disparate backgrounds and needs.

  1. End Users: Writing pipelines with an existing SDK, running it on an existing runner. These users want to focus on writing their application logic and have everything else just work.
  2. SDK Writers: Developing a Beam SDK targeted at a specific user community (Java, Python, Scala, Go, R, graphical, etc). These users are language geeks and would prefer to be shielded from all the details of various runners and their implementations.
  3. Runner Writers: Have an execution environment for distributed processing and would like to support programs written against the Beam Model. Would prefer to be shielded from details of multiple SDKs.

The Beam Model

The model behind Beam evolved from several internal Google data processing projects, including MapReduce, FlumeJava, and Millwheel. This model was originally known as the “Dataflow Model”.

To learn more about the Beam Model (though still under the original name of Dataflow), see the World Beyond Batch: Streaming 101 and Streaming 102 posts on O’Reilly’s Radar site, and the VLDB 2015 paper.

The key concepts in the Beam programming model are:

  • PCollection: represents a collection of data, which could be bounded or unbounded in size.
  • PTransform: represents a computation that transforms input PCollections into output PCollections.
  • Pipeline: manages a directed acyclic graph of PTransforms and PCollections that is ready for execution.
  • PipelineRunner: specifies where and how the pipeline should execute.

SDKs

Beam supports multiple language-specific SDKs for writing pipelines against the Beam Model.

Currently, this repository contains SDKs for Java, Python and Go.

Have ideas for new SDKs or DSLs? See the sdk-ideas label.

Runners

Beam supports executing programs on multiple distributed processing backends through PipelineRunners. Currently, the following PipelineRunners are available:

  • The DirectRunner runs the pipeline on your local machine.
  • The PrismRunner runs the pipeline on your local machine using Beam Portability.
  • The DataflowRunner submits the pipeline to the Google Cloud Dataflow.
  • The FlinkRunner runs the pipeline on an Apache Flink cluster. The code has been donated from dataArtisans/flink-dataflow and is now part of Beam.
  • The SparkRunner runs the pipeline on an Apache Spark cluster.
  • The JetRunner runs the pipeline on a Hazelcast Jet cluster. The code has been donated from hazelcast/hazelcast-jet and is now part of Beam.
  • The Twister2Runner runs the pipeline on a Twister2 cluster. The code has been donated from DSC-SPIDAL/twister2 and is now part of Beam.

Have ideas for new Runners? See the runner-ideas label.

Instructions for building and testing Beam itself are in the contribution guide.

📚 Learn More

Here are some resources actively maintained by the Beam community to help you get started:

Resource Details
Apache Beam Website Our website discussing the project, and it's specifics.
Java Quickstart A guide to getting started with the Java SDK.
Python Quickstart A guide to getting started with the Python SDK.
Go Quickstart A guide to getting started with the Go SDK.
Tour of Beam A comprehensive, interactive learning experience covering Beam concepts in depth.
Beam Quest A certification granted by Google Cloud, certifying proficiency in Beam.
Community Metrics Beam's Git Community Metrics.

Contact Us

To get involved with Apache Beam: