spark

Apache Spark - A unified analytics engine for large-scale data processing

41,366

28,646

41,366

193

View on GitHub

Top Related Projects

dask

13,376

Parallel computing with task scheduling

ray

38,187

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

polars

34,705

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

vaex

8,418

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

Quick Overview

Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs. Spark is designed for both batch and streaming data processing, offering speed, ease of use, and sophisticated analytics.

Pros

High performance: Spark can be up to 100x faster than Hadoop for large-scale data processing tasks
Versatility: Supports multiple programming languages and various data processing tasks (batch, streaming, machine learning, graph processing)
Rich ecosystem: Integrates well with other big data tools and has a wide range of libraries and extensions
In-memory computing: Allows data to be cached in memory for faster iterative algorithms

Cons

Steep learning curve: Requires understanding of distributed computing concepts
Memory-intensive: Can be expensive in terms of RAM requirements, especially for large datasets
Complexity in tuning: Optimal performance often requires careful configuration and tuning
Lack of file management system: Unlike Hadoop, Spark doesn't have its own distributed file system

Code Examples

Reading a CSV file and performing a simple aggregation:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SimpleAggregation").getOrCreate()

df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)
result = df.groupBy("column_name").agg({"value_column": "sum"})
result.show()

Creating a streaming query:

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split

spark = SparkSession.builder.appName("StreamingExample").getOrCreate()

lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()
words = lines.select(explode(split(lines.value, " ")).alias("word"))
wordCounts = words.groupBy("word").count()

query = wordCounts.writeStream.outputMode("complete").format("console").start()
query.awaitTermination()

Training a machine learning model:

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler

# Assume 'df' is a DataFrame with features and label columns
assembler = VectorAssembler(inputCols=["feature1", "feature2", "feature3"], outputCol="features")
df_assembled = assembler.transform(df)

lr = LogisticRegression(featuresCol="features", labelCol="label")
model = lr.fit(df_assembled)

predictions = model.transform(df_assembled)
predictions.select("label", "prediction", "probability").show()

Getting Started

Install Spark: Download from the Apache Spark website and set up environment variables.
Start a Spark session in Python:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MyFirstSparkApp").getOrCreate()

# Read a CSV file
df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)

# Perform operations
df.show()
df.printSchema()

# Stop the Spark session
spark.stop()

Competitor Comparisons

koalas

3,359

Koalas: pandas API on Apache Spark

Pros of Koalas

Provides a pandas-like API for PySpark, making it easier for data scientists familiar with pandas to transition to big data processing
Offers better performance for certain operations compared to native PySpark DataFrames
Seamless integration with existing pandas code and ecosystem

Cons of Koalas

Limited functionality compared to the full Apache Spark ecosystem
May introduce additional overhead for some operations
Smaller community and less extensive documentation compared to Spark

Code Comparison

Koalas:

import databricks.koalas as ks

df = ks.read_csv('data.csv')
result = df.groupby('category').agg({'value': 'mean'})

Spark:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = spark.read.csv('data.csv', header=True, inferSchema=True)
result = df.groupBy('category').agg({'value': 'mean'})

Both Koalas and Spark aim to process large-scale data, but Koalas provides a more familiar interface for pandas users. While Koalas offers easier adoption for data scientists, Spark provides a more comprehensive set of features and better scalability for complex big data processing tasks. The choice between the two depends on the specific use case, team expertise, and project requirements.

dask

13,376

Parallel computing with task scheduling

Pros of Dask

Lightweight and easy to install, integrates seamlessly with existing Python ecosystems
Flexible scheduling options, including local, distributed, and cloud-based execution
Native support for NumPy and Pandas data structures

Cons of Dask

Less mature ecosystem compared to Spark, with fewer high-level APIs and tools
Limited support for machine learning workflows
Smaller community and fewer enterprise-grade features

Code Comparison

Dask:

import dask.dataframe as dd

df = dd.read_csv('large_file.csv')
result = df.groupby('column').mean().compute()

Spark:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = spark.read.csv('large_file.csv')
result = df.groupBy('column').mean().collect()

Both Dask and Spark provide distributed computing capabilities for large-scale data processing. Dask excels in its simplicity and integration with the Python ecosystem, making it easier for data scientists to adopt. Spark, on the other hand, offers a more comprehensive set of features and a mature ecosystem, making it suitable for enterprise-level big data applications. The choice between the two depends on the specific requirements of the project, team expertise, and existing infrastructure.

ray

38,187

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

Pros of Ray

More flexible and general-purpose distributed computing framework
Better support for machine learning and reinforcement learning tasks
Easier to scale from a single machine to a cluster

Cons of Ray

Less mature ecosystem compared to Spark
Smaller community and fewer available resources
May require more low-level programming for some tasks

Code Comparison

Ray example:

import ray

@ray.remote
def f(x):
    return x * x

futures = [f.remote(i) for i in range(4)]
print(ray.get(futures))

Spark example:

from pyspark import SparkContext

sc = SparkContext("local", "Square App")
rdd = sc.parallelize(range(4))
result = rdd.map(lambda x: x * x).collect()
print(result)

Both examples demonstrate a simple distributed computation of squaring numbers. Ray uses its remote function decorator and futures, while Spark uses RDDs and transformations. Ray's approach is more flexible and can be easily extended to more complex distributed tasks, while Spark's approach is more specialized for data processing workflows.

polars

34,705

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

Pros of Polars

Faster performance for many operations, especially on single-node systems
Lower memory usage due to efficient data representation
Simpler API and easier to get started for data analysis tasks

Cons of Polars

Less mature ecosystem and fewer integrations compared to Spark
Limited distributed computing capabilities
Smaller community and fewer learning resources available

Code Comparison

Polars:

import polars as pl

df = pl.read_csv("data.csv")
result = df.groupby("category").agg(pl.col("value").sum())

Spark:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("data.csv", header=True, inferSchema=True)
result = df.groupBy("category").sum("value")

Both Polars and Spark are powerful data processing libraries, but they serve different use cases. Polars excels in single-node performance and ease of use for data analysis tasks, while Spark shines in distributed computing and big data processing. The choice between them depends on the specific requirements of your project, such as data size, processing needs, and scalability requirements.

vaex

8,418

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

Pros of Vaex

Designed for out-of-core processing, handling datasets larger than RAM efficiently
Faster performance for certain operations on large datasets
Simpler API and easier to get started with for data scientists

Cons of Vaex

Less mature ecosystem and community support compared to Spark
More limited in terms of distributed computing capabilities
Fewer integrations with other big data tools and frameworks

Code Comparison

Vaex:

import vaex
df = vaex.open('large_dataset.hdf5')
result = df.mean(df.column)

Spark:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.parquet('large_dataset.parquet')
result = df.agg({"column": "mean"}).collect()[0][0]

Summary

Vaex excels in handling large datasets on a single machine with its out-of-core processing capabilities and simpler API. It's particularly useful for data scientists working with datasets that don't fit in memory. However, Spark offers a more comprehensive ecosystem for distributed computing and big data processing, with broader integration capabilities and community support. The choice between the two depends on the specific use case, dataset size, and available computing resources.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Apache Spark

Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R (Deprecated), and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, pandas API on Spark for pandas workloads, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for stream processing.

Official version: https://spark.apache.org/
Development version: https://apache.github.io/spark/

Online Documentation

You can find the latest Spark documentation, including a programming guide, on the project web page. This README file only contains basic setup instructions.

Build Pipeline Status

Branch	Status
master

























branch-4.0






branch-3.5

Building Spark

Spark is built using Apache Maven. To build Spark and its example programs, run:

./build/mvn -DskipTests clean package

(You do not need to do this if you downloaded a pre-built package.)

More detailed documentation is available from the project site, at "Building Spark".

For general development tips, including info on developing Spark using an IDE, see "Useful Developer Tools".

Interactive Scala Shell

The easiest way to start using Spark is through the Scala shell:

./bin/spark-shell

Try the following command, which should return 1,000,000,000:

scala> spark.range(1000 * 1000 * 1000).count()

Interactive Python Shell

Alternatively, if you prefer Python, you can use the Python shell:

./bin/pyspark

And run the following command, which should also return 1,000,000,000:

>>> spark.range(1000 * 1000 * 1000).count()

Example Programs

Spark also comes with several sample programs in the examples directory. To run one of them, use ./bin/run-example <class> [params]. For example:

./bin/run-example SparkPi

will run the Pi example locally.

You can set the MASTER environment variable when running examples to submit examples to a cluster. This can be spark:// URL, "yarn" to run on YARN, and "local" to run locally with one thread, or "local[N]" to run locally with N threads. You can also use an abbreviated class name if the class is in the examples package. For instance:

MASTER=spark://host:7077 ./bin/run-example SparkPi

Many of the example programs print usage help if no params are given.

Running Tests

Testing first requires building Spark. Once Spark is built, tests can be run using:

./dev/run-tests

Please see the guidance on how to run tests for a module, or individual tests.

There is also a Kubernetes integration test, see resource-managers/kubernetes/integration-tests/README.md

A Note About Hadoop Versions

Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported storage systems. Because the protocols have changed in different versions of Hadoop, you must build Spark against the same version that your cluster runs.

Please refer to the build documentation at "Specifying the Hadoop Version and Enabling YARN" for detailed guidance on building for a particular distribution of Hadoop, including building for particular Hive and Hive Thriftserver distributions.

Configuration

Please refer to the Configuration Guide in the online documentation for an overview on how to configure Spark.

Contributing

Please review the Contribution to Spark guide for information on how to get started contributing to the project.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot