Convert Figma logo to code with AI

apache logospark

Apache Spark - A unified analytics engine for large-scale data processing

40,184
28,381
40,184
235

Top Related Projects

3,345

Koalas: pandas API on Apache Spark

12,495

Parallel computing with task scheduling

34,860

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

29,748

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

8,280

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

Quick Overview

Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs. Spark is designed for both batch and streaming data processing, offering speed, ease of use, and sophisticated analytics.

Pros

  • High performance: Spark can be up to 100x faster than Hadoop for large-scale data processing tasks
  • Versatility: Supports multiple programming languages and various data processing tasks (batch, streaming, machine learning, graph processing)
  • Rich ecosystem: Integrates well with other big data tools and has a wide range of libraries and extensions
  • In-memory computing: Allows data to be cached in memory for faster iterative algorithms

Cons

  • Steep learning curve: Requires understanding of distributed computing concepts
  • Memory-intensive: Can be expensive in terms of RAM requirements, especially for large datasets
  • Complexity in tuning: Optimal performance often requires careful configuration and tuning
  • Lack of file management system: Unlike Hadoop, Spark doesn't have its own distributed file system

Code Examples

  1. Reading a CSV file and performing a simple aggregation:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SimpleAggregation").getOrCreate()

df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)
result = df.groupBy("column_name").agg({"value_column": "sum"})
result.show()
  1. Creating a streaming query:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split

spark = SparkSession.builder.appName("StreamingExample").getOrCreate()

lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()
words = lines.select(explode(split(lines.value, " ")).alias("word"))
wordCounts = words.groupBy("word").count()

query = wordCounts.writeStream.outputMode("complete").format("console").start()
query.awaitTermination()
  1. Training a machine learning model:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler

# Assume 'df' is a DataFrame with features and label columns
assembler = VectorAssembler(inputCols=["feature1", "feature2", "feature3"], outputCol="features")
df_assembled = assembler.transform(df)

lr = LogisticRegression(featuresCol="features", labelCol="label")
model = lr.fit(df_assembled)

predictions = model.transform(df_assembled)
predictions.select("label", "prediction", "probability").show()

Getting Started

  1. Install Spark: Download from the Apache Spark website and set up environment variables.
  2. Start a Spark session in Python:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MyFirstSparkApp").getOrCreate()

# Read a CSV file
df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)

# Perform operations
df.show()
df.printSchema()

# Stop the Spark session
spark.stop()

Competitor Comparisons

3,345

Koalas: pandas API on Apache Spark

Pros of Koalas

  • Provides a pandas-like API for PySpark, making it easier for data scientists familiar with pandas to transition to big data processing
  • Offers better performance for certain operations compared to native PySpark DataFrames
  • Seamless integration with existing pandas code and ecosystem

Cons of Koalas

  • Limited functionality compared to the full Apache Spark ecosystem
  • May introduce additional overhead for some operations
  • Smaller community and less extensive documentation compared to Spark

Code Comparison

Koalas:

import databricks.koalas as ks

df = ks.read_csv('data.csv')
result = df.groupby('category').agg({'value': 'mean'})

Spark:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = spark.read.csv('data.csv', header=True, inferSchema=True)
result = df.groupBy('category').agg({'value': 'mean'})

Both Koalas and Spark aim to process large-scale data, but Koalas provides a more familiar interface for pandas users. While Koalas offers easier adoption for data scientists, Spark provides a more comprehensive set of features and better scalability for complex big data processing tasks. The choice between the two depends on the specific use case, team expertise, and project requirements.

12,495

Parallel computing with task scheduling

Pros of Dask

  • Lightweight and easy to install, integrates seamlessly with existing Python ecosystems
  • Flexible scheduling options, including local, distributed, and cloud-based execution
  • Native support for NumPy and Pandas data structures

Cons of Dask

  • Less mature ecosystem compared to Spark, with fewer high-level APIs and tools
  • Limited support for machine learning workflows
  • Smaller community and fewer enterprise-grade features

Code Comparison

Dask:

import dask.dataframe as dd

df = dd.read_csv('large_file.csv')
result = df.groupby('column').mean().compute()

Spark:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = spark.read.csv('large_file.csv')
result = df.groupBy('column').mean().collect()

Both Dask and Spark provide distributed computing capabilities for large-scale data processing. Dask excels in its simplicity and integration with the Python ecosystem, making it easier for data scientists to adopt. Spark, on the other hand, offers a more comprehensive set of features and a mature ecosystem, making it suitable for enterprise-level big data applications. The choice between the two depends on the specific requirements of the project, team expertise, and existing infrastructure.

34,860

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

Pros of Ray

  • More flexible and general-purpose distributed computing framework
  • Better support for machine learning and reinforcement learning tasks
  • Easier to scale from a single machine to a cluster

Cons of Ray

  • Less mature ecosystem compared to Spark
  • Smaller community and fewer available resources
  • May require more low-level programming for some tasks

Code Comparison

Ray example:

import ray

@ray.remote
def f(x):
    return x * x

futures = [f.remote(i) for i in range(4)]
print(ray.get(futures))

Spark example:

from pyspark import SparkContext

sc = SparkContext("local", "Square App")
rdd = sc.parallelize(range(4))
result = rdd.map(lambda x: x * x).collect()
print(result)

Both examples demonstrate a simple distributed computation of squaring numbers. Ray uses its remote function decorator and futures, while Spark uses RDDs and transformations. Ray's approach is more flexible and can be easily extended to more complex distributed tasks, while Spark's approach is more specialized for data processing workflows.

29,748

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

Pros of Polars

  • Faster performance for many operations, especially on single-node systems
  • Lower memory usage due to efficient data representation
  • Simpler API and easier to get started for data analysis tasks

Cons of Polars

  • Less mature ecosystem and fewer integrations compared to Spark
  • Limited distributed computing capabilities
  • Smaller community and fewer learning resources available

Code Comparison

Polars:

import polars as pl

df = pl.read_csv("data.csv")
result = df.groupby("category").agg(pl.col("value").sum())

Spark:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("data.csv", header=True, inferSchema=True)
result = df.groupBy("category").sum("value")

Both Polars and Spark are powerful data processing libraries, but they serve different use cases. Polars excels in single-node performance and ease of use for data analysis tasks, while Spark shines in distributed computing and big data processing. The choice between them depends on the specific requirements of your project, such as data size, processing needs, and scalability requirements.

8,280

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

Pros of Vaex

  • Designed for out-of-core processing, handling datasets larger than RAM efficiently
  • Faster performance for certain operations on large datasets
  • Simpler API and easier to get started with for data scientists

Cons of Vaex

  • Less mature ecosystem and community support compared to Spark
  • More limited in terms of distributed computing capabilities
  • Fewer integrations with other big data tools and frameworks

Code Comparison

Vaex:

import vaex
df = vaex.open('large_dataset.hdf5')
result = df.mean(df.column)

Spark:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.parquet('large_dataset.parquet')
result = df.agg({"column": "mean"}).collect()[0][0]

Summary

Vaex excels in handling large datasets on a single machine with its out-of-core processing capabilities and simpler API. It's particularly useful for data scientists working with datasets that don't fit in memory. However, Spark offers a more comprehensive ecosystem for distributed computing and big data processing, with broader integration capabilities and community support. The choice between the two depends on the specific use case, dataset size, and available computing resources.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Apache Spark

Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, pandas API on Spark for pandas workloads, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for stream processing.

GitHub Actions Build PySpark Coverage PyPI Downloads

Online Documentation

You can find the latest Spark documentation, including a programming guide, on the project web page. This README file only contains basic setup instructions.

Building Spark

Spark is built using Apache Maven. To build Spark and its example programs, run:

./build/mvn -DskipTests clean package

(You do not need to do this if you downloaded a pre-built package.)

More detailed documentation is available from the project site, at "Building Spark".

For general development tips, including info on developing Spark using an IDE, see "Useful Developer Tools".

Interactive Scala Shell

The easiest way to start using Spark is through the Scala shell:

./bin/spark-shell

Try the following command, which should return 1,000,000,000:

scala> spark.range(1000 * 1000 * 1000).count()

Interactive Python Shell

Alternatively, if you prefer Python, you can use the Python shell:

./bin/pyspark

And run the following command, which should also return 1,000,000,000:

>>> spark.range(1000 * 1000 * 1000).count()

Example Programs

Spark also comes with several sample programs in the examples directory. To run one of them, use ./bin/run-example <class> [params]. For example:

./bin/run-example SparkPi

will run the Pi example locally.

You can set the MASTER environment variable when running examples to submit examples to a cluster. This can be spark:// URL, "yarn" to run on YARN, and "local" to run locally with one thread, or "local[N]" to run locally with N threads. You can also use an abbreviated class name if the class is in the examples package. For instance:

MASTER=spark://host:7077 ./bin/run-example SparkPi

Many of the example programs print usage help if no params are given.

Running Tests

Testing first requires building Spark. Once Spark is built, tests can be run using:

./dev/run-tests

Please see the guidance on how to run tests for a module, or individual tests.

There is also a Kubernetes integration test, see resource-managers/kubernetes/integration-tests/README.md

A Note About Hadoop Versions

Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported storage systems. Because the protocols have changed in different versions of Hadoop, you must build Spark against the same version that your cluster runs.

Please refer to the build documentation at "Specifying the Hadoop Version and Enabling YARN" for detailed guidance on building for a particular distribution of Hadoop, including building for particular Hive and Hive Thriftserver distributions.

Configuration

Please refer to the Configuration Guide in the online documentation for an overview on how to configure Spark.

Contributing

Please review the Contribution to Spark guide for information on how to get started contributing to the project.