koalas

Koalas: pandas API on Apache Spark

3,359

366

3,359

110

View on GitHub

Top Related Projects

modin

10,129

Modin: Scale your Pandas workflows by changing a single line of code

polars

33,322

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

dask

13,158

Parallel computing with task scheduling

vaex

8,377

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

Quick Overview

Koalas is an open-source project that provides a pandas-like API on top of Apache Spark. It aims to make it easier for data scientists familiar with pandas to transition to working with large-scale data processing using Spark, by offering a similar interface and functionality.

Pros

Familiar pandas-like API for easier adoption by data scientists
Seamless integration with Spark's distributed computing capabilities
Improved performance for large-scale data processing compared to pandas
Supports both Spark DataFrame and pandas DataFrame interoperability

Cons

Not all pandas functions are implemented or fully supported
Some operations may be slower than native Spark operations
Limited support for certain data types and advanced pandas features
Potential overhead when working with smaller datasets compared to pandas

Code Examples

Creating a Koalas DataFrame:

import databricks.koalas as ks

df = ks.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df)

Reading a CSV file and performing operations:

df = ks.read_csv('data.csv')
result = df.groupby('category').agg({'sales': 'sum'})
print(result)

Converting between Koalas and Spark DataFrames:

import pyspark.sql as ps

spark_df = ps.DataFrame({'C': [7, 8, 9]})
koalas_df = spark_df.to_koalas()
print(koalas_df)

spark_df_back = koalas_df.to_spark()
print(spark_df_back)

Getting Started

To get started with Koalas, follow these steps:

Install Koalas using pip:

pip install koalas

Import Koalas in your Python script:

import databricks.koalas as ks

Create a Koalas DataFrame or read data:

df = ks.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
# or
df = ks.read_csv('data.csv')

Start using Koalas with familiar pandas-like operations:

result = df.groupby('A').sum()
print(result)

Competitor Comparisons

modin

10,129

Modin: Scale your Pandas workflows by changing a single line of code

Pros of Modin

Seamless integration with existing pandas code, requiring minimal changes
Supports both Ray and Dask as execution engines, offering flexibility
Generally faster than Koalas for large datasets due to its distributed computing approach

Cons of Modin

Less mature and stable compared to Koalas, with some pandas functions not fully implemented
Limited support for certain data types and operations that are available in Koalas
May require more system resources for optimal performance

Code Comparison

Modin:

import modin.pandas as pd

df = pd.read_csv("large_file.csv")
result = df.groupby("column").mean()

Koalas:

import databricks.koalas as ks

df = ks.read_csv("large_file.csv")
result = df.groupby("column").mean()

Both libraries aim to provide a pandas-like API for distributed computing, but Modin focuses on maintaining full pandas compatibility, while Koalas is designed specifically for Apache Spark integration. Modin's approach allows for easier adoption in existing pandas workflows, whereas Koalas may be more suitable for users already working with Spark environments.

polars

33,322

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

Pros of Polars

Faster performance due to Rust implementation and efficient memory usage
More flexible data manipulation capabilities with lazy evaluation
Supports both in-memory and out-of-memory (memory-mapped) operations

Cons of Polars

Less mature ecosystem and community support compared to Koalas
Steeper learning curve for users familiar with pandas/Koalas syntax
Limited integration with big data frameworks like Apache Spark

Code Comparison

Koalas:

import databricks.koalas as ks

df = ks.read_csv("data.csv")
result = df.groupby("category").agg({"value": "mean"})

Polars:

import polars as pl

df = pl.read_csv("data.csv")
result = df.groupby("category").agg(pl.col("value").mean())

Both libraries aim to provide DataFrame functionality similar to pandas, but with different underlying implementations and performance characteristics. Koalas focuses on Spark integration and distributed computing, while Polars emphasizes speed and memory efficiency through its Rust implementation. The choice between them depends on specific use cases, performance requirements, and existing infrastructure.

dask

13,158

Parallel computing with task scheduling

Pros of Dask

More flexible and can handle a wider variety of data processing tasks beyond just DataFrame operations
Better suited for large-scale distributed computing across clusters
Integrates well with other Python libraries in the scientific computing ecosystem

Cons of Dask

Steeper learning curve compared to Koalas, especially for users familiar with pandas
Less direct compatibility with Apache Spark ecosystem
May require more manual optimization for certain operations

Code Comparison

Koalas:

import databricks.koalas as ks

df = ks.read_csv('large_file.csv')
result = df.groupby('category').agg({'value': 'mean'})

Dask:

import dask.dataframe as dd

df = dd.read_csv('large_file.csv')
result = df.groupby('category').agg({'value': 'mean'}).compute()

Both libraries aim to provide DataFrame functionality for large datasets, but Dask offers a more comprehensive distributed computing framework, while Koalas focuses on providing a pandas-like API for Spark operations.

vaex

8,377

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

Pros of Vaex

Designed for handling large datasets (up to 1 billion rows) efficiently
Supports out-of-core computing, allowing processing of data larger than RAM
Offers advanced visualization capabilities for big data exploration

Cons of Vaex

Less compatibility with existing Pandas code and ecosystem
Smaller community and fewer third-party integrations compared to Koalas
Steeper learning curve for users familiar with Pandas

Code Comparison

Vaex:

import vaex
df = vaex.from_csv('large_dataset.csv')
result = df.groupby('category').agg({'value': 'mean'})

Koalas:

import databricks.koalas as ks
df = ks.read_csv('large_dataset.csv')
result = df.groupby('category')['value'].mean()

Both Vaex and Koalas aim to provide solutions for working with large datasets, but they take different approaches. Vaex focuses on out-of-core computing and efficient handling of massive datasets, while Koalas emphasizes compatibility with the Pandas API and integration with Apache Spark. The choice between the two depends on specific use cases, dataset sizes, and existing infrastructure.

cudf

8,886

cuDF - GPU DataFrame Library

Pros of cudf

Leverages GPU acceleration for faster data processing
Supports larger-than-memory datasets
Provides seamless integration with other RAPIDS ecosystem libraries

Cons of cudf

Requires NVIDIA GPU hardware
Limited compatibility with some pandas functions
Steeper learning curve for users unfamiliar with GPU computing

Code Comparison

Koalas:

import databricks.koalas as ks

df = ks.read_csv('data.csv')
result = df.groupby('category').agg({'value': 'mean'})

cudf:

import cudf

df = cudf.read_csv('data.csv')
result = df.groupby('category').agg({'value': 'mean'})

Key Differences

Koalas aims to provide a pandas-like API for Apache Spark, focusing on distributed computing
cudf is designed for GPU-accelerated data processing, offering significant speed improvements for compatible operations
Koalas integrates well with Spark ecosystem, while cudf works seamlessly with other RAPIDS libraries
Koalas has a more familiar API for pandas users, whereas cudf may require adjustments to leverage GPU capabilities fully

Both libraries aim to improve data processing performance, but they target different hardware and use cases. Koalas is better suited for distributed computing on CPU clusters, while cudf excels in GPU-accelerated operations on a single machine or GPU cluster.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

DEPRECATED: Koalas supports Apache Spark 3.1 and below as it is officially included to PySpark in Apache Spark 3.2. This repository is now in maintenance mode. For Apache Spark 3.2 and above, please use PySpark directly.

pandas API on Apache Spark
Explore Koalas docs Â»

Live notebook Â· Issues Â· Mailing list
Help Thirsty Koalas Devastated by Recent Fires

The Koalas project makes data scientists more productive when interacting with big data, by implementing the pandas DataFrame API on top of Apache Spark.

pandas is the de facto standard (single-node) DataFrame implementation in Python, while Spark is the de facto standard for big data processing. With this package, you can:

Be immediately productive with Spark, with no learning curve, if you are already familiar with pandas.
Have a single codebase that works both with pandas (tests, smaller datasets) and with Spark (distributed datasets).

We would love to have you try it and give us feedback, through our mailing lists or GitHub issues.

Try the Koalas 10 minutes tutorial on a live Jupyter notebook here. The initial launch can take up to several minutes.

Getting Started

Koalas can be installed in many ways such as Conda and pip.

# Conda
conda install koalas -c conda-forge

# pip
pip install koalas

See Installation for more details.

For Databricks Runtime, Koalas is pre-installed in Databricks Runtime 7.1 and above. Try Databricks Community Edition for free. You can also follow these steps to manually install a library on Databricks.

Lastly, if your PyArrow version is 0.15+ and your PySpark version is lower than 3.0, it is best for you to set ARROW_PRE_0_15_IPC_FORMAT environment variable to 1 manually. Koalas will try its best to set it for you but it is impossible to set it if there is a Spark context already launched.

Now you can turn a pandas DataFrame into a Koalas DataFrame that is API-compliant with the former:

import databricks.koalas as ks
import pandas as pd

pdf = pd.DataFrame({'x':range(3), 'y':['a','b','b'], 'z':['a','b','b']})

# Create a Koalas DataFrame from pandas DataFrame
df = ks.from_pandas(pdf)

# Rename the columns
df.columns = ['x', 'y', 'z1']

# Do some operations in place:
df['x2'] = df.x * df.x

For more details, see Getting Started and Dependencies in the official documentation.

Contributing Guide

See Contributing Guide and Design Principles in the official documentation.

FAQ

See FAQ in the official documentation.

Best Practices

See Best Practices in the official documentation.

Koalas Talks and Blogs

See Koalas Talks and Blogs in the official documentation.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot