Convert Figma logo to code with AI

databricks logokoalas

Koalas: pandas API on Apache Spark

3,328
356
3,328
114

Top Related Projects

9,742

Modin: Scale your Pandas workflows by changing a single line of code

29,137

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

12,378

Parallel computing with task scheduling

8,249

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

8,226

cuDF - GPU DataFrame Library

Quick Overview

Koalas is an open-source project that provides a pandas-like API on top of Apache Spark. It aims to make it easier for data scientists familiar with pandas to transition to working with large-scale data processing using Spark, by offering a similar interface and functionality.

Pros

  • Familiar pandas-like API for easier adoption by data scientists
  • Seamless integration with Spark's distributed computing capabilities
  • Improved performance for large-scale data processing compared to pandas
  • Supports both Spark DataFrame and pandas DataFrame interoperability

Cons

  • Not all pandas functions are implemented or fully supported
  • Some operations may be slower than native Spark operations
  • Limited support for certain data types and advanced pandas features
  • Potential overhead when working with smaller datasets compared to pandas

Code Examples

  1. Creating a Koalas DataFrame:
import databricks.koalas as ks

df = ks.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df)
  1. Reading a CSV file and performing operations:
df = ks.read_csv('data.csv')
result = df.groupby('category').agg({'sales': 'sum'})
print(result)
  1. Converting between Koalas and Spark DataFrames:
import pyspark.sql as ps

spark_df = ps.DataFrame({'C': [7, 8, 9]})
koalas_df = spark_df.to_koalas()
print(koalas_df)

spark_df_back = koalas_df.to_spark()
print(spark_df_back)

Getting Started

To get started with Koalas, follow these steps:

  1. Install Koalas using pip:
pip install koalas
  1. Import Koalas in your Python script:
import databricks.koalas as ks
  1. Create a Koalas DataFrame or read data:
df = ks.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
# or
df = ks.read_csv('data.csv')
  1. Start using Koalas with familiar pandas-like operations:
result = df.groupby('A').sum()
print(result)

Competitor Comparisons

9,742

Modin: Scale your Pandas workflows by changing a single line of code

Pros of Modin

  • Seamless integration with existing pandas code, requiring minimal changes
  • Supports both Ray and Dask as execution engines, offering flexibility
  • Generally faster than Koalas for large datasets due to its distributed computing approach

Cons of Modin

  • Less mature and stable compared to Koalas, with some pandas functions not fully implemented
  • Limited support for certain data types and operations that are available in Koalas
  • May require more system resources for optimal performance

Code Comparison

Modin:

import modin.pandas as pd

df = pd.read_csv("large_file.csv")
result = df.groupby("column").mean()

Koalas:

import databricks.koalas as ks

df = ks.read_csv("large_file.csv")
result = df.groupby("column").mean()

Both libraries aim to provide a pandas-like API for distributed computing, but Modin focuses on maintaining full pandas compatibility, while Koalas is designed specifically for Apache Spark integration. Modin's approach allows for easier adoption in existing pandas workflows, whereas Koalas may be more suitable for users already working with Spark environments.

29,137

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

Pros of Polars

  • Faster performance due to Rust implementation and efficient memory usage
  • More flexible data manipulation capabilities with lazy evaluation
  • Supports both in-memory and out-of-memory (memory-mapped) operations

Cons of Polars

  • Less mature ecosystem and community support compared to Koalas
  • Steeper learning curve for users familiar with pandas/Koalas syntax
  • Limited integration with big data frameworks like Apache Spark

Code Comparison

Koalas:

import databricks.koalas as ks

df = ks.read_csv("data.csv")
result = df.groupby("category").agg({"value": "mean"})

Polars:

import polars as pl

df = pl.read_csv("data.csv")
result = df.groupby("category").agg(pl.col("value").mean())

Both libraries aim to provide DataFrame functionality similar to pandas, but with different underlying implementations and performance characteristics. Koalas focuses on Spark integration and distributed computing, while Polars emphasizes speed and memory efficiency through its Rust implementation. The choice between them depends on specific use cases, performance requirements, and existing infrastructure.

12,378

Parallel computing with task scheduling

Pros of Dask

  • More flexible and can handle a wider variety of data processing tasks beyond just DataFrame operations
  • Better suited for large-scale distributed computing across clusters
  • Integrates well with other Python libraries in the scientific computing ecosystem

Cons of Dask

  • Steeper learning curve compared to Koalas, especially for users familiar with pandas
  • Less direct compatibility with Apache Spark ecosystem
  • May require more manual optimization for certain operations

Code Comparison

Koalas:

import databricks.koalas as ks

df = ks.read_csv('large_file.csv')
result = df.groupby('category').agg({'value': 'mean'})

Dask:

import dask.dataframe as dd

df = dd.read_csv('large_file.csv')
result = df.groupby('category').agg({'value': 'mean'}).compute()

Both libraries aim to provide DataFrame functionality for large datasets, but Dask offers a more comprehensive distributed computing framework, while Koalas focuses on providing a pandas-like API for Spark operations.

8,249

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

Pros of Vaex

  • Designed for handling large datasets (up to 1 billion rows) efficiently
  • Supports out-of-core computing, allowing processing of data larger than RAM
  • Offers advanced visualization capabilities for big data exploration

Cons of Vaex

  • Less compatibility with existing Pandas code and ecosystem
  • Smaller community and fewer third-party integrations compared to Koalas
  • Steeper learning curve for users familiar with Pandas

Code Comparison

Vaex:

import vaex
df = vaex.from_csv('large_dataset.csv')
result = df.groupby('category').agg({'value': 'mean'})

Koalas:

import databricks.koalas as ks
df = ks.read_csv('large_dataset.csv')
result = df.groupby('category')['value'].mean()

Both Vaex and Koalas aim to provide solutions for working with large datasets, but they take different approaches. Vaex focuses on out-of-core computing and efficient handling of massive datasets, while Koalas emphasizes compatibility with the Pandas API and integration with Apache Spark. The choice between the two depends on specific use cases, dataset sizes, and existing infrastructure.

8,226

cuDF - GPU DataFrame Library

Pros of cudf

  • Leverages GPU acceleration for faster data processing
  • Supports larger-than-memory datasets
  • Provides seamless integration with other RAPIDS ecosystem libraries

Cons of cudf

  • Requires NVIDIA GPU hardware
  • Limited compatibility with some pandas functions
  • Steeper learning curve for users unfamiliar with GPU computing

Code Comparison

Koalas:

import databricks.koalas as ks

df = ks.read_csv('data.csv')
result = df.groupby('category').agg({'value': 'mean'})

cudf:

import cudf

df = cudf.read_csv('data.csv')
result = df.groupby('category').agg({'value': 'mean'})

Key Differences

  • Koalas aims to provide a pandas-like API for Apache Spark, focusing on distributed computing
  • cudf is designed for GPU-accelerated data processing, offering significant speed improvements for compatible operations
  • Koalas integrates well with Spark ecosystem, while cudf works seamlessly with other RAPIDS libraries
  • Koalas has a more familiar API for pandas users, whereas cudf may require adjustments to leverage GPU capabilities fully

Both libraries aim to improve data processing performance, but they target different hardware and use cases. Koalas is better suited for distributed computing on CPU clusters, while cudf excels in GPU-accelerated operations on a single machine or GPU cluster.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

DEPRECATED: Koalas supports Apache Spark 3.1 and below as it is officially included to PySpark in Apache Spark 3.2. This repository is now in maintenance mode. For Apache Spark 3.2 and above, please use PySpark directly.

pandas API on Apache Spark
Explore Koalas docs »

Live notebook · Issues · Mailing list
Help Thirsty Koalas Devastated by Recent Fires

The Koalas project makes data scientists more productive when interacting with big data, by implementing the pandas DataFrame API on top of Apache Spark.

pandas is the de facto standard (single-node) DataFrame implementation in Python, while Spark is the de facto standard for big data processing. With this package, you can:

  • Be immediately productive with Spark, with no learning curve, if you are already familiar with pandas.
  • Have a single codebase that works both with pandas (tests, smaller datasets) and with Spark (distributed datasets).

We would love to have you try it and give us feedback, through our mailing lists or GitHub issues.

Try the Koalas 10 minutes tutorial on a live Jupyter notebook here. The initial launch can take up to several minutes.

Github Actions codecov Documentation Status Latest Release Conda Version Binder Downloads

Getting Started

Koalas can be installed in many ways such as Conda and pip.

# Conda
conda install koalas -c conda-forge
# pip
pip install koalas

See Installation for more details.

For Databricks Runtime, Koalas is pre-installed in Databricks Runtime 7.1 and above. Try Databricks Community Edition for free. You can also follow these steps to manually install a library on Databricks.

Lastly, if your PyArrow version is 0.15+ and your PySpark version is lower than 3.0, it is best for you to set ARROW_PRE_0_15_IPC_FORMAT environment variable to 1 manually. Koalas will try its best to set it for you but it is impossible to set it if there is a Spark context already launched.

Now you can turn a pandas DataFrame into a Koalas DataFrame that is API-compliant with the former:

import databricks.koalas as ks
import pandas as pd

pdf = pd.DataFrame({'x':range(3), 'y':['a','b','b'], 'z':['a','b','b']})

# Create a Koalas DataFrame from pandas DataFrame
df = ks.from_pandas(pdf)

# Rename the columns
df.columns = ['x', 'y', 'z1']

# Do some operations in place:
df['x2'] = df.x * df.x

For more details, see Getting Started and Dependencies in the official documentation.

Contributing Guide

See Contributing Guide and Design Principles in the official documentation.

FAQ

See FAQ in the official documentation.

Best Practices

See Best Practices in the official documentation.

Koalas Talks and Blogs

See Koalas Talks and Blogs in the official documentation.