polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

29,748

1,911

29,748

2,107

View on GitHub

Top Related Projects

arrow

14,426

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

pandas

43,524

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

dask

12,495

Parallel computing with task scheduling

modin

9,845

Modin: Scale your Pandas workflows by changing a single line of code

vaex

8,280

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

datatable

1,811

A Python package for manipulating 2-dimensional tabular data structures

Quick Overview

Polars is a fast, efficient, and expressive DataFrame library for Rust and Python. It leverages Arrow's columnar format for high-performance data processing and analysis, offering a powerful alternative to pandas for large-scale data manipulation tasks.

Pros

Extremely fast performance, often outperforming pandas and other DataFrame libraries
Memory-efficient due to its use of Apache Arrow's columnar format
Supports both eager and lazy execution modes for flexibility in data processing
Provides a rich API with intuitive methods for data manipulation and analysis

Cons

Steeper learning curve compared to pandas, especially for those new to Rust
Smaller ecosystem and fewer third-party integrations compared to more established libraries
Documentation, while improving, can be less comprehensive than more mature projects
Some advanced features found in pandas may not be available or implemented differently

Code Examples

Creating a DataFrame and performing basic operations:

import polars as pl

df = pl.DataFrame({
    "A": [1, 2, 3, 4, 5],
    "B": ["a", "b", "c", "d", "e"],
    "C": [1.1, 2.2, 3.3, 4.4, 5.5]
})

filtered_df = df.filter(pl.col("A") > 2).select(["A", "B"])
print(filtered_df)

Using lazy execution for complex operations:

import polars as pl

df = pl.scan_csv("large_file.csv")
result = (
    df.filter(pl.col("age") > 30)
    .groupby("city")
    .agg([
        pl.col("salary").mean().alias("avg_salary"),
        pl.col("age").max().alias("max_age")
    ])
    .sort("avg_salary", descending=True)
    .collect()
)
print(result)

Performing time series operations:

import polars as pl

df = pl.DataFrame({
    "date": pl.date_range(start="2023-01-01", end="2023-12-31", interval="1d"),
    "value": pl.random.randn(365)
})

monthly_avg = df.groupby_dynamic("date", every="1mo").agg(
    pl.col("value").mean().alias("monthly_avg")
)
print(monthly_avg)

Getting Started

To get started with Polars in Python:

Install Polars:
```
pip install polars
```
Import Polars in your Python script:
```
import polars as pl
```

Create a DataFrame and start working with data:

df = pl.DataFrame({
    "A": [1, 2, 3, 4, 5],
    "B": ["a", "b", "c", "d", "e"]
})
print(df)

For more advanced usage and detailed documentation, visit the official Polars documentation at https://pola-rs.github.io/polars-book/

Competitor Comparisons

arrow

14,426

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

Pros of Arrow

Broader ecosystem support across multiple programming languages
More mature project with longer development history
Extensive documentation and community resources

Cons of Arrow

Steeper learning curve for beginners
Can be more complex to set up and use for simple data tasks
Larger codebase and dependencies

Code Comparison

Arrow (Python):

import pyarrow as pa

data = [
    pa.array([1, 2, 3, 4]),
    pa.array(['a', 'b', 'c', 'd'])
]
table = pa.Table.from_arrays(data, names=['numbers', 'letters'])

Polars (Rust):

use polars::prelude::*;

let df = df! [
    "numbers" => [1, 2, 3, 4],
    "letters" => ["a", "b", "c", "d"]
].unwrap();

Both Arrow and Polars are powerful data processing libraries, but they serve different purposes. Arrow focuses on providing a standardized columnar memory format and interoperability across languages, while Polars is a high-performance DataFrame library built on top of Arrow. Polars offers a more user-friendly API for data manipulation tasks, making it easier for developers to work with tabular data efficiently. However, Arrow's broader ecosystem and language support make it a better choice for projects requiring cross-language data exchange or integration with various big data tools.

pandas

43,524

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Pros of pandas

Extensive documentation and large community support
Wide range of built-in functions for data manipulation and analysis
Seamless integration with other Python libraries in the data science ecosystem

Cons of pandas

Slower performance, especially for large datasets
Higher memory usage due to its design and implementation in Python
Less efficient handling of null values compared to Polars

Code Comparison

pandas

import pandas as pd

df = pd.read_csv("data.csv")
result = df.groupby("category").agg({"value": ["mean", "sum"]})

Polars

import polars as pl

df = pl.read_csv("data.csv")
result = df.groupby("category").agg([
    pl.col("value").mean(),
    pl.col("value").sum()
])

The code comparison shows that both libraries have similar syntax for basic operations, but Polars tends to be more explicit in column selection and aggregation functions. Polars is designed to be more memory-efficient and faster, especially for larger datasets, while pandas offers a wider range of built-in functions and better integration with the Python data science ecosystem.

dask

12,495

Parallel computing with task scheduling

Pros of Dask

Seamless integration with the PyData ecosystem (NumPy, Pandas)
Built-in support for distributed computing and scaling
Flexible task scheduling for complex workflows

Cons of Dask

Generally slower performance for single-machine operations
Higher memory usage compared to more optimized solutions
Steeper learning curve for advanced features

Code Comparison

Dask:

import dask.dataframe as dd

df = dd.read_csv('large_file.csv')
result = df.groupby('column').mean().compute()

Polars:

import polars as pl

df = pl.read_csv('large_file.csv')
result = df.groupby('column').mean()

Both Dask and Polars are powerful data processing libraries, but they have different strengths. Dask excels in distributed computing and integrating with existing PyData tools, while Polars focuses on high-performance operations on a single machine. Dask's syntax is more familiar to Pandas users, but Polars often provides better performance for local data processing tasks. The choice between them depends on specific use cases, data sizes, and computational requirements.

modin

9,845

Modin: Scale your Pandas workflows by changing a single line of code

Pros of Modin

Seamless integration with existing pandas code, requiring minimal changes
Distributed computing capabilities for handling large datasets
Supports multiple execution engines (Ray, Dask, etc.) for flexibility

Cons of Modin

Performance improvements may be less significant for smaller datasets
Not all pandas functions are fully optimized or supported
Potential overhead for small operations due to distributed nature

Code Comparison

Modin:

import modin.pandas as pd

df = pd.read_csv("large_file.csv")
result = df.groupby("column").mean()

Polars:

import polars as pl

df = pl.read_csv("large_file.csv")
result = df.groupby("column").mean()

Key Differences

Modin aims to be a drop-in replacement for pandas, while Polars is a separate DataFrame library
Polars is generally faster for single-machine operations, especially on larger datasets
Modin focuses on distributed computing, while Polars emphasizes single-machine performance
Polars has a more modern API design, while Modin maintains pandas compatibility

Both libraries offer improved performance over pandas for large datasets, but they take different approaches. Modin is ideal for those looking to scale existing pandas code, while Polars may be better for new projects or those willing to adapt to a new API for better performance.

vaex

8,280

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

Pros of Vaex

Specialized in handling large datasets (>1 billion rows) efficiently
Supports lazy evaluation and out-of-core processing
Offers advanced visualization capabilities

Cons of Vaex

Slower performance for smaller datasets compared to Polars
Less extensive data manipulation functionality
Smaller community and ecosystem

Code Comparison

Vaex:

import vaex
df = vaex.from_csv('data.csv')
result = df[df.age > 30].groupby('category').agg({'value': 'mean'})

Polars:

import polars as pl
df = pl.read_csv('data.csv')
result = df.filter(pl.col('age') > 30).groupby('category').agg(pl.col('value').mean())

Both Vaex and Polars are data manipulation libraries designed to handle large datasets efficiently. Vaex excels at processing extremely large datasets and offers advanced visualization features. However, Polars generally provides faster performance for smaller to medium-sized datasets and has a more comprehensive set of data manipulation functions. The code comparison shows that both libraries have similar syntax for basic operations, but Polars tends to be more concise and intuitive for common data manipulation tasks.

datatable

1,811

A Python package for manipulating 2-dimensional tabular data structures

Pros of datatable

Written in C++, potentially offering better performance for certain operations
Supports out-of-memory processing, allowing for handling datasets larger than available RAM
Provides a Python API that closely mimics pandas, easing the transition for pandas users

Cons of datatable

Less actively maintained compared to Polars
Smaller community and ecosystem
Limited support for advanced data manipulation operations

Code Comparison

datatable:

import datatable as dt
df = dt.fread("data.csv")
result = df[f.A > 0, :]

Polars:

import polars as pl
df = pl.read_csv("data.csv")
result = df.filter(pl.col("A") > 0)

Both libraries offer similar functionality for basic operations, but Polars generally provides a more expressive and flexible API for complex data manipulations. Polars also has better support for lazy evaluation and query optimization, which can lead to improved performance for large datasets and complex operations.

While datatable has some unique features like out-of-memory processing, Polars has gained significant popularity due to its performance, ease of use, and active development. The choice between the two may depend on specific use cases and requirements.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Documentation: Python - Rust - Node.js - R | StackOverflow: Python - Rust - Node.js - R | User guide | Discord

Polars: Blazingly fast DataFrames in Rust, Python, Node.js, R, and SQL

Polars is a DataFrame interface on top of an OLAP Query Engine implemented in Rust using Apache Arrow Columnar Format as the memory model.

Lazy | eager execution
Multi-threaded
SIMD
Query optimization
Powerful expression API
Hybrid Streaming (larger-than-RAM datasets)
Rust | Python | NodeJS | R | ...

To learn more, read the user guide.

Python

>>> import polars as pl
>>> df = pl.DataFrame(
...     {
...         "A": [1, 2, 3, 4, 5],
...         "fruits": ["banana", "banana", "apple", "apple", "banana"],
...         "B": [5, 4, 3, 2, 1],
...         "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
...     }
... )

# embarrassingly parallel execution & very expressive query language
>>> df.sort("fruits").select(
...     "fruits",
...     "cars",
...     pl.lit("fruits").alias("literal_string_fruits"),
...     pl.col("B").filter(pl.col("cars") == "beetle").sum(),
...     pl.col("A").filter(pl.col("B") > 2).sum().over("cars").alias("sum_A_by_cars"),
...     pl.col("A").sum().over("fruits").alias("sum_A_by_fruits"),
...     pl.col("A").reverse().over("fruits").alias("rev_A_by_fruits"),
...     pl.col("A").sort_by("B").over("fruits").alias("sort_A_by_B_by_fruits"),
... )
shape: (5, 8)
ââââââââââââ¬âââââââââââ¬âââââââââââââââ¬ââââââ¬ââââââââââââââ¬ââââââââââââââ¬ââââââââââââââ¬ââââââââââââââ
â fruits   â cars     â literal_stri â B   â sum_A_by_ca â sum_A_by_fr â rev_A_by_fr â sort_A_by_B â
â ---      â ---      â ng_fruits    â --- â rs          â uits        â uits        â _by_fruits  â
â str      â str      â ---          â i64 â ---         â ---         â ---         â ---         â
â          â          â str          â     â i64         â i64         â i64         â i64         â
ââââââââââââªâââââââââââªâââââââââââââââªââââââªââââââââââââââªââââââââââââââªââââââââââââââªââââââââââââââ¡
â "apple"  â "beetle" â "fruits"     â 11  â 4           â 7           â 4           â 4           â
â "apple"  â "beetle" â "fruits"     â 11  â 4           â 7           â 3           â 3           â
â "banana" â "beetle" â "fruits"     â 11  â 4           â 8           â 5           â 5           â
â "banana" â "audi"   â "fruits"     â 11  â 2           â 8           â 2           â 2           â
â "banana" â "beetle" â "fruits"     â 11  â 4           â 8           â 1           â 1           â
ââââââââââââ´âââââââââââ´âââââââââââââââ´ââââââ´ââââââââââââââ´ââââââââââââââ´ââââââââââââââ´ââââââââââââââ

SQL

>>> df = pl.scan_csv("docs/data/iris.csv")
>>> ## OPTION 1
>>> # run SQL queries on frame-level
>>> df.sql("""
...	SELECT species,
...	  AVG(sepal_length) AS avg_sepal_length
...	FROM self
...	GROUP BY species
...	""").collect()
shape: (3, 2)
ââââââââââââââ¬âââââââââââââââââââ
â species    â avg_sepal_length â
â ---        â ---              â
â str        â f64              â
ââââââââââââââªâââââââââââââââââââ¡
â Virginica  â 6.588            â
â Versicolor â 5.936            â
â Setosa     â 5.006            â
ââââââââââââââ´âââââââââââââââââââ
>>> ## OPTION 2
>>> # use pl.sql() to operate on the global context
>>> df2 = pl.LazyFrame({
...    "species": ["Setosa", "Versicolor", "Virginica"],
...    "blooming_season": ["Spring", "Summer", "Fall"]
...})
>>> pl.sql("""
... SELECT df.species,
...     AVG(df.sepal_length) AS avg_sepal_length,
...     df2.blooming_season
... FROM df
... LEFT JOIN df2 ON df.species = df2.species
... GROUP BY df.species, df2.blooming_season
... """).collect()

SQL commands can also be run directly from your terminal using the Polars CLI:

# run an inline SQL query
> polars -c "SELECT species, AVG(sepal_length) AS avg_sepal_length, AVG(sepal_width) AS avg_sepal_width FROM read_csv('docs/data/iris.csv') GROUP BY species;"

# run interactively
> polars
Polars CLI v0.3.0
Type .help for help.

> SELECT species, AVG(sepal_length) AS avg_sepal_length, AVG(sepal_width) AS avg_sepal_width FROM read_csv('docs/data/iris.csv') GROUP BY species;

Refer to the Polars CLI repository for more information.

Performance ðð

Blazingly fast

Polars is very fast. In fact, it is one of the best performing solutions available. See the PDS-H benchmarks results.

Lightweight

Polars is also very lightweight. It comes with zero required dependencies, and this shows in the import times:

polars: 70ms
numpy: 104ms
pandas: 520ms

Handles larger-than-RAM data

If you have data that does not fit into memory, Polars' query engine is able to process your query (or parts of your query) in a streaming fashion. This drastically reduces memory requirements, so you might be able to process your 250GB dataset on your laptop. Collect with collect(streaming=True) to run the query streaming. (This might be a little slower, but it is still very fast!)

Setup

Python

Install the latest Polars version with:

pip install polars

We also have a conda package (conda install -c conda-forge polars), however pip is the preferred way to install Polars.

Install Polars with all optional dependencies.

pip install 'polars[all]'

You can also install a subset of all optional dependencies.

pip install 'polars[numpy,pandas,pyarrow]'

See the User Guide for more details on optional dependencies

To see the current Polars version and a full list of its optional dependencies, run:

pl.show_versions()

Releases happen quite often (weekly / every few days) at the moment, so updating Polars regularly to get the latest bugfixes / features might not be a bad idea.

Rust

You can take latest release from crates.io, or if you want to use the latest features / performance improvements point to the main branch of this repo.

polars = { git = "https://github.com/pola-rs/polars", rev = "<optional git tag>" }

Requires Rust version >=1.80.

Contributing

Want to contribute? Read our contributing guide.

Python: compile Polars from source

If you want a bleeding edge release or maximal performance you should compile Polars from source.

This can be done by going through the following steps in sequence:

Install the latest Rust compiler
Install maturin: pip install maturin
cd py-polars and choose one of the following:
- make build-release, fastest binary, very long compile times
- make build-opt, fast binary with debug symbols, long compile times
- make build-debug-opt, medium-speed binary with debug assertions and symbols, medium compile times
- make build, slow binary with debug assertions and symbols, fast compile times
Append -native (e.g. make build-release-native) to enable further optimizations specific to your CPU. This produces a non-portable binary/wheel however.

Note that the Rust crate implementing the Python bindings is called py-polars to distinguish from the wrapped Rust crate polars itself. However, both the Python package and the Python module are named polars, so you can pip install polars and import polars.

Using custom Rust functions in Python

Extending Polars with UDFs compiled in Rust is easy. We expose PyO3 extensions for DataFrame and Series data structures. See more in https://github.com/pola-rs/pyo3-polars.

Going big...

Do you expect more than 2^32 (~4.2 billion) rows? Compile Polars with the bigidx feature flag or, for Python users, install pip install polars-u64-idx.

Don't use this unless you hit the row boundary as the default build of Polars is faster and consumes less memory.

Legacy

Do you want Polars to run on an old CPU (e.g. dating from before 2011), or on an x86-64 build of Python on Apple Silicon under Rosetta? Install pip install polars-lts-cpu. This version of Polars is compiled without AVX target features.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of Arrow

Cons of Arrow

Code Comparison

Pros of pandas

Cons of pandas

Code Comparison

pandas

Polars

Pros of Dask

Cons of Dask

Code Comparison

Pros of Modin

Cons of Modin

Code Comparison

Key Differences

Pros of Vaex

Cons of Vaex

Code Comparison

Pros of datatable

Cons of datatable

Code Comparison

Convert designs to code with AI

README

Polars: Blazingly fast DataFrames in Rust, Python, Node.js, R, and SQL

Python

SQL

Performance ðð

Blazingly fast

Lightweight

Handles larger-than-RAM data

Setup

Python

Rust

Contributing

Python: compile Polars from source

Using custom Rust functions in Python

Going big...

Legacy

Sponsors

Top Related Projects

Convert designs to code with AI

Performance ðð