polars
Dataframes powered by a multithreaded, vectorized query engine, written in Rust
Top Related Projects
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Parallel computing with task scheduling
Modin: Scale your Pandas workflows by changing a single line of code
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
A Python package for manipulating 2-dimensional tabular data structures
Quick Overview
Polars is a fast, efficient, and expressive DataFrame library for Rust and Python. It leverages Arrow's columnar format for high-performance data processing and analysis, offering a powerful alternative to pandas for large-scale data manipulation tasks.
Pros
- Extremely fast performance, often outperforming pandas and other DataFrame libraries
- Memory-efficient due to its use of Apache Arrow's columnar format
- Supports both eager and lazy execution modes for flexibility in data processing
- Provides a rich API with intuitive methods for data manipulation and analysis
Cons
- Steeper learning curve compared to pandas, especially for those new to Rust
- Smaller ecosystem and fewer third-party integrations compared to more established libraries
- Documentation, while improving, can be less comprehensive than more mature projects
- Some advanced features found in pandas may not be available or implemented differently
Code Examples
- Creating a DataFrame and performing basic operations:
import polars as pl
df = pl.DataFrame({
"A": [1, 2, 3, 4, 5],
"B": ["a", "b", "c", "d", "e"],
"C": [1.1, 2.2, 3.3, 4.4, 5.5]
})
filtered_df = df.filter(pl.col("A") > 2).select(["A", "B"])
print(filtered_df)
- Using lazy execution for complex operations:
import polars as pl
df = pl.scan_csv("large_file.csv")
result = (
df.filter(pl.col("age") > 30)
.groupby("city")
.agg([
pl.col("salary").mean().alias("avg_salary"),
pl.col("age").max().alias("max_age")
])
.sort("avg_salary", descending=True)
.collect()
)
print(result)
- Performing time series operations:
import polars as pl
df = pl.DataFrame({
"date": pl.date_range(start="2023-01-01", end="2023-12-31", interval="1d"),
"value": pl.random.randn(365)
})
monthly_avg = df.groupby_dynamic("date", every="1mo").agg(
pl.col("value").mean().alias("monthly_avg")
)
print(monthly_avg)
Getting Started
To get started with Polars in Python:
-
Install Polars:
pip install polars
-
Import Polars in your Python script:
import polars as pl
-
Create a DataFrame and start working with data:
df = pl.DataFrame({ "A": [1, 2, 3, 4, 5], "B": ["a", "b", "c", "d", "e"] }) print(df)
For more advanced usage and detailed documentation, visit the official Polars documentation at https://pola-rs.github.io/polars-book/
Competitor Comparisons
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
Pros of Arrow
- Broader ecosystem support across multiple programming languages
- More mature project with longer development history
- Extensive documentation and community resources
Cons of Arrow
- Steeper learning curve for beginners
- Can be more complex to set up and use for simple data tasks
- Larger codebase and dependencies
Code Comparison
Arrow (Python):
import pyarrow as pa
data = [
pa.array([1, 2, 3, 4]),
pa.array(['a', 'b', 'c', 'd'])
]
table = pa.Table.from_arrays(data, names=['numbers', 'letters'])
Polars (Rust):
use polars::prelude::*;
let df = df! [
"numbers" => [1, 2, 3, 4],
"letters" => ["a", "b", "c", "d"]
].unwrap();
Both Arrow and Polars are powerful data processing libraries, but they serve different purposes. Arrow focuses on providing a standardized columnar memory format and interoperability across languages, while Polars is a high-performance DataFrame library built on top of Arrow. Polars offers a more user-friendly API for data manipulation tasks, making it easier for developers to work with tabular data efficiently. However, Arrow's broader ecosystem and language support make it a better choice for projects requiring cross-language data exchange or integration with various big data tools.
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Pros of pandas
- Extensive documentation and large community support
- Wide range of built-in functions for data manipulation and analysis
- Seamless integration with other Python libraries in the data science ecosystem
Cons of pandas
- Slower performance, especially for large datasets
- Higher memory usage due to its design and implementation in Python
- Less efficient handling of null values compared to Polars
Code Comparison
pandas
import pandas as pd
df = pd.read_csv("data.csv")
result = df.groupby("category").agg({"value": ["mean", "sum"]})
Polars
import polars as pl
df = pl.read_csv("data.csv")
result = df.groupby("category").agg([
pl.col("value").mean(),
pl.col("value").sum()
])
The code comparison shows that both libraries have similar syntax for basic operations, but Polars tends to be more explicit in column selection and aggregation functions. Polars is designed to be more memory-efficient and faster, especially for larger datasets, while pandas offers a wider range of built-in functions and better integration with the Python data science ecosystem.
Parallel computing with task scheduling
Pros of Dask
- Seamless integration with the PyData ecosystem (NumPy, Pandas)
- Built-in support for distributed computing and scaling
- Flexible task scheduling for complex workflows
Cons of Dask
- Generally slower performance for single-machine operations
- Higher memory usage compared to more optimized solutions
- Steeper learning curve for advanced features
Code Comparison
Dask:
import dask.dataframe as dd
df = dd.read_csv('large_file.csv')
result = df.groupby('column').mean().compute()
Polars:
import polars as pl
df = pl.read_csv('large_file.csv')
result = df.groupby('column').mean()
Both Dask and Polars are powerful data processing libraries, but they have different strengths. Dask excels in distributed computing and integrating with existing PyData tools, while Polars focuses on high-performance operations on a single machine. Dask's syntax is more familiar to Pandas users, but Polars often provides better performance for local data processing tasks. The choice between them depends on specific use cases, data sizes, and computational requirements.
Modin: Scale your Pandas workflows by changing a single line of code
Pros of Modin
- Seamless integration with existing pandas code, requiring minimal changes
- Distributed computing capabilities for handling large datasets
- Supports multiple execution engines (Ray, Dask, etc.) for flexibility
Cons of Modin
- Performance improvements may be less significant for smaller datasets
- Not all pandas functions are fully optimized or supported
- Potential overhead for small operations due to distributed nature
Code Comparison
Modin:
import modin.pandas as pd
df = pd.read_csv("large_file.csv")
result = df.groupby("column").mean()
Polars:
import polars as pl
df = pl.read_csv("large_file.csv")
result = df.groupby("column").mean()
Key Differences
- Modin aims to be a drop-in replacement for pandas, while Polars is a separate DataFrame library
- Polars is generally faster for single-machine operations, especially on larger datasets
- Modin focuses on distributed computing, while Polars emphasizes single-machine performance
- Polars has a more modern API design, while Modin maintains pandas compatibility
Both libraries offer improved performance over pandas for large datasets, but they take different approaches. Modin is ideal for those looking to scale existing pandas code, while Polars may be better for new projects or those willing to adapt to a new API for better performance.
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
Pros of Vaex
- Specialized in handling large datasets (>1 billion rows) efficiently
- Supports lazy evaluation and out-of-core processing
- Offers advanced visualization capabilities
Cons of Vaex
- Slower performance for smaller datasets compared to Polars
- Less extensive data manipulation functionality
- Smaller community and ecosystem
Code Comparison
Vaex:
import vaex
df = vaex.from_csv('data.csv')
result = df[df.age > 30].groupby('category').agg({'value': 'mean'})
Polars:
import polars as pl
df = pl.read_csv('data.csv')
result = df.filter(pl.col('age') > 30).groupby('category').agg(pl.col('value').mean())
Both Vaex and Polars are data manipulation libraries designed to handle large datasets efficiently. Vaex excels at processing extremely large datasets and offers advanced visualization features. However, Polars generally provides faster performance for smaller to medium-sized datasets and has a more comprehensive set of data manipulation functions. The code comparison shows that both libraries have similar syntax for basic operations, but Polars tends to be more concise and intuitive for common data manipulation tasks.
A Python package for manipulating 2-dimensional tabular data structures
Pros of datatable
- Written in C++, potentially offering better performance for certain operations
- Supports out-of-memory processing, allowing for handling datasets larger than available RAM
- Provides a Python API that closely mimics pandas, easing the transition for pandas users
Cons of datatable
- Less actively maintained compared to Polars
- Smaller community and ecosystem
- Limited support for advanced data manipulation operations
Code Comparison
datatable:
import datatable as dt
df = dt.fread("data.csv")
result = df[f.A > 0, :]
Polars:
import polars as pl
df = pl.read_csv("data.csv")
result = df.filter(pl.col("A") > 0)
Both libraries offer similar functionality for basic operations, but Polars generally provides a more expressive and flexible API for complex data manipulations. Polars also has better support for lazy evaluation and query optimization, which can lead to improved performance for large datasets and complex operations.
While datatable has some unique features like out-of-memory processing, Polars has gained significant popularity due to its performance, ease of use, and active development. The choice between the two may depend on specific use cases and requirements.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Documentation: Python - Rust - Node.js - R | StackOverflow: Python - Rust - Node.js - R | User guide | Discord
Polars: Blazingly fast DataFrames in Rust, Python, Node.js, R, and SQL
Polars is a DataFrame interface on top of an OLAP Query Engine implemented in Rust using Apache Arrow Columnar Format as the memory model.
- Lazy | eager execution
- Multi-threaded
- SIMD
- Query optimization
- Powerful expression API
- Hybrid Streaming (larger-than-RAM datasets)
- Rust | Python | NodeJS | R | ...
To learn more, read the user guide.
Python
>>> import polars as pl
>>> df = pl.DataFrame(
... {
... "A": [1, 2, 3, 4, 5],
... "fruits": ["banana", "banana", "apple", "apple", "banana"],
... "B": [5, 4, 3, 2, 1],
... "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
... }
... )
# embarrassingly parallel execution & very expressive query language
>>> df.sort("fruits").select(
... "fruits",
... "cars",
... pl.lit("fruits").alias("literal_string_fruits"),
... pl.col("B").filter(pl.col("cars") == "beetle").sum(),
... pl.col("A").filter(pl.col("B") > 2).sum().over("cars").alias("sum_A_by_cars"),
... pl.col("A").sum().over("fruits").alias("sum_A_by_fruits"),
... pl.col("A").reverse().over("fruits").alias("rev_A_by_fruits"),
... pl.col("A").sort_by("B").over("fruits").alias("sort_A_by_B_by_fruits"),
... )
shape: (5, 8)
ââââââââââââ¬âââââââââââ¬âââââââââââââââ¬ââââââ¬ââââââââââââââ¬ââââââââââââââ¬ââââââââââââââ¬ââââââââââââââ
â fruits â cars â literal_stri â B â sum_A_by_ca â sum_A_by_fr â rev_A_by_fr â sort_A_by_B â
â --- â --- â ng_fruits â --- â rs â uits â uits â _by_fruits â
â str â str â --- â i64 â --- â --- â --- â --- â
â â â str â â i64 â i64 â i64 â i64 â
ââââââââââââªâââââââââââªâââââââââââââââªââââââªââââââââââââââªââââââââââââââªââââââââââââââªââââââââââââââ¡
â "apple" â "beetle" â "fruits" â 11 â 4 â 7 â 4 â 4 â
â "apple" â "beetle" â "fruits" â 11 â 4 â 7 â 3 â 3 â
â "banana" â "beetle" â "fruits" â 11 â 4 â 8 â 5 â 5 â
â "banana" â "audi" â "fruits" â 11 â 2 â 8 â 2 â 2 â
â "banana" â "beetle" â "fruits" â 11 â 4 â 8 â 1 â 1 â
ââââââââââââ´âââââââââââ´âââââââââââââââ´ââââââ´ââââââââââââââ´ââââââââââââââ´ââââââââââââââ´ââââââââââââââ
SQL
>>> df = pl.scan_csv("docs/assets/data/iris.csv")
>>> ## OPTION 1
>>> # run SQL queries on frame-level
>>> df.sql("""
... SELECT species,
... AVG(sepal_length) AS avg_sepal_length
... FROM self
... GROUP BY species
... """).collect()
shape: (3, 2)
ââââââââââââââ¬âââââââââââââââââââ
â species â avg_sepal_length â
â --- â --- â
â str â f64 â
ââââââââââââââªâââââââââââââââââââ¡
â Virginica â 6.588 â
â Versicolor â 5.936 â
â Setosa â 5.006 â
ââââââââââââââ´âââââââââââââââââââ
>>> ## OPTION 2
>>> # use pl.sql() to operate on the global context
>>> df2 = pl.LazyFrame({
... "species": ["Setosa", "Versicolor", "Virginica"],
... "blooming_season": ["Spring", "Summer", "Fall"]
...})
>>> pl.sql("""
... SELECT df.species,
... AVG(df.sepal_length) AS avg_sepal_length,
... df2.blooming_season
... FROM df
... LEFT JOIN df2 ON df.species = df2.species
... GROUP BY df.species, df2.blooming_season
... """).collect()
SQL commands can also be run directly from your terminal using the Polars CLI:
# run an inline SQL query
> polars -c "SELECT species, AVG(sepal_length) AS avg_sepal_length, AVG(sepal_width) AS avg_sepal_width FROM read_csv('docs/assets/data/iris.csv') GROUP BY species;"
# run interactively
> polars
Polars CLI v0.3.0
Type .help for help.
> SELECT species, AVG(sepal_length) AS avg_sepal_length, AVG(sepal_width) AS avg_sepal_width FROM read_csv('docs/assets/data/iris.csv') GROUP BY species;
Refer to the Polars CLI repository for more information.
Performance ðð
Blazingly fast
Polars is very fast. In fact, it is one of the best performing solutions available. See the PDS-H benchmarks results.
Lightweight
Polars is also very lightweight. It comes with zero required dependencies, and this shows in the import times:
- polars: 70ms
- numpy: 104ms
- pandas: 520ms
Handles larger-than-RAM data
If you have data that does not fit into memory, Polars' query engine is able to process your query
(or parts of your query) in a streaming fashion. This drastically reduces memory requirements, so
you might be able to process your 250GB dataset on your laptop. Collect with
collect(streaming=True)
to run the query streaming. (This might be a little slower, but it is
still very fast!)
Setup
Python
Install the latest Polars version with:
pip install polars
We also have a conda package (conda install -c conda-forge polars
), however pip is the preferred
way to install Polars.
Install Polars with all optional dependencies.
pip install 'polars[all]'
You can also install a subset of all optional dependencies.
pip install 'polars[numpy,pandas,pyarrow]'
See the User Guide for more details on optional dependencies
To see the current Polars version and a full list of its optional dependencies, run:
pl.show_versions()
Releases happen quite often (weekly / every few days) at the moment, so updating Polars regularly to get the latest bugfixes / features might not be a bad idea.
Rust
You can take latest release from crates.io
, or if you want to use the latest features /
performance improvements point to the main
branch of this repo.
polars = { git = "https://github.com/pola-rs/polars", rev = "<optional git tag>" }
Requires Rust version >=1.80
.
Contributing
Want to contribute? Read our contributing guide.
Python: compile Polars from source
If you want a bleeding edge release or maximal performance you should compile Polars from source.
This can be done by going through the following steps in sequence:
- Install the latest Rust compiler
- Install maturin:
pip install maturin
cd py-polars
and choose one of the following:make build
, slow binary with debug assertions and symbols, fast compile timesmake build-release
, fast binary without debug assertions, minimal debug symbols, long compile timesmake build-nodebug-release
, same as build-release but without any debug symbols, slightly faster to compilemake build-debug-release
, same as build-release but with full debug symbols, slightly slower to compilemake build-dist-release
, fastest binary, extreme compile times
By default the binary is compiled with optimizations turned on for a modern CPU. Specify LTS_CPU=1
with the command if your CPU is older and does not support e.g. AVX2.
Note that the Rust crate implementing the Python bindings is called py-polars
to distinguish from
the wrapped Rust crate polars
itself. However, both the Python package and the Python module are
named polars
, so you can pip install polars
and import polars
.
Using custom Rust functions in Python
Extending Polars with UDFs compiled in Rust is easy. We expose PyO3 extensions for DataFrame
and
Series
data structures. See more in https://github.com/pola-rs/pyo3-polars.
Going big...
Do you expect more than 2^32 (~4.2 billion) rows? Compile Polars with the bigidx
feature flag or,
for Python users, install pip install polars-u64-idx
.
Don't use this unless you hit the row boundary as the default build of Polars is faster and consumes less memory.
Legacy
Do you want Polars to run on an old CPU (e.g. dating from before 2011), or on an x86-64
build of
Python on Apple Silicon under Rosetta? Install pip install polars-lts-cpu
. This version of Polars
is compiled without AVX target features.
Sponsors
Top Related Projects
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Parallel computing with task scheduling
Modin: Scale your Pandas workflows by changing a single line of code
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
A Python package for manipulating 2-dimensional tabular data structures
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot