Convert Figma logo to code with AI

pola-rs logopolars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

29,137
1,833
29,137
1,992

Top Related Projects

14,246

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

43,205

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

12,378

Parallel computing with task scheduling

9,742

Modin: Scale your Pandas workflows by changing a single line of code

8,249

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

A Python package for manipulating 2-dimensional tabular data structures

Quick Overview

Polars is a fast, efficient, and expressive DataFrame library for Rust and Python. It leverages Arrow's columnar format for high-performance data processing and analysis, offering a powerful alternative to pandas for large-scale data manipulation tasks.

Pros

  • Extremely fast performance, often outperforming pandas and other DataFrame libraries
  • Memory-efficient due to its use of Apache Arrow's columnar format
  • Supports both eager and lazy execution modes for flexibility in data processing
  • Provides a rich API with intuitive methods for data manipulation and analysis

Cons

  • Steeper learning curve compared to pandas, especially for those new to Rust
  • Smaller ecosystem and fewer third-party integrations compared to more established libraries
  • Documentation, while improving, can be less comprehensive than more mature projects
  • Some advanced features found in pandas may not be available or implemented differently

Code Examples

  1. Creating a DataFrame and performing basic operations:
import polars as pl

df = pl.DataFrame({
    "A": [1, 2, 3, 4, 5],
    "B": ["a", "b", "c", "d", "e"],
    "C": [1.1, 2.2, 3.3, 4.4, 5.5]
})

filtered_df = df.filter(pl.col("A") > 2).select(["A", "B"])
print(filtered_df)
  1. Using lazy execution for complex operations:
import polars as pl

df = pl.scan_csv("large_file.csv")
result = (
    df.filter(pl.col("age") > 30)
    .groupby("city")
    .agg([
        pl.col("salary").mean().alias("avg_salary"),
        pl.col("age").max().alias("max_age")
    ])
    .sort("avg_salary", descending=True)
    .collect()
)
print(result)
  1. Performing time series operations:
import polars as pl

df = pl.DataFrame({
    "date": pl.date_range(start="2023-01-01", end="2023-12-31", interval="1d"),
    "value": pl.random.randn(365)
})

monthly_avg = df.groupby_dynamic("date", every="1mo").agg(
    pl.col("value").mean().alias("monthly_avg")
)
print(monthly_avg)

Getting Started

To get started with Polars in Python:

  1. Install Polars:

    pip install polars
    
  2. Import Polars in your Python script:

    import polars as pl
    
  3. Create a DataFrame and start working with data:

    df = pl.DataFrame({
        "A": [1, 2, 3, 4, 5],
        "B": ["a", "b", "c", "d", "e"]
    })
    print(df)
    

For more advanced usage and detailed documentation, visit the official Polars documentation at https://pola-rs.github.io/polars-book/

Competitor Comparisons

14,246

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

Pros of Arrow

  • Broader ecosystem support across multiple programming languages
  • More mature project with longer development history
  • Extensive documentation and community resources

Cons of Arrow

  • Steeper learning curve for beginners
  • Can be more complex to set up and use for simple data tasks
  • Larger codebase and dependencies

Code Comparison

Arrow (Python):

import pyarrow as pa

data = [
    pa.array([1, 2, 3, 4]),
    pa.array(['a', 'b', 'c', 'd'])
]
table = pa.Table.from_arrays(data, names=['numbers', 'letters'])

Polars (Rust):

use polars::prelude::*;

let df = df! [
    "numbers" => [1, 2, 3, 4],
    "letters" => ["a", "b", "c", "d"]
].unwrap();

Both Arrow and Polars are powerful data processing libraries, but they serve different purposes. Arrow focuses on providing a standardized columnar memory format and interoperability across languages, while Polars is a high-performance DataFrame library built on top of Arrow. Polars offers a more user-friendly API for data manipulation tasks, making it easier for developers to work with tabular data efficiently. However, Arrow's broader ecosystem and language support make it a better choice for projects requiring cross-language data exchange or integration with various big data tools.

43,205

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Pros of pandas

  • Extensive documentation and large community support
  • Wide range of built-in functions for data manipulation and analysis
  • Seamless integration with other Python libraries in the data science ecosystem

Cons of pandas

  • Slower performance, especially for large datasets
  • Higher memory usage due to its design and implementation in Python
  • Less efficient handling of null values compared to Polars

Code Comparison

pandas

import pandas as pd

df = pd.read_csv("data.csv")
result = df.groupby("category").agg({"value": ["mean", "sum"]})

Polars

import polars as pl

df = pl.read_csv("data.csv")
result = df.groupby("category").agg([
    pl.col("value").mean(),
    pl.col("value").sum()
])

The code comparison shows that both libraries have similar syntax for basic operations, but Polars tends to be more explicit in column selection and aggregation functions. Polars is designed to be more memory-efficient and faster, especially for larger datasets, while pandas offers a wider range of built-in functions and better integration with the Python data science ecosystem.

12,378

Parallel computing with task scheduling

Pros of Dask

  • Seamless integration with the PyData ecosystem (NumPy, Pandas)
  • Built-in support for distributed computing and scaling
  • Flexible task scheduling for complex workflows

Cons of Dask

  • Generally slower performance for single-machine operations
  • Higher memory usage compared to more optimized solutions
  • Steeper learning curve for advanced features

Code Comparison

Dask:

import dask.dataframe as dd

df = dd.read_csv('large_file.csv')
result = df.groupby('column').mean().compute()

Polars:

import polars as pl

df = pl.read_csv('large_file.csv')
result = df.groupby('column').mean()

Both Dask and Polars are powerful data processing libraries, but they have different strengths. Dask excels in distributed computing and integrating with existing PyData tools, while Polars focuses on high-performance operations on a single machine. Dask's syntax is more familiar to Pandas users, but Polars often provides better performance for local data processing tasks. The choice between them depends on specific use cases, data sizes, and computational requirements.

9,742

Modin: Scale your Pandas workflows by changing a single line of code

Pros of Modin

  • Seamless integration with existing pandas code, requiring minimal changes
  • Distributed computing capabilities for handling large datasets
  • Supports multiple execution engines (Ray, Dask, etc.) for flexibility

Cons of Modin

  • Performance improvements may be less significant for smaller datasets
  • Not all pandas functions are fully optimized or supported
  • Potential overhead for small operations due to distributed nature

Code Comparison

Modin:

import modin.pandas as pd

df = pd.read_csv("large_file.csv")
result = df.groupby("column").mean()

Polars:

import polars as pl

df = pl.read_csv("large_file.csv")
result = df.groupby("column").mean()

Key Differences

  • Modin aims to be a drop-in replacement for pandas, while Polars is a separate DataFrame library
  • Polars is generally faster for single-machine operations, especially on larger datasets
  • Modin focuses on distributed computing, while Polars emphasizes single-machine performance
  • Polars has a more modern API design, while Modin maintains pandas compatibility

Both libraries offer improved performance over pandas for large datasets, but they take different approaches. Modin is ideal for those looking to scale existing pandas code, while Polars may be better for new projects or those willing to adapt to a new API for better performance.

8,249

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

Pros of Vaex

  • Specialized in handling large datasets (>1 billion rows) efficiently
  • Supports lazy evaluation and out-of-core processing
  • Offers advanced visualization capabilities

Cons of Vaex

  • Slower performance for smaller datasets compared to Polars
  • Less extensive data manipulation functionality
  • Smaller community and ecosystem

Code Comparison

Vaex:

import vaex
df = vaex.from_csv('data.csv')
result = df[df.age > 30].groupby('category').agg({'value': 'mean'})

Polars:

import polars as pl
df = pl.read_csv('data.csv')
result = df.filter(pl.col('age') > 30).groupby('category').agg(pl.col('value').mean())

Both Vaex and Polars are data manipulation libraries designed to handle large datasets efficiently. Vaex excels at processing extremely large datasets and offers advanced visualization features. However, Polars generally provides faster performance for smaller to medium-sized datasets and has a more comprehensive set of data manipulation functions. The code comparison shows that both libraries have similar syntax for basic operations, but Polars tends to be more concise and intuitive for common data manipulation tasks.

A Python package for manipulating 2-dimensional tabular data structures

Pros of datatable

  • Written in C++, potentially offering better performance for certain operations
  • Supports out-of-memory processing, allowing for handling datasets larger than available RAM
  • Provides a Python API that closely mimics pandas, easing the transition for pandas users

Cons of datatable

  • Less actively maintained compared to Polars
  • Smaller community and ecosystem
  • Limited support for advanced data manipulation operations

Code Comparison

datatable:

import datatable as dt
df = dt.fread("data.csv")
result = df[f.A > 0, :]

Polars:

import polars as pl
df = pl.read_csv("data.csv")
result = df.filter(pl.col("A") > 0)

Both libraries offer similar functionality for basic operations, but Polars generally provides a more expressive and flexible API for complex data manipulations. Polars also has better support for lazy evaluation and query optimization, which can lead to improved performance for large datasets and complex operations.

While datatable has some unique features like out-of-memory processing, Polars has gained significant popularity due to its performance, ease of use, and active development. The choice between the two may depend on specific use cases and requirements.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Polars logo

Documentation: Python - Rust - Node.js - R | StackOverflow: Python - Rust - Node.js - R | User guide | Discord

Polars: Blazingly fast DataFrames in Rust, Python, Node.js, R, and SQL

Polars is a DataFrame interface on top of an OLAP Query Engine implemented in Rust using Apache Arrow Columnar Format as the memory model.

  • Lazy | eager execution
  • Multi-threaded
  • SIMD
  • Query optimization
  • Powerful expression API
  • Hybrid Streaming (larger-than-RAM datasets)
  • Rust | Python | NodeJS | R | ...

To learn more, read the user guide.

Python

>>> import polars as pl
>>> df = pl.DataFrame(
...     {
...         "A": [1, 2, 3, 4, 5],
...         "fruits": ["banana", "banana", "apple", "apple", "banana"],
...         "B": [5, 4, 3, 2, 1],
...         "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
...     }
... )

# embarrassingly parallel execution & very expressive query language
>>> df.sort("fruits").select(
...     "fruits",
...     "cars",
...     pl.lit("fruits").alias("literal_string_fruits"),
...     pl.col("B").filter(pl.col("cars") == "beetle").sum(),
...     pl.col("A").filter(pl.col("B") > 2).sum().over("cars").alias("sum_A_by_cars"),
...     pl.col("A").sum().over("fruits").alias("sum_A_by_fruits"),
...     pl.col("A").reverse().over("fruits").alias("rev_A_by_fruits"),
...     pl.col("A").sort_by("B").over("fruits").alias("sort_A_by_B_by_fruits"),
... )
shape: (5, 8)
┌──────────┬──────────┬──────────────┬─────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ fruits   ┆ cars     ┆ literal_stri ┆ B   ┆ sum_A_by_ca ┆ sum_A_by_fr ┆ rev_A_by_fr ┆ sort_A_by_B │
│ ---      ┆ ---      ┆ ng_fruits    ┆ --- ┆ rs          ┆ uits        ┆ uits        ┆ _by_fruits  │
│ str      ┆ str      ┆ ---          ┆ i64 ┆ ---         ┆ ---         ┆ ---         ┆ ---         │
│          ┆          ┆ str          ┆     ┆ i64         ┆ i64         ┆ i64         ┆ i64         │
╞══════════╪══════════╪══════════════╪═════╪═════════════╪═════════════╪═════════════╪═════════════╡
│ "apple"  ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 7           ┆ 4           ┆ 4           │
│ "apple"  ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 7           ┆ 3           ┆ 3           │
│ "banana" ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 8           ┆ 5           ┆ 5           │
│ "banana" ┆ "audi"   ┆ "fruits"     ┆ 11  ┆ 2           ┆ 8           ┆ 2           ┆ 2           │
│ "banana" ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 8           ┆ 1           ┆ 1           │
└──────────┴──────────┴──────────────┴─────┴─────────────┴─────────────┴─────────────┴─────────────┘

SQL

>>> df = pl.scan_csv("docs/data/iris.csv")
>>> ## OPTION 1
>>> # run SQL queries on frame-level
>>> df.sql("""
...	SELECT species,
...	  AVG(sepal_length) AS avg_sepal_length
...	FROM self
...	GROUP BY species
...	""").collect()
shape: (3, 2)
┌────────────┬──────────────────┐
│ species    ┆ avg_sepal_length │
│ ---        ┆ ---              │
│ str        ┆ f64              │
╞════════════╪══════════════════╡
│ Virginica  ┆ 6.588            │
│ Versicolor ┆ 5.936            │
│ Setosa     ┆ 5.006            │
└────────────┴──────────────────┘
>>> ## OPTION 2
>>> # use pl.sql() to operate on the global context
>>> df2 = pl.LazyFrame({
...    "species": ["Setosa", "Versicolor", "Virginica"],
...    "blooming_season": ["Spring", "Summer", "Fall"]
...})
>>> pl.sql("""
... SELECT df.species,
...     AVG(df.sepal_length) AS avg_sepal_length,
...     df2.blooming_season
... FROM df
... LEFT JOIN df2 ON df.species = df2.species
... GROUP BY df.species, df2.blooming_season
... """).collect()

SQL commands can also be run directly from your terminal using the Polars CLI:

# run an inline SQL query
> polars -c "SELECT species, AVG(sepal_length) AS avg_sepal_length, AVG(sepal_width) AS avg_sepal_width FROM read_csv('docs/data/iris.csv') GROUP BY species;"

# run interactively
> polars
Polars CLI v0.3.0
Type .help for help.

> SELECT species, AVG(sepal_length) AS avg_sepal_length, AVG(sepal_width) AS avg_sepal_width FROM read_csv('docs/data/iris.csv') GROUP BY species;

Refer to the Polars CLI repository for more information.

Performance 🚀🚀

Blazingly fast

Polars is very fast. In fact, it is one of the best performing solutions available. See the PDS-H benchmarks results.

Lightweight

Polars is also very lightweight. It comes with zero required dependencies, and this shows in the import times:

  • polars: 70ms
  • numpy: 104ms
  • pandas: 520ms

Handles larger-than-RAM data

If you have data that does not fit into memory, Polars' query engine is able to process your query (or parts of your query) in a streaming fashion. This drastically reduces memory requirements, so you might be able to process your 250GB dataset on your laptop. Collect with collect(streaming=True) to run the query streaming. (This might be a little slower, but it is still very fast!)

Setup

Python

Install the latest Polars version with:

pip install polars

We also have a conda package (conda install -c conda-forge polars), however pip is the preferred way to install Polars.

Install Polars with all optional dependencies.

pip install 'polars[all]'

You can also install a subset of all optional dependencies.

pip install 'polars[numpy,pandas,pyarrow]'

See the User Guide for more details on optional dependencies

To see the current Polars version and a full list of its optional dependencies, run:

pl.show_versions()

Releases happen quite often (weekly / every few days) at the moment, so updating Polars regularly to get the latest bugfixes / features might not be a bad idea.

Rust

You can take latest release from crates.io, or if you want to use the latest features / performance improvements point to the main branch of this repo.

polars = { git = "https://github.com/pola-rs/polars", rev = "<optional git tag>" }

Requires Rust version >=1.80.

Contributing

Want to contribute? Read our contributing guide.

Python: compile Polars from source

If you want a bleeding edge release or maximal performance you should compile Polars from source.

This can be done by going through the following steps in sequence:

  1. Install the latest Rust compiler

  2. Install maturin: pip install maturin

  3. cd py-polars and choose one of the following:

    • make build-release, fastest binary, very long compile times
    • make build-opt, fast binary with debug symbols, long compile times
    • make build-debug-opt, medium-speed binary with debug assertions and symbols, medium compile times
    • make build, slow binary with debug assertions and symbols, fast compile times

    Append -native (e.g. make build-release-native) to enable further optimizations specific to your CPU. This produces a non-portable binary/wheel however.

Note that the Rust crate implementing the Python bindings is called py-polars to distinguish from the wrapped Rust crate polars itself. However, both the Python package and the Python module are named polars, so you can pip install polars and import polars.

Using custom Rust functions in Python

Extending Polars with UDFs compiled in Rust is easy. We expose PyO3 extensions for DataFrame and Series data structures. See more in https://github.com/pola-rs/pyo3-polars.

Going big...

Do you expect more than 2^32 (~4.2 billion) rows? Compile Polars with the bigidx feature flag or, for Python users, install pip install polars-u64-idx.

Don't use this unless you hit the row boundary as the default build of Polars is faster and consumes less memory.

Legacy

Do you want Polars to run on an old CPU (e.g. dating from before 2011), or on an x86-64 build of Python on Apple Silicon under Rosetta? Install pip install polars-lts-cpu. This version of Polars is compiled without AVX target features.

Sponsors

JetBrains logo