cudf

cuDF - GPU DataFrame Library

8,886

943

8,886

1,102

View on GitHub

Top Related Projects

dask

13,158

Parallel computing with task scheduling

pandas

45,255

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

modin

10,129

Modin: Scale your Pandas workflows by changing a single line of code

vaex

8,377

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

polars

33,322

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

Quick Overview

cuDF is a GPU DataFrame library built on the RAPIDS AI platform. It provides a pandas-like API for manipulating large datasets on NVIDIA GPUs, offering significant performance improvements over CPU-based solutions for data science and machine learning tasks.

Pros

Accelerated data processing using GPU power
Familiar pandas-like API for easy adoption
Seamless integration with other RAPIDS libraries
Supports large-scale datasets that may not fit in CPU memory

Cons

Requires NVIDIA GPU hardware
Limited functionality compared to pandas
Potential learning curve for GPU-based data processing
May not be cost-effective for smaller datasets or simpler operations

Code Examples

Creating a DataFrame and performing basic operations:

import cudf

# Create a DataFrame
df = cudf.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Perform operations
result = df['A'] + df['B']
print(result)

Reading a CSV file and filtering data:

import cudf

# Read CSV file
df = cudf.read_csv('data.csv')

# Filter data
filtered_df = df[df['column_name'] > 100]
print(filtered_df.head())

Joining two DataFrames:

import cudf

# Create two DataFrames
df1 = cudf.DataFrame({'key': [1, 2, 3], 'value': ['a', 'b', 'c']})
df2 = cudf.DataFrame({'key': [2, 3, 4], 'other_value': ['x', 'y', 'z']})

# Perform inner join
joined_df = df1.merge(df2, on='key')
print(joined_df)

Getting Started

To get started with cuDF, follow these steps:

Install CUDA and compatible NVIDIA drivers
Install cuDF using conda:

conda install -c rapidsai -c nvidia -c conda-forge cudf=23.04 python=3.9 cuda-version=11.8

Import cuDF in your Python script:

import cudf

# Create a sample DataFrame
df = cudf.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df)

This will create a simple cuDF DataFrame and print it, confirming that the library is working correctly.

Competitor Comparisons

dask

13,158

Parallel computing with task scheduling

Pros of Dask

Supports both CPU and GPU processing, offering flexibility across hardware
Works with various data formats and storage systems beyond just DataFrames
More mature project with a larger ecosystem and community support

Cons of Dask

Generally slower performance for GPU operations compared to cuDF
Requires more complex setup and configuration for distributed computing
Less optimized for GPU-specific operations and memory management

Code Comparison

Dask:

import dask.dataframe as dd

df = dd.read_csv('large_file.csv')
result = df.groupby('column').mean().compute()

cuDF:

import cudf

df = cudf.read_csv('large_file.csv')
result = df.groupby('column').mean()

Both libraries offer similar high-level APIs for data manipulation, but cuDF is specifically designed for GPU acceleration, resulting in potentially faster execution for compatible operations. Dask provides a more general-purpose distributed computing framework, while cuDF focuses on GPU-accelerated DataFrame operations within the RAPIDS ecosystem.

pandas

45,255

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Pros of pandas

Mature and widely adopted library with extensive documentation and community support
Works on CPU, making it accessible for a broader range of hardware configurations
Supports a wide variety of data manipulation and analysis operations

Cons of pandas

Performance limitations on large datasets due to single-threaded CPU operations
Higher memory usage compared to GPU-accelerated alternatives
Limited scalability for big data processing tasks

Code Comparison

pandas:

import pandas as pd

df = pd.read_csv('data.csv')
result = df.groupby('category')['value'].mean()

cudf:

import cudf

gdf = cudf.read_csv('data.csv')
result = gdf.groupby('category')['value'].mean()

The code structure is similar, but cudf leverages GPU acceleration for faster processing on large datasets. While pandas is more versatile and widely supported, cudf offers significant performance improvements for compatible operations on CUDA-enabled GPUs.

modin

10,129

Modin: Scale your Pandas workflows by changing a single line of code

Pros of Modin

Easier to adopt: Modin aims to be a drop-in replacement for pandas, requiring minimal code changes
Supports both CPU and GPU acceleration, offering flexibility in hardware usage
Works with existing pandas-based workflows and libraries

Cons of Modin

Generally slower than cuDF for GPU-accelerated operations
Less mature and feature-complete compared to cuDF for GPU workloads
May not fully utilize GPU capabilities for all operations

Code Comparison

Modin:

import modin.pandas as pd

df = pd.read_csv("large_file.csv")
result = df.groupby("column").mean()

cuDF:

import cudf

df = cudf.read_csv("large_file.csv")
result = df.groupby("column").mean()

Both libraries aim to provide DataFrame functionality with improved performance. Modin focuses on ease of adoption and compatibility with pandas, while cuDF is designed specifically for GPU acceleration. The code usage is similar, with Modin aiming to mimic pandas API closely. cuDF, being part of the RAPIDS ecosystem, offers more GPU-optimized operations but may require more significant code changes when migrating from pandas.

vaex

8,377

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

Pros of Vaex

Works with out-of-memory datasets, allowing processing of data larger than RAM
Supports various file formats including HDF5, CSV, and Parquet
Can run on standard CPUs, making it more accessible for users without GPUs

Cons of Vaex

Generally slower than cuDF for in-memory operations
Less integration with other GPU-accelerated libraries
Smaller ecosystem and community compared to RAPIDS

Code Comparison

Vaex:

import vaex
df = vaex.open('large_dataset.hdf5')
result = df.mean(df.column)

cuDF:

import cudf
df = cudf.read_csv('large_dataset.csv')
result = df['column'].mean()

Key Differences

Vaex focuses on out-of-memory processing, while cuDF is designed for in-memory GPU acceleration
cuDF is part of the larger RAPIDS ecosystem, offering integration with other GPU-accelerated tools
Vaex provides more flexibility in terms of hardware requirements, while cuDF requires NVIDIA GPUs
cuDF generally offers faster performance for in-memory operations, especially on large datasets
Vaex supports a wider range of file formats out-of-the-box

polars

33,322

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

Pros of Polars

Cross-platform compatibility (works on CPU, no GPU required)
Easier installation and setup process
More flexible and user-friendly API

Cons of Polars

Generally slower performance for large datasets compared to GPU-accelerated cuDF
Less integration with other RAPIDS ecosystem libraries
Fewer advanced features for complex data manipulation

Code Comparison

Polars:

import polars as pl

df = pl.read_csv("data.csv")
result = df.filter(pl.col("age") > 30).groupby("city").agg(pl.col("salary").mean())

cuDF:

import cudf

df = cudf.read_csv("data.csv")
result = df[df.age > 30].groupby("city").salary.mean().reset_index()

Both libraries offer similar functionality for basic data manipulation tasks. Polars provides a more expressive API with method chaining, while cuDF closely mimics pandas syntax. cuDF leverages GPU acceleration for faster processing of large datasets, but requires CUDA-compatible hardware. Polars, being CPU-based, offers broader compatibility and easier setup, making it more accessible for general use cases. However, for big data applications requiring high-performance computing, cuDF's GPU acceleration can provide significant speed advantages.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

cuDF - GPU DataFrames

ð¢ cuDF can now be used as a no-code-change accelerator for pandas! To learn more, see here!

cuDF (pronounced "KOO-dee-eff") is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data. cuDF leverages libcudf, a blazing-fast C++/CUDA dataframe library and the Apache Arrow columnar format to provide a GPU-accelerated pandas API.

You can import cudf directly and use it like pandas:

import cudf

tips_df = cudf.read_csv("https://github.com/plotly/datasets/raw/master/tips.csv")
tips_df["tip_percentage"] = tips_df["tip"] / tips_df["total_bill"] * 100

# display average tip by dining party size
print(tips_df.groupby("size").tip_percentage.mean())

Or, you can use cuDF as a no-code-change accelerator for pandas, using cudf.pandas. cudf.pandas supports 100% of the pandas API, utilizing cuDF for supported operations and falling back to pandas when needed:

%load_ext cudf.pandas  # pandas operations now use the GPU!

import pandas as pd

tips_df = pd.read_csv("https://github.com/plotly/datasets/raw/master/tips.csv")
tips_df["tip_percentage"] = tips_df["tip"] / tips_df["total_bill"] * 100

# display average tip by dining party size
print(tips_df.groupby("size").tip_percentage.mean())

Resources

Try cudf.pandas now: Explore cudf.pandas on a free GPU enabled instance on Google Colab!
Install: Instructions for installing cuDF and other RAPIDS libraries.
cudf (Python) documentation
libcudf (C++/CUDA) documentation
RAPIDS Community: Get help, contribute, and collaborate.

See the RAPIDS install page for the most up-to-date information and commands for installing cuDF and other RAPIDS packages.

Installation

CUDA/GPU requirements

CUDA 12.0+ with a compatible NVIDIA driver
Volta architecture or better (Compute Capability >=7.0)

Pip

cuDF can be installed via pip from the NVIDIA Python Package Index. Be sure to select the appropriate cuDF package depending on the major version of CUDA available in your environment:

pip install cudf-cu12

Conda

cuDF can be installed with conda (via miniforge) from the rapidsai channel:

conda install -c rapidsai -c conda-forge cudf=25.08

We also provide nightly Conda packages built from the HEAD of our latest development branch.

Note: cuDF is supported only on Linux, and with Python versions 3.10 and later.

See the RAPIDS installation guide for more OS and version info.

Build/Install from Source

See build instructions.

Contributing

Please see our guide for contributing to cuDF.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of Dask

Cons of Dask

Code Comparison

Pros of pandas

Cons of pandas

Code Comparison

Pros of Modin

Cons of Modin

Code Comparison

Pros of Vaex

Cons of Vaex

Code Comparison

Key Differences

Pros of Polars

Cons of Polars

Code Comparison

Convert designs to code with AI

README

cuDF - GPU DataFrames

ð¢ cuDF can now be used as a no-code-change accelerator for pandas! To learn more, see here!

Resources

Installation

CUDA/GPU requirements

Pip

Conda

Build/Install from Source

Contributing

Top Related Projects

Convert designs to code with AI

ð¢ cuDF can now be used as a no-code-change accelerator for pandas! To learn more, see here!