Top Related Projects
Parallel computing with task scheduling
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Modin: Scale your Pandas workflows by changing a single line of code
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
Dataframes powered by a multithreaded, vectorized query engine, written in Rust
Quick Overview
cuDF is a GPU DataFrame library built on the RAPIDS AI platform. It provides a pandas-like API for manipulating large datasets on NVIDIA GPUs, offering significant performance improvements over CPU-based solutions for data science and machine learning tasks.
Pros
- Accelerated data processing using GPU power
- Familiar pandas-like API for easy adoption
- Seamless integration with other RAPIDS libraries
- Supports large-scale datasets that may not fit in CPU memory
Cons
- Requires NVIDIA GPU hardware
- Limited functionality compared to pandas
- Potential learning curve for GPU-based data processing
- May not be cost-effective for smaller datasets or simpler operations
Code Examples
- Creating a DataFrame and performing basic operations:
import cudf
# Create a DataFrame
df = cudf.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
# Perform operations
result = df['A'] + df['B']
print(result)
- Reading a CSV file and filtering data:
import cudf
# Read CSV file
df = cudf.read_csv('data.csv')
# Filter data
filtered_df = df[df['column_name'] > 100]
print(filtered_df.head())
- Joining two DataFrames:
import cudf
# Create two DataFrames
df1 = cudf.DataFrame({'key': [1, 2, 3], 'value': ['a', 'b', 'c']})
df2 = cudf.DataFrame({'key': [2, 3, 4], 'other_value': ['x', 'y', 'z']})
# Perform inner join
joined_df = df1.merge(df2, on='key')
print(joined_df)
Getting Started
To get started with cuDF, follow these steps:
- Install CUDA and compatible NVIDIA drivers
- Install cuDF using conda:
conda install -c rapidsai -c nvidia -c conda-forge cudf=23.04 python=3.9 cuda-version=11.8
- Import cuDF in your Python script:
import cudf
# Create a sample DataFrame
df = cudf.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df)
This will create a simple cuDF DataFrame and print it, confirming that the library is working correctly.
Competitor Comparisons
Parallel computing with task scheduling
Pros of Dask
- Supports both CPU and GPU processing, offering flexibility across hardware
- Works with various data formats and storage systems beyond just DataFrames
- More mature project with a larger ecosystem and community support
Cons of Dask
- Generally slower performance for GPU operations compared to cuDF
- Requires more complex setup and configuration for distributed computing
- Less optimized for GPU-specific operations and memory management
Code Comparison
Dask:
import dask.dataframe as dd
df = dd.read_csv('large_file.csv')
result = df.groupby('column').mean().compute()
cuDF:
import cudf
df = cudf.read_csv('large_file.csv')
result = df.groupby('column').mean()
Both libraries offer similar high-level APIs for data manipulation, but cuDF is specifically designed for GPU acceleration, resulting in potentially faster execution for compatible operations. Dask provides a more general-purpose distributed computing framework, while cuDF focuses on GPU-accelerated DataFrame operations within the RAPIDS ecosystem.
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Pros of pandas
- Mature and widely adopted library with extensive documentation and community support
- Works on CPU, making it accessible for a broader range of hardware configurations
- Supports a wide variety of data manipulation and analysis operations
Cons of pandas
- Performance limitations on large datasets due to single-threaded CPU operations
- Higher memory usage compared to GPU-accelerated alternatives
- Limited scalability for big data processing tasks
Code Comparison
pandas:
import pandas as pd
df = pd.read_csv('data.csv')
result = df.groupby('category')['value'].mean()
cudf:
import cudf
gdf = cudf.read_csv('data.csv')
result = gdf.groupby('category')['value'].mean()
The code structure is similar, but cudf leverages GPU acceleration for faster processing on large datasets. While pandas is more versatile and widely supported, cudf offers significant performance improvements for compatible operations on CUDA-enabled GPUs.
Modin: Scale your Pandas workflows by changing a single line of code
Pros of Modin
- Easier to adopt: Modin aims to be a drop-in replacement for pandas, requiring minimal code changes
- Supports both CPU and GPU acceleration, offering flexibility in hardware usage
- Works with existing pandas-based workflows and libraries
Cons of Modin
- Generally slower than cuDF for GPU-accelerated operations
- Less mature and feature-complete compared to cuDF for GPU workloads
- May not fully utilize GPU capabilities for all operations
Code Comparison
Modin:
import modin.pandas as pd
df = pd.read_csv("large_file.csv")
result = df.groupby("column").mean()
cuDF:
import cudf
df = cudf.read_csv("large_file.csv")
result = df.groupby("column").mean()
Both libraries aim to provide DataFrame functionality with improved performance. Modin focuses on ease of adoption and compatibility with pandas, while cuDF is designed specifically for GPU acceleration. The code usage is similar, with Modin aiming to mimic pandas API closely. cuDF, being part of the RAPIDS ecosystem, offers more GPU-optimized operations but may require more significant code changes when migrating from pandas.
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
Pros of Vaex
- Works with out-of-memory datasets, allowing processing of data larger than RAM
- Supports various file formats including HDF5, CSV, and Parquet
- Can run on standard CPUs, making it more accessible for users without GPUs
Cons of Vaex
- Generally slower than cuDF for in-memory operations
- Less integration with other GPU-accelerated libraries
- Smaller ecosystem and community compared to RAPIDS
Code Comparison
Vaex:
import vaex
df = vaex.open('large_dataset.hdf5')
result = df.mean(df.column)
cuDF:
import cudf
df = cudf.read_csv('large_dataset.csv')
result = df['column'].mean()
Key Differences
- Vaex focuses on out-of-memory processing, while cuDF is designed for in-memory GPU acceleration
- cuDF is part of the larger RAPIDS ecosystem, offering integration with other GPU-accelerated tools
- Vaex provides more flexibility in terms of hardware requirements, while cuDF requires NVIDIA GPUs
- cuDF generally offers faster performance for in-memory operations, especially on large datasets
- Vaex supports a wider range of file formats out-of-the-box
Dataframes powered by a multithreaded, vectorized query engine, written in Rust
Pros of Polars
- Cross-platform compatibility (works on CPU, no GPU required)
- Easier installation and setup process
- More flexible and user-friendly API
Cons of Polars
- Generally slower performance for large datasets compared to GPU-accelerated cuDF
- Less integration with other RAPIDS ecosystem libraries
- Fewer advanced features for complex data manipulation
Code Comparison
Polars:
import polars as pl
df = pl.read_csv("data.csv")
result = df.filter(pl.col("age") > 30).groupby("city").agg(pl.col("salary").mean())
cuDF:
import cudf
df = cudf.read_csv("data.csv")
result = df[df.age > 30].groupby("city").salary.mean().reset_index()
Both libraries offer similar functionality for basic data manipulation tasks. Polars provides a more expressive API with method chaining, while cuDF closely mimics pandas syntax. cuDF leverages GPU acceleration for faster processing of large datasets, but requires CUDA-compatible hardware. Polars, being CPU-based, offers broader compatibility and easier setup, making it more accessible for general use cases. However, for big data applications requiring high-performance computing, cuDF's GPU acceleration can provide significant speed advantages.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
cuDF - GPU DataFrames
ð¢ cuDF can now be used as a no-code-change accelerator for pandas! To learn more, see here!
cuDF (pronounced "KOO-dee-eff") is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data. cuDF leverages libcudf, a blazing-fast C++/CUDA dataframe library and the Apache Arrow columnar format to provide a GPU-accelerated pandas API.
You can import cudf
directly and use it like pandas
:
import cudf
tips_df = cudf.read_csv("https://github.com/plotly/datasets/raw/master/tips.csv")
tips_df["tip_percentage"] = tips_df["tip"] / tips_df["total_bill"] * 100
# display average tip by dining party size
print(tips_df.groupby("size").tip_percentage.mean())
Or, you can use cuDF as a no-code-change accelerator for pandas, using
cudf.pandas
.
cudf.pandas
supports 100% of the pandas API, utilizing cuDF for
supported operations and falling back to pandas when needed:
%load_ext cudf.pandas # pandas operations now use the GPU!
import pandas as pd
tips_df = pd.read_csv("https://github.com/plotly/datasets/raw/master/tips.csv")
tips_df["tip_percentage"] = tips_df["tip"] / tips_df["total_bill"] * 100
# display average tip by dining party size
print(tips_df.groupby("size").tip_percentage.mean())
Resources
- Try cudf.pandas now: Explore
cudf.pandas
on a free GPU enabled instance on Google Colab! - Install: Instructions for installing cuDF and other RAPIDS libraries.
- cudf (Python) documentation
- libcudf (C++/CUDA) documentation
- RAPIDS Community: Get help, contribute, and collaborate.
See the RAPIDS install page for the most up-to-date information and commands for installing cuDF and other RAPIDS packages.
Installation
CUDA/GPU requirements
- CUDA 11.2+
- NVIDIA driver 450.80.02+
- Volta architecture or better (Compute Capability >=7.0)
Pip
cuDF can be installed via pip
from the NVIDIA Python Package Index.
Be sure to select the appropriate cuDF package depending
on the major version of CUDA available in your environment:
For CUDA 11.x:
pip install --extra-index-url=https://pypi.nvidia.com cudf-cu11
For CUDA 12.x:
pip install --extra-index-url=https://pypi.nvidia.com cudf-cu12
Conda
cuDF can be installed with conda (via miniconda or the full Anaconda distribution from the rapidsai
channel:
conda install -c rapidsai -c conda-forge -c nvidia \
cudf=24.10 python=3.12 cuda-version=12.5
We also provide nightly Conda packages built from the HEAD of our latest development branch.
Note: cuDF is supported only on Linux, and with Python versions 3.10 and later.
See the RAPIDS installation guide for more OS and version info.
Build/Install from Source
See build instructions.
Contributing
Please see our guide for contributing to cuDF.
Top Related Projects
Parallel computing with task scheduling
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Modin: Scale your Pandas workflows by changing a single line of code
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
Dataframes powered by a multithreaded, vectorized query engine, written in Rust
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot