Convert Figma logo to code with AI

rapidsai logocudf

cuDF - GPU DataFrame Library

8,348
890
8,348
1,062

Top Related Projects

12,495

Parallel computing with task scheduling

43,524

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

9,845

Modin: Scale your Pandas workflows by changing a single line of code

8,280

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

29,748

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

Quick Overview

cuDF is a GPU DataFrame library built on the RAPIDS AI platform. It provides a pandas-like API for manipulating large datasets on NVIDIA GPUs, offering significant performance improvements over CPU-based solutions for data science and machine learning tasks.

Pros

  • Accelerated data processing using GPU power
  • Familiar pandas-like API for easy adoption
  • Seamless integration with other RAPIDS libraries
  • Supports large-scale datasets that may not fit in CPU memory

Cons

  • Requires NVIDIA GPU hardware
  • Limited functionality compared to pandas
  • Potential learning curve for GPU-based data processing
  • May not be cost-effective for smaller datasets or simpler operations

Code Examples

  1. Creating a DataFrame and performing basic operations:
import cudf

# Create a DataFrame
df = cudf.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Perform operations
result = df['A'] + df['B']
print(result)
  1. Reading a CSV file and filtering data:
import cudf

# Read CSV file
df = cudf.read_csv('data.csv')

# Filter data
filtered_df = df[df['column_name'] > 100]
print(filtered_df.head())
  1. Joining two DataFrames:
import cudf

# Create two DataFrames
df1 = cudf.DataFrame({'key': [1, 2, 3], 'value': ['a', 'b', 'c']})
df2 = cudf.DataFrame({'key': [2, 3, 4], 'other_value': ['x', 'y', 'z']})

# Perform inner join
joined_df = df1.merge(df2, on='key')
print(joined_df)

Getting Started

To get started with cuDF, follow these steps:

  1. Install CUDA and compatible NVIDIA drivers
  2. Install cuDF using conda:
conda install -c rapidsai -c nvidia -c conda-forge cudf=23.04 python=3.9 cuda-version=11.8
  1. Import cuDF in your Python script:
import cudf

# Create a sample DataFrame
df = cudf.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df)

This will create a simple cuDF DataFrame and print it, confirming that the library is working correctly.

Competitor Comparisons

12,495

Parallel computing with task scheduling

Pros of Dask

  • Supports both CPU and GPU processing, offering flexibility across hardware
  • Works with various data formats and storage systems beyond just DataFrames
  • More mature project with a larger ecosystem and community support

Cons of Dask

  • Generally slower performance for GPU operations compared to cuDF
  • Requires more complex setup and configuration for distributed computing
  • Less optimized for GPU-specific operations and memory management

Code Comparison

Dask:

import dask.dataframe as dd

df = dd.read_csv('large_file.csv')
result = df.groupby('column').mean().compute()

cuDF:

import cudf

df = cudf.read_csv('large_file.csv')
result = df.groupby('column').mean()

Both libraries offer similar high-level APIs for data manipulation, but cuDF is specifically designed for GPU acceleration, resulting in potentially faster execution for compatible operations. Dask provides a more general-purpose distributed computing framework, while cuDF focuses on GPU-accelerated DataFrame operations within the RAPIDS ecosystem.

43,524

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Pros of pandas

  • Mature and widely adopted library with extensive documentation and community support
  • Works on CPU, making it accessible for a broader range of hardware configurations
  • Supports a wide variety of data manipulation and analysis operations

Cons of pandas

  • Performance limitations on large datasets due to single-threaded CPU operations
  • Higher memory usage compared to GPU-accelerated alternatives
  • Limited scalability for big data processing tasks

Code Comparison

pandas:

import pandas as pd

df = pd.read_csv('data.csv')
result = df.groupby('category')['value'].mean()

cudf:

import cudf

gdf = cudf.read_csv('data.csv')
result = gdf.groupby('category')['value'].mean()

The code structure is similar, but cudf leverages GPU acceleration for faster processing on large datasets. While pandas is more versatile and widely supported, cudf offers significant performance improvements for compatible operations on CUDA-enabled GPUs.

9,845

Modin: Scale your Pandas workflows by changing a single line of code

Pros of Modin

  • Easier to adopt: Modin aims to be a drop-in replacement for pandas, requiring minimal code changes
  • Supports both CPU and GPU acceleration, offering flexibility in hardware usage
  • Works with existing pandas-based workflows and libraries

Cons of Modin

  • Generally slower than cuDF for GPU-accelerated operations
  • Less mature and feature-complete compared to cuDF for GPU workloads
  • May not fully utilize GPU capabilities for all operations

Code Comparison

Modin:

import modin.pandas as pd

df = pd.read_csv("large_file.csv")
result = df.groupby("column").mean()

cuDF:

import cudf

df = cudf.read_csv("large_file.csv")
result = df.groupby("column").mean()

Both libraries aim to provide DataFrame functionality with improved performance. Modin focuses on ease of adoption and compatibility with pandas, while cuDF is designed specifically for GPU acceleration. The code usage is similar, with Modin aiming to mimic pandas API closely. cuDF, being part of the RAPIDS ecosystem, offers more GPU-optimized operations but may require more significant code changes when migrating from pandas.

8,280

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

Pros of Vaex

  • Works with out-of-memory datasets, allowing processing of data larger than RAM
  • Supports various file formats including HDF5, CSV, and Parquet
  • Can run on standard CPUs, making it more accessible for users without GPUs

Cons of Vaex

  • Generally slower than cuDF for in-memory operations
  • Less integration with other GPU-accelerated libraries
  • Smaller ecosystem and community compared to RAPIDS

Code Comparison

Vaex:

import vaex
df = vaex.open('large_dataset.hdf5')
result = df.mean(df.column)

cuDF:

import cudf
df = cudf.read_csv('large_dataset.csv')
result = df['column'].mean()

Key Differences

  • Vaex focuses on out-of-memory processing, while cuDF is designed for in-memory GPU acceleration
  • cuDF is part of the larger RAPIDS ecosystem, offering integration with other GPU-accelerated tools
  • Vaex provides more flexibility in terms of hardware requirements, while cuDF requires NVIDIA GPUs
  • cuDF generally offers faster performance for in-memory operations, especially on large datasets
  • Vaex supports a wider range of file formats out-of-the-box
29,748

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

Pros of Polars

  • Cross-platform compatibility (works on CPU, no GPU required)
  • Easier installation and setup process
  • More flexible and user-friendly API

Cons of Polars

  • Generally slower performance for large datasets compared to GPU-accelerated cuDF
  • Less integration with other RAPIDS ecosystem libraries
  • Fewer advanced features for complex data manipulation

Code Comparison

Polars:

import polars as pl

df = pl.read_csv("data.csv")
result = df.filter(pl.col("age") > 30).groupby("city").agg(pl.col("salary").mean())

cuDF:

import cudf

df = cudf.read_csv("data.csv")
result = df[df.age > 30].groupby("city").salary.mean().reset_index()

Both libraries offer similar functionality for basic data manipulation tasks. Polars provides a more expressive API with method chaining, while cuDF closely mimics pandas syntax. cuDF leverages GPU acceleration for faster processing of large datasets, but requires CUDA-compatible hardware. Polars, being CPU-based, offers broader compatibility and easier setup, making it more accessible for general use cases. However, for big data applications requiring high-performance computing, cuDF's GPU acceleration can provide significant speed advantages.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

 cuDF - GPU DataFrames

📢 cuDF can now be used as a no-code-change accelerator for pandas! To learn more, see here!

cuDF (pronounced "KOO-dee-eff") is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data. cuDF leverages libcudf, a blazing-fast C++/CUDA dataframe library and the Apache Arrow columnar format to provide a GPU-accelerated pandas API.

You can import cudf directly and use it like pandas:

import cudf

tips_df = cudf.read_csv("https://github.com/plotly/datasets/raw/master/tips.csv")
tips_df["tip_percentage"] = tips_df["tip"] / tips_df["total_bill"] * 100

# display average tip by dining party size
print(tips_df.groupby("size").tip_percentage.mean())

Or, you can use cuDF as a no-code-change accelerator for pandas, using cudf.pandas. cudf.pandas supports 100% of the pandas API, utilizing cuDF for supported operations and falling back to pandas when needed:

%load_ext cudf.pandas  # pandas operations now use the GPU!

import pandas as pd

tips_df = pd.read_csv("https://github.com/plotly/datasets/raw/master/tips.csv")
tips_df["tip_percentage"] = tips_df["tip"] / tips_df["total_bill"] * 100

# display average tip by dining party size
print(tips_df.groupby("size").tip_percentage.mean())

Resources

See the RAPIDS install page for the most up-to-date information and commands for installing cuDF and other RAPIDS packages.

Installation

CUDA/GPU requirements

  • CUDA 11.2+
  • NVIDIA driver 450.80.02+
  • Volta architecture or better (Compute Capability >=7.0)

Pip

cuDF can be installed via pip from the NVIDIA Python Package Index. Be sure to select the appropriate cuDF package depending on the major version of CUDA available in your environment:

For CUDA 11.x:

pip install --extra-index-url=https://pypi.nvidia.com cudf-cu11

For CUDA 12.x:

pip install --extra-index-url=https://pypi.nvidia.com cudf-cu12

Conda

cuDF can be installed with conda (via miniconda or the full Anaconda distribution from the rapidsai channel:

conda install -c rapidsai -c conda-forge -c nvidia \
    cudf=24.10 python=3.12 cuda-version=12.5

We also provide nightly Conda packages built from the HEAD of our latest development branch.

Note: cuDF is supported only on Linux, and with Python versions 3.10 and later.

See the RAPIDS installation guide for more OS and version info.

Build/Install from Source

See build instructions.

Contributing

Please see our guide for contributing to cuDF.