Convert Figma logo to code with AI

NVIDIA logothrust

[ARCHIVED] The C++ parallel algorithms library. See https://github.com/NVIDIA/cccl

4,907
758
4,907
8

Top Related Projects

30,390

A library for efficient similarity search and clustering of dense vectors.

3,683

HIP: C++ Heterogeneous-Compute Interface for Portability

1,850

Kokkos C++ Performance Portability Programming Ecosystem: The Programming Model - Parallel Execution and Memory Abstraction

ArrayFire: a general purpose GPU library.

Quick Overview

Thrust is a C++ template library for CUDA that resembles the Standard Template Library (STL). It provides a rich collection of data parallel primitives such as scan, sort, and reduce, which can be used to rapidly develop performance-portable parallel applications.

Pros

  • High-Performance: Thrust is designed to provide high-performance parallel algorithms that can take advantage of the massive parallelism available in CUDA-enabled GPUs.
  • Ease of Use: Thrust provides a familiar and intuitive interface that is similar to the STL, making it easy for C++ developers to get started with GPU programming.
  • Portability: Thrust code can be compiled for both CPU and GPU, allowing developers to write a single codebase that can run on a variety of hardware platforms.
  • Extensive Functionality: Thrust includes a wide range of parallel algorithms and data structures, covering a broad range of use cases.

Cons

  • Limited to CUDA: Thrust is primarily designed for CUDA-enabled GPUs and may not be as well-suited for other GPU architectures or CPU-based parallel programming.
  • Steep Learning Curve: While Thrust aims to simplify GPU programming, developers still need to have a good understanding of CUDA and parallel programming concepts to use it effectively.
  • Dependency on CUDA: Thrust is tightly coupled with the CUDA ecosystem, which may limit its adoption in environments where CUDA is not available or preferred.
  • Potential Performance Overhead: The abstraction and flexibility provided by Thrust may introduce some performance overhead compared to hand-tuned CUDA code.

Code Examples

Here are a few examples of how to use Thrust:

  1. Parallel Reduction:
#include <thrust/device_vector.h>
#include <thrust/reduce.h>

int main() {
    thrust::device_vector<int> data(1000000, 1);
    int result = thrust::reduce(data.begin(), data.end());
    // result will be 1000000
    return 0;
}
  1. Parallel Sort:
#include <thrust/device_vector.h>
#include <thrust/sort.h>

int main() {
    thrust::device_vector<int> data(1000000);
    // Populate data with random values
    thrust::sort(data.begin(), data.end());
    // data is now sorted in ascending order
    return 0;
}
  1. Parallel Transform:
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/functional.h>

int main() {
    thrust::device_vector<int> input(1000000, 2);
    thrust::device_vector<int> output(1000000);

    thrust::transform(input.begin(), input.end(), output.begin(),
                      thrust::multiplies<int>());
    // output now contains the squares of the input values
    return 0;
}
  1. Parallel Scan (Prefix Sum):
#include <thrust/device_vector.h>
#include <thrust/scan.h>

int main() {
    thrust::device_vector<int> input(1000000, 1);
    thrust::device_vector<int> output(1000000);

    thrust::inclusive_scan(input.begin(), input.end(), output.begin());
    // output now contains the prefix sum of the input values
    return 0;
}

Getting Started

To get started with Thrust, you'll need to have a CUDA-enabled GPU and the CUDA toolkit installed on your system. Here's a quick guide to set up and use Thrust:

  1. Install the CUDA Toolkit from the NVIDIA Developer website.
  2. Clone the Thrust repository from GitHub:
git clone https://github.com/NVIDIA/thrust.git
  1. Add the Thrust include directory to your project's include path:
-I/path/to/thrust/include
  1. Link against the CUDA runtime library in your project:
-lcudart
  1. Start using Thrust in your C++ code by including

Competitor Comparisons

30,390

A library for efficient similarity search and clustering of dense vectors.

Pros of Faiss

  • Specialized for efficient similarity search and clustering of dense vectors
  • Supports GPU acceleration for faster processing of large datasets
  • Includes advanced indexing techniques like HNSW and IVF for improved performance

Cons of Faiss

  • More focused on vector search, less versatile for general-purpose parallel computing
  • Steeper learning curve due to its specialized nature
  • Limited to C++ and Python, while Thrust supports multiple languages

Code Comparison

Faiss (vector search):

IndexFlatL2 index(d);
index.add(n, xb);
index.search(nq, xq, k, D, I);

Thrust (parallel reduction):

thrust::device_vector<int> d_vec(data.begin(), data.end());
int sum = thrust::reduce(d_vec.begin(), d_vec.end());

Key Differences

  • Thrust is a general-purpose parallel algorithms library, while Faiss focuses on vector similarity search
  • Faiss provides more specialized tools for working with high-dimensional vectors and similarity metrics
  • Thrust offers broader language support and integration with CUDA ecosystem

Use Cases

  • Faiss: Recommendation systems, image similarity search, clustering large datasets
  • Thrust: General parallel computing tasks, sorting, searching, and reduction operations
3,683

HIP: C++ Heterogeneous-Compute Interface for Portability

Pros of HIP

  • Supports multiple hardware platforms (AMD GPUs, NVIDIA GPUs)
  • Easier porting of CUDA code to run on AMD hardware
  • More flexible and open ecosystem

Cons of HIP

  • Smaller community and less mature ecosystem compared to Thrust
  • May have performance differences on NVIDIA hardware
  • Fewer high-level abstractions for parallel algorithms

Code Comparison

Thrust example:

#include <thrust/device_vector.h>
#include <thrust/reduce.h>

thrust::device_vector<int> d_vec(1000);
int sum = thrust::reduce(d_vec.begin(), d_vec.end());

HIP example:

#include <hip/hip_runtime.h>
#include <hip/device_functions.h>

int* d_vec;
hipMalloc(&d_vec, 1000 * sizeof(int));
// Custom reduction kernel implementation required

Both Thrust and HIP provide GPU acceleration capabilities, but Thrust offers higher-level abstractions for parallel algorithms, while HIP focuses on providing a more hardware-agnostic approach. Thrust is primarily designed for NVIDIA GPUs, whereas HIP supports both AMD and NVIDIA hardware. The code comparison illustrates that Thrust provides more concise, algorithm-focused code, while HIP requires more low-level implementation details.

1,850

Kokkos C++ Performance Portability Programming Ecosystem: The Programming Model - Parallel Execution and Memory Abstraction

Pros of Kokkos

  • Broader hardware support, including CPUs, GPUs, and other accelerators
  • More flexible programming model with support for various execution spaces
  • Better abstraction for performance portability across different architectures

Cons of Kokkos

  • Steeper learning curve due to more complex API
  • Potentially higher overhead for simple operations compared to Thrust
  • Less mature ecosystem and fewer pre-built algorithms

Code Comparison

Thrust:

thrust::device_vector<float> d_vec(1000);
thrust::fill(d_vec.begin(), d_vec.end(), 1.0f);
float sum = thrust::reduce(d_vec.begin(), d_vec.end());

Kokkos:

Kokkos::View<float*> d_vec("d_vec", 1000);
Kokkos::parallel_for(1000, KOKKOS_LAMBDA(const int i) {
  d_vec(i) = 1.0f;
});
float sum = Kokkos::parallel_reduce(1000, KOKKOS_LAMBDA(const int i, float& lsum) {
  lsum += d_vec(i);
}, sum);

Both Thrust and Kokkos are parallel programming libraries, but they have different focuses and strengths. Thrust is primarily designed for CUDA GPUs, while Kokkos aims for broader hardware support and performance portability. Thrust offers a simpler API for common parallel operations, making it easier to use for straightforward tasks. Kokkos provides more flexibility and control over execution spaces, making it better suited for complex, heterogeneous computing environments.

ArrayFire: a general purpose GPU library.

Pros of ArrayFire

  • Supports multiple backends (CUDA, OpenCL, CPU) for broader hardware compatibility
  • Provides a higher-level API with more built-in functions for complex operations
  • Offers better support for image processing and signal processing tasks

Cons of ArrayFire

  • Larger library size and potentially higher memory footprint
  • May have a steeper learning curve due to its more extensive API
  • Less tightly integrated with CUDA-specific optimizations compared to Thrust

Code Comparison

Thrust:

thrust::device_vector<float> d_vec(1000);
thrust::fill(d_vec.begin(), d_vec.end(), 1.0f);
float sum = thrust::reduce(d_vec.begin(), d_vec.end());

ArrayFire:

af::array arr = af::constant(1.0f, 1000);
float sum = af::sum<float>(arr);

Both libraries aim to simplify GPU programming, but ArrayFire provides a more comprehensive set of functions at the cost of increased complexity. Thrust is more focused on fundamental parallel algorithms and is more closely tied to CUDA, while ArrayFire offers greater flexibility across different hardware backends. The choice between them depends on the specific requirements of the project and the desired level of abstraction.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

:warning: The Thrust repository has been archived and is now part of the unified nvidia/cccl repository. See the announcement here for more information. Please visit the new repository for the latest updates. :warning:

Thrust: The C++ Parallel Algorithms Library

Examples Godbolt Documentation

Thrust is the C++ parallel algorithms library which inspired the introduction of parallel algorithms to the C++ Standard Library. Thrust's high-level interface greatly enhances programmer productivity while enabling performance portability between GPUs and multicore CPUs. It builds on top of established parallel programming frameworks (such as CUDA, TBB, and OpenMP). It also provides a number of general-purpose facilities similar to those found in the C++ Standard Library.

The NVIDIA C++ Standard Library is an open source project; it is available on GitHub and included in the NVIDIA HPC SDK and CUDA Toolkit. If you have one of those SDKs installed, no additional installation or compiler flags are needed to use libcu++.

Examples

Thrust is best learned through examples.

The following example generates random numbers serially and then transfers them to a parallel device where they are sorted.

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <thrust/random.h>

int main() {
  // Generate 32M random numbers serially.
  thrust::default_random_engine rng(1337);
  thrust::uniform_int_distribution<int> dist;
  thrust::host_vector<int> h_vec(32 << 20);
  thrust::generate(h_vec.begin(), h_vec.end(), [&] { return dist(rng); });

  // Transfer data to the device.
  thrust::device_vector<int> d_vec = h_vec;

  // Sort data on the device.
  thrust::sort(d_vec.begin(), d_vec.end());

  // Transfer data back to host.
  thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());
}

See it on Godbolt

This example demonstrates computing the sum of some random numbers in parallel:

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <thrust/random.h>

int main() {
  // Generate random data serially.
  thrust::default_random_engine rng(1337);
  thrust::uniform_real_distribution<double> dist(-50.0, 50.0);
  thrust::host_vector<double> h_vec(32 << 20);
  thrust::generate(h_vec.begin(), h_vec.end(), [&] { return dist(rng); });

  // Transfer to device and compute the sum.
  thrust::device_vector<double> d_vec = h_vec;
  double x = thrust::reduce(d_vec.begin(), d_vec.end(), 0, thrust::plus<int>());
}

See it on Godbolt

This example show how to perform such a reduction asynchronously:

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/async/copy.h>
#include <thrust/async/reduce.h>
#include <thrust/functional.h>
#include <thrust/random.h>
#include <numeric>

int main() {
  // Generate 32M random numbers serially.
  thrust::default_random_engine rng(123456);
  thrust::uniform_real_distribution<double> dist(-50.0, 50.0);
  thrust::host_vector<double> h_vec(32 << 20);
  thrust::generate(h_vec.begin(), h_vec.end(), [&] { return dist(rng); });

  // Asynchronously transfer to the device.
  thrust::device_vector<double> d_vec(h_vec.size());
  thrust::device_event e = thrust::async::copy(h_vec.begin(), h_vec.end(),
                                               d_vec.begin());

  // After the transfer completes, asynchronously compute the sum on the device.
  thrust::device_future<double> f0 = thrust::async::reduce(thrust::device.after(e),
                                                           d_vec.begin(), d_vec.end(),
                                                           0.0, thrust::plus<double>());

  // While the sum is being computed on the device, compute the sum serially on
  // the host.
  double f1 = std::accumulate(h_vec.begin(), h_vec.end(), 0.0, thrust::plus<double>());
}

See it on Godbolt

Getting The Thrust Source Code

Thrust is a header-only library; there is no need to build or install the project unless you want to run the Thrust unit tests.

The CUDA Toolkit provides a recent release of the Thrust source code in include/thrust. This will be suitable for most users.

Users that wish to contribute to Thrust or try out newer features should recursively clone the Thrust Github repository:

git clone --recursive https://github.com/NVIDIA/thrust.git

Using Thrust From Your Project

For CMake-based projects, we provide a CMake package for use with find_package. See the CMake README for more information. Thrust can also be added via add_subdirectory or tools like the CMake Package Manager.

For non-CMake projects, compile with:

  • The Thrust include path (-I<thrust repo root>)
  • The libcu++ include path (-I<thrust repo root>/dependencies/libcudacxx/)
  • The CUB include path, if using the CUDA device system (-I<thrust repo root>/dependencies/cub/)
  • By default, the CPP host system and CUDA device system are used. These can be changed using compiler definitions:
    • -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_XXX, where XXX is CPP (serial, default), OMP (OpenMP), or TBB (Intel TBB)
    • -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_XXX, where XXX is CPP, OMP, TBB, or CUDA (default).

Developing Thrust

Thrust uses the CMake build system to build unit tests, examples, and header tests. To build Thrust as a developer, it is recommended that you use our containerized development system:

# Clone Thrust and CUB repos recursively:
git clone --recursive https://github.com/NVIDIA/thrust.git
cd thrust

# Build and run tests and examples:
ci/local/build.bash

That does the equivalent of the following, but in a clean containerized environment which has all dependencies installed:

# Clone Thrust and CUB repos recursively:
git clone --recursive https://github.com/NVIDIA/thrust.git
cd thrust

# Create build directory:
mkdir build
cd build

# Configure -- use one of the following:
cmake ..   # Command line interface.
ccmake ..  # ncurses GUI (Linux only).
cmake-gui  # Graphical UI, set source/build directories in the app.

# Build:
cmake --build . -j ${NUM_JOBS} # Invokes make (or ninja, etc).

# Run tests and examples:
ctest

By default, a serial CPP host system, CUDA accelerated device system, and C++14 standard are used. This can be changed in CMake and via flags to ci/local/build.bash

More information on configuring your Thrust build and creating a pull request can be found in the contributing section.

Licensing

Thrust is an open source project developed on GitHub. Thrust is distributed under the Apache License v2.0 with LLVM Exceptions; some parts are distributed under the Apache License v2.0 and the Boost License v1.0.

CI Status