thrust

[ARCHIVED] The C++ parallel algorithms library. See https://github.com/NVIDIA/cccl

4,974

763

4,974

View on GitHub

Top Related Projects

faiss

34,439

A library for efficient similarity search and clustering of dense vectors.

hip

4,100

HIP: C++ Heterogeneous-Compute Interface for Portability

kokkos

2,234

Kokkos C++ Performance Portability Programming Ecosystem: The Programming Model - Parallel Execution and Memory Abstraction

arrayfire

4,726

ArrayFire: a general purpose GPU library.

Quick Overview

Thrust is a C++ template library for CUDA that resembles the Standard Template Library (STL). It provides a rich collection of data parallel primitives such as scan, sort, and reduce, which can be used to rapidly develop performance-portable parallel applications.

Pros

High-Performance: Thrust is designed to provide high-performance parallel algorithms that can take advantage of the massive parallelism available in CUDA-enabled GPUs.
Ease of Use: Thrust provides a familiar and intuitive interface that is similar to the STL, making it easy for C++ developers to get started with GPU programming.
Portability: Thrust code can be compiled for both CPU and GPU, allowing developers to write a single codebase that can run on a variety of hardware platforms.
Extensive Functionality: Thrust includes a wide range of parallel algorithms and data structures, covering a broad range of use cases.

Cons

Limited to CUDA: Thrust is primarily designed for CUDA-enabled GPUs and may not be as well-suited for other GPU architectures or CPU-based parallel programming.
Steep Learning Curve: While Thrust aims to simplify GPU programming, developers still need to have a good understanding of CUDA and parallel programming concepts to use it effectively.
Dependency on CUDA: Thrust is tightly coupled with the CUDA ecosystem, which may limit its adoption in environments where CUDA is not available or preferred.
Potential Performance Overhead: The abstraction and flexibility provided by Thrust may introduce some performance overhead compared to hand-tuned CUDA code.

Code Examples

Here are a few examples of how to use Thrust:

Parallel Reduction:

#include <thrust/device_vector.h>
#include <thrust/reduce.h>

int main() {
    thrust::device_vector<int> data(1000000, 1);
    int result = thrust::reduce(data.begin(), data.end());
    // result will be 1000000
    return 0;
}

Parallel Sort:

#include <thrust/device_vector.h>
#include <thrust/sort.h>

int main() {
    thrust::device_vector<int> data(1000000);
    // Populate data with random values
    thrust::sort(data.begin(), data.end());
    // data is now sorted in ascending order
    return 0;
}

Parallel Transform:

#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/functional.h>

int main() {
    thrust::device_vector<int> input(1000000, 2);
    thrust::device_vector<int> output(1000000);

    thrust::transform(input.begin(), input.end(), output.begin(),
                      thrust::multiplies<int>());
    // output now contains the squares of the input values
    return 0;
}

Parallel Scan (Prefix Sum):

#include <thrust/device_vector.h>
#include <thrust/scan.h>

int main() {
    thrust::device_vector<int> input(1000000, 1);
    thrust::device_vector<int> output(1000000);

    thrust::inclusive_scan(input.begin(), input.end(), output.begin());
    // output now contains the prefix sum of the input values
    return 0;
}

Getting Started

To get started with Thrust, you'll need to have a CUDA-enabled GPU and the CUDA toolkit installed on your system. Here's a quick guide to set up and use Thrust:

Install the CUDA Toolkit from the NVIDIA Developer website.
Clone the Thrust repository from GitHub:

git clone https://github.com/NVIDIA/thrust.git

Add the Thrust include directory to your project's include path:

-I/path/to/thrust/include

Link against the CUDA runtime library in your project:

-lcudart

Start using Thrust in your C++ code by including

Competitor Comparisons

faiss

34,439

A library for efficient similarity search and clustering of dense vectors.

Pros of Faiss

Specialized for efficient similarity search and clustering of dense vectors
Supports GPU acceleration for faster processing of large datasets
Includes advanced indexing techniques like HNSW and IVF for improved performance

Cons of Faiss

More focused on vector search, less versatile for general-purpose parallel computing
Steeper learning curve due to its specialized nature
Limited to C++ and Python, while Thrust supports multiple languages

Code Comparison

Faiss (vector search):

IndexFlatL2 index(d);
index.add(n, xb);
index.search(nq, xq, k, D, I);

Thrust (parallel reduction):

thrust::device_vector<int> d_vec(data.begin(), data.end());
int sum = thrust::reduce(d_vec.begin(), d_vec.end());

Key Differences

Thrust is a general-purpose parallel algorithms library, while Faiss focuses on vector similarity search
Faiss provides more specialized tools for working with high-dimensional vectors and similarity metrics
Thrust offers broader language support and integration with CUDA ecosystem

Use Cases

Faiss: Recommendation systems, image similarity search, clustering large datasets
Thrust: General parallel computing tasks, sorting, searching, and reduction operations

hip

4,100

HIP: C++ Heterogeneous-Compute Interface for Portability

Pros of HIP

Supports multiple hardware platforms (AMD and NVIDIA GPUs)
Easier porting of CUDA code to run on AMD hardware
Open-source and community-driven development

Cons of HIP

Smaller ecosystem and less mature than Thrust
May have performance differences compared to native CUDA on NVIDIA hardware
Limited support for some advanced CUDA features

Code Comparison

Thrust example:

#include <thrust/device_vector.h>
#include <thrust/sort.h>

thrust::device_vector<int> d_vec(input.begin(), input.end());
thrust::sort(d_vec.begin(), d_vec.end());

HIP example:

#include <hip/hip_runtime.h>
#include <rocprim/device/device_radix_sort.hpp>

hipMalloc(&d_input, size * sizeof(int));
hipMemcpy(d_input, h_input, size * sizeof(int), hipMemcpyHostToDevice);
rocprim::device_radix_sort::sort(d_temp_storage, temp_storage_size, d_input, d_output, size);

Both libraries aim to simplify GPU programming, but Thrust provides a higher-level abstraction with its algorithms and containers, while HIP focuses on providing a CUDA-compatible API for cross-platform development. Thrust is more tightly integrated with CUDA, whereas HIP strives for broader hardware support at the cost of some advanced CUDA-specific features.

kokkos

2,234

Kokkos C++ Performance Portability Programming Ecosystem: The Programming Model - Parallel Execution and Memory Abstraction

Pros of Kokkos

Broader hardware support, including CPUs, GPUs, and other accelerators
More flexible programming model with support for various execution spaces
Better abstraction for performance portability across different architectures

Cons of Kokkos

Steeper learning curve due to more complex API
Potentially higher overhead for simple operations compared to Thrust
Less mature ecosystem and fewer pre-built algorithms

Code Comparison

Thrust:

thrust::device_vector<float> d_vec(1000);
thrust::fill(d_vec.begin(), d_vec.end(), 1.0f);
float sum = thrust::reduce(d_vec.begin(), d_vec.end());

Kokkos:

Kokkos::View<float*> d_vec("d_vec", 1000);
Kokkos::parallel_for(1000, KOKKOS_LAMBDA(const int i) {
  d_vec(i) = 1.0f;
});
float sum = Kokkos::parallel_reduce(1000, KOKKOS_LAMBDA(const int i, float& lsum) {
  lsum += d_vec(i);
}, sum);

Both Thrust and Kokkos are parallel programming libraries, but they have different focuses and strengths. Thrust is primarily designed for CUDA GPUs, while Kokkos aims for broader hardware support and performance portability. Thrust offers a simpler API for common parallel operations, making it easier to use for straightforward tasks. Kokkos provides more flexibility and control over execution spaces, making it better suited for complex, heterogeneous computing environments.

arrayfire

4,726

ArrayFire: a general purpose GPU library.

Pros of ArrayFire

Supports multiple backends (CUDA, OpenCL, CPU) for broader hardware compatibility
Provides a higher-level API with more built-in functions for complex operations
Offers better support for image processing and signal processing tasks

Cons of ArrayFire

Larger library size and potentially higher memory footprint
May have a steeper learning curve due to its more extensive API
Less tightly integrated with CUDA-specific optimizations compared to Thrust

Code Comparison

Thrust:

thrust::device_vector<float> d_vec(1000);
thrust::fill(d_vec.begin(), d_vec.end(), 1.0f);
float sum = thrust::reduce(d_vec.begin(), d_vec.end());

ArrayFire:

af::array arr = af::constant(1.0f, 1000);
float sum = af::sum<float>(arr);

Both libraries aim to simplify GPU programming, but ArrayFire provides a more comprehensive set of functions at the cost of increased complexity. Thrust is more focused on fundamental parallel algorithms and is more closely tied to CUDA, while ArrayFire offers greater flexibility across different hardware backends. The choice between them depends on the specific requirements of the project and the desired level of abstraction.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

:warning: The Thrust repository has been archived and is now part of the unified nvidia/cccl repository. See the announcement here for more information. Please visit the new repository for the latest updates. :warning:

Thrust: The C++ Parallel Algorithms Library

Examples	Godbolt	Documentation

Thrust is the C++ parallel algorithms library which inspired the introduction of parallel algorithms to the C++ Standard Library. Thrust's high-level interface greatly enhances programmer productivity while enabling performance portability between GPUs and multicore CPUs. It builds on top of established parallel programming frameworks (such as CUDA, TBB, and OpenMP). It also provides a number of general-purpose facilities similar to those found in the C++ Standard Library.

The NVIDIA C++ Standard Library is an open source project; it is available on GitHub and included in the NVIDIA HPC SDK and CUDA Toolkit. If you have one of those SDKs installed, no additional installation or compiler flags are needed to use libcu++.

Examples

Thrust is best learned through examples.

The following example generates random numbers serially and then transfers them to a parallel device where they are sorted.

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <thrust/random.h>

int main() {
  // Generate 32M random numbers serially.
  thrust::default_random_engine rng(1337);
  thrust::uniform_int_distribution<int> dist;
  thrust::host_vector<int> h_vec(32 << 20);
  thrust::generate(h_vec.begin(), h_vec.end(), [&] { return dist(rng); });

  // Transfer data to the device.
  thrust::device_vector<int> d_vec = h_vec;

  // Sort data on the device.
  thrust::sort(d_vec.begin(), d_vec.end());

  // Transfer data back to host.
  thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());
}

See it on Godbolt

This example demonstrates computing the sum of some random numbers in parallel:

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <thrust/random.h>

int main() {
  // Generate random data serially.
  thrust::default_random_engine rng(1337);
  thrust::uniform_real_distribution<double> dist(-50.0, 50.0);
  thrust::host_vector<double> h_vec(32 << 20);
  thrust::generate(h_vec.begin(), h_vec.end(), [&] { return dist(rng); });

  // Transfer to device and compute the sum.
  thrust::device_vector<double> d_vec = h_vec;
  double x = thrust::reduce(d_vec.begin(), d_vec.end(), 0, thrust::plus<int>());
}

See it on Godbolt

This example show how to perform such a reduction asynchronously:

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/async/copy.h>
#include <thrust/async/reduce.h>
#include <thrust/functional.h>
#include <thrust/random.h>
#include <numeric>

int main() {
  // Generate 32M random numbers serially.
  thrust::default_random_engine rng(123456);
  thrust::uniform_real_distribution<double> dist(-50.0, 50.0);
  thrust::host_vector<double> h_vec(32 << 20);
  thrust::generate(h_vec.begin(), h_vec.end(), [&] { return dist(rng); });

  // Asynchronously transfer to the device.
  thrust::device_vector<double> d_vec(h_vec.size());
  thrust::device_event e = thrust::async::copy(h_vec.begin(), h_vec.end(),
                                               d_vec.begin());

  // After the transfer completes, asynchronously compute the sum on the device.
  thrust::device_future<double> f0 = thrust::async::reduce(thrust::device.after(e),
                                                           d_vec.begin(), d_vec.end(),
                                                           0.0, thrust::plus<double>());

  // While the sum is being computed on the device, compute the sum serially on
  // the host.
  double f1 = std::accumulate(h_vec.begin(), h_vec.end(), 0.0, thrust::plus<double>());
}

See it on Godbolt

Getting The Thrust Source Code

Thrust is a header-only library; there is no need to build or install the project unless you want to run the Thrust unit tests.

The CUDA Toolkit provides a recent release of the Thrust source code in include/thrust. This will be suitable for most users.

Users that wish to contribute to Thrust or try out newer features should recursively clone the Thrust Github repository:

git clone --recursive https://github.com/NVIDIA/thrust.git

Using Thrust From Your Project

For CMake-based projects, we provide a CMake package for use with find_package. See the CMake README for more information. Thrust can also be added via add_subdirectory or tools like the CMake Package Manager.

For non-CMake projects, compile with:

The Thrust include path (-I<thrust repo root>)
The libcu++ include path (-I<thrust repo root>/dependencies/libcudacxx/)
The CUB include path, if using the CUDA device system (-I<thrust repo root>/dependencies/cub/)
By default, the CPP host system and CUDA device system are used. These can be changed using compiler definitions:
- -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_XXX, where XXX is CPP (serial, default), OMP (OpenMP), or TBB (Intel TBB)
- -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_XXX, where XXX is CPP, OMP, TBB, or CUDA (default).

Developing Thrust

Thrust uses the CMake build system to build unit tests, examples, and header tests. To build Thrust as a developer, it is recommended that you use our containerized development system:

# Clone Thrust and CUB repos recursively:
git clone --recursive https://github.com/NVIDIA/thrust.git
cd thrust

# Build and run tests and examples:
ci/local/build.bash

That does the equivalent of the following, but in a clean containerized environment which has all dependencies installed:

# Clone Thrust and CUB repos recursively:
git clone --recursive https://github.com/NVIDIA/thrust.git
cd thrust

# Create build directory:
mkdir build
cd build

# Configure -- use one of the following:
cmake ..   # Command line interface.
ccmake ..  # ncurses GUI (Linux only).
cmake-gui  # Graphical UI, set source/build directories in the app.

# Build:
cmake --build . -j ${NUM_JOBS} # Invokes make (or ninja, etc).

# Run tests and examples:
ctest

By default, a serial CPP host system, CUDA accelerated device system, and C++14 standard are used. This can be changed in CMake and via flags to ci/local/build.bash

More information on configuring your Thrust build and creating a pull request can be found in the contributing section.

Licensing

Thrust is an open source project developed on GitHub. Thrust is distributed under the Apache License v2.0 with LLVM Exceptions; some parts are distributed under the Apache License v2.0 and the Boost License v1.0.

CI Status

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot