thrust
[ARCHIVED] The C++ parallel algorithms library. See https://github.com/NVIDIA/cccl
Top Related Projects
A library for efficient similarity search and clustering of dense vectors.
HIP: C++ Heterogeneous-Compute Interface for Portability
Kokkos C++ Performance Portability Programming Ecosystem: The Programming Model - Parallel Execution and Memory Abstraction
ArrayFire: a general purpose GPU library.
Quick Overview
Thrust is a C++ template library for CUDA that resembles the Standard Template Library (STL). It provides a rich collection of data parallel primitives such as scan, sort, and reduce, which can be used to rapidly develop performance-portable parallel applications.
Pros
- High-Performance: Thrust is designed to provide high-performance parallel algorithms that can take advantage of the massive parallelism available in CUDA-enabled GPUs.
- Ease of Use: Thrust provides a familiar and intuitive interface that is similar to the STL, making it easy for C++ developers to get started with GPU programming.
- Portability: Thrust code can be compiled for both CPU and GPU, allowing developers to write a single codebase that can run on a variety of hardware platforms.
- Extensive Functionality: Thrust includes a wide range of parallel algorithms and data structures, covering a broad range of use cases.
Cons
- Limited to CUDA: Thrust is primarily designed for CUDA-enabled GPUs and may not be as well-suited for other GPU architectures or CPU-based parallel programming.
- Steep Learning Curve: While Thrust aims to simplify GPU programming, developers still need to have a good understanding of CUDA and parallel programming concepts to use it effectively.
- Dependency on CUDA: Thrust is tightly coupled with the CUDA ecosystem, which may limit its adoption in environments where CUDA is not available or preferred.
- Potential Performance Overhead: The abstraction and flexibility provided by Thrust may introduce some performance overhead compared to hand-tuned CUDA code.
Code Examples
Here are a few examples of how to use Thrust:
- Parallel Reduction:
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
int main() {
thrust::device_vector<int> data(1000000, 1);
int result = thrust::reduce(data.begin(), data.end());
// result will be 1000000
return 0;
}
- Parallel Sort:
#include <thrust/device_vector.h>
#include <thrust/sort.h>
int main() {
thrust::device_vector<int> data(1000000);
// Populate data with random values
thrust::sort(data.begin(), data.end());
// data is now sorted in ascending order
return 0;
}
- Parallel Transform:
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/functional.h>
int main() {
thrust::device_vector<int> input(1000000, 2);
thrust::device_vector<int> output(1000000);
thrust::transform(input.begin(), input.end(), output.begin(),
thrust::multiplies<int>());
// output now contains the squares of the input values
return 0;
}
- Parallel Scan (Prefix Sum):
#include <thrust/device_vector.h>
#include <thrust/scan.h>
int main() {
thrust::device_vector<int> input(1000000, 1);
thrust::device_vector<int> output(1000000);
thrust::inclusive_scan(input.begin(), input.end(), output.begin());
// output now contains the prefix sum of the input values
return 0;
}
Getting Started
To get started with Thrust, you'll need to have a CUDA-enabled GPU and the CUDA toolkit installed on your system. Here's a quick guide to set up and use Thrust:
- Install the CUDA Toolkit from the NVIDIA Developer website.
- Clone the Thrust repository from GitHub:
git clone https://github.com/NVIDIA/thrust.git
- Add the Thrust include directory to your project's include path:
-I/path/to/thrust/include
- Link against the CUDA runtime library in your project:
-lcudart
- Start using Thrust in your C++ code by including
Competitor Comparisons
A library for efficient similarity search and clustering of dense vectors.
Pros of Faiss
- Specialized for efficient similarity search and clustering of dense vectors
- Supports GPU acceleration for faster processing of large datasets
- Includes advanced indexing techniques like HNSW and IVF for improved performance
Cons of Faiss
- More focused on vector search, less versatile for general-purpose parallel computing
- Steeper learning curve due to its specialized nature
- Limited to C++ and Python, while Thrust supports multiple languages
Code Comparison
Faiss (vector search):
IndexFlatL2 index(d);
index.add(n, xb);
index.search(nq, xq, k, D, I);
Thrust (parallel reduction):
thrust::device_vector<int> d_vec(data.begin(), data.end());
int sum = thrust::reduce(d_vec.begin(), d_vec.end());
Key Differences
- Thrust is a general-purpose parallel algorithms library, while Faiss focuses on vector similarity search
- Faiss provides more specialized tools for working with high-dimensional vectors and similarity metrics
- Thrust offers broader language support and integration with CUDA ecosystem
Use Cases
- Faiss: Recommendation systems, image similarity search, clustering large datasets
- Thrust: General parallel computing tasks, sorting, searching, and reduction operations
HIP: C++ Heterogeneous-Compute Interface for Portability
Pros of HIP
- Supports multiple hardware platforms (AMD GPUs, NVIDIA GPUs)
- Easier porting of CUDA code to run on AMD hardware
- More flexible and open ecosystem
Cons of HIP
- Smaller community and less mature ecosystem compared to Thrust
- May have performance differences on NVIDIA hardware
- Fewer high-level abstractions for parallel algorithms
Code Comparison
Thrust example:
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
thrust::device_vector<int> d_vec(1000);
int sum = thrust::reduce(d_vec.begin(), d_vec.end());
HIP example:
#include <hip/hip_runtime.h>
#include <hip/device_functions.h>
int* d_vec;
hipMalloc(&d_vec, 1000 * sizeof(int));
// Custom reduction kernel implementation required
Both Thrust and HIP provide GPU acceleration capabilities, but Thrust offers higher-level abstractions for parallel algorithms, while HIP focuses on providing a more hardware-agnostic approach. Thrust is primarily designed for NVIDIA GPUs, whereas HIP supports both AMD and NVIDIA hardware. The code comparison illustrates that Thrust provides more concise, algorithm-focused code, while HIP requires more low-level implementation details.
Kokkos C++ Performance Portability Programming Ecosystem: The Programming Model - Parallel Execution and Memory Abstraction
Pros of Kokkos
- Broader hardware support, including CPUs, GPUs, and other accelerators
- More flexible programming model with support for various execution spaces
- Better abstraction for performance portability across different architectures
Cons of Kokkos
- Steeper learning curve due to more complex API
- Potentially higher overhead for simple operations compared to Thrust
- Less mature ecosystem and fewer pre-built algorithms
Code Comparison
Thrust:
thrust::device_vector<float> d_vec(1000);
thrust::fill(d_vec.begin(), d_vec.end(), 1.0f);
float sum = thrust::reduce(d_vec.begin(), d_vec.end());
Kokkos:
Kokkos::View<float*> d_vec("d_vec", 1000);
Kokkos::parallel_for(1000, KOKKOS_LAMBDA(const int i) {
d_vec(i) = 1.0f;
});
float sum = Kokkos::parallel_reduce(1000, KOKKOS_LAMBDA(const int i, float& lsum) {
lsum += d_vec(i);
}, sum);
Both Thrust and Kokkos are parallel programming libraries, but they have different focuses and strengths. Thrust is primarily designed for CUDA GPUs, while Kokkos aims for broader hardware support and performance portability. Thrust offers a simpler API for common parallel operations, making it easier to use for straightforward tasks. Kokkos provides more flexibility and control over execution spaces, making it better suited for complex, heterogeneous computing environments.
ArrayFire: a general purpose GPU library.
Pros of ArrayFire
- Supports multiple backends (CUDA, OpenCL, CPU) for broader hardware compatibility
- Provides a higher-level API with more built-in functions for complex operations
- Offers better support for image processing and signal processing tasks
Cons of ArrayFire
- Larger library size and potentially higher memory footprint
- May have a steeper learning curve due to its more extensive API
- Less tightly integrated with CUDA-specific optimizations compared to Thrust
Code Comparison
Thrust:
thrust::device_vector<float> d_vec(1000);
thrust::fill(d_vec.begin(), d_vec.end(), 1.0f);
float sum = thrust::reduce(d_vec.begin(), d_vec.end());
ArrayFire:
af::array arr = af::constant(1.0f, 1000);
float sum = af::sum<float>(arr);
Both libraries aim to simplify GPU programming, but ArrayFire provides a more comprehensive set of functions at the cost of increased complexity. Thrust is more focused on fundamental parallel algorithms and is more closely tied to CUDA, while ArrayFire offers greater flexibility across different hardware backends. The choice between them depends on the specific requirements of the project and the desired level of abstraction.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
:warning: The Thrust repository has been archived and is now part of the unified nvidia/cccl repository. See the announcement here for more information. Please visit the new repository for the latest updates. :warning:
Thrust: The C++ Parallel Algorithms Library
Examples | Godbolt | Documentation |
---|
Thrust is the C++ parallel algorithms library which inspired the introduction of parallel algorithms to the C++ Standard Library. Thrust's high-level interface greatly enhances programmer productivity while enabling performance portability between GPUs and multicore CPUs. It builds on top of established parallel programming frameworks (such as CUDA, TBB, and OpenMP). It also provides a number of general-purpose facilities similar to those found in the C++ Standard Library.
The NVIDIA C++ Standard Library is an open source project; it is available on GitHub and included in the NVIDIA HPC SDK and CUDA Toolkit. If you have one of those SDKs installed, no additional installation or compiler flags are needed to use libcu++.
Examples
Thrust is best learned through examples.
The following example generates random numbers serially and then transfers them to a parallel device where they are sorted.
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <thrust/random.h>
int main() {
// Generate 32M random numbers serially.
thrust::default_random_engine rng(1337);
thrust::uniform_int_distribution<int> dist;
thrust::host_vector<int> h_vec(32 << 20);
thrust::generate(h_vec.begin(), h_vec.end(), [&] { return dist(rng); });
// Transfer data to the device.
thrust::device_vector<int> d_vec = h_vec;
// Sort data on the device.
thrust::sort(d_vec.begin(), d_vec.end());
// Transfer data back to host.
thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());
}
This example demonstrates computing the sum of some random numbers in parallel:
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <thrust/random.h>
int main() {
// Generate random data serially.
thrust::default_random_engine rng(1337);
thrust::uniform_real_distribution<double> dist(-50.0, 50.0);
thrust::host_vector<double> h_vec(32 << 20);
thrust::generate(h_vec.begin(), h_vec.end(), [&] { return dist(rng); });
// Transfer to device and compute the sum.
thrust::device_vector<double> d_vec = h_vec;
double x = thrust::reduce(d_vec.begin(), d_vec.end(), 0, thrust::plus<int>());
}
This example show how to perform such a reduction asynchronously:
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/async/copy.h>
#include <thrust/async/reduce.h>
#include <thrust/functional.h>
#include <thrust/random.h>
#include <numeric>
int main() {
// Generate 32M random numbers serially.
thrust::default_random_engine rng(123456);
thrust::uniform_real_distribution<double> dist(-50.0, 50.0);
thrust::host_vector<double> h_vec(32 << 20);
thrust::generate(h_vec.begin(), h_vec.end(), [&] { return dist(rng); });
// Asynchronously transfer to the device.
thrust::device_vector<double> d_vec(h_vec.size());
thrust::device_event e = thrust::async::copy(h_vec.begin(), h_vec.end(),
d_vec.begin());
// After the transfer completes, asynchronously compute the sum on the device.
thrust::device_future<double> f0 = thrust::async::reduce(thrust::device.after(e),
d_vec.begin(), d_vec.end(),
0.0, thrust::plus<double>());
// While the sum is being computed on the device, compute the sum serially on
// the host.
double f1 = std::accumulate(h_vec.begin(), h_vec.end(), 0.0, thrust::plus<double>());
}
Getting The Thrust Source Code
Thrust is a header-only library; there is no need to build or install the project unless you want to run the Thrust unit tests.
The CUDA Toolkit provides a recent release of the Thrust source code in
include/thrust
. This will be suitable for most users.
Users that wish to contribute to Thrust or try out newer features should recursively clone the Thrust Github repository:
git clone --recursive https://github.com/NVIDIA/thrust.git
Using Thrust From Your Project
For CMake-based projects, we provide a CMake package for use with
find_package
. See the CMake README for more
information. Thrust can also be added via add_subdirectory
or tools like
the CMake Package Manager.
For non-CMake projects, compile with:
- The Thrust include path (
-I<thrust repo root>
) - The libcu++ include path (
-I<thrust repo root>/dependencies/libcudacxx/
) - The CUB include path, if using the CUDA device system (
-I<thrust repo root>/dependencies/cub/
) - By default, the CPP host system and CUDA device system are used.
These can be changed using compiler definitions:
-DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_XXX
, whereXXX
isCPP
(serial, default),OMP
(OpenMP), orTBB
(Intel TBB)-DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_XXX
, whereXXX
isCPP
,OMP
,TBB
, orCUDA
(default).
Developing Thrust
Thrust uses the CMake build system to build unit tests, examples, and header tests. To build Thrust as a developer, it is recommended that you use our containerized development system:
# Clone Thrust and CUB repos recursively:
git clone --recursive https://github.com/NVIDIA/thrust.git
cd thrust
# Build and run tests and examples:
ci/local/build.bash
That does the equivalent of the following, but in a clean containerized environment which has all dependencies installed:
# Clone Thrust and CUB repos recursively:
git clone --recursive https://github.com/NVIDIA/thrust.git
cd thrust
# Create build directory:
mkdir build
cd build
# Configure -- use one of the following:
cmake .. # Command line interface.
ccmake .. # ncurses GUI (Linux only).
cmake-gui # Graphical UI, set source/build directories in the app.
# Build:
cmake --build . -j ${NUM_JOBS} # Invokes make (or ninja, etc).
# Run tests and examples:
ctest
By default, a serial CPP
host system, CUDA
accelerated device system, and
C++14 standard are used.
This can be changed in CMake and via flags to ci/local/build.bash
More information on configuring your Thrust build and creating a pull request can be found in the contributing section.
Licensing
Thrust is an open source project developed on GitHub. Thrust is distributed under the Apache License v2.0 with LLVM Exceptions; some parts are distributed under the Apache License v2.0 and the Boost License v1.0.
CI Status
Top Related Projects
A library for efficient similarity search and clustering of dense vectors.
HIP: C++ Heterogeneous-Compute Interface for Portability
Kokkos C++ Performance Portability Programming Ecosystem: The Programming Model - Parallel Execution and Memory Abstraction
ArrayFire: a general purpose GPU library.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot