SPSCQueue

A bounded single-producer single-consumer wait-free and lock-free queue written in C++11

1,016

136

1,016

View on GitHub

Top Related Projects

readerwriterqueue

4,194

A fast single-producer, single-consumer lock-free queue for C++

folly

29,642

An open-source C++ library developed and used at Facebook.

Concurrency primitives, safe memory reclamation mechanisms and non-blocking (including lock-free) data structures designed to aid in the research, design and implementation of high performance concurrent systems developed in C99+.

Quick Overview

SPSCQueue is a single-producer/single-consumer lock-free queue implementation for C++. It's designed for high-performance inter-thread communication, particularly useful in low-latency systems. The queue is implemented as a ring buffer and is cache-friendly.

Pros

Lock-free implementation for high performance
Cache-friendly design
Header-only library for easy integration
Supports custom allocators

Cons

Limited to single-producer/single-consumer scenarios
Requires C++11 or later
Fixed capacity, cannot grow dynamically
May not be suitable for all memory models

Code Examples

Example 1: Basic usage

#include "SPSCQueue.h"
#include <thread>

rigtorp::SPSCQueue<int> queue(100);

void producer() {
    for (int i = 0; i < 1000; ++i) {
        queue.push(i);
    }
}

void consumer() {
    int value;
    for (int i = 0; i < 1000; ++i) {
        while (!queue.pop(value));
        // Process value
    }
}

int main() {
    std::thread p(producer);
    std::thread c(consumer);
    p.join();
    c.join();
    return 0;
}

Example 2: Using custom allocator

#include "SPSCQueue.h"
#include <memory>

struct MyAllocator : std::allocator<char> {
    char *allocate(size_t n) {
        return static_cast<char*>(std::aligned_alloc(64, n));
    }
    void deallocate(char *p, size_t) {
        std::free(p);
    }
};

rigtorp::SPSCQueue<int, MyAllocator> queue(100);

Example 3: Non-blocking pop

#include "SPSCQueue.h"

rigtorp::SPSCQueue<int> queue(100);

void process() {
    int value;
    while (queue.pop(value)) {
        // Process value
    }
    // Queue is empty, do something else
}

Getting Started

To use SPSCQueue in your project:

Download the SPSCQueue.h file from the GitHub repository.
Include the header in your C++ project:
```
#include "SPSCQueue.h"
```
Create a queue with a specified capacity:
```
rigtorp::SPSCQueue<int> queue(100);
```

Use push() to add elements and pop() to remove elements:

queue.push(42);
int value;
if (queue.pop(value)) {
    // Process value
}

Remember to compile with C++11 or later and link against pthread on Unix-like systems.

Competitor Comparisons

readerwriterqueue

4,194

A fast single-producer, single-consumer lock-free queue for C++

Pros of readerwriterqueue

Supports multiple producers and consumers, making it more versatile
Provides a blocking interface with timeouts, enhancing usability
Includes additional features like bulk enqueue/dequeue operations

Cons of readerwriterqueue

May have slightly higher overhead due to its multi-producer/multi-consumer design
Potentially more complex implementation, which could impact maintainability

Code Comparison

SPSCQueue:

SPSCQueue<int> q(100);
q.push(42);
int value;
q.pop(value);

readerwriterqueue:

ReaderWriterQueue<int> q(100);
q.enqueue(42);
int value;
q.try_dequeue(value);

Performance Considerations

SPSCQueue is optimized for single-producer/single-consumer scenarios, potentially offering better performance in these specific use cases. readerwriterqueue, while more flexible, may have slightly lower performance in single-producer/single-consumer situations due to its more general design.

Use Case Recommendations

Choose SPSCQueue for high-performance, low-latency applications with a single producer and consumer. Opt for readerwriterqueue when you need support for multiple producers/consumers or require additional features like blocking operations and timeouts.

atomic_queue

1,691

C++14 lock-free queue.

Pros of atomic_queue

Supports multiple producers and consumers, offering more flexibility than SPSCQueue's single-producer, single-consumer design
Provides a wider range of queue implementations, including bounded and unbounded variants
Generally offers better performance, especially in high-contention scenarios

Cons of atomic_queue

More complex implementation, which may make it harder to understand and maintain
Lacks some of the simplicity and guarantees provided by SPSCQueue's focused design
May have higher memory overhead due to its more versatile structure

Code Comparison

SPSCQueue:

template <typename T>
class SPSCQueue {
  std::atomic<T*> head_;
  std::atomic<T*> tail_;
  // ...
};

atomic_queue:

template <typename T, unsigned N>
class atomic_queue {
  alignas(64) std::atomic<unsigned> head_ = {0};
  alignas(64) std::atomic<unsigned> tail_ = {0};
  alignas(64) T elements_[N];
  // ...
};

Both implementations use atomic operations for thread-safety, but atomic_queue employs a more complex structure to support multiple producers and consumers. SPSCQueue focuses on a simpler, single-producer, single-consumer design, which may be more efficient for specific use cases but less versatile overall.

folly

29,642

An open-source C++ library developed and used at Facebook.

Pros of Folly

Comprehensive library with a wide range of utilities and data structures
Actively maintained by Facebook with frequent updates and improvements
Extensive documentation and examples for various use cases

Cons of Folly

Larger codebase and potential overhead for projects only needing specific components
Steeper learning curve due to the breadth of features and abstractions
May introduce dependencies on other Folly components

Code Comparison

SPSCQueue:

SPSCQueue<int> q(100);
q.push(42);
int value;
q.pop(value);

Folly (using MPMCQueue as an example):

folly::MPMCQueue<int> q(100);
q.write(42);
int value;
q.read(value);

Summary

SPSCQueue is a focused, single-producer-single-consumer queue implementation, while Folly is a comprehensive C++ library with various utilities. SPSCQueue may be more suitable for projects requiring a specific, lightweight queue implementation, whereas Folly offers a broader set of tools and data structures for larger-scale applications. The choice between the two depends on project requirements, performance needs, and the desired level of abstraction.

junction

1,430

Concurrent data structures in C++

Pros of Junction

Supports multiple producers and consumers (MPMC), offering more flexibility
Provides a wider range of concurrent data structures (queues, maps, sets)
Offers better scalability for complex multi-threaded scenarios

Cons of Junction

May have higher overhead due to its more general-purpose design
Potentially more complex to use and integrate for simple single-producer, single-consumer scenarios
Might have a larger memory footprint due to its broader feature set

Code Comparison

SPSCQueue:

SPSCQueue<int> q(100);
q.push(42);
int value;
q.pop(value);

Junction:

ConcurrentQueue<int> q;
q.push(42);
int value;
q.pop(value);

Key Differences

SPSCQueue is specifically designed for single-producer, single-consumer scenarios, offering high performance in these specific use cases. It's likely to be more efficient and have lower latency for such scenarios.

Junction provides a more comprehensive set of concurrent data structures and supports multiple producers and consumers. It's better suited for complex multi-threaded applications that require various concurrent data structures and patterns.

The choice between the two depends on the specific requirements of your project. If you need a simple, high-performance queue for single-producer, single-consumer scenarios, SPSCQueue might be the better choice. For more complex multi-threaded applications requiring various concurrent data structures, Junction could be more appropriate.

ck

2,517

Pros of ck

Broader scope: ck is a comprehensive concurrency toolkit with various data structures and synchronization primitives
More mature project with a larger community and wider adoption
Supports multiple platforms and architectures

Cons of ck

Higher complexity due to its broader scope, potentially steeper learning curve
May include unnecessary features for projects only needing a simple queue
Potentially larger memory footprint due to additional functionality

Code Comparison

SPSCQueue:

template <typename T>
class SPSCQueue {
public:
  explicit SPSCQueue(const size_t capacity);
  bool try_push(const T &v);
  bool try_pop(T &v);
};

ck:

struct ck_ring {
    unsigned int c_head;
    char pad[CK_MD_CACHELINE - sizeof(unsigned int)];
    unsigned int p_tail;
    char _pad[CK_MD_CACHELINE - sizeof(unsigned int)];
    unsigned int size;
    unsigned int mask;
};

SPSCQueue focuses specifically on a single-producer, single-consumer queue implementation, while ck provides a more general-purpose ring buffer (ck_ring) that can be used in various concurrency scenarios. SPSCQueue offers a simpler, more specialized API, whereas ck's implementation is part of a larger toolkit with more flexibility but potentially more complexity.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

SPSCQueue.h

A single producer single consumer wait-free and lock-free fixed size queue written in C++11. This implementation is faster than both boost::lockfree::spsc and folly::ProducerConsumerQueue.

Example

SPSCQueue<int> q(1);
auto t = std::thread([&] {
  while (!q.front());
  std::cout << *q.front() << std::endl;
  q.pop();
});
q.push(1);
t.join();

See src/SPSCQueueExample.cpp for the full example.

Usage

SPSCQueue<T>(size_t capacity);

Create a SPSCqueue holding items of type T with capacity capacity. Capacity needs to be at least 1.
void emplace(Args &&... args);

Enqueue an item using inplace construction. Blocks if queue is full.
bool try_emplace(Args &&... args);

Try to enqueue an item using inplace construction. Returns true on success and false if queue is full.
void push(const T &v);

Enqueue an item using copy construction. Blocks if queue is full.
template <typename P> void push(P &&v);

Enqueue an item using move construction. Participates in overload resolution only if std::is_constructible<T, P&&>::value == true. Blocks if queue is full.
bool try_push(const T &v);

Try to enqueue an item using copy construction. Returns true on success and false if queue is full.
template <typename P> bool try_push(P &&v);

Try to enqueue an item using move construction. Returns true on success and false if queue is full. Participates in overload resolution only if std::is_constructible<T, P&&>::value == true.
T *front();

Return pointer to front of queue. Returns nullptr if queue is empty.
void pop();

Dequeue first item of queue. You must ensure that the queue is non-empty before calling pop. This means that front() must have returned a non-nullptr before each call to pop(). Requires std::is_nothrow_destructible<T>::value == true.
size_t size();

Return the number of items available in the queue.
bool empty();

Return true if queue is currently empty.

Only a single writer thread can perform enqueue operations and only a single reader thread can perform dequeue operations. Any other usage is invalid.

Huge page support

In addition to supporting custom allocation through the standard custom allocator interface this library also supports standard proposal P0401R3 Providing size feedback in the Allocator interface. This allows convenient use of huge pages without wasting any allocated space. Using size feedback is only supported when C++17 is enabled.

The library currently doesn't include a huge page allocator since the APIs for allocating huge pages are platform dependent and handling of huge page size and NUMA awareness is application specific.

Below is an example huge page allocator for Linux:

#include <sys/mman.h>

template <typename T> struct Allocator {
  using value_type = T;

  struct AllocationResult {
    T *ptr;
    size_t count;
  };

  size_t roundup(size_t n) { return (((n - 1) >> 21) + 1) << 21; }

  AllocationResult allocate_at_least(size_t n) {
    size_t count = roundup(sizeof(T) * n);
    auto p = static_cast<T *>(mmap(nullptr, count, PROT_READ | PROT_WRITE,
                                   MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
                                   -1, 0));
    if (p == MAP_FAILED) {
      throw std::bad_alloc();
    }
    return {p, count / sizeof(T)};
  }

  void deallocate(T *p, size_t n) { munmap(p, roundup(sizeof(T) * n)); }
};

See src/SPSCQueueExampleHugepages.cpp for the full example on how to use huge pages on Linux.

Implementation

Memory layout

The underlying implementation is based on a ring buffer.

Care has been taken to make sure to avoid any issues with false sharing. The head and tail indices are aligned and padded to the false sharing range (cache line size). Additionally the slots buffer is padded with the false sharing range at the beginning and end, this prevents false sharing with any adjacent allocations.

This implementation has higher throughput than a typical concurrent ring buffer by locally caching the head and tail indices in the writer and reader respectively. The caching increases throughput by reducing the amount of cache coherency traffic.

To understand how that works first consider a read operation in absence of caching: the head index (read index) needs to be updated and thus that cache line is loaded into the L1 cache in exclusive state. The tail (write index) needs to be read in order to check that the queue is not empty and is thus loaded into the L1 cache in shared state. Since a queue write operation needs to read the head index it's likely that a write operation requires some cache coherency traffic to bring the head index cache line back into exclusive state. In the worst case there will be one cache line transition from shared to exclusive for every read and write operation.

Next consider a queue reader that caches the tail index: if the cached tail index indicates that the queue is empty, then load the tail index into the cached tail index. If the queue was non-empty multiple read operations up until the cached tail index can complete without stealing the writer's tail index cache line's exclusive state. Cache coherency traffic is therefore reduced. An analogous argument can be made for the queue write operation.

This implementation allows for arbitrary non-power of two capacities, instead allocating a extra queue slot to indicate full queue. If you don't want to waste storage for a extra queue slot you should use a different implementation.

References:

Intel. Avoiding and Identifying False Sharing Among Threads.
Wikipedia. Ring buffer.
Wikipedia. False sharing.

Testing

Testing lock-free algorithms is hard. I'm using two approaches to test the implementation:

A single threaded test that the functionality works as intended, including that the item constructor and destructor is invoked correctly.
A multi-threaded fuzz test verifies that all items are enqueued and dequeued correctly under heavy contention.

Benchmarks

Throughput benchmark measures throughput between 2 threads for a queue of int items.

Latency benchmark measures round trip time between 2 threads communicating using 2 queues of int items.

Benchmark results for a AMD Ryzen 9 3900X 12-Core Processor, the 2 threads are running on different cores on the same chiplet:

Queue	Throughput (ops/ms)	Latency RTT (ns)
SPSCQueue	362723	133
boost::lockfree::spsc	209877	222
folly::ProducerConsumerQueue	148818	147

Cited by

SPSCQueue have been cited by the following papers:

Peizhao Ou and Brian Demsky. 2018. Towards understanding the costs of avoiding out-of-thin-air results. Proc. ACM Program. Lang. 2, OOPSLA, Article 136 (October 2018), 29 pages. DOI: https://doi.org/10.1145/3276506

About

This project was created by Erik Rigtorp <erik@rigtorp.se>.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot