bloom

Go package implementing Bloom filters, used by Milvus and Beego.

2,614

251

2,614

View on GitHub

Top Related Projects

BoomFilters

1,615

Probabilistic data structures for processing continuous, unbounded streams.

Quick Overview

Bits-and-blooms/bloom is a Go package that implements various Bloom filter variants and other probabilistic data structures. It provides efficient and space-optimized solutions for set membership testing, counting, and related problems. The library is designed to be fast, memory-efficient, and easy to use in Go applications.

Pros

High performance and memory efficiency
Multiple Bloom filter variants and probabilistic data structures
Thread-safe implementations
Well-documented and actively maintained

Cons

Limited to Go programming language
Probabilistic nature may lead to false positives (inherent to Bloom filters)
Learning curve for understanding different filter types and their use cases
May require fine-tuning for optimal performance in specific scenarios

Code Examples

Creating a standard Bloom filter:

import "github.com/bits-and-blooms/bloom/v3"

filter := bloom.NewWithEstimates(1000000, 0.01)
filter.Add([]byte("example"))
exists := filter.Test([]byte("example"))

Using a counting Bloom filter:

import "github.com/bits-and-blooms/bloom/v3"

cbf := bloom.NewCountingBloomFilter(1000000, 0.01)
cbf.Add([]byte("item"))
cbf.Add([]byte("item"))
count := cbf.Count([]byte("item"))

Creating a scalable Bloom filter:

import "github.com/bits-and-blooms/bloom/v3"

sbf := bloom.NewDefaultScalableBloomFilter(0.01)
sbf.Add([]byte("scalable"))
exists := sbf.Test([]byte("scalable"))

Getting Started

To use the bits-and-blooms/bloom package in your Go project, follow these steps:

Install the package:

go get -u github.com/bits-and-blooms/bloom/v3

Import the package in your Go code:

import "github.com/bits-and-blooms/bloom/v3"

Create and use a Bloom filter:

filter := bloom.NewWithEstimates(1000000, 0.01)
filter.Add([]byte("example"))
exists := filter.Test([]byte("example"))
fmt.Printf("Item exists: %v\n", exists)

This example creates a Bloom filter optimized for 1 million items with a false positive rate of 1%, adds an item, and tests for its existence.

Competitor Comparisons

BoomFilters

1,615

Probabilistic data structures for processing continuous, unbounded streams.

Pros of BoomFilters

Supports multiple filter types (Stable Bloom Filter, Scalable Bloom Filter, Counting Bloom Filter)
Implements partitioned filters for better cache efficiency
Provides a benchmarking suite for performance testing

Cons of BoomFilters

Less actively maintained (last commit over 2 years ago)
Fewer contributors and stars on GitHub
Limited to Go language implementation

Code Comparison

BoomFilters:

filter := boom.NewBloomFilter(1000, 0.01)
filter.Add([]byte("item"))
if filter.Test([]byte("item")) {
    fmt.Println("Item probably in set")
}

bloom:

filter := bloom.NewWithEstimates(1000, 0.01)
filter.Add([]byte("item"))
if filter.Test([]byte("item")) {
    fmt.Println("Item probably in set")
}

Both libraries offer similar basic functionality for Bloom filters, with BoomFilters providing additional filter types and bloom focusing on a more streamlined API. The bloom library is more actively maintained and has a larger community, while BoomFilters offers more diverse filter implementations and benchmarking tools. Choose based on your specific requirements and the level of community support you need.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Bloom filters

This library is used by popular systems such as Milvus and beego.

A Bloom filter is a concise/compressed representation of a set, where the main requirement is to make membership queries; i.e., whether an item is a member of a set. A Bloom filter will always correctly report the presence of an element in the set when the element is indeed present. A Bloom filter can use much less storage than the original set, but it allows for some 'false positives': it may sometimes report that an element is in the set whereas it is not.

When you construct, you need to know how many elements you have (the desired capacity), and what is the desired false positive rate you are willing to tolerate. A common false-positive rate is 1%. The lower the false-positive rate, the more memory you are going to require. Similarly, the higher the capacity, the more memory you will use. You may construct the Bloom filter capable of receiving 1 million elements with a false-positive rate of 1% in the following manner.

    filter := bloom.NewWithEstimates(1000000, 0.01)

You should call NewWithEstimates conservatively: if you specify a number of elements that it is too small, the false-positive bound might be exceeded. A Bloom filter is not a dynamic data structure: you must know ahead of time what your desired capacity is.

Our implementation accepts keys for setting and testing as []byte. Thus, to add a string item, "Love":

    filter.Add([]byte("Love"))

Similarly, to test if "Love" is in bloom:

    if filter.Test([]byte("Love"))

For numerical data, we recommend that you look into the encoding/binary library. But, for example, to add a uint32 to the filter:

    i := uint32(100)
    n1 := make([]byte, 4)
    binary.BigEndian.PutUint32(n1, i)
    filter.Add(n1)

Godoc documentation: https://pkg.go.dev/github.com/bits-and-blooms/bloom/v3

Installation

go get -u github.com/bits-and-blooms/bloom/v3

Verifying the False Positive Rate

Sometimes, the actual false positive rate may differ (slightly) from the theoretical false positive rate. We have a function to estimate the false positive rate of a Bloom filter with m bits and k hashing functions for a set of size n:

    if bloom.EstimateFalsePositiveRate(20*n, 5, n) > 0.001 ...

You can use it to validate the computed m, k parameters:

    m, k := bloom.EstimateParameters(n, fp)
    ActualfpRate := bloom.EstimateFalsePositiveRate(m, k, n)

    f := bloom.NewWithEstimates(n, fp)
    ActualfpRate := bloom.EstimateFalsePositiveRate(f.m, f.k, n)

You would expect ActualfpRate to be close to the desired false-positive rate fp in these cases.

The EstimateFalsePositiveRate function creates a temporary Bloom filter. It is also relatively expensive and only meant for validation.

Serialization

You can read and write the Bloom filters as follows:

	f := New(1000, 4)
	var buf bytes.Buffer
	bytesWritten, err := f.WriteTo(&buf)
	if err != nil {
		t.Fatal(err.Error())
	}
	var g BloomFilter
	bytesRead, err := g.ReadFrom(&buf)
	if err != nil {
		t.Fatal(err.Error())
	}
	if bytesRead != bytesWritten {
		t.Errorf("read unexpected number of bytes %d != %d", bytesRead, bytesWritten)
	}

Performance tip: When reading and writing to a file or a network connection, you may get better performance by wrapping your streams with bufio instances.

E.g.,

	f, err := os.Create("myfile")
	w := bufio.NewWriter(f)

	f, err := os.Open("myfile")
	r := bufio.NewReader(f)

Contributing

If you wish to contribute to this project, please branch and issue a pull request against master ("GitHub Flow")

This project includes a Makefile that allows you to test and build the project with simple commands. To see all available options:

make help

Running all tests

Before committing the code, please check if it passes all tests using (note: this will install some dependencies):

make deps
make qa

Design

A Bloom filter has two parameters: m, the number of bits used in storage, and k, the number of hashing functions on elements of the set. (The actual hashing functions are important, too, but this is not a parameter for this implementation). A Bloom filter is backed by a BitSet; a key is represented in the filter by setting the bits at each value of the hashing functions (modulo m). Set membership is done by testing whether the bits at each value of the hashing functions (again, modulo m) are set. If so, the item is in the set. If the item is actually in the set, a Bloom filter will never fail (the true positive rate is 1.0); but it is susceptible to false positives. The art is to choose k and m correctly.

In this implementation, the hashing functions used is murmurhash, a non-cryptographic hashing function.

Given the particular hashing scheme, it's best to be empirical about this. Note that estimating the FP rate will clear the Bloom filter.

Goroutine safety

In general, it not safe to access the same filter using different goroutines--they are unsynchronized for performance. Should you want to access a filter from more than one goroutine, you should provide synchronization. Typically this is done by using channels (in Go style; so there is only ever one owner), or by using sync.Mutex to serialize operations. Exceptionally, you may access the same filter from different goroutines if you never modify the content of the filter.

Top Related Projects

BoomFilters

1,615

Probabilistic data structures for processing continuous, unbounded streams.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot