Top Related Projects
Go package implementing Bloom filters, used by Milvus and Beego.
Probabilistic data structures for processing continuous, unbounded streams.
A modern text/numeric/geo-spatial/vector indexing library for go
Quick Overview
Bloomd is a high-performance C server for Bloom filters, which are probabilistic data structures used for efficient set membership testing. It allows for the creation and management of multiple Bloom filters, providing a network protocol for interaction and supporting multiple concurrent clients.
Pros
- High performance and low memory footprint due to its implementation in C
- Supports multiple concurrent clients and Bloom filters
- Provides a simple network protocol for easy integration with various systems
- Offers both in-memory and disk-persistent storage options
Cons
- Limited to Bloom filter functionality, not suitable for more complex data structures
- Requires separate client libraries for different programming languages
- May have a steeper learning curve compared to higher-level implementations
- Lacks built-in security features, requiring additional measures for secure deployments
Code Examples
Since Bloomd is a server implementation and not a code library, there are no direct code examples for using it as a library. Instead, here are examples of how to interact with the Bloomd server using its protocol:
# Create a new filter named "test" with a capacity of 1000000 and a false positive rate of 0.01
create test 1000000 0.01
# Check if the key "foo" is in the "test" filter
check test foo
# Add the key "bar" to the "test" filter
set test bar
# Check multiple keys in the "test" filter
multi test foo bar baz
Getting Started
To get started with Bloomd:
- Clone the repository:
git clone https://github.com/armon/bloomd.git
- Build the project:
cd bloomd && make
- Run the server:
./bloomd
- Connect to the server using a client library or telnet:
telnet localhost 8673
- Start using Bloomd commands to create and manage Bloom filters
For more detailed instructions and configuration options, refer to the project's README and documentation.
Competitor Comparisons
Go package implementing Bloom filters, used by Milvus and Beego.
Pros of bloom
- Written in Go, offering better performance and concurrency support
- More actively maintained with recent updates and contributions
- Provides a wider range of Bloom filter implementations (e.g., counting, stable, scalable)
Cons of bloom
- Lacks built-in network server functionality
- Does not offer persistence out of the box
- May require more setup and integration effort for distributed systems
Code Comparison
bloomd (C):
int bloomf_add(bloom_filter *filter, char *key) {
uint64_t hash = bloomf_calculate_hash(filter, key);
return set_bit(filter, hash);
}
bloom (Go):
func (f *Filter) Add(item []byte) {
for i := uint(0); i < f.k; i++ {
f.setBit(f.location(item, i))
}
}
Both implementations add an item to the Bloom filter, but bloom uses multiple hash functions (k) for better distribution, while bloomd uses a single hash function. The Go implementation in bloom is more concise and leverages language features for simplicity.
Probabilistic data structures for processing continuous, unbounded streams.
Pros of BoomFilters
- Written in Go, offering better performance and concurrency support
- Provides a wider variety of probabilistic data structures (e.g., Counting Bloom Filter, Cuckoo Filter)
- More actively maintained with recent updates
Cons of BoomFilters
- Lacks a network server component, primarily a library for integration
- May require more setup and configuration for distributed use cases
Code Comparison
BoomFilters (Go):
filter := boom.NewBloomFilter(1000, 0.01)
filter.Add([]byte("item"))
exists := filter.Test([]byte("item"))
bloomd (C):
bloom_filter_t *filter = bf_create(1000, 0.01);
bf_add(filter, "item", strlen("item"));
int exists = bf_contains(filter, "item", strlen("item"));
Summary
BoomFilters offers a more diverse set of probabilistic data structures and is implemented in Go, which can provide performance benefits. However, bloomd includes a network server component, making it potentially easier to set up for distributed systems. The choice between the two depends on specific use cases and language preferences.
A modern text/numeric/geo-spatial/vector indexing library for go
Pros of Bleve
- Full-text search engine with advanced querying capabilities
- Supports multiple languages and custom text analysis
- Offers faceting, highlighting, and other search-specific features
Cons of Bleve
- More complex to set up and use compared to Bloomd
- Higher memory and storage requirements
- May be overkill for simple membership testing use cases
Code Comparison
Bloomd (C):
int bloom_add(struct bloom *bloom, const char *buffer, int buffer_len) {
uint64_t hash = bloomd_hash(buffer, buffer_len, bloom->hash_seeds[0]);
setbit(bloom->bitmap, (hash % bloom->bitmap_size));
return 0;
}
Bleve (Go):
func (i *IndexImpl) Index(id string, data interface{}) error {
doc := document.NewDocument(id)
err := i.analysisQueue.Queue(func() {
i.docAnalyzer.Analyze(data, doc)
})
return i.Update(doc)
}
Summary
Bloomd is a lightweight Bloom filter implementation focused on efficient membership testing. Bleve, on the other hand, is a full-featured text search engine with more advanced capabilities. Bloomd is simpler and more memory-efficient for basic set membership operations, while Bleve offers powerful search functionality at the cost of increased complexity and resource usage.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Bloomd
Bloomd is a high-performance C server which is used to expose bloom filters and operations over them to networked clients. It uses a simple ASCII protocol which is human readable, and similar to memcached.
Bloom filters are a type of sketching or approximate data structure. They trade exactness for efficiency of representation. Bloom filters provide a set-like abstraction, with the important caveat that they may contain false-positives, meaning they may claim an element is part of the set when it was never in fact added. The rate of false positives can be tuned to meet application demands, but reducing the error rate rapidly increases the amount of memory required for the representation.
TL;DR: Bloom filters enable you to represent 1MM items with a false positive rate of 0.1% in 2.4MB of RAM.
Features
- Scalable non-blocking core allows for many connected clients and concurrent operations
- Implements scalable bloom filters, allowing dynamic filter sizes
- Supports asynchronous flushes to disk for persistence
- Supports non-disk backed bloom filters for high I/O
- Automatically faults cold filters out of memory to save resources
- Dead simple to start and administer
- FAST, FAST, FAST
Install
Download and build from source:
$ git clone https://armon@github.com/armon/bloomd.git
$ cd bloomd
$ pip install SCons # Uses the Scons build system, may not be necessary
$ scons
$ ./bloomd
This will generate some errors related to building the test code as it depends on libcheck. To build the test code successfully, do the following:
$ cd deps/check-0.9.8/
$ ./configure
$ make
# make install
# ldconfig (necessary on some Linux distros)
Then re-build bloomd. At this point, the test code should build successfully.
For CentOS or RHEL users, the kind folks from Vortex RPM have made a repo available with RPM's.
- Repo: http://vortex-rpm.org/
- Bloomd RPM spec: https://github.com/vortex-rpm/bloomd
Usage
Bloomd can be configured using a file which is in INI format. Here is an example configuration file:
# Settings for bloomd
[bloomd]
tcp_port = 8673
data_dir = /mnt/bloomd
log_level = INFO
flush_interval = 300
workers = 2
Then run bloomd, pointing it to that file:
bloomd -f /etc/bloomd.conf
A full list of configuration options is below.
Clients
Here is a list of known client implementations:
- Python : https://github.com/kiip/bloom-python-driver
- Ruby : https://github.com/SponsorPay/bloomrb
- Erlang : https://github.com/armon/erl-bloomd
- Go : https://github.com/geetarista/go-bloomd
- Perl : https://github.com/dams/Bloomd-Client
- Node.js : https://github.com/majelbstoat/node-bloomd
- PHP: https://github.com/mdlayher/php-bloomd
- Java: https://github.com/casidiablo/java-bloomd-client
- Lua/OpenResty: https://github.com/jie123108/lua-resty-bloomd
Here is a list of "best-practices" for client implementations:
- Maintain a set of open connections to the server to minimize connection time
- Make use of the bulk operations when possible, as they are more efficient.
- For long keys, it is better to do a client-side hash (SHA1 at least), and send the hash as the key to minimize network traffic.
Configuration Options
Each configuration option is documented below:
-
tcp_port : Integer, sets the tcp port to listen on. Default 8673.
-
port: Same as above. For compatibility.
-
udp_port : Integer, sets the udp port. Currently listened on but otherwise unused. Default 8674.
-
bind_address: The IP to bind to. Defaults to 0.0.0.0
-
data_dir : The data directory that is used. Defaults to /tmp/bloomd
-
log_level : The logging level that bloomd should use. One of: DEBUG, INFO, WARN, ERROR, or CRITICAL. All logs go to syslog, and stderr if that is a TTY. Default is DEBUG.
-
workers : This controls the number of worker threads that are used. Defaults to 1. If many different filters are used, it can be advantageous to increase this to the number of CPU cores. If only a few filters are used, the increased lock contention may reduce throughput, and a single worker may be better.
-
flush_interval : This is the time interval in seconds in which filters are flushed to disk. Defaults to 60 seconds. Set to 0 to disable.
-
cold_interval : If a filter is not accessed (check or set), for this amount of time, it is eligible to be removed from memory and left only on disk. If a filter is accessed, it will automatically be faulted back into memory. Set to 3600 seconds by default (1 hour). Set to 0 to disable cold faulting.
-
in_memory : If set to 1, then all filters are in-memory ONLY by default. This means they are not persisted to disk, and are not eligible for cold fault out. Defaults to 0.
-
initial_capacity : If a create command does not provide an initial capacity for a filter, this value is used. Defaults to 100K items.
-
default_probability : If a create command does not provide a false-positive probability rate, this value is used. Defaults to 1/10K.
-
use_mmap : If set to 1, the bloomd internal buffer management is disabled, and instead buffers use a plain mmap() and rely on the kernel for all management. This increases data safety in the case that bloomd crashes, but has adverse affects on performance if the total memory utilization of the system is high. In general, this should be left to 0, which is the default.
-
scale_size : When a bloom filter is "scaled" up, this is the multiplier that is used. It should either be 2 or 4. Setting it to 2 will conserve memory, but is slower due to the increased number of filters that need to be checked. Defaults to 4.
-
probability_reduction : This is a subtle control value that affects the scaling of bloom filters. It should probably not be modified. Defaults to 0.9.
Protocol
By default, Bloomd will listen for TCP connections on port 8673. It uses a simple ASCII protocol that is very similar to memcached.
A command has the following syntax:
cmd [args][\r]\n
We start each line by specifying a command, providing optional arguments, and ending the line in a newline (carriage return is optional).
There are a total of 11 commands:
- create - Create a new filter (a filter is a named bloom filter)
- list - List all filters or those matching a prefix
- drop - Drop a filters (Deletes from disk)
- close - Closes a filter (Unmaps from memory, but still accessible)
- clear - Clears a filter from the lists (Removes memory, left on disk)
- check|c - Check if a key is in a filter
- multi|m - Checks if a list of keys are in a filter
- set|s - Set an item in a filter
- bulk|b - Set many items in a filter at once
- info - Gets info about a filter
- flush - Flushes all filters or just a specified one
For the create
command, the format is:
create filter_name [capacity=initial_capacity] [prob=max_prob] [in_memory=0|1]
Note:
capacity
must > 10,000 (1e4, 10K)capacity
is suggested <= 1,000,000,000 (1e9, 1G)prob
must < 0.1 (1e-1)prob
is suggested <= 0.01 (1e-2)
Where filter_name
is the name of the filter,
and can contain the characters a-z, A-Z, 0-9, ., _.
If an initial capacity is provided the filter
will be created to store at least that many items in the initial filter.
Otherwise the configured default value will be used.
If a maximum false positive probability is provided,
that will be used, otherwise the configured default is used.
You can optionally specify in_memory to force the filter to not be
persisted to disk.
As an example:
create foobar capacity=1000000 prob=0.001
This will create a filter foobar that has a 1M initial capacity, and a 1/1000 probability of generating false positives. Valid responses are either "Done", "Exists", or "Delete in progress". The last response occurs if a filter of the same name was recently deleted, and bloomd has not yet completed the delete operation. If so, a client should retry the create in a few seconds.
The list
command takes either no arguments or a set prefix, and returns information
about the matching filters. Here is an example response to a command:
> list foo
START
foobar 0.001 1797211 1000000 0
END
With the list prefix "foo", this indicates a single filter named foobar, with a probability of 0.001 of false positives, a 1.79MB size, a current capacity of 1M items, and 0 current items. The size and capacity automatically scale as more items are added.
The drop
, close
and clear
commands are like create, but only takes a filter name.
It can either return "Done" or "Filter does not exist". clear
can also return "Filter is not proxied. Close it first.".
This means that the filter is still in-memory and not qualified for being cleared.
This can be resolved by first closing the filter.
Check and set look similar, they are either:
[check|set] filter_name key
The command must specify a filter and a key to use. They will either return "Yes", "No" or "Filter does not exist".
The bulk and multi commands are similar to check/set but allows for many keys to be set or checked at once. Keys must be separated by a space:
[multi|bulk] filter_name key1 [key_2 [key_3 [key_N]]]
The check, multi, set and bulk commands can also be called by their aliasses c, m, s and b respectively.
The info
command takes a filter name, and returns
information about the filter. Here is an example output:
START
capacity 1000000
checks 0
check_hits 0
check_misses 0
page_ins 0
page_outs 0
probability 0.001
sets 0
set_hits 0
set_misses 0
size 0
storage 1797211
END
The command may also return "Filter does not exist" if the filter does not exist.
The flush
command may be called without any arguments, which
causes all filters to be flushed. If a filter name is provided
then that filter will be flushed. This will either return "Done" or
"Filter does not exist".
Example
Here is an example of a client flow, assuming bloomd is running on the default port using just telnet:
$ telnet localhost 8673
> list
START
END
> create foobar
Done
> check foobar zipzab
No
> set foobar zipzab
Yes
> check foobar zipzab
Yes
> multi foobar zipzab blah boo
Yes No No
> bulk foobar zipzab blah boo
No Yes Yes
> multi foobar zipzab blah boo
Yes Yes Yes
> list
START
foobar 0.000100 300046 100000 3
END
> drop foobar
Done
> list
START
END
Performance
Although extensive performance evaluations have not been done, casual testing on a 2012 MBP with pure set/check operations allows for a throughput of at least 600K ops/sec. On Linux, response times can be as low as 1.5 μs.
Bloomd also supports multi-core systems for scalability, so
it is important to tune it for the given work load. The number
of worker threads can be configured either in the configuration
file, or by providing a -w
flag. This should be set to at most
2 * CPU count. By default, only a single worker is used.
References
Here are some related works which we make use of:
- Space/Time Trade-offs in Hash Coding with Allowable Errors (Bloom): https://dl.acm.org/doi/10.1145/362686.362692
- Scalable Bloom Filters (Almeida et. al): http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf
- Less Hashing, Same Performance: Building a Better Bloom Filter (Kirsch and Mitzenmacher): https://www.eecs.harvard.edu/~michaelm/postscripts/tr-02-05.pdf
Top Related Projects
Go package implementing Bloom filters, used by Milvus and Beego.
Probabilistic data structures for processing continuous, unbounded streams.
A modern text/numeric/geo-spatial/vector indexing library for go
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot