16 个版本

0.5.9	2023 年 12 月 4 日
0.5.8	2023 年 11 月 27 日
0.5.7	2023 年 10 月 7 日
0.5.5	2023 年 7 月 24 日
0.3.0	2022 年 6 月 6 日

#114 在数据结构中

2,918 每月下载量
在 4 个 crate 中使用 (3 个直接使用)

Apache-2.0

65KB
1K SLoC

fastbloom

一个快速的 bloom 过滤器 | 计数 bloom 过滤器，由 Rust 为 Rust 和 Python 实现！

语言：简体中文

设置
- Python
  - 需求
  - 安装
- Rust
- Java
示例
- BloomFilter
  - Python
  - Rust
- CountingBloomFilter
  - Python
  - Rust
基准测试

设置

Python

需求

Python >= 3.7

安装

使用以下命令安装最新版本的 fastbloom：

pip install fastbloom-rs

Rust

fastbloom-rs = "{latest}"

Java

maven

<dependency>
    <groupId>io.github.yankun1992</groupId>
    <artifactId>fastbloom</artifactId>
    <version>{latest-version}</version>
</dependency>

示例

BloomFilter

bloom 过滤器是一种由 Burton Howard Bloom 在 1970 年提出的空间效率高的概率数据结构，用于测试一个元素是否是集合的成员。可能存在假阳性匹配，但不存在假阴性匹配。

参考文献：Bloom, B. H. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7), 422-426. 全文文章

Python

基本用法

from fastbloom_rs import BloomFilter

bloom = BloomFilter(100_000_000, 0.01)

bloom.add_str('hello')
bloom.add_bytes(b'world')
bloom.add_int(9527)

assert bloom.contains('hello')
assert bloom.contains(b'world')
assert bloom.contains(9527)

assert not bloom.contains('hello world')

从字节或列表构建 bloom 过滤器

from fastbloom_rs import BloomFilter

bloom = BloomFilter(100_000_000, 0.01)
bloom.add_str('hello')
assert bloom.contains('hello')

bloom2 = BloomFilter.from_bytes(bloom.get_bytes(), bloom.hashes())
assert bloom2.contains('hello')

bloom3 = BloomFilter.from_int_array(bloom.get_int_array(), bloom.hashes())
assert bloom3.contains('hello')

有一些批量 API 用于 Python，以减少 Python 和 Rust 之间的 FFI 成本

bloom = BloomFilter(100_000_000, 0.01)
inserts = [1, 2, 3, 4, 5, 6, 7, 9, 18, 68, 90, 100]
checks = [1, 2, 3, 4, 5, 6, 7, 9, 18, 68, 90, 100, 190, 290, 390]
results = [True, True, True, True, True, True, True, True, True, True, True, True, False, False, False]

bloom.add_int_batch(inserts)
contains = bloom.contains_int_batch(checks)
assert contains == results

bloom.add_str_batch(list(map(lambda x: str(x), inserts)))
assert bloom.contains_str_batch(list(map(lambda x: str(x), checks))) == results

bloom.add_bytes_batch(list(map(lambda x: bytes(x), inserts)))
assert bloom.contains_bytes_batch(list(map(lambda x: bytes(x), checks))) == results

更多示例在 py_tests。

Rust

use fastbloom_rs::{BloomFilter, FilterBuilder};

let mut bloom = FilterBuilder::new(100_000_000, 0.01).build_bloom_filter();

bloom.add(b"helloworld");
assert_eq!(bloom.contains(b"helloworld"), true);
assert_eq!(bloom.contains(b"helloworld!"), false);

更多示例在 docs.rs。

CountingBloomFilter

计数 bloom 过滤器与普通 bloom 过滤器的工作方式类似；然而，它能够跟踪插入和删除操作。在计数 bloom 过滤器中，每个 bloom 过滤器的条目都与一个基本 bloom 过滤器位关联的小计数器。

参考文献：F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G. Varghese, “An Improved Construction for Counting Bloom Filters,” in 14th Annual European Symposium on Algorithms, LNCS 4168, 2006

Python

from fastbloom_rs import CountingBloomFilter

cbf = CountingBloomFilter(1000_000, 0.01)
cbf.add('hello')
cbf.add('hello')
assert 'hello' in cbf
cbf.remove('hello')
assert 'hello' in cbf  # because 'hello' added twice. 
# If add same element larger than 15 times, then remove 15 times the filter will not contain the element.
cbf.remove('hello')
assert 'hello' not in cbf

计数BloomFilter有一个四位的计数器来保存哈希索引，因此当重复插入一个元素时，计数器会很快溢出。因此，您可以设置 enable_repeat_insert 为 False 来检查元素是否已添加。如果已添加，则不会再次添加。enable_repeat_insert 默认设置为 True。

from fastbloom_rs import CountingBloomFilter

cbf = CountingBloomFilter(1000_000, 0.01, False)
cbf.add('hello')
cbf.add('hello')  # because enable_repeat_insert=False, this addition will not take effect. 
assert 'hello' in cbf
cbf.remove('hello')
assert 'hello' not in cbf

更多示例请见 py_tests。

Rust

use fastbloom_rs::{CountingBloomFilter, FilterBuilder};

let mut builder = FilterBuilder::new(100_000, 0.01);
let mut cbf = builder.build_counting_bloom_filter();
cbf.add(b"helloworld");
assert_eq!(bloom.contains(b"helloworld"), true);

基准测试

计算机信息

CPU	内存	操作系统
AMD Ryzen 7 5800U 带Radeon显卡	16G	Windows 10

向 bloom 过滤器添加一个字符串

基准测试将一个字符串插入BloomFilter

bloom_add_test          time:   [41.168 ns 41.199 ns 41.233 ns]
                        change: [-0.4891% -0.0259% +0.3417%] (p = 0.91 > 0.05)
                        No change in performance detected.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) high mild
  12 (12.00%) high severe

向 bloom 过滤器添加一百万个元素

基准测试循环插入 (1..1_000_000).map(|n| { n.to_string() }) 到BloomFilter

bloom_add_all_test      time:   [236.24 ms 236.86 ms 237.55 ms]
                        change: [-3.4346% -2.9050% -2.3524%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

检查一个元素是否在 bloom 过滤器中

bloom_contains_test     time:   [42.065 ns 42.102 ns 42.156 ns]
                        change: [-0.7830% -0.5901% -0.4029%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 15 outliers among 100 measurements (15.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  9 (9.00%) high severe

检查一个元素是否不在 bloom 过滤器中

bloom_not_contains_test time:   [22.695 ns 22.727 ns 22.773 ns]
                        change: [-3.1948% -2.9695% -2.7268%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe

向计数 bloom 过滤器添加一个字符串

counting_bloom_add_test time:   [60.822 ns 60.861 ns 60.912 ns]
                        change: [+0.2427% +0.3772% +0.5579%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe

向计数 bloom 过滤器添加一百万个元素

基准测试循环插入 (1..1_000_000).map(|n| { n.to_string() }) 到计数BloomFilter

counting_bloom_add_million_test
                        time:   [272.48 ms 272.58 ms 272.68 ms]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild

依赖项

~1.6–2.3MB
~42K SLoC