#bloom-filter #bloom #filter #counting-bloom-filter

bin+lib fastbloom-rs

用 Rust 为 Python 和 Rust 实现的一些快速 bloom 过滤器!

16 个版本

0.5.9 2023 年 12 月 4 日
0.5.8 2023 年 11 月 27 日
0.5.7 2023 年 10 月 7 日
0.5.5 2023 年 7 月 24 日
0.3.0 2022 年 6 月 6 日

#114数据结构

Download history 1094/week @ 2024-04-14 1142/week @ 2024-04-21 1273/week @ 2024-04-28 1620/week @ 2024-05-05 1511/week @ 2024-05-12 1378/week @ 2024-05-19 1178/week @ 2024-05-26 1215/week @ 2024-06-02 1363/week @ 2024-06-09 857/week @ 2024-06-16 925/week @ 2024-06-23 1009/week @ 2024-06-30 880/week @ 2024-07-07 731/week @ 2024-07-14 635/week @ 2024-07-21 581/week @ 2024-07-28

2,918 每月下载量
4 个 crate 中使用 (3 个直接使用)

Apache-2.0

65KB
1K SLoC

fastbloom

OSCS Status docs.rs Test Rust Test Python Benchmark Crates Latest Release PyPI Latest Release Sonatype Nexus (Snapshots)

一个快速的 bloom 过滤器 | 计数 bloom 过滤器,由 Rust 为 Rust 和 Python 实现!

语言: 简体中文

设置

Python

需求

Python >= 3.7

安装

使用以下命令安装最新版本的 fastbloom:

pip install fastbloom-rs

Rust

fastbloom-rs = "{latest}"

Java

maven

<dependency>
    <groupId>io.github.yankun1992</groupId>
    <artifactId>fastbloom</artifactId>
    <version>{latest-version}</version>
</dependency>

示例

BloomFilter

bloom 过滤器是一种由 Burton Howard Bloom 在 1970 年提出的空间效率高的概率数据结构,用于测试一个元素是否是集合的成员。可能存在假阳性匹配,但不存在假阴性匹配。

参考文献:Bloom, B. H. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7), 422-426. 全文文章

Python

基本用法

from fastbloom_rs import BloomFilter

bloom = BloomFilter(100_000_000, 0.01)

bloom.add_str('hello')
bloom.add_bytes(b'world')
bloom.add_int(9527)

assert bloom.contains('hello')
assert bloom.contains(b'world')
assert bloom.contains(9527)

assert not bloom.contains('hello world')

从字节或列表构建 bloom 过滤器

from fastbloom_rs import BloomFilter

bloom = BloomFilter(100_000_000, 0.01)
bloom.add_str('hello')
assert bloom.contains('hello')

bloom2 = BloomFilter.from_bytes(bloom.get_bytes(), bloom.hashes())
assert bloom2.contains('hello')

bloom3 = BloomFilter.from_int_array(bloom.get_int_array(), bloom.hashes())
assert bloom3.contains('hello')

有一些批量 API 用于 Python,以减少 Python 和 Rust 之间的 FFI 成本

bloom = BloomFilter(100_000_000, 0.01)
inserts = [1, 2, 3, 4, 5, 6, 7, 9, 18, 68, 90, 100]
checks = [1, 2, 3, 4, 5, 6, 7, 9, 18, 68, 90, 100, 190, 290, 390]
results = [True, True, True, True, True, True, True, True, True, True, True, True, False, False, False]

bloom.add_int_batch(inserts)
contains = bloom.contains_int_batch(checks)
assert contains == results

bloom.add_str_batch(list(map(lambda x: str(x), inserts)))
assert bloom.contains_str_batch(list(map(lambda x: str(x), checks))) == results

bloom.add_bytes_batch(list(map(lambda x: bytes(x), inserts)))
assert bloom.contains_bytes_batch(list(map(lambda x: bytes(x), checks))) == results

更多示例在 py_tests

Rust

use fastbloom_rs::{BloomFilter, FilterBuilder};

let mut bloom = FilterBuilder::new(100_000_000, 0.01).build_bloom_filter();

bloom.add(b"helloworld");
assert_eq!(bloom.contains(b"helloworld"), true);
assert_eq!(bloom.contains(b"helloworld!"), false);

更多示例在 docs.rs

CountingBloomFilter

计数 bloom 过滤器与普通 bloom 过滤器的工作方式类似;然而,它能够跟踪插入和删除操作。在计数 bloom 过滤器中,每个 bloom 过滤器的条目都与一个基本 bloom 过滤器位关联的小计数器。

参考文献:F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G. Varghese, “An Improved Construction for Counting Bloom Filters,” in 14th Annual European Symposium on Algorithms, LNCS 4168, 2006

Python

from fastbloom_rs import CountingBloomFilter

cbf = CountingBloomFilter(1000_000, 0.01)
cbf.add('hello')
cbf.add('hello')
assert 'hello' in cbf
cbf.remove('hello')
assert 'hello' in cbf  # because 'hello' added twice. 
# If add same element larger than 15 times, then remove 15 times the filter will not contain the element.
cbf.remove('hello')
assert 'hello' not in cbf

计数BloomFilter有一个四位的计数器来保存哈希索引,因此当重复插入一个元素时,计数器会很快溢出。因此,您可以设置 enable_repeat_insertFalse 来检查元素是否已添加。如果已添加,则不会再次添加。enable_repeat_insert 默认设置为 True

from fastbloom_rs import CountingBloomFilter

cbf = CountingBloomFilter(1000_000, 0.01, False)
cbf.add('hello')
cbf.add('hello')  # because enable_repeat_insert=False, this addition will not take effect. 
assert 'hello' in cbf
cbf.remove('hello')
assert 'hello' not in cbf 

更多示例请见 py_tests

Rust

use fastbloom_rs::{CountingBloomFilter, FilterBuilder};

let mut builder = FilterBuilder::new(100_000, 0.01);
let mut cbf = builder.build_counting_bloom_filter();
cbf.add(b"helloworld");
assert_eq!(bloom.contains(b"helloworld"), true);

基准测试

计算机信息

CPU 内存 操作系统
AMD Ryzen 7 5800U 带Radeon显卡 16G Windows 10

向 bloom 过滤器添加一个字符串

基准测试将一个字符串插入BloomFilter

bloom_add_test          time:   [41.168 ns 41.199 ns 41.233 ns]
                        change: [-0.4891% -0.0259% +0.3417%] (p = 0.91 > 0.05)
                        No change in performance detected.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) high mild
  12 (12.00%) high severe

向 bloom 过滤器添加一百万个元素

基准测试循环插入 (1..1_000_000).map(|n| { n.to_string() }) 到BloomFilter

bloom_add_all_test      time:   [236.24 ms 236.86 ms 237.55 ms]
                        change: [-3.4346% -2.9050% -2.3524%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

检查一个元素是否在 bloom 过滤器中

bloom_contains_test     time:   [42.065 ns 42.102 ns 42.156 ns]
                        change: [-0.7830% -0.5901% -0.4029%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 15 outliers among 100 measurements (15.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  9 (9.00%) high severe

检查一个元素是否不在 bloom 过滤器中

bloom_not_contains_test time:   [22.695 ns 22.727 ns 22.773 ns]
                        change: [-3.1948% -2.9695% -2.7268%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe

向计数 bloom 过滤器添加一个字符串

counting_bloom_add_test time:   [60.822 ns 60.861 ns 60.912 ns]
                        change: [+0.2427% +0.3772% +0.5579%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe

向计数 bloom 过滤器添加一百万个元素

基准测试循环插入 (1..1_000_000).map(|n| { n.to_string() }) 到计数BloomFilter

counting_bloom_add_million_test
                        time:   [272.48 ms 272.58 ms 272.68 ms]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild

依赖项

~1.6–2.3MB
~42K SLoC