4 个版本

0.2.0 2023年6月23日
0.1.2 2022年7月15日
0.1.1 2022年3月5日
0.1.0 2022年2月8日

#467 in 算法

Download history 194/week @ 2024-03-13 216/week @ 2024-03-20 213/week @ 2024-03-27 255/week @ 2024-04-03 113/week @ 2024-04-10 205/week @ 2024-04-17 330/week @ 2024-04-24 213/week @ 2024-05-01 148/week @ 2024-05-08 190/week @ 2024-05-15 166/week @ 2024-05-22 172/week @ 2024-05-29 206/week @ 2024-06-05 492/week @ 2024-06-12 303/week @ 2024-06-19 210/week @ 2024-06-26

1,229 每月下载次数
2 crates 中使用

MIT 许可证

125KB
3K SLoC

高丫

关于

此项目实现了局部敏感哈希算法和数据结构,用于索引和查询文本文档。高丫的主要用途是去重和聚类。

主要特性

  • 64,32,16,8 位 minhash
  • 64,128 位 simhash
  • Rust 中快速实现
  • 多线程,得益于 rayon
  • Python 绑定

Python 示例

>>> import gaoya
>>> index = gaoya.minhash.MinHashStringIndex(hash_size=32, 
                                             jaccard_threshold=0.5, 
                                             num_bands=42, 
                                             band_size=3,
                                             num_hashes=42*3,
                                             analyzer='word', 
                                             lowercase=True, 
                                             ngram_range=(1,1))
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third document.',
...     'Is this the first document?',
...     'This not the first nor the second nor the third, but the fourth document'
... ]
>>> 
>>> for i, doc in enumerate(corpus): index.insert_document(i, doc)
... 
>>> index.query('This is the first document.')
[0, 1, 2, 3]
>>> 

安装

$ pip3 install gaoya

示例

使用 Gaoya 进行文档去重

Rust 示例

use gaoya::minhash::{MinHashIndex, MinHasher32, MinHasher} ;
use gaoya::text::whitespace_split;
use fxhash::FxHashSet;
let corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third document.",
    "Is this the first document?",
    "This not the first nor the second nor the third, but the fourth document"];
let (num_bands, band_width) = (42, 3);
let minhasher = MinHasher32::new(num_bands * band_width);
let mut index = MinHashIndex::new(num_bands, band_width, 0.5);
for (i, doc) in corpus.iter().enumerate() {
    index.insert(i, minhasher.create_signature(whitespace_split(&doc.to_lowercase())));
}
for (i, doc) in corpus.iter().enumerate() {
    if i < 4 {
        let mut expected = FxHashSet::default();
        expected.extend(vec![0, 1, 2, 3].into_iter());
        let signature = minhasher.create_signature(whitespace_split(&doc.to_lowercase()));
        assert_eq!(index.query_owned(&signature), expected);
    } else {
        let mut expected = FxHashSet::default();
        expected.insert(4);
        let signature = minhasher.create_signature(whitespace_split(&doc.to_lowercase()));
        assert_eq!(index.query_owned(&signature), expected);
    }
}

参考

[1] 第 3 章,大规模数据集挖掘

[2] 从舍入算法中得出的相似度估计技术

[3] 检测网络爬虫中的近似重复项

依赖项

~4.5MB
~76K SLoC