4 个版本
0.2.0 | 2023年6月23日 |
---|---|
0.1.2 | 2022年7月15日 |
0.1.1 | 2022年3月5日 |
0.1.0 | 2022年2月8日 |
#467 in 算法
1,229 每月下载次数
在 2 crates 中使用
125KB
3K SLoC
高丫
关于
此项目实现了局部敏感哈希算法和数据结构,用于索引和查询文本文档。高丫的主要用途是去重和聚类。
主要特性
- 64,32,16,8 位 minhash
- 64,128 位 simhash
- Rust 中快速实现
- 多线程,得益于 rayon
- Python 绑定
Python 示例
>>> import gaoya
>>> index = gaoya.minhash.MinHashStringIndex(hash_size=32,
jaccard_threshold=0.5,
num_bands=42,
band_size=3,
num_hashes=42*3,
analyzer='word',
lowercase=True,
ngram_range=(1,1))
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third document.',
... 'Is this the first document?',
... 'This not the first nor the second nor the third, but the fourth document'
... ]
>>>
>>> for i, doc in enumerate(corpus): index.insert_document(i, doc)
...
>>> index.query('This is the first document.')
[0, 1, 2, 3]
>>>
安装
$ pip3 install gaoya
示例
Rust 示例
use gaoya::minhash::{MinHashIndex, MinHasher32, MinHasher} ;
use gaoya::text::whitespace_split;
use fxhash::FxHashSet;
let corpus = [
"This is the first document.",
"This document is the second document.",
"And this is the third document.",
"Is this the first document?",
"This not the first nor the second nor the third, but the fourth document"];
let (num_bands, band_width) = (42, 3);
let minhasher = MinHasher32::new(num_bands * band_width);
let mut index = MinHashIndex::new(num_bands, band_width, 0.5);
for (i, doc) in corpus.iter().enumerate() {
index.insert(i, minhasher.create_signature(whitespace_split(&doc.to_lowercase())));
}
for (i, doc) in corpus.iter().enumerate() {
if i < 4 {
let mut expected = FxHashSet::default();
expected.extend(vec![0, 1, 2, 3].into_iter());
let signature = minhasher.create_signature(whitespace_split(&doc.to_lowercase()));
assert_eq!(index.query_owned(&signature), expected);
} else {
let mut expected = FxHashSet::default();
expected.insert(4);
let signature = minhasher.create_signature(whitespace_split(&doc.to_lowercase()));
assert_eq!(index.query_owned(&signature), expected);
}
}
参考
依赖项
~4.5MB
~76K SLoC