1个不稳定版本

0.1.0 2022年3月15日

#1503算法

无授权许可

16KB
127

Crates.io License Test Status Documentation

Rust最小shingle哈希实现

该crate实现了简单的最小shingle哈希算法。更多信息请参阅W-shingling

算法

Shingle哈希(或w-shingle)是由n-gram组成的集合,每个n-gram由输入序列中的连续标记组成,每个标记后移一个元素。例如,文档

to be or not to be that is the question

有以下2-gram(shingles)集合

(to, be)(be, or)(or, not)(not, to)(be, that)(that, is)(is, the)(the, question)

注意:2-gram (to, be) 出现了两次。

2-gram集合是文档shingle哈希。该哈希可以用来测量两个文档的相似度,使用Jaccard系数

R(doc1,doc2) = (H(doc1)⋂ H(doc2)) / (H(doc1)⋃ H(doc2))

其中

  • R - 相似度
  • H - shingle哈希

前面的算法不适用于大型文档,因为n-gram集合可能增长非常快。例如,如果使用3-gram,并且输入序列的字母表是255个符号,那么集合的大小可能是255 ^ 3~16 * 10 ^ 6(最坏情况),这会消耗大量的内存。

为了解决这个问题,使用了min-shingle算法。它利用了一种特殊的优化技术:不是存储所有序列n-gram,而是计算n-gram的哈希值并保存最小哈希值。因为数据流的最低值可以即时计算(无需保存所有值),内存消耗大大减少。使用多个哈希函数(或多个哈希函数种子)重复该过程,产生shingle哈希。与shingle哈希一样,min-shingle哈希可以用来衡量文档之间的距离(或相似度)。

基本示例

schindel依赖项添加到Cargo.toml

[dependencies]
schindel = "^0.1.0"

将以下代码添加到您的main.rs

use schindel::shingles::{MinShingleHash, Murmur3Hasher};

fn main() {
    let original = "\
        “My sight is failing,” she said finally. “Even when I was young I could not have read what was written there. \
        But it appears to me that that wall looks different. Are the Seven Commandments the same as they used to be, \
        Benjamin?” For once Benjamin consented to break his rule, and he read out to her what was written on the wall. \
        There was nothing there now except a single Commandment. It ran:\
        ALL ANIMALS ARE EQUAL BUT SOME ANIMALS ARE MORE EQUAL THAN OTHERS";

    let plagiarism = "\
        “My sight is failing,” she said finally. “When I was young I could not have read what was written there. \
        But it appears to me that that wall looks different. Are the Seven Commandments the same as they used to be” \
        Benjamin read out to her what was written. There was nothing there now except a single Commandment. \
        It ran: ALL ANIMALS ARE EQUAL BUT SOME ANIMALS ARE MORE EQUAL THAN OTHERS";

    let other = "\
        Throughout the spring and summer they worked a sixty-hour week, and in August Napoleon announced that there \
        would be work on Sunday afternoons as well. This work was strictly voluntary, but any animal who absented \
        himself from it would have his rations reduced by half. Even so, it was found necessary to leave certain \
        tasks undone. The harvest was a little less successful than in the previous year, and two fields which \
        should have been sown with roots in the early summer were not sown because the ploughing had not been \
        completed early enough. It was possible to foresee that the coming winter would be a hard one.";

    const HASH_LEN: usize = 100;
    const NGRAM_LEN: usize = 5;

    let original_hash = MinShingleHash::<Murmur3Hasher, HASH_LEN, NGRAM_LEN>::new(original.chars());

    let plagiarism_hash = MinShingleHash::<Murmur3Hasher, HASH_LEN, NGRAM_LEN>::new(plagiarism.chars());
    println!("plagiarism similarity: {}", original_hash.compare(&plagiarism_hash));

    let other_hash = MinShingleHash::<Murmur3Hasher, HASH_LEN, NGRAM_LEN>::new(other.chars());
    println!("other text similarity: {}", original_hash.compare(&other_hash));
}

依赖项

~13KB