10个版本

0.3.10	2024年6月20日
0.3.8	2024年5月16日
0.3.4	2024年3月29日
0.2.7	~~2023年12月31日~~
0.2.3	~~2023年7月1日~~

#128 in 编码

536次每月下载

GPL-2.0-or-later

785KB
12K SLoC

ffuzzy：纯Rust的ssdeep兼容模糊哈希库

ssdeep 是一个计算上下文触发分段哈希（CTPH）的程序。也称为模糊哈希，CTPH可以匹配具有同源性的输入。这些输入在相同顺序中具有相同字节序列，尽管这些序列之间的字节在内容和长度上可能不同。

您可以使用此crate生成/解析/比较（ssdeep兼容）模糊哈希。

除了“简单”函数外，它还提供了用于高性能/高级用例的模糊哈希相关结构。如果您很好地理解模糊哈希的性质以及此crate，您可以将模糊哈希聚类速度比libfuzzy快5倍以上。

用法：基本

对文件进行哈希处理

// Required Features: "std" and "easy-functions" (default enabled)
fn main() -> Result<(), ssdeep::GeneratorOrIOError> {
    let fuzzy_hash = ssdeep::hash_file("data/examples/hello.txt")?;
    let fuzzy_hash_str = fuzzy_hash.to_string();
    assert_eq!(fuzzy_hash_str, "3:aaX8v:aV");
    Ok(())
}

比较两个模糊哈希

// Required Feature: "easy-functions" (default enabled)
let score = ssdeep::compare(
    "6:3ll7QzDkmJmMHkQoO/llSZEnEuLszmbMAWn:VqDk5QtLbW",
    "6:3ll7QzDkmQjmMoDHglHOxPWT0lT0lT0lB:VqDk+n"
).unwrap();
assert_eq!(score, 46);

用法：高级

对缓冲区进行哈希处理

// Requires the "alloc" feature to use the `to_string()` method (default enabled).
use ssdeep::{Generator, RawFuzzyHash};

let mut generator = Generator::new();
let buf1: &[u8]    = b"Hello, ";
let buf2: &[u8; 6] = b"World!";

// Optional but supplying the *total* input size first improves the performance.
// This is the total size of three update calls below.
generator.set_fixed_input_size_in_usize(buf1.len() + buf2.len() + 1).unwrap();

// Update the internal state of the generator.
// Of course, you can update multiple times.
if true {
    // Option 1: `+=` operator overload
    generator += buf1;
    generator += buf2;
    generator += b'\n';
}
else {
    // Option 2: `update()`-family functions
    //           (unlike `+=`, iterators are supported)
    generator.update(buf1);
    generator.update_by_iter((*buf2).into_iter());
    generator.update_by_byte(b'\n');
}

// Retrieve the fuzzy hash and convert to the string.
let hash: RawFuzzyHash = generator.finalize().unwrap();
assert_eq!(hash.to_string(), "3:aaX8v:aV");

比较模糊哈希

// Requires either the "alloc" feature or std environment on your crate
// to use the `to_string()` method (default enabled).
use ssdeep::{FuzzyHash, FuzzyHashCompareTarget};

// Those fuzzy hash strings are "normalized" so that easier to compare.
let str1 = "12288:+ySwl5P+C5IxJ845HYV5sxOH/cccccccei:+Klhav84a5sxJ";
let str2 = "12288:+yUwldx+C5IxJ845HYV5sxOH/cccccccex:+glvav84a5sxK";
// FuzzyHash object can be used to avoid parser / normalization overhead
// and helps improving the performance.
let hash1: FuzzyHash = str::parse(str1).unwrap();
let hash2: FuzzyHash = str::parse(str2).unwrap();

// Note that converting the (normalized) fuzzy hash object back to the string
// may not preserve the original string.  To preserve the original fuzzy hash
// string too, consider using dual fuzzy hashes (such like DualFuzzyHash) that
// preserves the original string in the compressed format.
// *   str1:  "12288:+ySwl5P+C5IxJ845HYV5sxOH/cccccccei:+Klhav84a5sxJ"
// *   hash1: "12288:+ySwl5P+C5IxJ845HYV5sxOH/cccei:+Klhav84a5sxJ"
assert_ne!(hash1.to_string(), str1);

// If we have number of fuzzy hashes and a hash is compared more than once,
// storing those hashes as FuzzyHash objects is faster.
assert_eq!(hash1.compare(&hash2), 88);

// But there's another way of comparison.
// If you compare "a fuzzy hash" with "other many fuzzy hashes", this method
// (using FuzzyHashCompareTarget as "a fuzzy hash") is much, much faster.
let target: FuzzyHashCompareTarget = FuzzyHashCompareTarget::from(&hash1);
assert_eq!(target.compare(&hash2), 88);

// If you reuse the same `target` object repeatedly for multiple fuzzy hashes,
// `new()` and `init_from()` will be helpful.
let mut target: FuzzyHashCompareTarget = FuzzyHashCompareTarget::new();
target.init_from(&hash1);
assert_eq!(target.compare(&hash2), 88);

双重模糊哈希简介

它只显示双重模糊哈希的一个属性。双重模糊哈希对象将在许多非常复杂的案例中非常有用。

// Requires either the "alloc" feature or std environment on your crate
// to use the `to_string()` method (default enabled).
use ssdeep::{FuzzyHash, DualFuzzyHash};

// "Normalization" would change the contents.
let str1      = "12288:+ySwl5P+C5IxJ845HYV5sxOH/cccccccei:+Klhav84a5sxJ";
let str2      = "12288:+yUwldx+C5IxJ845HYV5sxOH/cccccccex:+glvav84a5sxK";
let str2_norm = "12288:+yUwldx+C5IxJ845HYV5sxOH/cccex:+glvav84a5sxK";
let hash1: FuzzyHash = str::parse(str1).unwrap();
let hash2: DualFuzzyHash = str::parse(str2).unwrap();

// Note that a dual fuzzy hash object efficiently preserves both raw and
// normalized contents of the fuzzy hash.
// *   raw:        "12288:+yUwldx+C5IxJ845HYV5sxOH/cccccccex:+glvav84a5sxK"
// *   normalized: "12288:+yUwldx+C5IxJ845HYV5sxOH/cccex:+glvav84a5sxK"
assert_eq!(hash2.to_raw_form_string(),   str2);
assert_eq!(hash2.to_normalized_string(), str2_norm);

// You can use the dual fuzzy hash object
// just like regular fuzzy hashes on some methods.
assert_eq!(hash1.compare(&hash2), 88);

crate特性

alloc 和 std (默认)
此crate支持 no_std（通过禁用两者）并且 alloc 和 std 建立在最小的 no_std 实现之上。这些功能使依赖于 alloc 和 std 的实现成为可能。
easy-functions (默认)
它提供了易于使用的高级函数。
strict-parser
它启用严格解析器，该解析器拒绝会导致“原始”变体上的错误但不会导致“规范化”变体（在默认解析器）上的错误的模糊哈希字符串。默认情况下禁用（因为它会减慢解析器）但启用它将使解析器更不令人困惑且更健壮。
unsafe (快速但不安全)
此crate可选不安全。默认情况下，此crate使用100%安全的Rust构建。启用此功能将启用不安全的Rust代码（尽管不安全/安全代码共享最多使用宏）。
unchecked
此功能公开了不检查输入有效性的unsafe函数和方法。这是公开unsafe功能的子集，但它不会将程序切换到使用不安全的Rust。
不稳定
此功能启用Nightly Rust的一些特定功能。请注意，此功能高度依赖于rustc的版本，不应视为稳定（不要期望与SemVer兼容的语义）。
opt-reduce-fnv-table（不推荐启用此功能）
ssdeep使用部分（最低6位）FNV哈希。虽然默认情况下使用表查找代替完整的FNV哈希计算在大多数情况下更快，但它对某些配置的性能影响不大。启用此选项将关闭使用预计算的FNV哈希表（4KiB）。请注意，即使您想减少内存占用，也不建议启用此功能，因为生成器的大小约为2KiB，用于模糊哈希比较的临时对象的大小约为1KiB（因此减少4KiB不会带来太大的好处）。
tests-slow和tests-very-slow
它们分别启用“慢速”（可能需要几秒甚至几分钟）和“非常慢速”（可能需要更长时间）的测试。

ssdeep的历史和主要贡献者

Andrew Tridgell编写了一个名为"spamsum"的程序，用于检测与已知垃圾邮件相似的邮件。

Jesse Kornblum根据spamsum编写了程序"ssdeep"，通过添加solid引擎来扩展Andrew的工作。Jesse多年来一直在改进ssdeep。

Helmut Grohne编写了他的重写和优化的流式模糊哈希引擎，该引擎可以多线程运行，并且可以处理文件而不需要搜索。

Tsukasa OI首先帮助解决了编辑距离代码的许可问题（该代码不是开源的），进一步优化了引擎，并引入了位并行字符串处理函数。他多次编写与ssdeep兼容的引擎，包括ffuzzy++。

许可（GNU GPL v2或更高版本）

本软件包（作为一个整体库）根据自由软件基金会发布的GNU通用公共许可证的条款进行许可；许可证版本2，或者（根据您的选择）许可证的任何后续版本。

然而，某些部分使用更宽松的许可证（有关详细信息，请参阅源代码）。

参考文献

Jesse Kornblum（2006）“使用上下文触发分块哈希识别几乎相同的文件”(doi:10.1016/j.diin.2006.06.015)

无需 std ffuzzy