6个版本 (3个重大更新)

0.4.0	2020年8月13日
0.3.1	2020年8月4日
0.2.1	2020年8月3日
0.2.0	2020年7月31日
0.1.0	2020年7月27日

#1473 in 文件系统

MIT许可证

20KB
320 行

recursum

Rust脚本，快速计算多个文件的哈希值。

有3种操作模式。

原始模式，也是crate名称的根，是递归检查单个给定的目录树中的所有（非符号链接）文件。
计算给定作为参数的任意数量的文件。
从stdin读取文件列表并计算每个文件的哈希值。

并行化文件发现（使用第1种方式）和哈希。默认的哈希器（Default hasher）不是加密安全的。

默认情况下，将{path}\t{hex_digest}打印到stdout。与大多数哈希工具（如md5sum，sha1sum等）相比，这是反向的，目的是更容易按文件名排序，并且因为制表符（许多文件系统接口不允许）比双空格（文件名中的易错）更可靠地分割。但是，存在--compatible开关来打印{hex_digest} {path}。

将正在进行的信息和最终时间和速率打印到stderr。

请注意，大多数哈希器，尤其是快速的非加密哈希，将比较慢的存储介质（如磁盘）快，因此使用多个哈希线程的收益可能会迅速饱和。

欢迎贡献。

安装

已安装cargo（使用rustup获取）

cargo install recursum

用法

recursum
Hash lots of files fast, in parallel.

USAGE:
    recursum [FLAGS] [OPTIONS] <input>...

FLAGS:
    -c, --compatible    "Compatible mode", which prints the hash first and changes the default separator to double-
                        space, as used by system utilities like md5sum
    -h, --help          Prints help information
    -q, --quiet         Do not show progress information
    -V, --version       Prints version information

OPTIONS:
    -d, --digest-length <digest-length>    Maximum length of output hash digests
    -s, --separator <separator>            Separator. Defaults to tab unless --compatible is given. Use "\t" for tab and
                                           "\0" for null (cannot be mixed with other characters)
    -t, --threads <threads>                Hashing threads
    -w, --walkers <walkers>                Directory-walking threads, if <input> is a directory

ARGS:
    <input>...    One or more file names, one directory name (every file recursively will be hashed, in depth first
                  order), or '-' for getting list of files from stdin (order is conserved)

示例

fd --threads 1 --type file | recursum --threads 10 --digest 64 - > my_checksums.txt

这可能会比使用--exec或| xargs更有效，并且具有更好的日志记录。

请注意，--separator不理解转义序列。为了传递例如制表符作为分隔符，请使用recursum -s $(echo '\t') -

操作总的来说，recursum 使用 >= 1 个线程来填充一个要散列的文件队列；要么懒式递归遍历目录内部队列是有限制的，通过反向压力最小化RAM浪费这个队列的长度远大于散列线程的数量，因此它们不应该等待队列被填充将它们作为参数列表传递积极从stdin读取这可以防止管道缓冲区填满并阻塞源，而源可能无法优雅地处理此类阻塞内部队列是无界的，如果文件通过管道以比它们被散列更快的速度传入，则可能变得非常大同时，从队列中弹出项目并使用tokio的线程调度器执行。每个任务内部不应发生上下文切换；任务以接收到的顺序处理。主线程获取结果（顺序相同）并将它们打印到stdout。替代方案 find（或fd）带有-exec（--exec），例如 find . -type f -exec md5sum {} \; find是单线程的，而-exec将找到的文件列表扁平化，并将每个文件作为额外参数传递给散列工具。如果文件数量很大，这可能会出错。此外，许多内置的散列工具不是多线程的；此外，工具实际上只有在文件列表被填充后才会被调用。在那里，您还可以将参数列表通过管道传递给xargs，它可以与-P并行，并使用-n限制给定的参数数量 find . -type f -print0 | xargs -0 -P 8 -n 1 -I _ md5sum "_" 每次调用都会启动一个新的shell，这可能是问题所在，并且可能无法像存在进程间通信那样充分利用CPU。更好的方法是使用“xargs模式”中的parallel。由于多次执行校验和工具，CPU可能会有一些开销，由于并行缓冲输出，RAM也可能会增加。 find . -type f | parallel -X md5sum 这些工具比recursum成熟得多，因此它们可能更适合您。

依赖关系 ~11–21MB ~250K SLoC digest 0.9 hex indicatif 0.15 jwalk 0.5.1 meowhash num_cpus structopt 0.3 tokio 0.2+rt-threaded+sync+stream+io-std+io-util dev cargo-release 0.13.5