39 个版本 (4 个稳定版本)

1.3.0	2024年7月30日
1.2.0	2024年2月28日
1.1.0	2023年2月13日
1.0.0	2022年1月23日
0.8.3	2020年11月20日

#105 in 文件系统

每月下载 274 次

MIT 许可证

56KB
1.5K SLoC

YADF — 另一个重复文件查找器

在我的机器上运行得很快。

你可能需要使用 fclones。

安装

预构建包

某些平台的可执行二进制文件可在发布部分找到。

从源码构建

安装 Rust 工具链
运行 cargo install --locked yadf

用法

yadf 默认设置

搜索当前工作目录 $PWD
输出格式与 "标准" fdupes 相同，按换行符分隔的组
自动进入子目录
搜索包括所有文件（包括空文件）

yadf # find duplicate files in current directory
yadf ~/Documents ~/Pictures # find duplicate files in two directories
yadf --depth 0 file1 file2 # compare two files
yadf --depth 1 # find duplicates in current directory without descending
fd --type d a | yadf --depth 1 # find directories with an "a" and search them for duplicates without descending
fd --type f a | yadf # find files with an "a" and check them for duplicates

过滤

yadf --min 100M # find duplicate files of at least 100 MB
yadf --max 100M # find duplicate files below 100 MB
yadf --pattern '*.jpg' # find duplicate jpg
yadf --regex '^g' # find duplicate starting with 'g'
yadf --rfactor over:10 # find files with more than 10 copies
yadf --rfactor under:10 # find files with less than 10 copies
yadf --rfactor equal:1 # find unique files

格式化

查看帮助以获取输出格式的列表 yadf -h。

yadf -f json
yadf -f fdupes
yadf -f csv
yadf -f ldjson

帮助输出。

Yet Another Dupes Finder

Usage: yadf [OPTIONS] [PATHS]...

Arguments:
  [PATHS]...  Directories to search

Options:
  -f, --format <FORMAT>        Output format [default: fdupes] [possible values: csv, fdupes, json, json-pretty, ld-json, machine]
  -a, --algorithm <ALGORITHM>  Hashing algorithm [default: ahash] [possible values: ahash, highway, metrohash, seahash, xxhash]
  -n, --no-empty               Excludes empty files
      --min <size>             Minimum file size
      --max <size>             Maximum file size
  -d, --depth <depth>          Maximum recursion depth
  -H, --hard-links             Treat hard links to same file as duplicates
  -R, --regex <REGEX>          Check files with a name matching a Perl-style regex, see: https://docs.rs/regex/1.4.2/regex/index.html#syntax
  -p, --pattern <glob>         Check files with a name matching a glob pattern, see: https://docs.rs/globset/0.4.6/globset/index.html#syntax
  -v, --verbose...             Increase logging verbosity
  -q, --quiet...               Decrease logging verbosity
      --rfactor <RFACTOR>      Replication factor [under|equal|over]:n
  -o, --output <OUTPUT>        Optional output file
  -h, --help                   Print help (see more with '--help')
  -V, --version                Print version

For sizes, K/M/G/T[B|iB] suffixes can be used (case-insensitive).

关于算法的说明

大多数¹重复文件查找器遵循 3 步算法

按文件大小分组
按文件的前几个字节分组
按文件的全部内容分组

yadf 跳过第一步，仅执行步骤 2 和 3，优先考虑哈希而不是字节比较。在我的测试中，在 SSD 上执行第一步实际上减慢了程序的速度。 yadf 大量使用标准库 BTreeMap，它使用缓存感知实现，以避免过多的缓存未命中。 yadf 使用 ignore（禁用其 ignore 功能）和 rayon 的并行迭代器来并行执行这两个步骤。

¹: 一些需要不同的算法来支持不同的功能或不同的性能权衡

设计目标

我寻求构建一个高性能的工件，通过组装执行实际工作的库来实现，这里没有一个是定制的，所有都是“现成”的软件。

基准测试

yadf 的性能与硬件密切相关，特别是 NVMe SSD。我推荐使用 fclones，因为它具有更多的硬件启发式方法和更多的功能。在 HDD 上使用 yadf 是 非常糟糕的。

我的家目录包含超过 70 万个路径和 39 GB 的数据，可能是一个有病态的文件重复案例，包括所有 node_modules、python 虚拟环境、rust 目标等。这里最重要的指标可能是当文件系统缓存冷时的时间平均值。

程序（温暖的文件系统缓存）	版本	平均 [s]	最小 [s]	最大 [s]
`fclones`	0.29.3	7.435 ± 1.609	4.622	9.317
`jdupes`	1.14.0	16.787 ± 0.208	16.484	17.178
`ddh`	0.13	12.703 ± 1.547	10.814	14.793
`dupe-krill`	1.4.7	15.555 ± 1.633	12.486	16.959
`fddf`	1.7.0	18.441 ± 1.947	15.097	22.389
`yadf`	1.1.0	3.157 ± 0.638	2.362	4.175

程序（冷的文件系统缓存）	版本	平均 [s]	最小 [s]	最大 [s]
`fclones`	0.29.3	68.950 ± 3.694	63.165	73.534
`jdupes`	1.14.0	303.907 ± 11.578	277.618	314.226
`yadf`	1.1.0	52.481 ± 1.125	50.412	54.265

这里测试的程序较少，因为运行需要几个小时。

用于基准测试的脚本可以在这里阅读：此处。

使用的硬件。

从 neofetch 和 hwinfo --disk 中提取。

操作系统：Ubuntu 20.04.1 LTS x86_64
主机：XPS 15 9570
内核：5.4.0-42-generic
处理器：Intel i9-8950HK (12) @ 4.800GHz
内存：4217MiB / 31755MiB
磁盘
- 型号："SK hynix Disk"
- 驱动："nvme"

依赖项

~5–16MB
~196K SLoC