69 个版本 (19 个破坏性版本)

0.28.0	2024年4月11日
0.26.1	2024年4月8日
0.21.2	2024年3月31日
0.18.1	2023年12月30日
0.4.7	~~2023年7月31日~~

#180 in 命令行工具

每月下载量 7,978

BSD-3-Clause

78KB
1.5K SLoC

新项目名称

该项目已重命名为： find-identical-files。

旧项目名称： find_duplicate_files

find_duplicate_files

根据文件大小和哈希算法查找相同的文件。

"哈希函数是一种数学算法，它将输入（在本例中，是一个文件）转换为固定大小的字符字符串，称为哈希值或校验和。哈希值作为原始输入的摘要表示。这个哈希值对于输入数据是唯一的（不考虑不寻常的碰撞），这意味着输入的任何微小变化都将导致完全不同的哈希值。"

哈希算法选项有

ahash（由 hashbrown 使用）
blake3 版本 3（默认）
fxhash（由 Firefox 和 rustc 使用）
sha256
sha512

find_duplicate_files 只读取文件，永远不会更改其内容。请参阅函数 fn open_file() 以验证。

用法示例

1. 要在当前目录中查找重复文件，请运行以下命令

find_duplicate_files

2. 在当前目录中搜索至少有 5 个相同文件的文件，请运行以下命令

find_duplicate_files -n 5

使用 --min_number（或 -n）参数选项设置“最小相同文件数”。

使用 --max_number（或 -N）参数选项设置“最大相同文件数”。

如果 n = 0 或 n = 1，则将报告所有文件。

如果 n = 2（默认），则查找重复文件或更多相同文件。

3. 使用 `fxhash` 算法和 `yaml` 格式查找重复文件

find_duplicate_files -twa fxhash -r yaml

4. 在 `Downloads` 目录中查找重复文件，并将输出重定向到 `json` 文件以供进一步分析

find_duplicate_files -vi ~/Downloads -r json > fdf.json

5. 查找当前目录中大小大于或等于 8 字节的重复文件

find_duplicate_files -b 8

6. 在当前目录中查找大小小于或等于1024字节的重复文件

find_duplicate_files -B 1024

7. 在当前目录中查找大小在8到1024字节之间的重复文件

find_duplicate_files -b 8 -B 1024

8. 在当前目录中查找大小恰好为1024字节的重复文件

find_duplicate_files -b 1024 -B 1024

9. 将当前目录中的重复文件信息导出到CSV文件（fdf.csv）。

8.1 CSV文件将保存在当前目录中

find_duplicate_files -c .

8.2 CSV文件将保存在/tmp目录中

find_duplicate_files --csv_dir=/tmp

10. 将当前目录中的重复文件信息导出到XLSX文件（fdf.xlsx）。

9.1 XLSX文件将保存在~/Downloads目录中

find_duplicate_files -x ~/Downloads

9.2 XLSX文件将保存在/tmp目录中

find_duplicate_files --xlsx_dir=/tmp

11. 使用`ahash`算法在`Downloads`目录中查找重复文件并将结果导出到`/tmp/fdf.xlsx`

find_duplicate_files -twi ~/Downloads -x /tmp -a ahash

帮助

在终端中输入find_duplicate_files -h以查看帮助信息和所有可用选项

find identical files according to their size and hashing algorithm

Usage: find_duplicate_files [OPTIONS]

Options:
  -a, --algorithm <ALGORITHM>
          Choose the hash algorithm [default: blake3] [possible values: ahash, blake3, fxhash, sha256, sha512]
  -b, --min_size <MIN_SIZE>
          Set a minimum file size (in bytes) to search for duplicate files
  -B, --max_size <MAX_SIZE>
          Set a maximum file size (in bytes) to search for duplicate files
  -c, --csv_dir <CSV_DIR>
          Set the output directory for the CSV file (fdf.csv)
  -d, --min_depth <MIN_DEPTH>
          Set the minimum depth to search for duplicate files
  -D, --max_depth <MAX_DEPTH>
          Set the maximum depth to search for duplicate files
  -f, --full_path
          Prints full path of duplicate files, otherwise relative path
  -g, --generate <GENERATOR>
          If provided, outputs the completion file for given shell [possible values: bash, elvish, fish, powershell, zsh]
  -i, --input_dir <INPUT_DIR>
          Set the input directory where to search for duplicate files [default: current directory]
  -n, --min_number <MIN_NUMBER>
          Minimum 'number of identical files' to be reported
  -N, --max_number <MAX_NUMBER>
          Maximum 'number of identical files' to be reported
  -o, --omit_hidden
          Omit hidden files (starts with '.'), otherwise search all files
  -r, --result_format <RESULT_FORMAT>
          Print the result in the chosen format [default: personal] [possible values: json, yaml, personal]
  -s, --sort
          Sort result by number of duplicate files, otherwise sort by file size
  -t, --time
          Show total execution time
  -v, --verbose
          Show intermediate runtime messages
  -w, --wipe_terminal
          Wipe (Clear) the terminal screen before listing the duplicate files
  -x, --xlsx_dir <XLSX_DIR>
          Set the output directory for the XLSX file (fdf.xlsx)
  -h, --help
          Print help (see more with '--help')
  -V, --version
          Print version

构建

要从源代码构建和安装，请运行以下命令

cargo install find_duplicate_files

另一种选择是从github安装

cargo install --git https://github.com/claudiofsr/find_duplicate_files.git

互斥特性

递归遍历目录：jwalk或walkdir。

通常，jwalk（默认）比walkdir快。

但如果你更喜欢使用walkdir

cargo install --features walkdir find_duplicate_files

依赖项

~15–25MB
~347K SLoC