21 个不稳定版本 (3 个破坏性更新)

新功能 0.33.0	2024 年 8 月 23 日
0.32.2	2024 年 5 月 29 日
0.31.9	2024 年 5 月 26 日
0.31.5	2024 年 4 月 27 日
0.30.6	2024 年 4 月 15 日

865 在命令行工具中

每月 1,394 次下载

BSD-3-Clause

80KB
1.5K SLoC

find-identical-files

根据文件大小和哈希算法查找相同文件。

因此，如果一个文件与另一个文件具有相同的大小和哈希值，则它们是相同的。

"哈希函数是一种数学算法，它接受一个输入（在这种情况下，是一个文件）并生成一个固定大小的字符字符串，称为哈希值或校验和。哈希值作为原始输入的总结表示。此哈希值对输入数据是唯一的（不考虑不寻常的碰撞），这意味着输入的微小变化将导致完全不同的哈希值。"

为了查找相同文件，执行了以下 3 个程序

程序 1. 根据大小分组文件。

程序 2. 使用 ahash 算法根据 hash(first_bytes) 分组文件。

程序 3. 使用选择的算法根据 hash(entire_file) 分组文件。

哈希算法选项有

ahash (由 hashbrown 使用)
blake 版本 3 (默认)
fxhash (由使用于 Firefox 和 rustc)
sha256
sha512

find-identical-files 仅读取文件，永远不会更改其内容。请参阅 open_file 函数以验证。

使用示例

1. 要在当前目录中查找相同文件，请运行以下命令

find-identical-files

相同文件的数量是相同文件被找到的次数（重复次数或频率）。

默认情况下，将筛选出频率为两个（重复）或更多的相同文件。

2. 在当前目录中搜索至少有N个相同文件的文件，运行以下命令：

find-identical-files -f N

其中N是一个大于等于1的整数（N >= 1）。

使用-f（或--min_frequency）参数选项，设置最小频率（相同文件的数量）。

使用-F（或--max_frequency）参数选项，设置最大频率（相同文件的数量）。

报告所有文件

适用于获取当前目录中所有文件的哈希信息。

find-identical-files -f 1

查找重复或频率更高的文件（默认）

find-identical-files

或

find-identical-files -f 2

查找频率正好为4的文件

find-identical-files -f 4 -F 4

3. 要查找当前目录中大小大于或等于N字节的相同文件

find-identical-files -b N

其中N是一个整数（N >= 0）。

使用-b（或--min_size）参数选项，设置最小大小（字节）。

使用-B（或--max_size）参数选项，设置最大大小（字节）。

要查找大小大于或等于8字节的相同文件

find-identical-files -b 8

要查找大小小于或等于1024字节的相同文件

find-identical-files -B 1024

要查找大小在8和1024字节之间的相同文件

find-identical-files -b 8 -B 1024

要查找大小正好为1024字节的相同文件

find-identical-files -b 1024 -B 1024

4. 使用`fxhash`算法和`yaml`格式查找相同文件

find-identical-files -twa fxhash -r yaml

5. 将当前目录中的相同文件信息导出到CSV文件（fif.csv）。

find-identical-files -c .

find-identical-files -c /tmp

或

find-identical-files --csv_dir=/tmp

6. 将当前目录中的相同文件信息导出到XLSX文件（fif.xlsx）。

XLSX文件将保存在~/Downloads目录中

find-identical-files -x ~/Downloads

find-identical-files -x /tmp

或

find-identical-files --xlsx_dir=/tmp

7. 使用`ahash`算法在`Downloads`目录中查找相同文件，将输出重定向到`json`文件（/tmp/fif.json）并导出结果到XLSX文件（/tmp/fif . xlsx）以进行进一步分析

find-identical-files -tvi ~/Downloads -a ahash -r json > /tmp/fif.json -x /tmp

8. 使用jq获取信息

打印所有哈希值

find-identical-files -r json | jq -sr '.[:-1].[].["File information"].hash'

从第一个相同文件获取信息

find-identical-files -r json | jq -s '.[0]'

从第15个相同文件获取信息（如果存在）

find-identical-files -r json | jq -s '.[14]'

从范围[a,b)获取信息，包含起始点(a)和排除终点(b)。

对于a = 2 和 b = 5

find-identical-files -r json | jq -s '.[2:5]'

获取摘要信息

find-identical-files -r json | jq -s '.[-1]'

另一种选择是将结果重定向到临时文件并读取特定信息

find-identical-files -vr json > /tmp/fif

jq -sr '.[:-1].[].["File information"].hash' /tmp/fif
jq -s '.[0]' /tmp/fif
jq -s '.[-2]' /tmp/fif
jq -s '.[-1]' /tmp/fif
jq -s '.[-1]["Total number of identical files"]' /tmp/fif

帮助

在终端输入find-identical-files --h以查看帮助消息和所有可用选项

find identical files according to their size and hashing algorithm

Usage: find-identical-files [OPTIONS]

Options:
  -a, --algorithm <ALGORITHM>
          Choose the hash algorithm [default: blake3] [possible values: ahash, blake3, fxhash, sha256, sha512]
  -b, --min_size <MIN_SIZE>
          Set a minimum file size (in bytes) to search for identical files [default: 0]
  -B, --max_size <MAX_SIZE>
          Set a maximum file size (in bytes) to search for identical files
  -c, --csv_dir <CSV_DIR>
          Set the output directory for the CSV file (fif.csv)
  -d, --min_depth <MIN_DEPTH>
          Set the minimum depth to search for identical files [default: 0]
  -D, --max_depth <MAX_DEPTH>
          Set the maximum depth to search for identical files
  -e, --extended_path
          Prints extended path of identical files, otherwise relative path
  -f, --min_frequency <MIN_FREQUENCY>
          Minimum frequency (number of identical files) to be filtered [default: 2]
  -F, --max_frequency <MAX_FREQUENCY>
          Maximum frequency (number of identical files) to be filtered
  -g, --generate <GENERATOR>
          If provided, outputs the completion file for given shell [possible values: bash, elvish, fish, powershell, zsh]
  -i, --input_dir <INPUT_DIR>
          Set the input directory where to search for identical files [default: current directory]
  -o, --omit_hidden
          Omit hidden files (starts with '.'), otherwise search all files
  -r, --result_format <RESULT_FORMAT>
          Print the result in the chosen format [default: personal] [possible values: json, yaml, personal]
  -s, --sort
          Sort result by number of identical files, otherwise sort by file size
  -t, --time
          Show total execution time
  -v, --verbose
          Show intermediate runtime messages
  -w, --wipe_terminal
          Wipe (Clear) the terminal screen before listing the identical files
  -x, --xlsx_dir <XLSX_DIR>
          Set the output directory for the XLSX file (fif.xlsx)
  -h, --help
          Print help (see more with '--help')
  -V, --version
          Print version

构建

要从源代码构建和安装，运行以下命令

cargo install find-identical-files

另一个选项是从GitHub安装

cargo install --git https://github.com/claudiofsr/find-identical-files.git

互斥特性

递归遍历目录：jwalk或walkdir。

通常，jwalk（默认）比walkdir更快。

但如果你更喜欢使用walkdir

cargo install --features walkdir find-identical-files

依赖项

~16–26MB
~355K SLoC