3 个稳定版本

1.0.2	2024 年 8 月 11 日

#130 在命令行工具

每月 295 次下载

GPL-3.0 或更高版本

70KB
1K SLoC

cuniq

介绍：cuniq 是一个专门用于统计文本输入中唯一行的命令行工具。如果您经常运行 sort -u | wc -l 或 sort | uniq -c 这样的命令，您会发现使用 cuniq 可以提高性能。

反驳：对于小输入，使用 sort 和 uniq 没问题，因为我们在这里讨论的是切换到 cuniq 可以节省的毫秒级时间。然而，如果您一直在使用 sort | uniq | wc -l，您应该切换到 sort -u | wc -l，因为这是一种无需走出标准 POSIX 命令即可获得的免费性能提升。

性能

cuniq 已经与 GNU coreutils（sort、uniq 和 wc）的多种组合以及其他基于哈希的 Rust 工具（runiq、sortuniq 和 huniq）进行了基准测试。截至本文写作时，您不应使用 runiq 2.0.0 或 sortuniq 0.2.0 来统计唯一行：它们在所有情况下都表现不佳，在许多情况下，它们的性能与或甚至低于 sort -u | wc -l。

对于统计，cuniq 在所有情况下都可靠地优于 GNU sort。

对于 报告行出现次数，cuniq 在所有情况下都可靠地优于 GNU uniq，只有一个例外

[!NOTE] 如果您的输入有极少的重复项，并且您需要排序的报告，那么您最好使用 sort | uniq c。这是因为对于极少的重复项，两种方法都必须对几乎所有输入进行排序，但 cuniq 还会浪费时间构建哈希表。

有关 cuniq 的基准测试和基于配置文件指导的优化等技术细节，请参阅 PERFORMANCE.md。

兼容性

cuniq 与相应的 GNU coreutils 命令具有兼容的输出

GNU coreutils 命令	cuniq 等价	效果	注意
`sort\|uniq\|wc-l`	`cuniq`	唯一行的计数
`sort-u\|wc-l`	`cuniq`	唯一行的计数	这个GNU coreutils命令比上面的更高效
`sort\|uniq-c`	`cuniq-c`	独特的行数未排序报告	两个命令的输出顺序不同
`sort\|uniq-c`	`cuniq-cs`	独特的行数排序报告

安装

从源码安装

安装Rust。在hash_raw_entry特性稳定之前，需要Nightly工具链。
RUSTFLAGS="-C target-cpu=native" cargo+nightly install cuniq

手动安装

从最新版本下载cuniq，并将其保存到您选择的文件夹

用法

cuniq可以从stdin或文件列表接受行

Usage: cuniq [OPTIONS] [FILES]...

Arguments:
  [FILES]...
          Files to process

Options:
  -c, --report
          Instead of printing total unique lines, print a report showing occurrence count of each
          line. This is only compatible with "exact" mode (the default)

  -s, --sort
          Sort report output alphabetically by line. Has no effect unless used with `--report`

  -t, --trim
          Remove leading and trailing whitespace from input

  -l, --lower
          Convert input to lowercase

  -m, --mode <MODE>
          Sets the algorithm used to count (or estimate) cardinality

          [default: exact]

          Possible values:
          - exact:      Uses a hash table to exactly count cardinality. The size of the hash table
            is proportional to the cardinality of the input. You may use the `--size` flag to set
            the initial capacity of the internal hash table. For very large inputs `--size` may help
            reduce expensive hash table reallocations. Avoid setting `--size` for small datasets
          - near-exact: Uses a hash table to exactly count cardinality, but does not store the
            original line. This mode is faster than "exact" mode, but hash collision will result in
            under-counting the cardinality by one. However, hash collisions for a 64-bit hash are
            exceedingly unlikely. The size of the hash table is proportional to the cardinality of
            the input. You may use the `--size` flag to set the initial capacity of the internal
            hash table. For very large inputs `--size` may help reduce expensive hash table
            reallocations. Avoid setting `--size` for small datasets. This mode is not compatible
            with `--report`
          - estimate:   Uses the HyperLogLog algorithm to estimate cardinality with fixed memory.
            Use the `--size` flag to specify the number of 1-byte registers to use. More registers
            will increase estimate accuracy. By default, 65536 is used. This mode is not compatible
            with `--report`

  -n, --size <SIZE>
          Set the size used by the selected counting mode. See the `--mode` documentation for how
          this affects each counting mode

      --threads <THREADS>
          Set the number of threads used to perform the count. By default, the number of logical
          cores is used. Not all counting modes support parallelism: see `--mode` for details

      --no-stdin
          Disable checking stdin for input. May yield a small performance improvement when only
          reading input from files

      --memmap
          Force reading files via memmap. This may yield improved performance for large files. If
          the binary was built without memmap support, using this flag will result in an error

      --no-memmap
          Disable reading files via memmap, instead falling back to normal reads. By default, cuniq
          will try to use memmap if it thinks it will be faster. Disabling memmap may yield improved
          performance for small files

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

许可协议

cuniq是自由软件：您可以在自由软件基金会发布的GNU通用公共许可证的条款下重新分发和/或修改它，许可证版本为3，或（根据您的选择）任何更新的版本。

cuniq分发时希望它是有用的，但没有任何保证；甚至没有关于适销性和针对特定目的的隐含保证。有关更多详细信息，请参阅GNU通用公共许可证。

完整的依赖项列表可在Cargo.toml中找到，或使用cargo deny list按许可证生成依赖项的分解。

依赖项

~4–14MB
~175K SLoC