11 个版本 (7 个稳定版)

1.0.6	2022 年 1 月 26 日
1.0.5	2021 年 5 月 27 日
1.0.4	2020 年 10 月 22 日
1.0.3	2020 年 8 月 3 日
0.1.0	2018 年 2 月 5 日

在文本处理中排名 #389

MPL-2.0 许可证

36KB
678 行

uwc

与 wc 类似，但支持 Unicode，并有行模式。

uwc 可以统计

行
词
字节
图形群
Unicode 代码点

此外，它还可以在 行模式 下操作，这将统计行内的内容。

用法示例

默认情况下，uwc 将统计行、单词和字节。您可以指定要统计的计数器，或使用 -a 标志请求所有计数器。

$ uwc tests/fixtures/**/input
lines  words  bytes  filename
8      5      29     tests/fixtures/all_newlines/input
0      0      0      tests/fixtures/empty/input
0      0      0      tests/fixtures/empty_line_mode/input
1      9      97     tests/fixtures/flags_bp/input
1      9      97     tests/fixtures/flags_cl/input
1      9      97     tests/fixtures/flags_w/input
0      1      5      tests/fixtures/hello/input
1      9      97     tests/fixtures/i_can_eat_glass/input
8      8      29     tests/fixtures/line_mode/input
7      8      28     tests/fixtures/line_mode_no_trailing_newline/input
7      8      28     tests/fixtures/line_mode_no_trailing_newline_count_newlines/input
34     66     507    total

$ uwc -a tests/fixtures/**/input
lines  words  bytes  graphemes  codepoints  filename
8      5      29     23         24          tests/fixtures/all_newlines/input
0      0      0      0          0           tests/fixtures/empty/input
0      0      0      0          0           tests/fixtures/empty_line_mode/input
1      9      97     51         51          tests/fixtures/flags_bp/input
1      9      97     51         51          tests/fixtures/flags_cl/input
1      9      97     51         51          tests/fixtures/flags_w/input
0      1      5      5          5           tests/fixtures/hello/input
1      9      97     51         51          tests/fixtures/i_can_eat_glass/input
8      8      29     28         28          tests/fixtures/line_mode/input
7      8      28     27         27          tests/fixtures/line_mode_no_trailing_newline/input
7      8      28     27         27          tests/fixtures/line_mode_no_trailing_newline_count_newlines/input
34     66     507    314        315         total

您还可以使用 --mode 标志切换到行模式

$ uwc -a --mode line tests/fixtures/line_mode/input
lines  words  bytes  graphemes  codepoints  filename
0      1      1      1          1           tests/fixtures/line_mode/input:1
0      1      2      2          2           tests/fixtures/line_mode/input:2
0      1      3      3          3           tests/fixtures/line_mode/input:3
0      1      5      4          4           tests/fixtures/line_mode/input:4
0      1      1      1          1           tests/fixtures/line_mode/input:5
0      1      4      4          4           tests/fixtures/line_mode/input:6
0      1      2      2          2           tests/fixtures/line_mode/input:7
0      1      3      3          3           tests/fixtures/line_mode/input:8
0      8      21     20         20          tests/fixtures/line_mode/input:total

为什么？

此项目旨在在统计时正确考虑 Unicode 规则。具体来说，它应该

正确统计所有换行符。这包括不太为人所知的换行符，如 NEL（U+0085）、FF（U+000C）、LS（U+2028）和 PS（U+2029）。
使用 Unicode 标准的单词边界规则统计所有单词。
正确统计所有完整的图形群，因此即使像 Z҉͈͓͈͎a̘͈̠̭l̨̯g̶̬͇̭o̝̹̗͎̙ ͟t͖̙̟̹͇̥̝͡e̥͘x͚̺̭̻͘t͉͔̩̲̘ 这样的边缘情况也能正确统计。

然而，它并不旨在实现这些 Unicode 算法，因此它使用了 unicode-segmentation 库来完成大部分工作。由于 Rust 生态系统中的 Unicode 支持还不够成熟，这对此项目有一些影响。见下文注意事项。

安装

它在 crates.io 上发布，因此只需

$ cargo install uwc

注意事项

UTF-8

它只支持 UTF-8 文件。如果有需求，UTF-16 可以添加到我的待办事项列表中。目前，您可以使用 iconv 首先将非 UTF-8 文件进行转换。

内存使用

当前实现始终会在进行计数之前读取完整的行；如果不自己手动实现Unicode行分割算法的流式处理实现，这在行模式下是必要的，以确保正确性。其结果是，如果你给它很大的行文件，它会使用与行大小成比例的内存。如果你给它一个没有换行符的文件，它将占用整个文件到内存中。请注意。

速度

它比wc慢。我的分析还不够全面，但据我所知，原因如下：

它使用Unicode算法，这肯定比ASCII慢。
我对Rust的经验不是很丰富，所以很可能我没有尽可能高效地做事。
我的空闲时间有限，我正在优先考虑正确性而不是速度（尽管速度也很重要）。

话虽如此，它已经并行化了，这有所帮助。在我的本地笔记本电脑上对较大数据集进行测试时，速度在wc的一个数量级之内。我测量了在18 MiB的文本文件集合中，uwc比wc慢1.5倍。

本地化

Rust目前还没有本地化库，这带来了一些后果。一些计数将会是错误的，比如带连字符的单词，这具有地区特定性，需要语言字典查找才能正确。还有一些语言没有句法词分隔符，例如日语，所以例如：

私はガラスを食べられます。

应该是5个单词，但没有本地化，我们无法确定这一点。

依赖关系

~7–16MB
~196K SLoC