2个版本

0.1.1	2019年11月29日
0.1.0	2019年11月29日

#1407 在文本处理

自定义许可

12KB
235 行

corpus-count

一个小工具，用于统计空格标记语料库中的标记和可选字符ngrams。

输出排序的列表。

用法

# read from file, write ngram and token counts to files
$ corpus-count -c /path/to/corpus.txt -n /path/to/ngram_output.txt \
    -t /path/to/token_output.txt

# read from file, don't count ngrams and write token counts to stdout
$ corpus-count -c /path/to/corpus.txt

# read from stdin, don't count ngrams and write token counts to stdout
$ corpus-count < /path/to/corpus.txt

# read from file, write ngram and token counts to files, filter tokens and
# ngrams appearing less than 30 times. ngrams are counted **before** filtering
# tokens.
$ corpus-count -c /path/to/corpus.txt -n /path/to/ngram_output.txt \
    -t /path/to/token_output.txt --token_min 30 --ngram_min 30
    

# read from file, write ngram and token counts to files, filter out tokens and
# ngrams appearing less than 30 times. Count ngrams **after** filtering tokens.
$ corpus-count -c /path/to/corpus.txt -n /path/to/ngram_output.txt \
    -t /path/to/token_output.txt --token_min 30 --ngram_min 30 --filter_first

统计ngrams取决于是否向--ngram_count或-n标志传递参数。如果没有--filter_first标志，则ngram计数在过滤标记之前确定，因此出现次数少于--token_min的标记仍然可以贡献ngram的计数。如果设置此标志，则先过滤标记，只有词汇表中的标记影响ngram的计数。

默认情况下，在提取ngrams之前，标记用"<"和">"括起来。这不会影响标记，只会影响ngrams，可以通过--no_bracket标志切换。

可以通过相应的--min_n和--max_n标志设置ngram的最小和最大长度。

安装

需要Rust，最简单的方式是通过https://rustup.rs安装。

cargo install corpus-count

依赖关系

~755KB