4 个稳定版本

2.0.0	2023年2月24日
1.2.3	2022年5月9日
1.2.1	2022年3月3日
1.1.1	2022年3月1日
0.1.0	~~2021年2月15日~~

701 在文本处理

30 每月下载量

Apache-2.0

530KB
4.5K SLoC

Ungoliant

🕷️ Ungoliant 是一个高性能的管道，提供从 CommonCrawl 构建语料库生成管道的工具。 🕷️

它目前是 OSCAR 语料库的生成管道，来自 CommonCrawl。Ungoliant 是 goclassy 的替代品。

安装

安装/编译二进制文件

通过 cargo：cargo install ungoliant
通过 git：cargo install --git https://github.com/oscar-corpus/ungoliant

Ungoliant 需要许多依赖项，这些依赖项在安装时应该编译。然而，由于该项目使用了 fasttext-rs，可能需要 cmake / gcc。

KenLM 功能

KenLM 功能是可选的，因为它依赖于可能因提供的模型文件不正确而损坏的不安全代码。

要启用它，安装 KenLM 要求

apt install -y libboost-all-dev libeigen3-dev

并使用 cargo install ungoliant --feature kenlm 或如果您是从源代码构建，请使用 cargo b --features kenlm。

获取语言识别文件（用于 fastText）

使用curl https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -o lid.176.bin。

用法

生成语料库的常规方法是

从最近的CommonCrawl 溢出中获取 wet.paths.gz 文件并解压缩。
使用 download 命令下载文件。
使用 pipeline 命令生成语料库（可能需要一些时间）。
有关打包步骤，请访问oscar-tools

有关每个命令的更多信息，请查看 --help。

ungoliant 2
corpus generation tool.

USAGE:
    ungoliant <SUBCOMMAND>

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

SUBCOMMANDS:
    download    Download a CommonCrawl release
    help        Prints this message or the help of the given subcommand(s)
    pipeline    Run pipeline
    rebuild     Rebuild the corpus for a given language.

文档

Ungoliant 尚未在 docs.rs 上：使用 cargo doc --bins --open 打开文档。

有关项目的更多信息，请访问OSCAR 文档

依赖项 ~29–45MB ~613K SLoC avro-rs 0.13+snappy bytes csv kenlm? ctclib-pp env_logger 0.8.3 fasttext flate2 futures futures-core futures-util glob itertools 0.10 language-tags 0.3.2 lazy_static log oscar-io 0.2.2 oxilangtag+serde rand 0.8.4 rayon reqwest 0.11+rustls-tls+blocking+stream runiq-lib schemars serde+derive serde_json sha2 0.9.5 structopt 0.3.21 tlsh-fixed tokio+full tokio-util 0.6.6+compat twox-hash unic-ucd unicode-script unicode-segmentation url ut1_blocklist warc+with_serde dev criterion 0.3 dev rand_distr dev serial_test 0.5.1 dev sha-1 0.9 dev tempfile dev test-log