Lib.rs

›

#preprocessor #text #corpus #cli #mark #modifier

app corpus-preproc

文本和HTML语料库的前处理程序

所有者：dosjorge。

1个不稳定版本

0.1.0	2022年2月6日

#1303 在文本处理

MIT 许可证

415KB
8K SLoC

语料库前处理程序

CLI和HTTP API，用于预处理语料库以用于词嵌入和其他可能的NLP任务。主要目标是将许多HTML或纯文本文件转换为单个规范化的纯文本语料库。

功能

目录中文件的并行处理（仅限CLI）
NKFC和空白规范化
移除修饰符和标点
小写折叠
修剪单词周围的标点符号
如果单词符合以下任何条件，则用<unk>占位符替换单词
- 单词包含符号@
- 单词缺少字母字符
- 单词连续有两个标点符号，例如http://
解析HTML代码，可以使用CSS选择器
- 移除不需要的元素
- 在段落和换行符后插入新行
- 提取HTML文档的主要内容
如果原始编码在编码标准中，则文本将自动转换为UTF-8。

用法

命令行界面（CLI）

# Install
$ cargo install corpus-preproc
# Run CLI help
$ corpus-preproc clean -h
Preprocess a file or directory

USAGE:
    corpus-preproc clean [OPTIONS] <INPUT> <OUTPUT>

ARGS:
    <INPUT>     
    <OUTPUT>    

OPTIONS:
    -c
            Clean HTML tags

        --content-selector <CONTENT_SELECTOR>
            CSS selector for main content

        --delete-selector <DELETE_SELECTOR>
            CSS selector for tag removal [default: "script, style, pre, svg, math, noscript, ref,
            table, tr, td, ol, ul, li, time, [aria-hidden], img, figure"]

    -h, --help
            Print help information

    -l
            Perform case-folding

    -m
            Keep modifiers and marks on normalization

    -n
            Perform NFKC and whitespace normalization

        --nl-append-selector <NL_APPEND_SELECTOR>
            CSS selector to append newline [default: "div, p, hr, br, h1, h2, h3, h4, h5, h6"]

    -p
            Trim punctuation surrounding words

    -t <THREADS>
            Number of threads to use [default: 4]

HTTP API

启动

$ corpus-preproc serve 127.0.0.1:8000

Python示例

需要安装requests Python库。

import requests
import json

DEFAULT_CONFIG = {
  "htmlClean": {
    "enabled": True,
    "contentSelector": None,
    "deleteSelector": "script, style, pre, svg, math, noscript, ref, table, tr, td, ol, ul, li, time, [aria-hidden], img, figure",
    "nlAppendSelector": "div, p, hr, br, h1, h2, h3, h4, h5, h6",
  },
  "charNormalization": {
    "enabled": True,
    "keepModifiersAndMarks": False,
    "lowercase": True,
  },
  "wordNormalization": {
    "enabled": True,
    "replacePii": True,
  }
}

def clean_text(text):
    files = {
        'config': (None, json.dumps(DEFAULT_CONFIG), 'application/json'), # optional
        'data': (None, text, 'text/plain'),
    }
    response = requests.post('http://127.0.0.1:3000/preproc', files=files)
    return response.text
clean = clean_text("<b>HELLo, WORLD!!!").rstrip()
assert (clean == "hello world"), "OK"

待办事项

规范化或移除内部单词分隔符
将indicatif替换为linya
将CLI选项导出和加载为JSON文件

愿望清单

速度

使用tokenizers的高效纯文本前处理程序
使用更好的文本数据结构，例如ropey或tendril
确定将文本作为流处理而不是将整个文件缓冲区加载到内存中的可行性
- 请参阅lol-html和html5ever问题#149

功能

实现质量控制（最小和最大句子长度）
使用 pdf-extract 实现PDF文本提取器
使用 dotext 或 docx 实现docx/pptx/odt文本提取器
使用 rust-stemmers 实现词干提取器
使用 fasttext-rs 和一个语言识别模型实现基于所需语言的句子过滤
使用 MITIE（Rust绑定缺失）或 phrase 自动连接常用多词表达（MWE）

互操作性

Python绑定

依赖项

~23–35MB
~630K SLoC