FoLiA — Rust 工具 // Lib.rs • Rust 包仓库

6 个版本

0.0.6	2020 年 11 月 16 日
0.0.5	2020 年 9 月 29 日
0.0.3	2020 年 8 月 12 日
0.0.2	2019 年 10 月 3 日

#864 在科学中

30 每月下载量
在 deepfrog 中使用

GPL-3.0+

495KB
8K SLoC

这是一个高性能的 Rust 库，用于处理 FoLiA XML 格式，这是一种丰富的语言学标注格式。

此库目前处于 alpha 版本，可以用来读取 FoLiA 文档，也可以从头创建文档。 注意：此库尚未实现验证功能！ 您需要运行另一个 FoLiA 验证器来确保您的 FoLiA 文档是有效的，因为此库尚未保证生成有效的 FoLiA。

关于 FoLiA 库的比较和已实现功能的列表，请参阅 FoLiA 实现。

安装

将 folia 添加到项目的 Cargo.toml 文件中。

使用

从文件中读取并查询所有单词

extern crate folia;

use folia;

//load document from file
let doc = folia::Document::from_file(filename, folia::DocumentProperties::default()).expect("parsing folia");
//Build a query, here you can match on any attribute
let query = folia::Query::select().element(folia::Cmp::Is(folia::ElementType::Word));
//Turn the query into a specific selector
let selector = folia::Selector::from_query(&doc, &query).expect("selector");

//Run the selector
for word in doc.select(selector, folia::Recursion::Always) {
    //print the ID and the text
    println!("{}\t{}",
        word.id().or(Some("No-ID")),
        word.text(&folia::TextParameters::default())
    );
}

常见模式是分两个阶段查询，方法如 get_annotation()、get_annotations() 提供了 select() 的快捷方式。让我们输出词性标注

//Run the selector
for word in doc.select(selector, folia::Recursion::Always) {
    if let Some(pos) = word.get_annotation(folia::AnnotationType::POS, folia::Cmp::Any, folia::Recursion::No) {
        println!(pos.class().unwrap());
    }
}

我们可以从头创建一个文档，所有新元素都可以使用高级的 annotate() 方法添加

let doc = folia::Document::new("example", folia::DocumentProperties::default()).expect("instantiating folia");
let root: ElementKey = 0; //root element always has key 0
//add a sentence, returns its key
let sentence = doc.annotate(root,
                    folia::ElementData::new(folia::ElementType::Sentence).
                    with_attrib(folia::Attribute::Id("s.1".to_string())) ).expect("Adding sentence");

doc.annotate(sentence,
             ElementData::new(ElementType::Word)
             .with_attrib(Attribute::Id("word.1".to_string()))
             .with_text("hello".to_string())
            ).expect("Adding word 1");

doc.annotate(sentence,
             ElementData::new(ElementType::Word)
             .with_attrib(Attribute::Id("word.2".to_string()))
             .with_text("world".to_string())
            ).expect("Adding word 2");

让我们添加上面两个单词的命名实体

doc.annotate(sentence,
             ElementData::new(ElementType::Entity)
             .with_attrib(Attribute::Set("adhoc".to_string()))
             .with_attrib(Attribute::Class("greeting".to_string()))
             .with_span(&[ "word.1", "word.2" ])
).expect("adding entity");

注意，这将取决于第一个参数（sentence），因为跨度是明确提供的：annotate() 将自动找出添加层的位置（如果需要的话）。

如果您有一个元素的键（一个数字内部标识符），您可以轻松地获得一个 FoliaElement 实例

if let Some(element) = doc.get_element(key) {

}

如果您有它的官方 ID，您可以这样做

if let Some(element) = doc.get_element_by_id("example.s.1.w.1") {

}

声明

所有注释类型都需要在FoLiA中声明，但只要您不将 DocumentProperties.autodeclare 设置为 false，库就会为您自动完成这项工作。显式声明使用 Document.declare() 完成。以下是一个简单的无集合声明示例

doc.declare(folia::AnnotationType::SENTENCE, &None, &None, &None);

这是一个更详细的例子

doc.declare(folia::AnnotationType::POS, Some("https://somewhere/my/pos/set".to_string()), &None, &None);

来源

FoLiA v2提供了广泛的来源支持，因此这个库也实现了这一点。您可以通过在 folia::DocumentProperties 中设置来关联一个活动处理器。

    let processor = Processor::new("test".to_string()).autofill();
    let doc =  Document::new("example", DocumentProperties::default().with_processor(processor)).expect("document");

可以随时使用 doc.active_processor(processor_key) 切换处理器。激活处理器后做出的任何声明将自动分配该处理器。

基准测试

由于本库的主要目标是提供一个高性能库，我们对其与其它更成熟、功能更完整的FoLiA库（如：用Python编写的FoliaPy和用C++编写的libfolia）进行了一些有限的基准测试。

在Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz、Linux 5.3上测试

注意：folia-rust实现仅执行最小验证，而其他库在解析时进行完整浅层验证，包括文本一致性验证。

在+-100MB的FoLiA文档上的基准测试

(bosb002gide03_01.nederlab.folia.xml)

将文件解析为完整的内存表示（DOM）

实现	CPU	内存	峰值内存
foliapy v2.2.1	60.9秒	2083 MB	-
libfolia v2.3	14.7秒	2656 MB	2681 MB
folia-rust v0.0.1	2.6秒	531 MB	622 MB

选择并迭代所有单词

实现	CPU	内存	峰值内存
foliapy v2.2.1	1.46秒	-	-
libfolia v2.3	0.84秒	-	-
folia-rust v0.0.1	0.122秒	-	-

序列化（不写入磁盘）

实现	CPU	内存	峰值内存
foliapy v2.2.1	77.7秒	-	-
libfolia v2.3	5.06秒	-	-
folia-rust v0.0.1	1.14秒	-	-

依赖项

~5.5MB
~90K SLoC