2 个版本
0.1.1 | 2021 年 4 月 7 日 |
---|---|
0.1.0 | 2020 年 10 月 19 日 |
在 科学 分类中排名第 152
84KB
1.5K SLoC
Treebender
Treebender 是一个受 HDPSG 启发的 Rust 符号自然语言解析库。
这是什么?
这是一个用于将自然语言或构造语言解析成语法树和特征结构的库。它没有机器学习或概率模型,一切都是手工制作的,确定性的。
你可以在这篇博客文章中了解更多关于这个项目的动机。
但你是用它来做什么的?
我正在用它来解析我即将推出的外星语言学游戏 Themengi 的构造语言。
动机
使用下面教程中介绍的简单 80 行语法,我们可以解析英语的简单子集,检查反身代词约束、格和数的一致性。
$ cargo run --bin cli examples/reflexives.fgr
> she likes himself
Parsed 0 trees
> her likes herself
Parsed 0 trees
> she like herself
Parsed 0 trees
> she likes herself
Parsed 1 tree
(0..3: S
(0..1: N (0..1: she))
(1..2: TV (1..2: likes))
(2..3: N (2..3: herself)))
[
child-2: [
case: acc
pron: ref
needs_pron: #0 she
num: sg
child-0: [ word: herself ]
]
child-1: [
tense: nonpast
child-0: [ word: likes ]
num: #1 sg
]
child-0: [
child-0: [ word: she ]
case: nom
pron: #0
num: #1
]
]
低资源语言?没有问题!不需要在数十亿字节的文本上训练,只需用你的大脑编写语法。假设在美国手语中,主题化的名词(用扬起的眉毛表达)必须出现在句子的开头。我们可以编写一个小语法(18 行),并插入一些句子
$ cargo run --bin cli examples/asl-wordorder.fgr -n
> boy sit
Parsed 1 tree
(0..2: S
(0..1: NP ((0..1: N (0..1: boy))))
(1..2: IV (1..2: sit)))
> boy throw ball
Parsed 1 tree
(0..3: S
(0..1: NP ((0..1: N (0..1: boy))))
(1..2: TV (1..2: throw))
(2..3: NP ((2..3: N (2..3: ball)))))
> ball nm-raised-eyebrows boy throw
Parsed 1 tree
(0..4: S
(0..2: NP
(0..1: N (0..1: ball))
(1..2: Topic (1..2: nm-raised-eyebrows)))
(2..3: NP ((2..3: N (2..3: boy))))
(3..4: TV (3..4: throw)))
> boy throw ball nm-raised-eyebrows
Parsed 0 trees
教程
以一个例子为例,假设我们想要构建一个英语反身代词(他自己、她自己、他们自己、他们自己、它自己)的解析器。我们还将支持数量(“他喜欢 X” 与 “他们喜欢 X”)和简单的嵌入式子句(“他说他们喜欢 X”)。
语法文件是用类似于 BNF 的自定义语言编写的,称为特征语法 (.fgr)。有一个 VSCode 语法高亮扩展可用于这些文件,可在 fgr-syntax
找到。
我们将首先定义我们的词汇表。词汇表是语法将匹配的终端符号(实际输入中的符号)的集合。终端符号必须以小写字母开头,非终端符号必须以大写字母开头。
// pronouns
N -> he
N -> him
N -> himself
N -> she
N -> her
N -> herself
N -> they
N -> them
N -> themselves
N -> themself
// names, lowercase as they are terminals
N -> mary
N -> sue
N -> takeshi
N -> robert
// complementizer
Comp -> that
// verbs -- intransitive, transitive, and clausal
IV -> falls
IV -> fall
IV -> fell
TV -> likes
TV -> like
TV -> liked
CV -> says
CV -> say
CV -> said
接下来,我们可以添加我们的句子规则(它们必须添加在顶部,因为文件中的第一个规则被认为是顶级规则)
// sentence rules
S -> N IV
S -> N TV N
S -> N CV Comp S
// ... previous lexicon ...
假设这个文件保存为 examples/no-features.fgr
(是的,就是它 😉),我们可以使用内置的 CLI 测试此文件
$ cargo run --bin cli examples/no-features.fgr
> he falls
Parsed 1 tree
(0..2: S
(0..1: N (0..1: he))
(1..2: IV (1..2: falls)))
[
child-1: [ child-0: [ word: falls ] ]
child-0: [ child-0: [ word: he ] ]
]
> he falls her
Parsed 0 trees
> he likes her
Parsed 1 tree
(0..3: S
(0..1: N (0..1: he))
(1..2: TV (1..2: likes))
(2..3: N (2..3: her)))
[
child-2: [ child-0: [ word: her ] ]
child-1: [ child-0: [ word: likes ] ]
child-0: [ child-0: [ word: he ] ]
]
> he likes
Parsed 0 trees
> he said that he likes her
Parsed 1 tree
(0..6: S
(0..1: N (0..1: he))
(1..2: CV (1..2: said))
(2..3: Comp (2..3: that))
(3..6: S
(3..4: N (3..4: he))
(4..5: TV (4..5: likes))
(5..6: N (5..6: her))))
[
child-0: [ child-0: [ word: he ] ]
child-2: [ child-0: [ word: that ] ]
child-1: [ child-0: [ word: said ] ]
child-3: [
child-2: [ child-0: [ word: her ] ]
child-1: [ child-0: [ word: likes ] ]
child-0: [ child-0: [ word: he ] ]
]
]
> he said that he
Parsed 0 trees
该语法已经解析了一些正确的句子,并阻止了一些显而易见的错误句子。然而,目前它不考虑数字、大小写或反身代词
> she likes himself // unbound reflexive pronoun
Parsed 1 tree
(0..3: S
(0..1: N (0..1: she))
(1..2: TV (1..2: likes))
(2..3: N (2..3: himself)))
[
child-0: [ child-0: [ word: she ] ]
child-2: [ child-0: [ word: himself ] ]
child-1: [ child-0: [ word: likes ] ]
]
> him like her // incorrect case on the subject pronoun, should be nominative
// (he) instead of accusative (him)
Parsed 1 tree
(0..3: S
(0..1: N (0..1: him))
(1..2: TV (1..2: like))
(2..3: N (2..3: her)))
[
child-0: [ child-0: [ word: him ] ]
child-1: [ child-0: [ word: like ] ]
child-2: [ child-0: [ word: her ] ]
]
> he like her // incorrect verb number agreement
Parsed 1 tree
(0..3: S
(0..1: N (0..1: he))
(1..2: TV (1..2: like))
(2..3: N (2..3: her)))
[
child-2: [ child-0: [ word: her ] ]
child-1: [ child-0: [ word: like ] ]
child-0: [ child-0: [ word: he ] ]
]
要修复这个问题,我们需要向我们的词汇表中添加 特性,并基于特性限制句子规则。
特性以方括号添加,并且是逗号分隔的键值对。 **top**
是一个特殊特性值,基本上意味着“未指定”——我们稍后再讨论。未指定的特性也假定具有 **top**
值,但有时明确地指出 top 更清晰。
/// Pronouns
// The added features are:
// * num: sg or pl, whether this noun wants a singular verb (likes) or
// a plural verb (like). note this is grammatical number, so for example
// singular they takes plural agreement ("they like X", not *"they likes X")
// * case: nom or acc, whether this noun is nominative or accusative case.
// nominative case goes in the subject, and accusative in the object.
// e.g., "he fell" and "she likes him", not *"him fell" and *"her likes he"
// * pron: he, she, they, or ref -- what type of pronoun this is
// * needs_pron: whether this is a reflexive that needs to bind to another
// pronoun.
N[ num: sg, case: nom, pron: he ] -> he
N[ num: sg, case: acc, pron: he ] -> him
N[ num: sg, case: acc, pron: ref, needs_pron: he ] -> himself
N[ num: sg, case: nom, pron: she ] -> she
N[ num: sg, case: acc, pron: she ] -> her
N[ num: sg, case: acc, pron: ref, needs_pron: she] -> herself
N[ num: pl, case: nom, pron: they ] -> they
N[ num: pl, case: acc, pron: they ] -> them
N[ num: pl, case: acc, pron: ref, needs_pron: they ] -> themselves
N[ num: sg, case: acc, pron: ref, needs_pron: they ] -> themself
// Names
// The added features are:
// * num: sg, as people are singular ("mary likes her" / *"mary like her")
// * case: **top**, as names can be both subjects and objects
// ("mary likes her" / "she likes mary")
// * pron: whichever pronoun the person uses for reflexive agreement
// mary pron: she => mary likes herself
// sue pron: they => sue likes themself
// takeshi pron: he => takeshi likes himself
N[ num: sg, case: **top**, pron: she ] -> mary
N[ num: sg, case: **top**, pron: they ] -> sue
N[ num: sg, case: **top**, pron: he ] -> takeshi
N[ num: sg, case: **top**, pron: he ] -> robert
// Complementizer doesn't need features
Comp -> that
// Verbs -- intransitive, transitive, and clausal
// The added features are:
// * num: sg, pl, or **top** -- to match the noun numbers.
// **top** will match either sg or pl, as past-tense verbs in English
// don't agree in number: "he fell" and "they fell" are both fine
// * tense: past or nonpast -- this won't be used for agreement, but will be
// copied into the final feature structure, and the client code could do
// something with it
IV[ num: sg, tense: nonpast ] -> falls
IV[ num: pl, tense: nonpast ] -> fall
IV[ num: **top**, tense: past ] -> fell
TV[ num: sg, tense: nonpast ] -> likes
TV[ num: pl, tense: nonpast ] -> like
TV[ num: **top**, tense: past ] -> liked
CV[ num: sg, tense: nonpast ] -> says
CV[ num: pl, tense: nonpast ] -> say
CV[ num: **top**, tense: past ] -> said
现在,我们的词汇表已经更新了特性,我们可以更新我们的句子规则,以基于这些特性约束解析。这使用了两个新特性,标签和统一。标签允许在规则中的节点之间关联特性,而统一控制这些特性的兼容性。统一的规则如下:
- 字符串特性可以与具有相同值的字符串特性统一
- top 特性可以与任何内容统一,并且节点被合并
- 复杂特性([ ... ] 结构)递归地与另一个复杂特性统一。
如果统一失败,解析将中止并丢弃树。这允许程序员在特性不匹配时丢弃树。
// Sentence rules
// Intransitive verb:
// * Subject must be nominative case
// * Subject and verb must agree in number (copied through #1)
S -> N[ case: nom, num: #1 ] IV[ num: #1 ]
// Transitive verb:
// * Subject must be nominative case
// * Subject and verb must agree in number (copied through #2)
// * If there's a reflexive in the object position, make sure its `needs_pron`
// feature matches the subject's `pron` feature. If the object isn't a
// reflexive, then its `needs_pron` feature will implicitly be `**top**`, so
// will unify with anything.
S -> N[ case: nom, pron: #1, num: #2 ] TV[ num: #2 ] N[ case: acc, needs_pron: #1 ]
// Clausal verb:
// * Subject must be nominative case
// * Subject and verb must agree in number (copied through #1)
// * Reflexives can't cross clause boundaries (*"He said that she likes himself"),
// so we can ignore reflexives and delegate to inner clause rule
S -> N[ case: nom, num: #1 ] CV[ num: #1 ] Comp S
现在我们有了这个增强的语法(作为 examples/reflexives.fgr
提供),我们可以尝试它并看到它拒绝了一些之前接受的非法句子,同时仍然接受有效的句子
> he fell
Parsed 1 tree
(0..2: S
(0..1: N (0..1: he))
(1..2: IV (1..2: fell)))
[
child-1: [
child-0: [ word: fell ]
num: #0 sg
tense: past
]
child-0: [
pron: he
case: nom
num: #0
child-0: [ word: he ]
]
]
> he like him
Parsed 0 trees
> he likes himself
Parsed 1 tree
(0..3: S
(0..1: N (0..1: he))
(1..2: TV (1..2: likes))
(2..3: N (2..3: himself)))
[
child-1: [
num: #0 sg
child-0: [ word: likes ]
tense: nonpast
]
child-2: [
needs_pron: #1 he
num: sg
child-0: [ word: himself ]
pron: ref
case: acc
]
child-0: [
child-0: [ word: he ]
pron: #1
num: #0
case: nom
]
]
> he likes herself
Parsed 0 trees
> mary likes herself
Parsed 1 tree
(0..3: S
(0..1: N (0..1: mary))
(1..2: TV (1..2: likes))
(2..3: N (2..3: herself)))
[
child-0: [
pron: #0 she
num: #1 sg
case: nom
child-0: [ word: mary ]
]
child-1: [
tense: nonpast
child-0: [ word: likes ]
num: #1
]
child-2: [
child-0: [ word: herself ]
num: sg
pron: ref
case: acc
needs_pron: #0
]
]
> mary likes themself
Parsed 0 trees
> sue likes themself
Parsed 1 tree
(0..3: S
(0..1: N (0..1: sue))
(1..2: TV (1..2: likes))
(2..3: N (2..3: themself)))
[
child-0: [
pron: #0 they
child-0: [ word: sue ]
case: nom
num: #1 sg
]
child-1: [
tense: nonpast
num: #1
child-0: [ word: likes ]
]
child-2: [
needs_pron: #0
case: acc
pron: ref
child-0: [ word: themself ]
num: sg
]
]
> sue likes himself
Parsed 0 trees
如果您对此感兴趣并想了解更多,您可以查看我的博客系列 Symbolic Linguistics Part1、优秀的教科书 Syntactic Theory: A Formal Introduction (2nd ed.) 以及 DELPH-IN 项目,其 LKB 的工作启发了这个简化版本。
从代码中使用
我需要更详细地编写这部分内容,但如果您对 Rust 感到舒适,我建议查看代码库。它不是完美的,它是我的第一个 Rust 项目之一(在经历了从 F# -> TypeScript -> C 的迁移后,以寻找正确的性能/舒适度权衡),它还需要更多的测试,但总体来说还不错。
基本上,处理流程是
- 创建一个
Grammar
结构
Grammar
定义在rules.rs
。- 创建
Grammar
的最简单方法是Grammar::parse_from_file
,它主要是parse_grammar.rs
中的手写递归下降解析器。是的,我意识到这里的讽刺。
- 它接受输入(在
Grammar::parse
中,为你做所有事情,或者在Grammar::parse_chart
中,只做图) - 输入首先在
earley.rs
中进行图解析 - 然后,从图表中构建了一个森林,在
forest.rs
中,使用了我在一个非常有用的博客系列中找到的算法,因为我忘记那个博客的URL了,因为这个学术文献中的算法……很奇怪。 - 最后,使用特征统一来修剪森林,只保留有效的树。在解析过程中做这个会更有效率,但算了。
通过代码而不是通过命令行可以做最有趣的事情可能是获取原始特征有向图,因为这样你就可以做像代词指代消解这样的事情。有向图的代码在 featurestructure.rs
中,应该相对容易理解--围绕 Rc<RefCell<...>>
的Rust仪式有很多,因为使用一个游戏分配crate似乎 太过了,但 NodeRef
类型别名多少减轻了这一点。如果你在这方面需要帮助,请访问 https://vgel.me/contact!
许可证
遵循MIT许可证。
贡献
除非你明确声明,否则任何有意提交以包含在作品中的贡献都应按上述方式许可,没有其他额外条款或条件。
依赖项
~2.2–3MB
~54K SLoC