rusty-peg — Rust库 // Lib.rs • Rust 包仓库

6个版本 (3个重大更新)

使用旧的Rust 2015

0.4.0	2015年6月2日
0.3.0	2015年7月26日
0.2.0	2015年6月16日
0.0.3	2015年5月28日

1984 在 Rust模式中

每月65次下载
用于 perf-focus

Unlicense

49KB
1K SLoC

Rusty PEG是一套宏和库代码，旨在使Rust中编写解析器变得非常容易。其名称来源于该库旨在生成使用解析表达式语法方案的解析器。Rusty PEG可在Cargo中找到。

使用PEG

要使用PEG，将以下内容添加到您的Cargo.toml

[dependencies]
rusty-peg = 0.0

然后将其添加到项目中lib.rs文件中

#[macro_use]
extern crate rusty_peg;

最后，在您的代码中某处使用rusty_peg宏（如下所示）

rusty_peg! {
    parser MyParser<'input> { ... }
}

Rusty PEG简明教程

在Rusty PEG中定义语法是通过rusty_peg!宏完成的。本节通过计算器示例解释了该宏的工作原理：计算器接受类似3+5*11/3+1的字符串作为输入并计算结果（22），同时尊重运算顺序等。

每次使用rusty_peg!宏都会定义一个特定的类型，称为解析器（以及各种其他类型）。这个解析器可以被实例化并用于解析特定的输入。在这种情况下，我们希望将我们的解析器类型称为Calculator，如下所示

rusty_peg! {
    // Name of the type for the parser you are defining.
    // The macro will generate a struct named, in this case,
    // `Calculator`. This struct will have one lifetime parameter,
    // named `'input`, which represents the lifetime of the
    // input text. This lifetime may appear in the types of nonterminals
    // (which might match a slice of the input, for example) and other
    // such cases.
    parser Calculator<'input> {
        ... // symbol definitions come here; read on
    }
}

在解析器定义中，会有一系列的符号定义。一个符号仅仅是语法中的一部分，它匹配输入的一部分并从该结构中产生一些结果（你可能熟悉“终结符”和“非终结符”这两个术语——符号实际上是两者的并集，因为PEG语法不会在词法分析和语法分析之间做出区分）。这个结果可能是一个数据结构，例如编译器中的AST，或者其他东西。在我们的计算器例子中，输出是计算结果，所以结果类型将是u32。

rusty_peg! {
    parser Calculator<'input> {
    
        // Every symbol definition starts out out with a name (here, `EXPR`)
        // and a type (here, `u32`), an `=` sign, and then a definition.
        // There are various kinds of definitions you can do. This first
        // definition is the simplest; it just defines a kind of shorthand,
        // saying that an "EXPR" is the same as an "ADD_SUB_EXPR" (another
        // symbol, defined below).
        EXPR: u32 = ADD_SUB_EXPR;
        
        // This is an example of a "map" nonterminal. "Map" first match a
        // series of other strings and symbols, and then allow you to
        // write some arbitraryRust code that will execute and produce the
        // result. In this case, we are defining `PAREN_EXPR`, which will
        // match an open parentheses, an arbirtary expression, and then
        // a close parentheses. The notation `<e:EXPR>` defines a variable
        // (`e`) which can be referenced in the expression code.
        // In this case, the expression is just `e`, meaning that the value
        // of a paren expression is the same as the expression without the parens.
        PAREN_EXPR: u32 = ("(", <e:EXPR>, ")") => e;

        // This is an example of a "fold" nonterminal. Fold nonterminals are a
        // special form that first match one instance of some base form (in this case,
        // `<lhs:MUL_DIV_EXPR>`) and then match zero or more instances of various
        // extensions. This is a common pattern that arises in expressions
        // in particular. Each time an extension is parsed, a little bit of custom
        // code runs to produce a new result. In this case, `ADD_SUB_EXPR` is
        // basically parsing a series like:
        //
        //     MUL_DIV_EXPR + MUL_DIV_EXPR - MUL_DIV_EXPR
        //
        // The tiered structure you see here is intended to enforce the
        // order of operations: all multiplications and divisions will be performed
        // before we parse the `+` and `-` operators.
        ADD_SUB_EXPR: u32 =
            fold(<lhs:MUL_DIV_EXPR>,
                 ("+", <rhs:MUL_DIV_EXPR>) => { lhs + rhs },
                 ("-", <rhs:MUL_DIV_EXPR>) => { lhs - rhs });

        // Another fold definition, this time for `*` and `/`.
        MUL_DIV_EXPR: u32 =
            fold(<lhs:ATOM_EXPR>,
                 ("*", <rhs:ATOM_EXPR>) => { lhs * rhs },
                 ("/", <rhs:ATOM_EXPR>) => { lhs / rhs });

        // `ATOM_EXPR` is the base expression form. It uses the `/` operator, which
        // is called "ordered choice". Basically the parser will attempt the various
        // options listed, in order, and take the first option which matches.
        //
        // Note that `/` plays the same role as the `|` operator in traditional CFGs,
        // but it has quite different semantics.
        //
        // IT IS IMPORTANT TO ORDER YOUR CHOICES CORRECTLY!
        ATOM_EXPR: u32 =
            (NUMBER / PAREN_EXPR);

        // The next two symbols define a number. This is done by combining
        // a "regex" symbol (the final kind of symbol) with a map. Currently this
        // cannot be done in one step, though it seems like it'd be a nice addition. :)
        //
        // Regex symbols match a given regular expression against the input. They
        // always produce the type `&'input str` -- which is a slice from the input
        // string. They are based on the regex crate.
        NUMBER: u32 =
            (<s:NUMBER_STRING>) => u32::from_str(s).unwrap();

        NUMBER_STRING: &'input str =
            regex(r"[0-9]+");
    }
}

运行解析器

目前解析器是通过创建解析器类型的实例，并在符号类型之一上调用parse_complete方法来执行的，如本测试所示。

#[test]
fn parse_expr() {
    let mut classy = Calculator::new(());
    let result =
        EXPR.parse_complete(&mut classy, "3 + 5*11/3 + 1")
            .unwrap();
    assert_eq!(result, 22);
}

缓存

PEG工作原理的一部分是它们会缓存中间结果。这意味着所有符号类型都必须是可克隆的，并且克隆应该便宜，因为每个符号都会存储在缓存中，并且可能多次从该缓存中克隆出来。如果需要，请使用Rc。

符号类型更详细地说明

实际上没有多少比上面看到的更多的内容要说，但还有一些细节。首先，让我们定义一个符号定义的语法。我们将使用修改过的BNF形式。

SYMBOL = IDENTIFIER ":" TYPE "=" DEFINITION ";"
DEFINITION = ITEM ["=>" EXPR]
           | "regex" "(" EXPR ")" ["-" "[" EXPR {"," EXPR} "]"]
           | "fold" "(" "<" IDENTIFIER ":" ITEM ">",
                        ITEM "=>" EXPR,
                        {"," ITEM "=>" EXPR} ")"

项目和项目列表

上面的定义经常引用“ITEM”，它们是我们语法定义的构建块。各种项目类型如下：

像EXPR或NUMBER这样的标识符：符号类型的名称，要么是在语法中定义的，要么是由用户定义的（下面将介绍自定义符号类型）。
像"foo"这样的字符串：匹配那个确切的字符串。
(ITEMS)：匹配项目列表ITEMS（下面将介绍项目列表）。
[ITEMS]：匹配项目列表ITEMS，但可选。类型为Option<T>，其中T是项目列表的类型。
{ITEMS}：匹配项目列表ITEMS零次或多次，类型为Vec<T>，其中T是项目列表的类型。
{+ ITEMS}：匹配项目列表ITEMS一次或多次，类型为Vec<T>，其中T是项目列表的类型。

一个项目列表 ITEMS 的形式是一个单独的项目 ITEM，或者是一个逗号分隔的列表，例如 ITEM0, ITEM1, ITEM2。后者具有类型 (T0,T1,T2)，其中 Tn 是 ITEMn 的类型。

[ITEM]（或 [ITEM, ITEM]）是一个可选的项目匹配。它的类型是 Option<T>，其中 T 是 ITEM 的类型。
{ITEM,} 定义了零个或多个实例

定义自定义符号类型

可以通过手动实现 Symbol 特性来自定义符号类型。文档“即将推出”（或者可能不会那么快，具体情况而定），或者至少一个展示如何实现的测试。

正则表达式符号

正则表达式符号还可以包含一个可选的“排除列表”，如在 Classy 示例中所示。这主要用于从标识符列表中排除关键字。

        ID: &'input str =
            regex(r"^[a-zA-Z_][a-zA-Z_0-9]*") - ["class"];

未来方向/想要帮忙？

这个库非常不稳定。我很乐意得到反馈。如果你有关于如何改进 Rusty PEG 的想法，我很乐意听听（或者看到 PR，无论如何）。以下是我对可能在未来进行的更改的一些想法

编写更多的测试；测试套件远未详尽 :)
消除逗号操作符的需要；这被 rustc 中的一个错误阻止了，该错误已修复但尚未发布到稳定版本
将解析器构建与缓存构建分开。
也许让宏生成辅助函数来执行解析，这样用户就不必编写那么多无聊的代码来调用解析器。
更加关注错误消息，并允许用户自定义错误恢复。
也许提供一种停止在所有地方隐式跳过空白的方法。
添加 & 和 ! 操作符。
稍微泛化宏，例如允许在正则表达式和折叠的左侧统一使用 => 映射表达式。

提供一种更简单的方法来定义关键字列表。某种方式来检查库的歧义性（换句话说，检查 / 和 | 操作符是否等价）。

我也对探索自底向上的解析器和分割分词器等感兴趣，但这将是另一回事，可能是一个不同的项目。

依赖关系 ~3.5MB ~75K SLoC regex 0.1