#parser-combinator #grammar #combinator #parser #generator #syntax-tree #tdpl

mpl

Minimal Parsing Language (MPL) 的最小解析组合器,类似于 Top-Down Parsing Language (TDPL)

2 个不稳定版本

0.2.0 2022年2月13日
0.1.0 2021年6月29日

#75 in 解析工具


2 crates 中使用

MIT/Apache

49KB
1K SLoC

Minimal Parsing Language (MPL)

Crate API

这是类似于 Top-Down Parsing Language (TDPL) 的 Minimal Parsing Language (MPL) 的最小解析组合器。它为每个输入创建一个抽象语法树 (AST)。

入门

  1. 实现 Variable
  2. 将每个规则插入到 HashMap
  3. minimal_parse()
  • 可选
    • 实现 Input
      • 默认支持 [T]str
    • 实现 Position
      • 默认支持 u*, i*, 和 f*
    • 实现 Span
      • 默认支持 StartAndLenSpan
    • 实现 Terminal
      • 默认支持 SliceTerminal, StrTerminal, 和 U8SliceTerminal
    • 实现 Output
      • 默认支持 ()
    • 实现 Rules
      • 默认支持 HashMap
    • 实现 Parse
      • 默认支持 [T], str, 和 [u8]

示例

use crate::ParenthesesVariable::*;
use mpl::parser::Parser;
use mpl::rules::{RightRule, RightRuleKind::*, Rules};
use mpl::span::{StartAndLenSpan, Start, Len};
use mpl::output::Output;
use mpl::symbols::{StrTerminal, StrTerminal::*, Variable};
use mpl::trees::AST;
use std::collections::HashMap;

#[derive(Clone, Debug, Hash, Eq, PartialEq)]
enum ParenthesesVariable {
    Open,
    Parentheses,
    Close,
}

impl Variable for ParenthesesVariable {}

struct ParenthesesParser;

impl<'i, V, P, L, R, O> Parser<'i, str, StrTerminal<'i>, V, StartAndLenSpan<P, L>, P, R, O>
    for ParenthesesParser
where
    V: Variable,
    P: Start<str, L>,
    L: Len<str, P>,
    R: Rules<StrTerminal<'i>, V>,
    O: Output<'i, str, V, StartAndLenSpan<P, L>>,
{
}

/// ```
/// Open = '(' Parentheses / ()
/// Parentheses = Open Close / f
/// Close = ")" Open / f
/// ```
fn main() {
    let parser = ParenthesesParser;
    let mut rules = HashMap::new();

    rules.insert(
        Open,
        RightRule::from_right_rule_kind((T(Char('(')), V(Parentheses)), Empty),
    );
    rules.insert(
        Parentheses,
        RightRule::from_right_rule_kind((V(Open), V(Close)), Failure),
    );
    rules.insert(
        Close,
        RightRule::from_right_rule_kind((T(Str(")")), V(Open)), Failure),
    );

    let input = "(()(()))";

    // all of the span
    let all_of_the_span = StartAndLenSpan::<u32, u16>::from_start_len(0, input.len() as u16);

    let result: Result<
        AST<ParenthesesVariable, StartAndLenSpan<u32, u16>, ()>,
        AST<ParenthesesVariable, StartAndLenSpan<u32, u16>, ()>,
    > = parser.parse(input, &rules, &Open, &all_of_the_span);

    if let Ok(ast) = result {
        println!("{}", ast);
    }
}

测试示例

用 MPL 编写的解析器

  • WAV AST : RIFF 波形音频格式

MPL

MPL 语法定义

一个 MPL 语法 G 是一个元组 G = (V, Σ, R, S),其中

  • V 是一个有限变量集。
  • Σ 是一个有限的原终结符集合。
  • T 是 Σ 或 M 的并集(Σ ∪ M)(M = {(), f} 是一个有限的全称符号集合)。
  • R 是形式为的有限规则集合
    • A=B C/D
      A ∈ V,(A 属于 V),
      B, C, D ∈ E(E = T ∪ V)(T ∩ V = ∅)(B, C, D ∈ E)。
      对于任何变量 A,都有一个规则,其左侧是 =
  • S ∈ V(S 属于 V)是起始变量。

() 是一个总是成功而不消耗输入的全称符号。

Empty = () () / ()

失败

f 是一个总是失败而不消耗输入的全称符号。

Failure = f f / f

扩展 MPL

由于 MPL 的目标之一是创建一个 AST,因此它在易用性和速度方面也支持两个特性。

任何

? 是一个表示任何单个输入(如通配符字符)的全称符号。如果还有输入,则成功,如果没有输入,则失败。

Any = ? () / f

为了扩展 MPL 语法定义,让 ? ∈ M。

所有

* 是一个表示所有剩余输入(如通配符字符)的全称符号。即使剩余输入为零,也会成功。

All = * () / f

All = ? All / () 相同。

为了扩展 MPL 语法定义,让 * ∈ M。

TDPL 和 MPL 的区别

这两种语法的最大区别在于规则形式。TDPL 有两种规则形式。

A..BC/D,A, B, C, D 属于 V。
A..a,a ∈ Σ ∪ {ε, f},f 是不在 Σ 中的全称符号,ε 是空字符串。

另一方面,MPL 只有一种规则形式。

MPLG(MPL 语法)语法

在 MPL 语法中

// Hierarchical syntax
Mplg = ZeroOrMoreLines () / f
ZeroOrMoreLines = Line ZeroOrMoreLines / ()

Line = Line1 EndOfLine / f
Line1 = LineComment () / Line2
Line2 = Rule () / ()

Rule = Variable Rule1 / f
Rule1 = " = " Rule2 / f
Rule2 = E Rule3 / f
Rule3 = Space Rule4 / f
Rule4 = E Rule5 / f
Rule5 = " / " Rule6 / f
Rule6 = E () / f
E = TerminalSymbol () / Variable


// Lexical syntax
// Variable
Variable = Identifier () / f

// Terminal symbol
TerminalSymbol = MetasymbolLiteral () / OriginalSymbolExpr

// Expr
Expr = ExprWithoutBlock () / f

// Without Block
ExprWithoutBlock = LiteralExpr () / ExprWithoutBlock1
ExprWithoutBlock1 = StructExpr () / f

// Struct
StructExpr = StructExprStruct () / StructExpr1
StructExpr1 = StructExprTuple () / StructExprUnit

StructExprStruct = f f / f

StructExprTuple = PathInExpr StructExprTuple1 / f
StructExprTuple1 = '(' StructExprTuple2 / f
StructExprTuple2 = ZeroOrMoreExpr ')' / f
ZeroOrMoreExpr = Expr () / f

StructExprUnit = PathInExpr () / f

// PathInExpr
PathInExpr = ZeroOrOneDoubleColon OneOrMorePathExprSegment / f
ZeroOrOneDoubleColon = "::" () / ()
OneOrMorePathExprSegment = PathExprSegment () / f

PathExprSegment = PathIdentSegment PathExprSegment1 / f
PathExprSegment1 = "::" GenericArgs / ()

PathIdentSegment = Identifier () / f

GenericArgs = f f / f

// Literal
LiteralExpr = CharLiteral () / LiteralExpr1
LiteralExpr1 = StringLiteral () / LiteralExpr2
LiteralExpr2 = IntegerLiteral () / f

// Metasymbol
MetasymbolLiteral = EmptyLiteral () / MetasymbolLiteral1
MetasymbolLiteral1 = FailureLiteral () / MetasymbolLiteral2
MetasymbolLiteral2 = AnyLiteral () / MetasymbolLiteral3
MetasymbolLiteral3 = AllLiteral () / f
EmptyLiteral = "()" () / f
FailureLiteral = 'f' () / f
AnyLiteral = '?' ZeroOrMoreAny / f
ZeroOrMoreAny = '?' ZeroOrMoreAny / ()
AllLiteral = '*' () / f

// Original symbol
OriginalSymbolExpr = "{ " OriginalSymbolExpr1 / f
OriginalSymbolExpr1 = ExprWithoutBlock " }" / f

// Char
CharLiteral = '\'' CharLiteral1 / f
CharLiteral1 = InnerCharLiteral '\'' / f
InnerCharLiteral = NotCharLetter InnerCharLiteral1 / f
NotCharLetter = '\'' * / ()
InnerCharLiteral1 = QuoteEscape () / ?

// String
StringLiteral = '"' StringLiteral1 / f
StringLiteral1 = InnerStringLiteral '"' / f
InnerStringLiteral = InnerStringLiteralLetter InnerStringLiteral / ()
// InnerStringLiteralLetter
InnerStringLiteralLetter = NotStringLetter InnerStringLiteralLetter1 / f
NotStringLetter = '"' * / ()
InnerStringLiteralLetter1 = QuoteEscape () / ?

// Integer
IntegerLiteral = IntegerLiterals () / f
IntegerLiterals = DecLiteral () / f
DecLiteral = DecDigit ZeroOrMoreDecDigit / f
ZeroOrMoreDecDigit = DecDigitOrUnderscore ZeroOrMoreDecDigit / ()
DecDigitOrUnderscore = DecDigit () / '_'

// IDENTIFIER
Identifier = Uppercase ZeroOrMoreIdentifierContinue / f
ZeroOrMoreIdentifierContinue =  IdentifierContinue ZeroOrMoreIdentifierContinue / ()
IdentifierContinue =  Alphabet () / DecDigit


// Letters
Alphabet = Lowercase () / Uppercase
// Lowercase
LowercaseAToF = LowercaseAToF1 () / f
LowercaseAToF1 = 'a' () / LowercaseAToF2
LowercaseAToF2 = 'b' () / LowercaseAToF3
LowercaseAToF3 = 'c' () / LowercaseAToF4
LowercaseAToF4 = 'd' () / LowercaseAToF5
LowercaseAToF5 = 'e' () / LowercaseAToF6
LowercaseAToF6 = 'f' () / f
Lowercase = LowercaseAToF () / Lowercase1
Lowercase1 = 'g' () / Lowercase2
Lowercase2 = 'h' () / Lowercase3
Lowercase3 = 'i' () / Lowercase4
Lowercase4 = 'j' () / Lowercase5
Lowercase5 = 'k' () / Lowercase6
Lowercase6 = 'l' () / Lowercase7
Lowercase7 = 'm' () / Lowercase8
Lowercase8 = 'n' () / Lowercase9
Lowercase9 = 'o' () / Lowercase10
Lowercase10 = 'p' () / Lowercase11
Lowercase11 = 'q' () / Lowercase12
Lowercase12 = 'r' () / Lowercase13
Lowercase13 = 's' () / Lowercase14
Lowercase14 = 't' () / Lowercase15
Lowercase15 = 'u' () / Lowercase16
Lowercase16 = 'v' () / Lowercase17
Lowercase17 = 'w' () / Lowercase18
Lowercase18 = 'x' () / Lowercase19
Lowercase19 = 'y' () / Lowercase20
Lowercase20 = 'z' () / f
// Uppercase
UppercaseAToF = UppercaseAToF1 () / f
UppercaseAToF1 = 'A' () / UppercaseAToF2
UppercaseAToF2 = 'B' () / UppercaseAToF3
UppercaseAToF3 = 'C' () / UppercaseAToF4
UppercaseAToF4 = 'D' () / UppercaseAToF5
UppercaseAToF5 = 'E' () / UppercaseAToF6
UppercaseAToF6 = 'F' () / f
Uppercase = UppercaseAToF () / Uppercase1
Uppercase1 = 'G' () / Uppercase2
Uppercase2 = 'H' () / Uppercase3
Uppercase3 = 'I' () / Uppercase4
Uppercase4 = 'J' () / Uppercase5
Uppercase5 = 'K' () / Uppercase6
Uppercase6 = 'L' () / Uppercase7
Uppercase7 = 'M' () / Uppercase8
Uppercase8 = 'N' () / Uppercase9
Uppercase9 = 'O' () / Uppercase10
Uppercase10 = 'P' () / Uppercase11
Uppercase11 = 'Q' () / Uppercase12
Uppercase12 = 'R' () / Uppercase13
Uppercase13 = 'S' () / Uppercase14
Uppercase14 = 'T' () / Uppercase15
Uppercase15 = 'U' () / Uppercase16
Uppercase16 = 'V' () / Uppercase17
Uppercase17 = 'W' () / Uppercase18
Uppercase18 = 'X' () / Uppercase19
Uppercase19 = 'Y' () / Uppercase20
Uppercase20 = 'Z' () / f

QuoteEscape = "\\'" () / "\\\""
EndOfLine = "\r\n" () / '\n'
Space = ' ' () / f

// Digits
BinDigit = "0" () / "1"
OctDigit = BinDigit () / OctDigit1
OctDigit1 = "2" () / OctDigit2
OctDigit2 = "3" () / OctDigit3
OctDigit3 = "4" () / OctDigit4
OctDigit4 = "5" () / OctDigit5
OctDigit5 = "6" () / OctDigit6
OctDigit6 = "7" () / f
DecDigit = OctDigit () / DecDigit1
DecDigit1 = "8" () / DecDigit2
DecDigit2 = "9" () / f

// Comment
LineComment = "//" InnerLineComment / f
InnerLineComment = AnyExceptLF InnerLineComment / ()
AnyExceptLF = AnyExceptLF1 ? / f
AnyExceptLF1 = EndOfLine * / ()

待办事项

任务

  • CST 中的 into_first()

  • 在 mplg 中添加 { 原始 }
  • 添加从 AST 获取变量的函数
  • 添加 RowColSpan
  • 从 MPLG 文件创建解析器
  • 错误处理
  • Packrat 解析
  • 左递归

下一实现

  • 添加从 AST 获取变量的函数
  • 可以是叶节点中的变量
  • 错误处理

练习

序列

A<-e1 e2

A = e1 e2 / f

选择

A<-e1/e2

A = e1 () / e2

零个或多个

A<-e*

A = e A / ()

非谓词

A<- !e?

A = B ? / f
B = e * / ()

参考文献

  • Alexander Birman. The TMG Recognition Schema. PhD thesis, Princeton University, February 1970
  • Alfred V. Aho and Jeffrey D. Ullman. The Theory of Parsing, Translation and Compiling - Vol. I: Parsing. Prentice Hall, Englewood Cliffs, N.J., 1972.
  • Bryan Ford. 2002. Packrat parsing: a practical linear-time algorithm with backtracking. Ph.D. Dissertation. Massachusetts Institute of Technology.
  • Bryan Ford. 2004. Parsing expression grammars: a recognition-based syntactic foundation. In Proceedings of the 31st ACM SIGPLAN-SIGACT symposium on Principles of programming languages. 111–122.
  • Hutchison, Luke AD. "Pika parsing: reformulating packrat parsing as a dynamic programming algorithm solves the left recursion and error recovery problems." arXiv preprint arXiv:2005.06444 (2020).

无运行时依赖