#lexer #file-path #tokenizer #lexical #scanlex

bin+lib lexical_scanner

这是一个简单的词法分析器,根据 Rust 编程语言创建超过 115 种不同的标记。这个完整的词法分析器/词法扫描器为字符串或文件路径条目生成标记。

19 个版本

0.1.18 2022 年 4 月 7 日
0.1.17 2022 年 4 月 6 日

17#lexical

MIT 许可证

2.5MB
1K SLoC

包含 (WOFF 字体,680KB) doc/NanumBarunGothic.ttf.woff,(WOFF 字体,400KB) doc/NanumBarunGothic.ttf.woff2,(WOFF 字体,190KB) doc/FiraSans-Medium.woff,(WOFF 字体,135KB) doc/FiraSans-Medium.woff2,(WOFF 字体,185KB) doc/FiraSans-Regular.woff,(WOFF 字体,130KB) doc/FiraSans-Regular.woff2 以及更多

文档

这个完整的词法分析器/词法扫描器为字符串或文件路径条目生成超过 115 种标记。输出是一个 Vector,用户可以按照需要处理。所有标记都包括在内(包括空白字符),因为这留给用户决定如何使用标记。

如果您有任何问题、评论或需要帮助,请在此处添加讨论

https://github.com/mjehrhart/lexical_scanner/discussions

要查看输出示例,请查看维基页面

https://github.com/mjehrhart/lexical_scanner/wiki/Example

设置

将此依赖项添加到 TOML

[dependencies]
lexical_scanner = "0.1.18"

基本用法

执行词法扫描的两种方式是传入文件路径或传入字符串。传入字符串主要用于测试,而传入文件路径则适用于日常工作。词法扫描器可以非常快速地生成成千上万的标记。因此,最好使用文件路径。

use lexical_scanner;

fn main() {
    let path = "/Users/gues/my_file_to_read_into_tokens/temp.txt";
    let token_list = lexical_scanner::lexer(&path); 
}

input -> 
: :: > >= >> < <= << => += -= *= /= &= ^= &= |= == != + - * / % ^ & && | || !  >>= <<= -> /// //! // /* */ /*! /**
output ->
0. Colon
1. WhiteSpace
2. PathSep
3. WhiteSpace
4. Gt
5. WhiteSpace
6. Ge
7. WhiteSpace
8. Shr
9. WhiteSpace
10. Lt
11. WhiteSpace
12. Le
13. WhiteSpace
14. Shl
15. WhiteSpace
16. FatArrow
17. WhiteSpace
18. PlusEq
19. WhiteSpace
20. MinusEq
21. WhiteSpace
22. StarEq
23. WhiteSpace
24. SlashEq
25. WhiteSpace
26. AndEq
27. WhiteSpace
28. CaretEq
29. WhiteSpace
30. AndEq
31. WhiteSpace
32. OrEq
33. WhiteSpace
34. EqEq
35. WhiteSpace
36. NotEq
37. WhiteSpace
38. Plus
39. WhiteSpace
40. Minus
41. WhiteSpace
42. Star
43. WhiteSpace
44. Slash
45. WhiteSpace
46. Percent
47. WhiteSpace
48. Caret
49. WhiteSpace
50. And
51. WhiteSpace
52. AndAnd
53. WhiteSpace
54. Or
55. WhiteSpace
56. OrOr
57. WhiteSpace
58. Not
59. WhiteSpace
60. LineComment("//")
61. WhiteSpace
62. BlockCommentStart("/*")
63. WhiteSpace
64. BlockCommentStop("*/")
65. WhiteSpace
66. ShrEq
67. WhiteSpace
68. ShlEq
69. WhiteSpace
70. RArrow
71. WhiteSpace
72. OuterLineDoc("///")
73. WhiteSpace
74. InnerLineDoc("//!")
75. WhiteSpace
76. InnerBlockDoc("/*!")
77. WhiteSpace
78. OuterBlockDoc("/**")
79. Newline

这就是全部要做的事情!词法扫描器返回一个 Vec,用户可以按需处理。

要使用字符串进行测试,只需调用此方法

use lexical_scanner;

fn main() {
    let text = "The number 5.0 is > 1;";
    let token_list = lexical_scanner::lexer_as_str(&text); 
}

以下是用于单元测试查看标记的简单方法

for (i, token) in token_list.iter().enumerate(){
    println!("{}. {:?}", i, token);
}

output -> 
0. Word("The")
1. WhiteSpace
2. Word("number")
3. WhiteSpace
4. Floating("5.0")
5. WhiteSpace
6. Word("is")
7. WhiteSpace
8. Gt
9. WhiteSpace
10. Numeric("1")
11. Semi

自定义关键字

有一种方法可以添加自己的关键字标识符。这样做将有助于管理标记的解析。

use lexical_scanner;

fn main() {
    let text = "The number 5.0 is left and nor right of the up and down 1;";
    let user_keywords = ["up", "down", "left", "right"];
    let token_list = lexical_scanner::lexer_with_user_keywords(&text, user_keywords.to_vec()); 
}

以下是用于单元测试查看标记的简单方法。您可以看到,“up”、“down”、“left”和“right”已被标记化为 KW_UserDefined(String)。

for (i, token) in token_list.iter().enumerate(){
    println!("{}. {:?}", i, token);
}

output -> 
0. Word("The")
1. WhiteSpace
2. Word("number")
3. WhiteSpace
4. Floating("5.0")
5. WhiteSpace
6. Word("is")
7. WhiteSpace
8. KW_UserDefined("left")
9. WhiteSpace
10. Word("and")
11. WhiteSpace
12. Word("nor")
13. WhiteSpace
14. KW_UserDefined("right")
15. WhiteSpace
16. Word("of")
17. WhiteSpace
18. Word("the")
19. WhiteSpace
20. KW_UserDefined("up")
21. WhiteSpace
22. Word("and")
23. WhiteSpace
24. KW_UserDefined("down")
25. WhiteSpace
26. Numeric("1")
27. Semi

支持的标记

& => And,
&& => AndAnd,
&= => AndEq,
@ => At,
\ => Backslash,
BitCharacterCode7(String),
BitCharacterCode8(String),
/* => BlockCommentStart(String),
*/ => BlockCommentStop(String),
[ => BracketLeft,
] => BracketRight,
b'H' => Byte(String),
b"Hello" => ByteString(String),
^ => Caret,
^ => CaretEq,
\r\n => CarriageReturn,
Character(String),
: => Colon,
, => Comma,
{ => CurlyBraceLeft,
} => CurlyBraceRight,
$ => Dollar,
. => Dot,
.. => DotDot,
... => DotDotDot,
..= => DotDotEq,
" => DoubleQuote,
= => Eq,
== => EqEq,
>= => Ge,
> => Gt,
=> => FatArrow,
//! => InnerBlockDoc(String),
/*! => InnerLineDoc(String),
< => Le,
// => LineComment(String),
<= Lt,
- => Minus,
-= => MinusEq,
| => Or,
|= => OrEq,
|| => OrOr,
/** => OuterBlockDoc(String),
/// => OuterLineDoc(String),
\n => Newline,
! => Not,
!= => NotEq,
Null,
3.14 => Floating(String),
314 => Numeric(String),
( => ParenLeft,
) => ParenRight,
:: => PathSep,
% => Percent,
%= => PercentEq,
+ => Plus,
+= => PlusEq,
# => Pound,
? => Question,
-> => RArrow,
r#"Hello"# => RawString(String),
rb#"Hello"# => RawByteString(String),
; => Semi,
<< => Shl,
<<= => ShlEq,
>> => Shr,
>>= => ShrEq,
' => SingleQuote,
/ => Slash,
/= => SlashEq,
* => Star,
*= => StarEq,
Stopped(String), //for debugging
"Hello" => String(String),
\t => Tab,
Undefined,
_ => Underscore,
' ' => WhiteSpace,
Word(String),
KW_As,
KW_Async,
KW_Await,
KW_Break,
KW_Const,
KW_Contine,
KW_Crate,
KW_Dyn,
KW_Else,
KW_Enum,
KW_Extern,
KW_False,
KW_Fn,
KW_For,
KW_If,
KW_Impl,
KW_In,
KW_Let,
KW_Loop,
KW_Match,
KW_Mod,
KW_Move,
KW_Mut,
KW_Pub,
KW_Ref,
KW_Return,
KW_SELF,
KW_Self,
KW_Static,
KW_Struct,
KW_Super,
KW_Trait,
KW_True,
KW_Type,
KW_Union,
KW_Unsafe,
KW_Use,
KW_UserDefined(String),
KW_Where,
KW_While,

crates.io => https://crates.io/crates/lexical_scanner
github.com => https://github.com/mjehrhart/lexical_scanner

没有运行时依赖