9个版本

0.3.3	2022年4月10日
0.3.2	2022年4月10日
0.2.0	2021年4月15日
0.1.3	2021年4月14日
0.1.1	2021年2月15日

#936 在解析器实现

MIT 许可证

47KB
971 行

Cylon

Cylon是一个用于读取robots.txt文件的库。

特性

Cylon试图与robots排除协议保持兼容。

以下指令被支持（特别是缺少Site-map）

用户-代理
允许
拒绝
Crawl-Delay（可选，通过使用crawl-delay功能启用）

以下特殊字符被支持

* - 匹配任何子串的通配符
$ - 匹配路径的结尾
# - 表示被Cylon忽略的注释

用法

使用Cylon非常简单。只需为您的用户代理创建一个新的编译器，然后编译robots.txt文件。

// You can use something like hyper or reqwest to download
// the robots.txt file instead.
let example_robots = r#"
User-agent: googlebot
Allow: /

User-agent: *
Disallow: /
"#
.as_bytes();

// Create a new compiler that compiles a robots.txt file looking for
// rules that apply to the "googlebot" user agent.
let compiler = Compiler::new("googlebot");
let cylon = compiler.compile(example_robots).await.unwrap();
assert_eq!(true, cylon.allow("/index.html"));
assert_eq!(true, cylon.allow("/directory"));

// Create a new compiler that compiles a robots.txt file looking for
// rules that apply to the "bing" user agent.
let complier = Compiler::new("bing");
let cylon = compiler.compile(example_robots).await.unwrap();
assert_eq!(false, cylon.allow("/index.html"));
assert_eq!(false, cylon.allow("/directory"));

贡献

欢迎贡献！请发起拉取请求。除非问题暴露了基本问题或安全问题，否则可能无法及时解决。

实现

异步

此库默认使用异步API。此库不假设任何异步运行时，因此您可以使用任何（tokio、async-std等）。

同步API可能是一个可选功能，但目前没有添加的计划。如果您需要同步API，请考虑自己添加（欢迎贡献）。

性能

Cylon将robots.txt文件编译成NFA。这意味着它非常适合需要为多个URL使用相同robots.txt文件的网页爬虫。重用相同的编译Cylon NFA将避免重复工作。

NFA通常比原始解决方案更高效地匹配。这是因为NFA可以同时匹配多个规则，这需要更少的比较。一般来说，robots.txt文件中的重复前缀越多，NFA“压缩”所需的工作量就越多。

在某些退化情况下，它可能比原始方法表现更差。在遇到冗余通配符匹配时，特别小心以避免指数运行时间。多个重复的通配符被视为单个通配符。

在编译和运行NFA时，Cylon最小化随机内存访问，以最大化缓存局部性。

(反)(序列化)

这个库使用serde允许序列化和反序列化编译后的Cylon NFA结构体。这在例如需要将NFA缓存到Memcached或Redis等地方很有用。（使用bincode或msgpack等格式将其转换为字节。）

错误处理

robots.txt文件更像是一种指导原则，而不是实际的规则。

通常，Cylon会尽量避免对可能被认为是无效robots.txt文件的事物造成错误，这意味着失败案例非常少。

许可证

MIT

依赖关系

~1–2MB
~41K SLoC