1 个不稳定版本

0.2.1	2022年4月1日

1459 在解析器实现中

MIT 许可证

43KB
890 行

Cylon

Cylon是一个用于读取robots.txt文件的库。

功能

关于在robots.txt文件中，网络爬虫需要支持哪些规则并没有一个统一的标准。Cylon支持以下指令（特别是Site-map缺失）

用户-代理
允许
禁止
抓取-延迟

此外，Cylon支持*作为通配符，以匹配0个或多个字符的任意长度子串，以及$字符以匹配路径的末尾。

使用方法

使用Cylon非常简单。只需为您的用户代理创建一个新的编译器，然后编译robots.txt文件。

// You can use something like hyper or reqwest to download
// the robots.txt file instead.
let example_robots = r#"
User-agent: googlebot
Allow: /

User-agent: *
Disallow: /
"#
.as_bytes();

// Create a new compiler that compiles a robots.txt file looking for
// rules that apply to the "googlebot" user agent.
let compiler = Compiler::new("googlebot");
let cylon = compiler.compile(example_robots).await.unwrap();
assert_eq!(true, cylon.allow("/index.html"));
assert_eq!(true, cylon.allow("/directory"));

// Create a new compiler that compiles a robots.txt file looking for
// rules that apply to the "bing" user agent.
let complier = Compiler::new("bing");
let cylon = compiler.compile(example_robots).await.unwrap();
assert_eq!(false, cylon.allow("/index.html"));
assert_eq!(false, cylon.allow("/directory"));

贡献

欢迎贡献！请发起一个pull request。除非问题涉及基本问题或安全问题，否则可能无法及时解决。

实现

异步

该库默认使用异步API。该库不假设任何异步运行时，因此您可以使用任何（tokio、async-std等）。

未来可能作为可选功能添加同步API，但目前没有添加的计划。如果需要同步API，请考虑自己添加（欢迎贡献）。

性能

Cylon将robots.txt文件编译成非常高效的DFA。这意味着它非常适合需要为多个URL使用同一robots.txt文件的网络爬虫。

编译器在编译DFA时避免了任何随机的内存访问（例如，不使用hashmaps或树结构），因此具有非常好的缓存局部性。

DFA可以大致在O(n)时间内匹配输入路径，其中n是输入路径的长度。（将其与将输入路径与robots.txt文件中的每个规则进行匹配的O(n * m)复杂度进行比较。）

(反)序列化

该库使用serde来允许序列化和反序列化编译后的Cylon DFA结构体。如果您需要将DFA缓存到类似Memcached或Redis的东西中，这很有用。（使用bincode或msgpack之类的格式将其转换为字节。）

错误处理

robots.txt文件更像是一种指南而不是实际规则。

通常，Cylon会尽量不因可能被视为无效robots.txt文件的东西而产生错误，这意味着失败案例很少。

许可证

麻省理工学院

依赖项

~1.1–2MB
~41K SLoC