robotxt

13 个不稳定版本 (5 个破坏性更改)

0.6.1	2024年3月7日
0.5.0	2023年7月31日
0.2.0	2023年3月31日

#2 in #scraper

每月390次下载

MIT 许可证

71KB
1.5K SLoC

还可以查看其他 spire-rs 项目这里.

Rust 编程语言中 robots.txt（或 URL 排除）协议的实现，支持 crawl-delay、sitemap 和通用 * 匹配扩展（根据 RFC 规范）。

特性 parser 解析器，启用 robotxt::{Robots}。默认启用。 builder 构建器，启用 robotxt::{RobotsBuilder, GroupBuilder}。默认启用。 optimal 优化重叠和全局规则，可能以牺牲更长的解析时间为代价提高匹配速度。 serde 实现，启用 serde::{Deserialize, Serialize}，允许缓存相关规则。示例解析提供的 robots.txt 文件中最具体的 user-agent use robotxt::Robots; fn main() { let txt = r#" User-Agent: foobot Disallow: * Allow: /example/ Disallow: /example/nope.txt "#; let r = Robots::from_bytes(txt.as_bytes(), "foobot"); assert!(r.is_relative_allowed("/example/yeah.txt")); assert!(!r.is_relative_allowed("/example/nope.txt")); assert!(!r.is_relative_allowed("/invalid/path.txt")); } 以声明方式构建新的 robots.txt 文件 use robotxt::RobotsBuilder; fn main() -> Result<(), url::ParseError> { let txt = RobotsBuilder::default() .header("Robots.txt: Start") .group(["foobot"], |u| { u.crawl_delay(5) .header("Rules for Foobot: Start") .allow("/example/yeah.txt") .disallow("/example/nope.txt") .footer("Rules for Foobot: End") }) .group(["barbot", "nombot"], |u| { u.crawl_delay(2) .disallow("/example/yeah.txt") .disallow("/example/nope.txt") }) .sitemap("https://example.com/sitemap_1.xml".try_into()?) .sitemap("https://example.com/sitemap_1.xml".try_into()?) .footer("Robots.txt: End"); println!("{}", txt.to_string()); Ok(()) } 链接 RFC-Editor.com 上的请求评论：9309 Google.com 上的 Robots.txt 简介 Google.com 上 Google 如何解释 Robots.txt Moz.com 上的 Robots.txt 文件是什么注释解析器基于 Smerity/texting_robots。不支持 Host 指令。

依赖 ~1.3–2.9MB ~79K SLoC parser bstr 解析器 nom 解析器 regex 百分比编码可选 serde thiserror url 开发版 serde_json 其他功能构建器完整版最优版