10个版本

0.1.9	2024年6月13日
0.1.8	2023年9月3日
0.1.6	2023年8月30日

#197 in 并发

每月520次下载

自定义许可

19KB
262 行

🕷️ crawly

A lightweight and efficient web crawler in Rust, optimized for concurrent scraping while respecting robots.txt rules.

🚀 Features

并发爬取: 利用并发在多个核心上高效爬取；
尊重 robots.txt: 自动抓取并遵守网站抓取指南；
DFS算法: 使用深度优先搜索算法来爬取网页链接；
使用Builder模式自定义: 无需费力即可调整爬取深度、速率限制和其他参数；
检测Cloudflare: 如果目标URL由Cloudflare托管且找到缓解措施，则将跳过URL而不会抛出任何错误；
用Rust构建: 保证内存安全和卓越的速度。

📦 安装

将 crawly 添加到您的 Cargo.toml

[dependencies]
crawly = "^0.1"

🛠️ 使用

一个简单的使用示例

use anyhow::Result;
use crawly::Crawler;

#[tokio::main]
async fn main() -> Result<()> {
    let crawler = Crawler::new()?;
    let results = crawler.crawl_url("https://example.com").await?;

    for (url, content) in &results {
        println!("URL: {}\nContent: {}", url, content);
    }

    Ok(())
}

使用Builder

为了更精细地控制爬虫的行为，CrawlerBuilder很有用

use anyhow::Result;
use crawly::CrawlerBuilder;

#[tokio::main]
async fn main() -> Result<()> {
    let crawler = CrawlerBuilder::new()
        .with_max_depth(10)
        .with_max_pages(100)
        .with_max_concurrent_requests(50)
        .with_rate_limit_wait_seconds(2)
        .with_robots(true)
        .build()?;

    let results = crawler.start("https://www.example.com").await?;

    for (url, content) in &results {
        println!("URL: {}\nContent: {}", url, content);
    }

    Ok(())
}

🛡️ Cloudflare

此crate将检测Cloudflare托管站点，如果找到头 cf-mitigated，则将跳过URL而不会抛出任何错误。

📜 追踪

每个函数都进行了仪表化，此外，此crate还将发出一些DEBUG消息，以便更好地理解爬取流程。

🤝 贡献

欢迎贡献、问题和功能请求！

请随时查看问题页面。您还可以查看贡献指南。

📝 许可证

此项目采用MIT许可。

💌 联系

作者：Dario Cancelliere
电子邮件：dario@ai-chat.it
公司网站： https://ai-chat.it

依赖项

~10–24MB
~368K SLoC