76 个重大版本发布

0.82.0	2021 年 11 月 27 日
0.80.0	2021 年 10 月 31 日
0.72.3	2021 年 7 月 4 日

16 在 #spider 中排名

每月 202 次下载
在 crusty 中使用

GPL-3.0 许可证

98KB
2.5K SLoC

Crusty-core - 构建您自己的网页爬虫！

示例 - 爬取单个网站，收集关于 `TITLE` 标签的信息

use crusty_core::{prelude::*, select_task_expanders::FollowLinks};

#[derive(Debug, Default)]
pub struct JobState {
    sum_title_len: usize,
}

#[derive(Debug, Clone, Default)]
pub struct TaskState {
    title: String,
}

pub struct DataExtractor {}
type Ctx = JobCtx<JobState, TaskState>;
impl TaskExpander<JobState, TaskState, Document> for DataExtractor {
    fn expand(
        &self,
        ctx: &mut Ctx,
        _: &Task,
        _: &HttpStatus,
        doc: &Document,
    ) -> task_expanders::Result {
        if let Some(title) = doc.find(Name("title")).next().map(|v| v.text()) {
            ctx.job_state.lock().unwrap().sum_title_len += title.len();
            ctx.task_state.title = title;
        }
        Ok(())
    }
}

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let crawler = Crawler::new_default()?;

    let settings = config::CrawlingSettings::default();
    let rules = CrawlingRules::new(CrawlingRulesOptions::default(), document_parser())
        .with_task_expander(|| DataExtractor {})
        .with_task_expander(|| FollowLinks::new(LinkTarget::HeadFollow));

    let job = Job::new("https://example.com", settings, rules, JobState::default())?;
    for r in crawler.iter(job) {
        println!("- {}, task state: {:?}", r, r.ctx.task_state);
        if let JobStatus::Finished(_) = r.status {
            println!("final job state: {:?}", r.ctx.job_state.lock().unwrap());
        }
    }
    Ok(())
}

如果您想做得更复杂，配置一些内容或更精确地控制导入

use crusty_core::{
    config,
    select::predicate::Name,
    select_task_expanders::{document_parser, Document, FollowLinks},
    task_expanders,
    types::{HttpStatus, Job, JobCtx, JobStatus, LinkTarget, Task},
    Crawler, CrawlingRules, CrawlingRulesOptions, ParserProcessor, TaskExpander,
};

#[derive(Debug, Default)]
pub struct JobState {
    sum_title_len: usize,
}

#[derive(Debug, Clone, Default)]
pub struct TaskState {
    title: String,
}

pub struct DataExtractor {}
type Ctx = JobCtx<JobState, TaskState>;
impl TaskExpander<JobState, TaskState, Document> for DataExtractor {
    fn expand(
        &self,
        ctx: &mut Ctx,
        _: &Task,
        _: &HttpStatus,
        doc: &Document,
    ) -> task_expanders::Result {
        let title = doc.find(Name("title")).next().map(|v| v.text());
        if let Some(title) = title {
            ctx.job_state.lock().unwrap().sum_title_len += title.len();
            ctx.task_state.title = title;
        }
        Ok(())
    }
}

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let concurrency_profile = config::ConcurrencyProfile::default();
    let parser_profile = config::ParserProfile::default();
    let tx_pp = ParserProcessor::spawn(concurrency_profile, parser_profile);

    let networking_profile = config::NetworkingProfile::default().resolve()?;
    let crawler = Crawler::new(networking_profile, tx_pp);

    let settings = config::CrawlingSettings::default();
    let rules_opt = CrawlingRulesOptions::default();
    let rules = CrawlingRules::new(rules_opt, document_parser())
        .with_task_expander(|| DataExtractor {})
        .with_task_expander(|| FollowLinks::new(LinkTarget::HeadFollow));

    let job = Job::new("https://example.com", settings, rules, JobState::default())?;
    for r in crawler.iter(job) {
        println!("- {}, task state: {:?}", r, r.ctx.task_state);
        if let JobStatus::Finished(_) = r.status {
            println!("final job state: {:?}", r.ctx.job_state.lock().unwrap());
        }
    }

    Ok(())
}

安装

只需将此内容添加到您的 Cargo.toml

[dependencies]
crusty-core = {version = "~0.82.0", features=["select_rs"]}

如果您只需要库，不需要内置的 select.rs 任务扩展器（用于链接、图像等）

[dependencies]
crusty-core = "~0.82.0"

关键功能

基于 tokio 的多线程和异步（在 tokio 之上）
在每个步骤中都有高度可定制的筛选功能
- 自定义 DNS 解析器，内置 IP/子网筛选功能
- 接收的状态码/头信息（内置的内容类型筛选器在此步骤工作），
- 下载的页面（例如，我们可以决定不解析 DOM），
- 任务筛选，对要跟随的 -what- 和 -how- 有完全控制（仅解析 DNS、head、head+get）
基于 hyper 构建（内置 http2 和 gzip/deflate）
使用 select 提取丰富内容
使用 tracing 和自定义指标暴露给用户（如 HTML 解析持续时间、发送/接收的字节数）
很多选项，几乎一切都是可配置的，可以通过选项或代码进行配置
适用于专注和广泛爬取
当您想要爬取数百万/数十亿的域名时，可以轻松扩展
它非常快，非常快！

开发

确保已安装 rustup： https://rustup.rs/
确保已安装 pre-commit： https://pre-commit.git-scm.cn/
确保已安装 markdown-pp： https://github.com/jreese/markdown-pp
运行 ./go setup
运行 ./go check 以运行所有 pre-commit 钩子并确保一切准备就绪，以便使用 git
运行 ./go release minor 以在 crates.io 上发布下一个次版本

注意

有关更复杂的用法示例，请参阅示例。此爬虫比某些其他爬虫更详细，但它在每个步骤都允许不可思议的定制。

如果您对广泛网络爬取领域感兴趣，请参阅 crusty，它是完全基于 crusty-core 开发的，旨在解决广泛网络爬取的一些挑战

依赖项

~13–26MB
~393K SLoC

76 个重大版本发布

Crusty-core - 构建您自己的网页爬虫！

示例 - 爬取单个网站，收集关于 TITLE 标签的信息

安装

关键功能

开发

注意

依赖项

示例 - 爬取单个网站，收集关于 `TITLE` 标签的信息