4 个版本

0.2.1	2022年1月12日
0.1.2	2021年7月24日
0.1.1	2020年12月30日
0.1.0	2020年12月29日

#17 in #网页爬虫

160 每月下载次数

MIT/Apache

74KB
1.5K SLoC

voyager

使用 voyager，您可以轻松地从网站中提取结构化数据。

通过 voyager 按状态机模型编写您自己的爬虫/抓取工具。

示例

示例使用 tokio 作为其运行时，因此您的 Cargo.toml 可能如下所示

[dependencies]
voyager = { version = "0.1" }
tokio = { version = "1.8", features = ["full"] }

声明您自己的抓取器和模型

// Declare your scraper, with all the selectors etc.
struct HackernewsScraper {
    post_selector: Selector,
    author_selector: Selector,
    title_selector: Selector,
    comment_selector: Selector,
    max_page: usize,
}

/// The state model
#[derive(Debug)]
enum HackernewsState {
    Page(usize),
    Post,
}

/// The ouput the scraper should eventually produce
#[derive(Debug)]
struct Entry {
    author: String,
    url: Url,
    link: Option<String>,
    title: String,
}

实现 `voyager::Scraper` 特性

Scraper 由两个关联类型组成

Output，抓取器最终产生的类型
State，抓取器可以携带多个请求，这些请求最终导致一个 Output

以及 scrape 回调，该回调在每个接收到的响应之后被调用。

根据附加到 response 的状态，您可以向爬虫提供新的要访问的 URL，或者不附加任何状态。

抓取使用 causal-agent/scraper 完成。

impl Scraper for HackernewsScraper {
    type Output = Entry;
    type State = HackernewsState;

    /// do your scraping
    fn scrape(
        &mut self,
        response: Response<Self::State>,
        crawler: &mut Crawler<Self>,
    ) -> Result<Option<Self::Output>> {
        let html = response.html();

        if let Some(state) = response.state {
            match state {
                HackernewsState::Page(page) => {
                    // find all entries
                    for id in html
                        .select(&self.post_selector)
                        .filter_map(|el| el.value().attr("id"))
                    {
                        // submit an url to a post
                        crawler.visit_with_state(
                            &format!("https://news.ycombinator.com/item?id={}", id),
                            HackernewsState::Post,
                        );
                    }
                    if page < self.max_page {
                        // queue in next page
                        crawler.visit_with_state(
                            &format!("https://news.ycombinator.com/news?p={}", page + 1),
                            HackernewsState::Page(page + 1),
                        );
                    }
                }

                HackernewsState::Post => {
                    // scrape the entry
                    let entry = Entry {
                        // ...
                    };
                    return Ok(Some(entry))
                }
            }
        }

        Ok(None)
    }
}

设置并收集所有输出

使用 CrawlerConfig 配置爬虫

域名允许/阻止列表
请求之间的延迟
是否遵守 Robots.txt 规则

将您的配置和抓取器实例提供给驱动 Crawler 并将响应转发到您的 Scraper 的 Collector。

use voyager::scraper::Selector;
use voyager::*;
use tokio::stream::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    
    // only fulfill requests to `news.ycombinator.com`
    let config = CrawlerConfig::default().allow_domain_with_delay(
        "news.ycombinator.com",
        // add a delay between requests
        RequestDelay::Fixed(std::time::Duration::from_millis(2_000)),
    );
    
    let mut collector = Collector::new(HackernewsScraper::default(), config);

    collector.crawler_mut().visit_with_state(
        "https://news.ycombinator.com/news",
        HackernewsState::Page(1),
    );

    while let Some(output) = collector.next().await {
        let post = output?;
        dbg!(post);
    }
    
    Ok(())
}

更多内容请见示例。

注入异步调用

有时，先执行一些其他调用，例如获取令牌等可能很有帮助。您向爬虫提交 async 闭包以手动获取响应并注入状态或将状态驱动到完成。


fn scrape(
    &mut self,
    response: Response<Self::State>,
    crawler: &mut Crawler<Self>,
) -> Result<Option<Self::Output>> {

    // inject your custom crawl function that produces a `reqwest::Response` and `Self::State` which will get passed to `scrape` when resolved.
    crawler.crawl(move |client| async move {
        let state = response.state;
        let auth = client.post("some auth end point ").send()?.await?.json().await?;
        // do other async tasks etc..
        let new_resp = client.get("the next html page").send().await?;
        Ok((new_resp, state))
    });
    
    // submit a crawling job that completes to `Self::Output` directly
    crawler.complete(move |client| async move {
        // do other async tasks to create a `Self::Output` instance
        let output = Self::Output{/*..*/};
        Ok(Some(output))
    });
    
    Ok(None)
}

恢复丢失的状态

如果爬虫遇到错误，由于失败的或被禁止的http请求，错误被报告为CrawlError，它携带最后的有效状态。然后可以将错误向下转换。


let mut collector = Collector::new(HackernewsScraper::default(), config);

while let Some(output) = collector.next().await {
  match output {
    Ok(post) => {/**/}
    Err(err) => {
      // recover the state by downcasting the error
      if let Ok(err) = err.downcast::<CrawlError<<HackernewsScraper as Scraper>::State>>() {
        let last_state = err.state();
      }
    }
  }
}

根据以下任一授权

Apache License 2.0，（LICENSE-APACHE 或 https://apache.ac.cn/licenses/LICENSE-2.0）
MIT 许可证（LICENSE-MIT 或 https://open-source.org.cn/licenses/MIT）

依赖项

~7–19MB
~296K SLoC