4 个版本

0.2.1 2022年1月12日
0.1.2 2021年7月24日
0.1.1 2020年12月30日
0.1.0 2020年12月29日

#17 in #网页爬虫

Download history 72/week @ 2024-04-14 46/week @ 2024-04-21 39/week @ 2024-04-28 34/week @ 2024-05-05 18/week @ 2024-05-12 16/week @ 2024-05-19 23/week @ 2024-05-26 18/week @ 2024-06-02 33/week @ 2024-06-09 41/week @ 2024-06-16 26/week @ 2024-06-23 29/week @ 2024-06-30 36/week @ 2024-07-07 54/week @ 2024-07-14 23/week @ 2024-07-21 46/week @ 2024-07-28

160 每月下载次数

MIT/Apache

74KB
1.5K SLoC

voyager

github crates.io docs.rs build status

使用 voyager,您可以轻松地从网站中提取结构化数据。

通过 voyager 按状态机模型编写您自己的爬虫/抓取工具。

示例

示例使用 tokio 作为其运行时,因此您的 Cargo.toml 可能如下所示

[dependencies]
voyager = { version = "0.1" }
tokio = { version = "1.8", features = ["full"] }

声明您自己的抓取器和模型

// Declare your scraper, with all the selectors etc.
struct HackernewsScraper {
    post_selector: Selector,
    author_selector: Selector,
    title_selector: Selector,
    comment_selector: Selector,
    max_page: usize,
}

/// The state model
#[derive(Debug)]
enum HackernewsState {
    Page(usize),
    Post,
}

/// The ouput the scraper should eventually produce
#[derive(Debug)]
struct Entry {
    author: String,
    url: Url,
    link: Option<String>,
    title: String,
}

实现 voyager::Scraper 特性

Scraper 由两个关联类型组成

  • Output,抓取器最终产生的类型
  • State,抓取器可以携带多个请求,这些请求最终导致一个 Output

以及 scrape 回调,该回调在每个接收到的响应之后被调用。

根据附加到 response 的状态,您可以向爬虫提供新的要访问的 URL,或者不附加任何状态。

抓取使用 causal-agent/scraper 完成。

impl Scraper for HackernewsScraper {
    type Output = Entry;
    type State = HackernewsState;

    /// do your scraping
    fn scrape(
        &mut self,
        response: Response<Self::State>,
        crawler: &mut Crawler<Self>,
    ) -> Result<Option<Self::Output>> {
        let html = response.html();

        if let Some(state) = response.state {
            match state {
                HackernewsState::Page(page) => {
                    // find all entries
                    for id in html
                        .select(&self.post_selector)
                        .filter_map(|el| el.value().attr("id"))
                    {
                        // submit an url to a post
                        crawler.visit_with_state(
                            &format!("https://news.ycombinator.com/item?id={}", id),
                            HackernewsState::Post,
                        );
                    }
                    if page < self.max_page {
                        // queue in next page
                        crawler.visit_with_state(
                            &format!("https://news.ycombinator.com/news?p={}", page + 1),
                            HackernewsState::Page(page + 1),
                        );
                    }
                }

                HackernewsState::Post => {
                    // scrape the entry
                    let entry = Entry {
                        // ...
                    };
                    return Ok(Some(entry))
                }
            }
        }

        Ok(None)
    }
}

设置并收集所有输出

使用 CrawlerConfig 配置爬虫

  • 域名允许/阻止列表
  • 请求之间的延迟
  • 是否遵守 Robots.txt 规则

将您的配置和抓取器实例提供给驱动 Crawler 并将响应转发到您的 ScraperCollector

use voyager::scraper::Selector;
use voyager::*;
use tokio::stream::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    
    // only fulfill requests to `news.ycombinator.com`
    let config = CrawlerConfig::default().allow_domain_with_delay(
        "news.ycombinator.com",
        // add a delay between requests
        RequestDelay::Fixed(std::time::Duration::from_millis(2_000)),
    );
    
    let mut collector = Collector::new(HackernewsScraper::default(), config);

    collector.crawler_mut().visit_with_state(
        "https://news.ycombinator.com/news",
        HackernewsState::Page(1),
    );

    while let Some(output) = collector.next().await {
        let post = output?;
        dbg!(post);
    }
    
    Ok(())
}

更多内容请见 示例

注入异步调用

有时,先执行一些其他调用,例如获取令牌等可能很有帮助。您向爬虫提交 async 闭包以手动获取响应并注入状态或将状态驱动到完成。


fn scrape(
    &mut self,
    response: Response<Self::State>,
    crawler: &mut Crawler<Self>,
) -> Result<Option<Self::Output>> {

    // inject your custom crawl function that produces a `reqwest::Response` and `Self::State` which will get passed to `scrape` when resolved.
    crawler.crawl(move |client| async move {
        let state = response.state;
        let auth = client.post("some auth end point ").send()?.await?.json().await?;
        // do other async tasks etc..
        let new_resp = client.get("the next html page").send().await?;
        Ok((new_resp, state))
    });
    
    // submit a crawling job that completes to `Self::Output` directly
    crawler.complete(move |client| async move {
        // do other async tasks to create a `Self::Output` instance
        let output = Self::Output{/*..*/};
        Ok(Some(output))
    });
    
    Ok(None)
}

恢复丢失的状态

如果爬虫遇到错误,由于失败的或被禁止的http请求,错误被报告为CrawlError,它携带最后的有效状态。然后可以将错误向下转换。


let mut collector = Collector::new(HackernewsScraper::default(), config);

while let Some(output) = collector.next().await {
  match output {
    Ok(post) => {/**/}
    Err(err) => {
      // recover the state by downcasting the error
      if let Ok(err) = err.downcast::<CrawlError<<HackernewsScraper as Scraper>::State>>() {
        let last_state = err.state();
      }
    }
  }
}

根据以下任一授权

依赖项

~7–19MB
~296K SLoC