#web-crawler #request #rate-limiting #completely #declarative #async #scrape

quick_crawler

QuickCrawler 是一个 Rust 包,提供了一个完全异步、声明式的网页爬虫,内置了特定域的请求速率限制。

3 个版本

0.1.2 2020 年 6 月 15 日
0.1.1 2020 年 4 月 4 日
0.1.0 2020 年 2 月 19 日

#744异步

Apache-2.0/MIT

52KB
729

QuickCrawler

QuickCrawler 是一个 Rust 包,提供了一个完全异步、声明式的网页爬虫,内置了特定域的请求速率限制。

示例

假设你正在尝试爬取特定域的页面子集

https://bike-site.com/search?q=red-bikes

并且常规的 GET 请求将返回

  <html>
      <body>
          <div>
              <a class="bike-item" href="https://bike-site.com/red-bike-1">
                  cool red bike 1
              </a><a class="bike-item" href="https://bike-site.com/red-bike-2">
                  cool red bike 2
              </a>
              <a class="bike-item" href="https://bike-site.com/red-bike-3">
                  cool red bike 3
              </a>
              <div>
                  <a class="bike-other-item" href="https://bike-site.com/other-red-bike-4">
                      other cool red bike 4
                  </a>
              </div>
          </div>
      </body>
  </html>

并且当你导航到该页面的 第 1 至 3 个链接 时,每个页面都会返回

<html>
    <body>
        <div class='awesome-bike'>
            <div class='bike-info'>
                The best bike ever.
            </div>
            <ul class='bike-specs'>
                <li>
                    Super fast.
                </li>
                <li>
                    Jumps high.
                </li>
            </ul>
        </div>
    </body>
</html>

并且当你导航到该页面的 最后一个链接 时,它将返回

<html>
    <body>
        <div class='other-bike'>
            <div class='other-bike-info'>
                The best bike ever.
            </div>
            <ul class='other-bike-specs'>
                <li>
                    Super slow.
                </li>
                <li>
                    Doesn't jump.
                </li>
            </ul>
        </div>
    </body>
</html>

QuickCrawler 声明式地帮助你轻松地爬取和抓取给定页面的数据

use quick_crawler::{
    QuickCrawler, 
    QuickCrawlerBuilder,
    limiter::Limiter, 
    scrape::{
        ResponseLogic::Parallel, 
        StartUrl, 
        Scrape, 
        ElementUrlExtractor, 
        ElementDataExtractor
    }
};


fn main() {
    let mut builder = QuickCrawlerBuilder::new();


    let start_urls = vec![
        StartUrl::new()
            .url("https://bike-site.com/search?q=red-bikes")
            .method("GET")
            .response_logic(Parallel(vec![
                // All Scrapers below will be provided the html page response body
                Scrape::new()
                    .find_elements_with_urls(".bike-item")
                    .extract_urls_from_elements(ElementUrlExtractor::Attr("href".to_string()))
                    // now setup the logic to execute on each of the return html pages
                    .response_logic(Parallel(vec![
                        Scrape::new()
                            .find_elements_with_data(".awesome-bike .bike-info")
                            .extract_data_from_elements(ElementDataExtractor::Text)
                            .store(|vec: Vec<String>| async move {
                                println!("store bike info in DB: {:?}", vec);
                            }),
                        Scrape::new()
                            .find_elements_with_data(".bike-specs li")
                            .extract_data_from_elements(ElementDataExtractor::Text)
                            .store(|vec: Vec<String>| async move {
                                println!("store bike specs in DB: {:?}", vec);
                            }),
                    ])),
                Scrape::new()
                    .find_elements_with_urls(".bike-other-item")
                    .extract_urls_from_elements(ElementUrlExtractor::Attr("href".to_string()))
                    .response_logic(Parallel(vec![
                        Scrape::new()
                            .find_elements_with_data(".other-bike .other-bike-info")
                            .extract_data_from_elements(ElementDataExtractor::Text)
                            .store(|vec: Vec<String>| async move {
                                println!("store other bike info in DB: {:?}", vec);
                            }),
                        Scrape::new()
                            .find_elements_with_data(".other-bike-specs li")
                            .extract_data_from_elements(ElementDataExtractor::Text)
                            .store(|vec: Vec<String>| async move {
                                println!("store other bike specs in DB: {:?}", vec);
                            }),
                    ]))  
            ])
        )
        // more StartUrl::new 's if you feel ambitious
    ] ;

    // It's smart to use a limiter - for now automatically set to 3 request per second per domain.
    // This will soon be configurable.

    let limiter = Limiter::new();

    builder
        .with_start_urls(
            start_urls
        )
        .with_limiter(
            limiter
        )
        // Optionally configure how to make a request and return an html string
        .with_request_handler(
            |config: RequestHandlerConfig| async move {
                // ... use any request library, like reqwest
                surf::get(config.url.clone()).recv_string().await.map_err(|_| QuickCrawlerError::RequestErr)
            }
        );
    let crawler = builder.finish().map_err(|_| "Builder could not finish").expect("no error");
    
    // QuickCrawler is async, so choose your favorite executor.
    // (Tested and working for both async-std and tokio)
    let res = async_std::task::block_on(async {
        crawler.process().await
    });

}

贡献

克隆仓库。

运行测试

cargo watch -x check -x 'test -- --nocapture'

查看 src/lib.rs 中的测试以获取示例用法。

感谢使用!

如果你使用了这个包并且它对你的项目有帮助,请给它加星标!

许可

MIT

依赖

~16–29MB
~466K SLoC