3 个版本
0.1.2 | 2020 年 6 月 15 日 |
---|---|
0.1.1 | 2020 年 4 月 4 日 |
0.1.0 | 2020 年 2 月 19 日 |
#744 在 异步
52KB
729 行
QuickCrawler
QuickCrawler 是一个 Rust 包,提供了一个完全异步、声明式的网页爬虫,内置了特定域的请求速率限制。
示例
假设你正在尝试爬取特定域的页面子集
https://bike-site.com/search?q=red-bikes
并且常规的 GET 请求将返回
<html>
<body>
<div>
<a class="bike-item" href="https://bike-site.com/red-bike-1">
cool red bike 1
</a><a class="bike-item" href="https://bike-site.com/red-bike-2">
cool red bike 2
</a>
<a class="bike-item" href="https://bike-site.com/red-bike-3">
cool red bike 3
</a>
<div>
<a class="bike-other-item" href="https://bike-site.com/other-red-bike-4">
other cool red bike 4
</a>
</div>
</div>
</body>
</html>
并且当你导航到该页面的 第 1 至 3 个链接 时,每个页面都会返回
<html>
<body>
<div class='awesome-bike'>
<div class='bike-info'>
The best bike ever.
</div>
<ul class='bike-specs'>
<li>
Super fast.
</li>
<li>
Jumps high.
</li>
</ul>
</div>
</body>
</html>
并且当你导航到该页面的 最后一个链接 时,它将返回
<html>
<body>
<div class='other-bike'>
<div class='other-bike-info'>
The best bike ever.
</div>
<ul class='other-bike-specs'>
<li>
Super slow.
</li>
<li>
Doesn't jump.
</li>
</ul>
</div>
</body>
</html>
QuickCrawler 声明式地帮助你轻松地爬取和抓取给定页面的数据
use quick_crawler::{
QuickCrawler,
QuickCrawlerBuilder,
limiter::Limiter,
scrape::{
ResponseLogic::Parallel,
StartUrl,
Scrape,
ElementUrlExtractor,
ElementDataExtractor
}
};
fn main() {
let mut builder = QuickCrawlerBuilder::new();
let start_urls = vec![
StartUrl::new()
.url("https://bike-site.com/search?q=red-bikes")
.method("GET")
.response_logic(Parallel(vec![
// All Scrapers below will be provided the html page response body
Scrape::new()
.find_elements_with_urls(".bike-item")
.extract_urls_from_elements(ElementUrlExtractor::Attr("href".to_string()))
// now setup the logic to execute on each of the return html pages
.response_logic(Parallel(vec![
Scrape::new()
.find_elements_with_data(".awesome-bike .bike-info")
.extract_data_from_elements(ElementDataExtractor::Text)
.store(|vec: Vec<String>| async move {
println!("store bike info in DB: {:?}", vec);
}),
Scrape::new()
.find_elements_with_data(".bike-specs li")
.extract_data_from_elements(ElementDataExtractor::Text)
.store(|vec: Vec<String>| async move {
println!("store bike specs in DB: {:?}", vec);
}),
])),
Scrape::new()
.find_elements_with_urls(".bike-other-item")
.extract_urls_from_elements(ElementUrlExtractor::Attr("href".to_string()))
.response_logic(Parallel(vec![
Scrape::new()
.find_elements_with_data(".other-bike .other-bike-info")
.extract_data_from_elements(ElementDataExtractor::Text)
.store(|vec: Vec<String>| async move {
println!("store other bike info in DB: {:?}", vec);
}),
Scrape::new()
.find_elements_with_data(".other-bike-specs li")
.extract_data_from_elements(ElementDataExtractor::Text)
.store(|vec: Vec<String>| async move {
println!("store other bike specs in DB: {:?}", vec);
}),
]))
])
)
// more StartUrl::new 's if you feel ambitious
] ;
// It's smart to use a limiter - for now automatically set to 3 request per second per domain.
// This will soon be configurable.
let limiter = Limiter::new();
builder
.with_start_urls(
start_urls
)
.with_limiter(
limiter
)
// Optionally configure how to make a request and return an html string
.with_request_handler(
|config: RequestHandlerConfig| async move {
// ... use any request library, like reqwest
surf::get(config.url.clone()).recv_string().await.map_err(|_| QuickCrawlerError::RequestErr)
}
);
let crawler = builder.finish().map_err(|_| "Builder could not finish").expect("no error");
// QuickCrawler is async, so choose your favorite executor.
// (Tested and working for both async-std and tokio)
let res = async_std::task::block_on(async {
crawler.process().await
});
}
贡献
克隆仓库。
运行测试
cargo watch -x check -x 'test -- --nocapture'
查看 src/lib.rs
中的测试以获取示例用法。
感谢使用!
如果你使用了这个包并且它对你的项目有帮助,请给它加星标!
许可
MIT
依赖
~16–29MB
~466K SLoC