3个版本
新版本 0.3.2 | 2024年8月3日 |
---|---|
0.3.1 | 2024年7月13日 |
0.3.0 | 2024年7月13日 |
1562 在 过程宏 中
每月下载 362 次
用于 reqwest-scraper
20KB
507 代码行
reqwest-scraper - 使用reqwest进行网络爬虫集成
扩展 reqwest 以支持多种网络爬虫方法。
功能
入门指南
- 添加依赖项
reqwest = { version = "0.12", features = ["json"] } reqwest-scraper="0.3.1"
- 使用ScraperResponse
use reqwest_scraper::ScraperResponse;
JsonPath
Json::选择<T:DeserializeOwned>(路径: &字符串) -> 结果<Vec<T>>
Json::选择一个<T:DeserializeOwned>(路径: &字符串) -> 结果<T>
Json::选择为字符串(路径: &字符串) -> 结果<字符串>
示例:
use reqwest_scraper::ScraperResponse;
pub async fn request() -> Result<()> {
let json = reqwest::Client::builder()
.build()?
.get("https://api.github.com/search/repositories?q=rust")
.header("User-Agent", "Rust Reqwest")
.send()
.await?
.jsonpath()
.await?;
let total_count = json.select_as_str("$.total_count")?;
let names: Vec<String> = json.select("$.items[*].full_name")?;
println!("{}", total_count);
println!("{}", names.join("\t"));
Ok(())
}
CSS选择器
Html::选择(选择器: &字符串) -> 结果<Selectable>
Selectable::迭代() ->实现迭代器<SelectItem>
Selectable::第一个() -> Option<SelectItem>
SelectItem::名称() -> &字符串
SelectItem::id() -> Option<&字符串>
SelectItem::有类(类: &字符串,大小写敏感:大小写敏感性) -> 布尔值
SelectItem::类() ->类
SelectItem::属性() ->属性
SelectItem::属性(属性: &字符串) -> Option<&字符串>
SelectItem::文本() ->字符串
SelectItem::Html() ->字符串
SelectItem::内部Html() ->字符串
SelectItem::子元素() ->实现迭代器<SelectItem>
SelectItem::查找(选择器: &字符串) -> 结果<Selectable>
示例:
use reqwest_scraper::ScraperResponse;
async fn request() -> Result<()> {
let html = reqwest::get("https://github.com/holmofy")
.await?
.css_selector()
.await?;
assert_eq!(
html.select(".p-name")?.iter().nth(0).unwrap().text().trim(),
"holmofy"
);
let select_result = html.select(".vcard-details > li.vcard-detail")?;
for detail_item in select_result.iter() {
println!("{}", detail_item.attr("aria-label").unwrap())
}
Ok(())
}
XPath
XHtml::选择(xpath: &字符串) -> 结果<XPathResult>
XPathResult::作为节点() -> Vec<节点>
XPathResult::作为字符串() -> Vec<字符串>
XPathResult::作为节点() -> Option<节点>
XPathResult::作为字符串() -> Option<字符串>
节点::名称() ->字符串
节点::id() -> Option<字符串>
节点::类() -> HashSet<字符串>
节点::属性(属性: &字符串) -> Option<字符串>
节点::有属性(属性: &字符串) -> 布尔值
节点::文本() ->字符串
- 待办事项:
Node::html() -> String
- 待办事项:
Node::inner_html() -> String
节点::子元素() -> Vec<节点>
节点::查找节点(相对xpath: &字符串) -> 结果<Vec<节点>>
节点::查找值(相对xpath: &字符串) -> 结果<Vec<字符串>>
节点::查找节点(相对xpath: &字符串) -> 结果<Option<节点>>
节点::查找值(相对xpath: &字符串) -> 结果<Option<字符串>>
示例:
async fn request() -> Result<()> {
let html = reqwest::get("https://github.com/holmofy")
.await?
.xpath()
.await?;
// simple extract element
let name = html
.select("//span[contains(@class,'p-name')]")?
.as_node()
.unwrap()
.text();
println!("{}", name);
assert_eq!(name.trim(), "holmofy");
// iterate elements
let select_result = html
.select("//ul[contains(@class,'vcard-details')]/li[contains(@class,'vcard-detail')]")?
.as_nodes();
println!("{}", select_result.len());
for item in select_result.into_iter() {
let attr = item.attr("aria-label").unwrap_or_else(|| "".into());
println!("{}", attr);
println!("{}", item.text());
}
// attribute extract
let select_result = html
.select("//ul[contains(@class,'vcard-details')]/li[contains(@class,'vcard-detail')]/@aria-label")?
.as_strs();
println!("{}", select_result.len());
select_result.into_iter().for_each(|s| println!("{}", s));
Ok(())
}
推导宏提取
使用 FromCssSelector
和 selector
从 HTML 元素中提取到结构体
// define struct and derive the FromCssSelector trait
#[derive(Debug, FromCssSelector)]
#[selector(path = "#user-repositories-list > ul > li")]
struct Repo {
#[selector(path = "a[itemprop~='name']", default = "<unname>", text)]
name: String,
#[selector(path = "span[itemprop~='programmingLanguage']", text)]
program_lang: Option<String>,
#[selector(path = "div.topics-row-container>a", text)]
topics: Vec<String>,
}
// request
let html = reqwest::get("https://github.com/holmofy?tab=repositories")
.await?
.css_selector()
.await?;
// Use the generated `from_html` method to extract data into the struct
let items = Repo::from_html(html)?;
items.iter().for_each(|item| println!("{:?}", item));
使用 FromXPath
和 xpath
从 HTML 元素中提取到结构体
// define struct and derive the FromXPath trait
#[derive(Debug, FromXPath)]
#[xpath(path = "//div[@id='user-repositories-list']/ul/li")]
struct Repo {
#[xpath(path = ".//a[contains(@itemprop,'name')]/text()", default = "<unname>")]
name: String,
#[xpath(path = ".//span[contains(@itemprop,'programmingLanguage')]/text()")]
program_lang: Option<String>,
#[xpath(path = ".//div[contains(@class,'topics-row-container')]/a/text()")]
topics: Vec<String>,
}
let html = reqwest::get("https://github.com/holmofy?tab=repositories")
.await?
.xpath()
.await?;
// Use the generated `from_xhtml` method to extract data into the struct
let items = Repo::from_xhtml(html)?;
items.iter().for_each(|item| println!("{:?}", item));
相关项目
依赖项
~3.5–9.5MB
~86K SLoC