#web-scraping #reqwest #integration #response #html #macro #css-selectors

macro reqwest-scraper-macros

使用reqwest进行网络爬虫集成

3个版本

新版本 0.3.2 2024年8月3日
0.3.1 2024年7月13日
0.3.0 2024年7月13日

1562过程宏

Download history 227/week @ 2024-07-11 12/week @ 2024-07-18 4/week @ 2024-07-25 119/week @ 2024-08-01

每月下载 362
用于 reqwest-scraper

MIT 许可证

20KB
507 代码行

reqwest-scraper - 使用reqwest进行网络爬虫集成

crates.io Documentation CI

扩展 reqwest 以支持多种网络爬虫方法。

功能

入门指南

  • 添加依赖项
    reqwest = { version = "0.12", features = ["json"] }
    reqwest-scraper="0.3.1"
    
  • 使用ScraperResponse
    use reqwest_scraper::ScraperResponse;
    

JsonPath

  • Json::选择<T:DeserializeOwned>(路径: &字符串) -> 结果<Vec<T>>
  • Json::选择一个<T:DeserializeOwned>(路径: &字符串) -> 结果<T>
  • Json::选择为字符串(路径: &字符串) -> 结果<字符串>

示例:

use reqwest_scraper::ScraperResponse;

pub async fn request() -> Result<()> {
    let json = reqwest::Client::builder()
        .build()?
        .get("https://api.github.com/search/repositories?q=rust")
        .header("User-Agent", "Rust Reqwest")
        .send()
        .await?
        .jsonpath()
        .await?;

    let total_count = json.select_as_str("$.total_count")?;
    let names: Vec<String> = json.select("$.items[*].full_name")?;

    println!("{}", total_count);
    println!("{}", names.join("\t"));

    Ok(())
}

CSS选择器

  • Html::选择(选择器: &字符串) -> 结果<Selectable>
  • Selectable::迭代() ->实现迭代器<SelectItem>
  • Selectable::第一个() -> Option<SelectItem>
  • SelectItem::名称() -> &字符串
  • SelectItem::id() -> Option<&字符串>
  • SelectItem::有类(: &字符串,大小写敏感:大小写敏感性) -> 布尔值
  • SelectItem::() ->
  • SelectItem::属性() ->属性
  • SelectItem::属性(属性: &字符串) -> Option<&字符串>
  • SelectItem::文本() ->字符串
  • SelectItem::Html() ->字符串
  • SelectItem::内部Html() ->字符串
  • SelectItem::子元素() ->实现迭代器<SelectItem>
  • SelectItem::查找(选择器: &字符串) -> 结果<Selectable>

示例:

use reqwest_scraper::ScraperResponse;

async fn request() -> Result<()> {
    let html = reqwest::get("https://github.com/holmofy")
        .await?
        .css_selector()
        .await?;

    assert_eq!(
        html.select(".p-name")?.iter().nth(0).unwrap().text().trim(),
        "holmofy"
    );

    let select_result = html.select(".vcard-details > li.vcard-detail")?;

    for detail_item in select_result.iter() {
        println!("{}", detail_item.attr("aria-label").unwrap())
    }

    Ok(())
}

XPath

  • XHtml::选择(xpath: &字符串) -> 结果<XPathResult>
  • XPathResult::作为节点() -> Vec<节点>
  • XPathResult::作为字符串() -> Vec<字符串>
  • XPathResult::作为节点() -> Option<节点>
  • XPathResult::作为字符串() -> Option<字符串>
  • 节点::名称() ->字符串
  • 节点::id() -> Option<字符串>
  • 节点::() -> HashSet<字符串>
  • 节点::属性(属性: &字符串) -> Option<字符串>
  • 节点::有属性(属性: &字符串) -> 布尔值
  • 节点::文本() ->字符串
  • 待办事项: Node::html() -> String
  • 待办事项: Node::inner_html() -> String
  • 节点::子元素() -> Vec<节点>
  • 节点::查找节点(相对xpath: &字符串) -> 结果<Vec<节点>>
  • 节点::查找值(相对xpath: &字符串) -> 结果<Vec<字符串>>
  • 节点::查找节点(相对xpath: &字符串) -> 结果<Option<节点>>
  • 节点::查找值(相对xpath: &字符串) -> 结果<Option<字符串>>

示例:

async fn request() -> Result<()> {
    let html = reqwest::get("https://github.com/holmofy")
        .await?
        .xpath()
        .await?;

    // simple extract element
    let name = html
        .select("//span[contains(@class,'p-name')]")?
        .as_node()
        .unwrap()
        .text();
    println!("{}", name);
    assert_eq!(name.trim(), "holmofy");

    // iterate elements
    let select_result = html
        .select("//ul[contains(@class,'vcard-details')]/li[contains(@class,'vcard-detail')]")?
        .as_nodes();

    println!("{}", select_result.len());

    for item in select_result.into_iter() {
        let attr = item.attr("aria-label").unwrap_or_else(|| "".into());
        println!("{}", attr);
        println!("{}", item.text());
    }

    // attribute extract
    let select_result = html
        .select("//ul[contains(@class,'vcard-details')]/li[contains(@class,'vcard-detail')]/@aria-label")?
        .as_strs();

    println!("{}", select_result.len());
    select_result.into_iter().for_each(|s| println!("{}", s));

    Ok(())
}

推导宏提取

使用 FromCssSelectorselector 从 HTML 元素中提取到结构体

// define struct and derive the FromCssSelector trait
#[derive(Debug, FromCssSelector)]
#[selector(path = "#user-repositories-list > ul > li")]
struct Repo {
    #[selector(path = "a[itemprop~='name']", default = "<unname>", text)]
    name: String,

    #[selector(path = "span[itemprop~='programmingLanguage']", text)]
    program_lang: Option<String>,

    #[selector(path = "div.topics-row-container>a", text)]
    topics: Vec<String>,
}

// request
let html = reqwest::get("https://github.com/holmofy?tab=repositories")
    .await?
    .css_selector()
    .await?;

// Use the generated `from_html` method to extract data into the struct
let items = Repo::from_html(html)?;
items.iter().for_each(|item| println!("{:?}", item));

使用 FromXPathxpath 从 HTML 元素中提取到结构体

// define struct and derive the FromXPath trait
#[derive(Debug, FromXPath)]
#[xpath(path = "//div[@id='user-repositories-list']/ul/li")]
struct Repo {
    #[xpath(path = ".//a[contains(@itemprop,'name')]/text()", default = "<unname>")]
    name: String,

    #[xpath(path = ".//span[contains(@itemprop,'programmingLanguage')]/text()")]
    program_lang: Option<String>,

    #[xpath(path = ".//div[contains(@class,'topics-row-container')]/a/text()")]
    topics: Vec<String>,
}

let html = reqwest::get("https://github.com/holmofy?tab=repositories")
    .await?
    .xpath()
    .await?;

// Use the generated `from_xhtml` method to extract data into the struct
let items = Repo::from_xhtml(html)?;
items.iter().for_each(|item| println!("{:?}", item));

依赖项

~3.5–9.5MB
~86K SLoC