8个版本

0.2.6 2024年4月4日
0.2.5 2024年2月17日
0.2.4 2024年1月27日
0.1.0 2024年1月10日

#1613 in Web编程

Download history 3/week @ 2024-06-29 34/week @ 2024-07-06 50/week @ 2024-07-27

84 每月下载量

MIT/Apache

52KB
1K SLoC

DOM_FINDER

Crates.io version Download docs.rs docs ci codecov

dom_finder 是一个Rust crate,提供在HTML文档的文档对象模型(DOM)中查找元素的功能。它允许您根据各种CSS标准轻松定位特定元素。使用 dom_finder,您可以从HTML文档中提取数据并在获取结果之前对其进行转换。

目前,此功能依赖于YAML配置。

示例

通用


use dom_finder::{Config, Finder, Value};

const CFG_YAML: &str = r"
name: root
base_path: html
children:
  - name: results
    base_path: div.serp__results div.result
    many: true
    children:
      - name: url
        base_path: h2.result__title > a[href]
        extract: href
      - name: title
        base_path: h2.result__title
        extract: text
      - name: snippet
        base_path: a.result__snippet
        extract: html
        pipeline: [ [ policy_highlight ] ]
";

const HTML_DOC: &str = include_str!("../test_data/page_0.html");


fn main() {
    // Loading config from yaml str, -- &str can be retrieved from file or buffer,
    let cfg = Config::from_yaml(CFG_YAML).unwrap();
    // Creating a new Finder instance
    let finder = Finder::new(&cfg).unwrap();

    // or in one line:
    // let finder: Finder = Config::from_yaml(CFG_YAML).unwrap().try_into().unwrap();
    
    // parsing html-string (actually &str), and getting the result as `Value`.
    // Returned `Value` from `parse` method is always `Value::Object` and it has only one key (String).
    let results: Value = finder.parse(HTML_DOC);

    // from the `Value` we can navigate to descendant (inline) value, by path,
    // similar like `gjson` has, but in `Value` case -- path is primitive.
    // For more examples, please check out the `tests/` folder.

    // Getting the count of results by using `from_path` method.
    // We know that `results` is `Value::Array`, 
    // because in the config we set `many: true` for `results`.
    // if the Value option is Array (actually Vector), we can query it by: # or a (positive) number.
    let raw_count = results.from_path("root.results.#").unwrap();
    let count_opt: Option<i64> = raw_count.into();
    assert_eq!(count_opt.unwrap(), 21);


    // Getting an exact Value, and casting it to a real value
    // Same way we can retrieve all urls inside `results` array, 
    // by specifying path as `root.results.#.url`.
    // If there will no `url` key, or it will not have a Value::String type, 
    // it will return None, otherwise -- Some
    let url: String = results.from_path("root.results.0.url")
    .and_then(| v| v.into()).unwrap();
    assert_eq!(url, "https://ethereum.org/en/");

    // Also the `Value` instance can be serialized with serde serializer 
    // (like json or any other available)
    // Useful if you just need to send parsed data with http response, 
    // or put parsed data into the database
    let serialized = serde_json::to_string(&res).unwrap();
}

移除选择

use dom_finder::{Config, Finder};
use dom_query::Document;


const HTML_DOC: &str = include_str!("../test_data/page_0.html");


fn main() {

  // Create finder, like in previous example
  let cfg_yaml = r"
  name: root
  base_path: html
  children:
  - name: feedback
    base_path: div#links.results div.feedback-btn
    extract: html
    remove_selection: true
    pipeline: [ [ trim_space ] ]
  ";
  let cfg = Config::from_yaml(cfg_yaml).unwrap();
  let finder = Finder::new(&cfg).unwrap();

  // Create dom_query::Document
  let doc = Document::from(HTML_DOC);

  // Parse the document
  // As we set remove_selection it matched selection will be removed from the document.
  // But the value of matched selection will be available in the result
  let res = finder.parse_document(&doc);
  let feedback_caption: Option<String> = res.from_path("root.feedback").unwrap().into();
  assert_eq!(feedback_caption.unwrap(), "Feedback");

  let html = doc.html();
  // html document doesn't contain feedback button anymore. 
  assert!(!html.contains("feedback-btn"));
}

更多示例

功能

  • json_cfg — 可选,允许从JSON字符串中加载配置。

许可证

本项目根据以下任一许可证授权:

您可以选择。

除非您明确声明,否则根据Apache-2.0许可证定义,您有意提交以包含在此crate中的任何贡献,均应如上双授权,而不附加任何额外条款或条件。

依赖关系

~10–17MB
~177K SLoC