8个版本
0.2.6 | 2024年4月4日 |
---|---|
0.2.5 | 2024年2月17日 |
0.2.4 | 2024年1月27日 |
0.1.0 | 2024年1月10日 |
#1613 in Web编程
84 每月下载量
52KB
1K SLoC
DOM_FINDER
dom_finder
是一个Rust crate,提供在HTML文档的文档对象模型(DOM)中查找元素的功能。它允许您根据各种CSS标准轻松定位特定元素。使用 dom_finder
,您可以从HTML文档中提取数据并在获取结果之前对其进行转换。
目前,此功能依赖于YAML配置。
示例
通用
use dom_finder::{Config, Finder, Value};
const CFG_YAML: &str = r"
name: root
base_path: html
children:
- name: results
base_path: div.serp__results div.result
many: true
children:
- name: url
base_path: h2.result__title > a[href]
extract: href
- name: title
base_path: h2.result__title
extract: text
- name: snippet
base_path: a.result__snippet
extract: html
pipeline: [ [ policy_highlight ] ]
";
const HTML_DOC: &str = include_str!("../test_data/page_0.html");
fn main() {
// Loading config from yaml str, -- &str can be retrieved from file or buffer,
let cfg = Config::from_yaml(CFG_YAML).unwrap();
// Creating a new Finder instance
let finder = Finder::new(&cfg).unwrap();
// or in one line:
// let finder: Finder = Config::from_yaml(CFG_YAML).unwrap().try_into().unwrap();
// parsing html-string (actually &str), and getting the result as `Value`.
// Returned `Value` from `parse` method is always `Value::Object` and it has only one key (String).
let results: Value = finder.parse(HTML_DOC);
// from the `Value` we can navigate to descendant (inline) value, by path,
// similar like `gjson` has, but in `Value` case -- path is primitive.
// For more examples, please check out the `tests/` folder.
// Getting the count of results by using `from_path` method.
// We know that `results` is `Value::Array`,
// because in the config we set `many: true` for `results`.
// if the Value option is Array (actually Vector), we can query it by: # or a (positive) number.
let raw_count = results.from_path("root.results.#").unwrap();
let count_opt: Option<i64> = raw_count.into();
assert_eq!(count_opt.unwrap(), 21);
// Getting an exact Value, and casting it to a real value
// Same way we can retrieve all urls inside `results` array,
// by specifying path as `root.results.#.url`.
// If there will no `url` key, or it will not have a Value::String type,
// it will return None, otherwise -- Some
let url: String = results.from_path("root.results.0.url")
.and_then(| v| v.into()).unwrap();
assert_eq!(url, "https://ethereum.org/en/");
// Also the `Value` instance can be serialized with serde serializer
// (like json or any other available)
// Useful if you just need to send parsed data with http response,
// or put parsed data into the database
let serialized = serde_json::to_string(&res).unwrap();
}
移除选择
use dom_finder::{Config, Finder};
use dom_query::Document;
const HTML_DOC: &str = include_str!("../test_data/page_0.html");
fn main() {
// Create finder, like in previous example
let cfg_yaml = r"
name: root
base_path: html
children:
- name: feedback
base_path: div#links.results div.feedback-btn
extract: html
remove_selection: true
pipeline: [ [ trim_space ] ]
";
let cfg = Config::from_yaml(cfg_yaml).unwrap();
let finder = Finder::new(&cfg).unwrap();
// Create dom_query::Document
let doc = Document::from(HTML_DOC);
// Parse the document
// As we set remove_selection it matched selection will be removed from the document.
// But the value of matched selection will be available in the result
let res = finder.parse_document(&doc);
let feedback_caption: Option<String> = res.from_path("root.feedback").unwrap().into();
assert_eq!(feedback_caption.unwrap(), "Feedback");
let html = doc.html();
// html document doesn't contain feedback button anymore.
assert!(!html.contains("feedback-btn"));
}
更多示例
功能
json_cfg
— 可选,允许从JSON字符串中加载配置。
许可证
本项目根据以下任一许可证授权:
- Apache License,版本2.0,(LICENSE-APACHE 或 https://apache.ac.cn/licenses/LICENSE-2.0)
- MIT许可证(LICENSE-MIT 或 https://opensource.org/licenses/MIT)
您可以选择。
除非您明确声明,否则根据Apache-2.0许可证定义,您有意提交以包含在此crate中的任何贡献,均应如上双授权,而不附加任何额外条款或条件。
依赖关系
~10–17MB
~177K SLoC