16 个版本 (9 个稳定版)
2.0.1 | 2024年5月3日 |
---|---|
2.0.0 | 2023年10月24日 |
2.0.0-alpha.1 | 2023年9月4日 |
1.6.0 | 2023年5月24日 |
0.1.3 | 2018年6月25日 |
#313 in 网页编程
5,141 每月下载量
用于 17 个 Crates (10 直接使用)
30KB
553 行
Webpage.rs
一个小型库,用于获取网页信息:标题、描述、语言、HTTP 信息、链接、RSS 源、Opengraph、Schema.org 以及更多
用法
use webpage::{Webpage, WebpageOptions};
let info = Webpage::from_url("https://rust-lang.net.cn/en-US/", WebpageOptions::default())
.expect("Could not read from URL");
// the HTTP transfer info
let http = info.http;
assert_eq!(http.ip, "54.192.129.71".to_string());
assert!(http.headers[0].starts_with("HTTP"));
assert!(http.body.starts_with("<!DOCTYPE html>"));
assert_eq!(http.url, "https://rust-lang.net.cn/en-US/".to_string()); // followed redirects (HTTPS)
assert_eq!(http.content_type, "text/html".to_string());
// the parsed HTML info
let html = info.html;
assert_eq!(html.title, Some("The Rust Programming Language".to_string()));
assert_eq!(html.description, Some("A systems programming language that runs blazingly fast, prevents segfaults, and guarantees thread safety.".to_string()));
assert_eq!(html.opengraph.og_type, "website".to_string());
您还可以获取本地数据的 HTML 信息
use webpage::HTML;
let html = HTML::from_file("index.html", None);
// or let html = HTML::from_string(input, None);
功能
序列化
如果您需要使用 serde 序列化库提供的功能,可以在 Cargo.toml
中声明依赖时包含指定 功能 serde
webpage = { version = "2.0", features = ["serde"] }
无 curl 依赖
默认启用 curl
功能,但它是可选的。如果您不需要 HTTP 客户端但已经有 HTML 数据,这很有用。
所有字段
pub struct Webpage {
pub http: HTTP, // info about the HTTP transfer
pub html: HTML, // info from the parsed HTML doc
}
pub struct HTTP {
pub ip: String,
pub transfer_time: Duration,
pub redirect_count: u32,
pub content_type: String,
pub response_code: u32,
pub headers: Vec<String>, // raw headers from final request
pub url: String, // effective url
pub body: String,
}
pub struct HTML {
pub title: Option<String>,
pub description: Option<String>,
pub url: Option<String>, // canonical url
pub feed: Option<String>, // RSS feed typically
pub language: Option<String>, // as specified, not detected
pub text_content: String, // all tags stripped from body
pub links: Vec<Link>, // all links in the document
pub meta: HashMap<String, String>, // flattened down list of meta properties
pub opengraph: Opengraph,
pub schema_org: Vec<SchemaOrg>,
}
pub struct Link {
pub url: String, // resolved url of the link
pub text: String, // anchor text
}
pub struct Opengraph {
pub og_type: String,
pub properties: HashMap<String, String>,
pub images: Vec<Object>,
pub videos: Vec<Object>,
pub audios: Vec<Object>,
}
// Facebook's Opengraph structured data
pub struct OpengraphObject {
pub url: String,
pub properties: HashMap<String, String>,
}
// Google's schema.org structured data
pub struct SchemaOrg {
pub schema_type: String,
pub value: serde_json::Value,
}
选项
以下 HTTP 配置可用
pub struct WebpageOptions {
allow_insecure: false,
follow_location: true,
max_redirections: 5,
timeout: Duration::from_secs(10),
useragent: "Webpage - Rust crate - https://crates.io/crates/webpage".to_string(),
headers: vec!["X-My-Header: 1234".to_string()],
}
// usage
let mut options = WebpageOptions::default();
options.allow_insecure = true;
let info = Webpage::from_url(&url, options).expect("Halp, could not fetch");
依赖
~11–19MB
~319K SLoC