#html #querying #manipulating #html5ever #python #different #top

soup

受Python库BeautifulSoup的启发,这是在html5ever之上添加不同API来查询和操作HTML的一层

8个版本 (4个破坏性更新)

使用旧的Rust 2015

0.5.1 2021年3月25日
0.5.0 2020年2月14日
0.4.1 2019年4月29日
0.3.0 2018年11月14日
0.1.1 2018年11月2日

文本处理分类中排名#951

Download history 355/week @ 2024-03-14 443/week @ 2024-03-21 526/week @ 2024-03-28 306/week @ 2024-04-04 299/week @ 2024-04-11 339/week @ 2024-04-18 432/week @ 2024-04-25 525/week @ 2024-05-02 439/week @ 2024-05-09 506/week @ 2024-05-16 320/week @ 2024-05-23 333/week @ 2024-05-30 247/week @ 2024-06-06 811/week @ 2024-06-13 842/week @ 2024-06-20 766/week @ 2024-06-27

每月下载量2,698
用于21个crate18个直接使用)

CC-PDDC许可协议

50KB
863

Soup

受Python库BeautifulSoup的启发,这是在html5ever之上添加不同API用于查询和操作HTML的一层

文档(最新版本)

文档(master)

安装

为了使用,请将以下内容添加到您的 Cargo.toml

[dependencies]
soup = "0.5"

用法

// src/main.rs
extern crate reqwest;
extern crate soup;

use std::error::Error;

use reqwest;
use soup::prelude::*;

fn main() -> Result<(), Box<Error>> {
    let response = reqwest::get("https://google.com")?;
    let soup = Soup::from_reader(response);
    let some_text = soup.tag("p")
			.attr("class", "hidden")
			.find()
			.and_then(|p| p.text());
    OK(())
}


lib.rs:

受Python库"BeautifulSoup"的启发,soup是建立在html5ever之上的一个层,旨在提供稍微不同的API来查询和操作HTML

示例(受bs4文档启发)

以下是我们在后续示例中将要使用的HTML文档

const THREE_SISTERS: &'static str = r#"
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"#;
# fn main() {}

首先,让我们尝试搜索具有特定名称的标签

# extern crate soup;
# const THREE_SISTERS: &'static str = r#"
# <html><head><title>The Dormouse's story</title></head>
# <body>
# <p class="title"><b>The Dormouse's story</b></p>
#
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
# <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
# <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
#
# <p class="story">...</p>
# "#;
# fn main() {
use soup::prelude::*;

let soup = Soup::new(THREE_SISTERS);

let title = soup.tag("title").find().expect("Couldn't find tag 'title'");
assert_eq!(title.display(), "<title>The Dormouse's story</title>");
assert_eq!(title.name(), "title");
assert_eq!(title.text(), "The Dormouse's story".to_string());
assert_eq!(title.parent().expect("Couldn't find parent of 'title'").name(), "head");

let p = soup.tag("p").find().expect("Couldn't find tag 'p'");
assert_eq!(
    p.display(),
    r#"<p class="title"><b>The Dormouse's story</b></p>"#
);
assert_eq!(p.get("class"), Some("title".to_string()));
# }

因此我们看到,.find将返回与查询匹配的第一个元素,我们已经看到了可以调用的结果方法。但如果我们想通过查询检索多个元素怎么办?为此,我们将使用.find_all

# extern crate soup;
# use soup::prelude::*;
# const THREE_SISTERS: &'static str = r#"
# <html><head><title>The Dormouse's story</title></head>
# <body>
# <p class="title"><b>The Dormouse's story</b></p>
#
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
# <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
# <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
#
# <p class="story">...</p>
# "#;
# fn main() {
# let soup = Soup::new(THREE_SISTERS);
// .find returns only the first 'a' tag
let a = soup.tag("a").find().expect("Couldn't find tag 'a'");
assert_eq!(
    a.display(),
    r#"<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>"#
);
// but .find_all will return _all_ of them:
let a_s = soup.tag("a").find_all();
assert_eq!(
    a_s.map(|a| a.display())
       .collect::<Vec<_>>()
       .join("\n"),
    r#"<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>"#
);
# }

由于.find_all返回一个迭代器,您可以像使用其他迭代器一样使用它

# extern crate soup;
# use soup::prelude::*;
# const THREE_SISTERS: &'static str = r#"
# <html><head><title>The Dormouse's story</title></head>
# <body>
# <p class="title"><b>The Dormouse's story</b></p>
#
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
# <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
# <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
#
# <p class="story">...</p>
# "#;
# fn main() {
# let soup = Soup::new(THREE_SISTERS);
let expected = [
    "http://example.com/elsie",
    "http://example.com/lacie",
    "http://example.com/tillie",
];

for (i, link) in soup.tag("a").find_all().enumerate() {
    let href = link.get("href").expect("Couldn't find link with 'href' attribute");
    assert_eq!(href, expected[i].to_string());
}
# }

我们一直在处理的最顶层结构soup实现了与查询结果相同的相同方法,因此您可以在其上调用相同的方法,它将调用根节点以代理这些调用

# extern crate soup;
# use soup::prelude::*;
# const THREE_SISTERS: &'static str = r#"
# <html><head><title>The Dormouse's story</title></head>
# <body>
# <p class="title"><b>The Dormouse's story</b></p>
#
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
# <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
# <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
#
# <p class="story">...</p>
# "#;
# fn main() {
# let soup = Soup::new(THREE_SISTERS);
let text = soup.text();
assert_eq!(
    text,
    r#"The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

...
"#
);
# }

您可以使用不仅仅是字符串来搜索结果,例如正则表达式

use regex::Regex;

let soup = Soup::new(r#"<body><p>some text, <b>Some bold text</b></p></body>"#);
let results = soup.tag(Regex::new("^b")?)
                  .find_all()
                  .map(|tag| tag.name().to_string())
                  .collect::<Vec<_>>();
assert_eq!(results, vec!["body".to_string(), "b".to_string()]);

传递true将匹配一切


let soup = Soup::new(r#"<body><p>some text, <b>Some bold text</b></p></body>"#);
let results = soup.tag(true)
                  .find_all()
                  .map(|tag| tag.name().to_string())
                  .collect::<Vec<_>>();
assert_eq!(results, vec![
    "html".to_string(),
    "head".to_string(),
    "body".to_string(),
    "p".to_string(),
    "b".to_string(),
]);

(此外,传递 false 将始终返回无结果,尽管如果您觉得这很有用,请告诉我)

那么一旦您查询的结果出来了,您能做什么呢?好吧,首先,您可以用几种不同的方式遍历树。您可以向上遍历树


let soup = Soup::new(r#"<body><p>some text, <b>Some bold text</b></p></body>"#);
let b = soup.tag("b")
            .find()
            .expect("Couldn't find tag 'b'");
let p = b.parent()
         .expect("Couldn't find parent of 'b'");
assert_eq!(p.name(), "p".to_string());
let body = p.parent()
            .expect("Couldn't find parent of 'p'");
assert_eq!(body.name(), "body".to_string());

或者向下遍历


let soup = Soup::new(r#"<body><ul><li>ONE</li><li>TWO</li><li>THREE</li></ul></body>"#);
let ul = soup.tag("ul")
            .find()
            .expect("Couldn't find tag 'ul'");
let mut li_tags = ul.children().filter(|child| child.is_element());
assert_eq!(li_tags.next().map(|tag| tag.text().to_string()), Some("ONE".to_string()));
assert_eq!(li_tags.next().map(|tag| tag.text().to_string()), Some("TWO".to_string()));
assert_eq!(li_tags.next().map(|tag| tag.text().to_string()), Some("THREE".to_string()));
assert!(li_tags.next().is_none());

或者使用迭代器向上遍历


let soup = Soup::new(r#"<body><ul><li>ONE</li><li>TWO</li><li>THREE</li></ul></body>"#);
let li = soup.tag("li").find().expect("Couldn't find tag 'li'");
let mut parents = li.parents();
assert_eq!(parents.next().map(|tag| tag.name().to_string()), Some("ul".to_string()));
assert_eq!(parents.next().map(|tag| tag.name().to_string()), Some("body".to_string()));
assert_eq!(parents.next().map(|tag| tag.name().to_string()), Some("html".to_string()));
assert_eq!(parents.next().map(|tag| tag.name().to_string()), Some("[document]".to_string()));

依赖项

~3-5MB
~95K SLoC