#xml #writer #html #serde #parser

fast-xml

高性能 XML 读取器和写入器

4 个版本

0.23.1 2022 年 5 月 30 日
0.23.0 2022 年 5 月 8 日
0.23.0-alpha32022 年 5 月 2 日
0.22.0 2022 年 5 月 2 日

#87#writer

Download history 1326/week @ 2024-03-14 1382/week @ 2024-03-21 1393/week @ 2024-03-28 1294/week @ 2024-04-04 1514/week @ 2024-04-11 1494/week @ 2024-04-18 1582/week @ 2024-04-25 1886/week @ 2024-05-02 3549/week @ 2024-05-09 13104/week @ 2024-05-16 15771/week @ 2024-05-23 26888/week @ 2024-05-30 19466/week @ 2024-06-06 13375/week @ 2024-06-13 6827/week @ 2024-06-20 4243/week @ 2024-06-27

50,828 每月下载量
用于 4 个 crate (2 直接)

MIT 许可证

455KB
9K SLoC

quick-xml 已恢复并重新活跃。请使用它

fast-xml -- quick-xml 的继任者

status Crate

高性能 XML 拉取读取器和写入器。

读取器

  • 几乎零拷贝(尽可能使用 Cow
  • 内存分配容易(API 提供了一种重用缓冲区的方法)
  • 支持各种编码(带有 encoding 功能)、命名空间解析、特殊字符。

docs.rs

语法灵感来源于 xml-rs

quick-xml 迁移

如果您正在使用 quick-xml 0.22.0 或 0.23.0-alpha3,您只需在您的 Cargo.toml 中将 quick-xml 替换为 fast-xml。将您的代码库中的每个 quick_xml crate 名称替换为 fast_xml

fast-xml 的这两个版本是专门为迁移制作的,它们包含与原始 quick-xml 相同的代码,除了更新了测试、benches 和示例中的 cargo 元数据和 extern crate 名称。

示例

读取器

use fast_xml::Reader;
use fast_xml::events::Event;

let xml = r#"<tag1 att1 = "test">
                <tag2><!--Test comment-->Test</tag2>
                <tag2>
                    Test 2
                </tag2>
            </tag1>"#;

let mut reader = Reader::from_str(xml);
reader.trim_text(true);

let mut count = 0;
let mut txt = Vec::new();
let mut buf = Vec::new();

// The `Reader` does not implement `Iterator` because it outputs borrowed data (`Cow`s)
loop {
    // NOTE: this is the generic case when we don't know about the input BufRead.
    // when the input is a &str or a &[u8], we don't actually need to use another
    // buffer, we could directly call `reader.read_event_unbuffered()`
    match reader.read_event(&mut buf) {
        Ok(Event::Start(ref e)) => {
            match e.name() {
                b"tag1" => println!("attributes values: {:?}",
                                    e.attributes().map(|a| a.unwrap().value).collect::<Vec<_>>()),
                b"tag2" => count += 1,
                _ => (),
            }
        },
        Ok(Event::Text(e)) => txt.push(e.unescape_and_decode(&reader).unwrap()),
        Ok(Event::Eof) => break, // exits the loop when reaching end of file
        Err(e) => panic!("Error at position {}: {:?}", reader.buffer_position(), e),
        _ => (), // There are several other `Event`s we do not consider here
    }

    // if we don't keep a borrow elsewhere, we can clear the buffer to keep memory usage low
    buf.clear();
}

写入器

use fast_xml::Writer;
use fast_xml::Reader;
use fast_xml::events::{Event, BytesEnd, BytesStart};
use std::io::Cursor;
use std::iter;

let xml = r#"<this_tag k1="v1" k2="v2"><child>text</child></this_tag>"#;
let mut reader = Reader::from_str(xml);
reader.trim_text(true);
let mut writer = Writer::new(Cursor::new(Vec::new()));
let mut buf = Vec::new();
loop {
    match reader.read_event(&mut buf) {
        Ok(Event::Start(ref e)) if e.name() == b"this_tag" => {

            // crates a new element ... alternatively we could reuse `e` by calling
            // `e.into_owned()`
            let mut elem = BytesStart::owned(b"my_elem".to_vec(), "my_elem".len());

            // collect existing attributes
            elem.extend_attributes(e.attributes().map(|attr| attr.unwrap()));

            // copy existing attributes, adds a new my-key="some value" attribute
            elem.push_attribute(("my-key", "some value"));

            // writes the event to the writer
            assert!(writer.write_event(Event::Start(elem)).is_ok());
        },
        Ok(Event::End(ref e)) if e.name() == b"this_tag" => {
            assert!(writer.write_event(Event::End(BytesEnd::borrowed(b"my_elem"))).is_ok());
        },
        Ok(Event::Eof) => break,
	// you can use either `e` or `&e` if you don't want to move the event
        Ok(e) => assert!(writer.write_event(&e).is_ok()),
        Err(e) => panic!("Error at position {}: {:?}", reader.buffer_position(), e),
    }
    buf.clear();
}

let result = writer.into_inner().into_inner();
let expected = r#"<my_elem k1="v1" k2="v2" my-key="some value"><child>text</child></my_elem>"#;
assert_eq!(result, expected.as_bytes());

Serde

当使用 serialize 功能时,fast-xml 可以与 serde 的 Serialize/Deserialize 特性一起使用。

以下是一个反序列化 crates.io 源的示例

// Cargo.toml
// [dependencies]
// serde = { version = "1.0", features = [ "derive" ] }
// fast-xml = { version = "0.22", features = [ "serialize" ] }
use serde::Deserialize;
use fast_xml::de::{from_str, DeError};

#[derive(Debug, Deserialize, PartialEq)]
struct Link {
    rel: String,
    href: String,
    sizes: Option<String>,
}

#[derive(Debug, Deserialize, PartialEq)]
#[serde(rename_all = "lowercase")]
enum Lang {
    En,
    Fr,
    De,
}

#[derive(Debug, Deserialize, PartialEq)]
struct Head {
    title: String,
    #[serde(rename = "link", default)]
    links: Vec<Link>,
}

#[derive(Debug, Deserialize, PartialEq)]
struct Script {
    src: String,
    integrity: String,
}

#[derive(Debug, Deserialize, PartialEq)]
struct Body {
    #[serde(rename = "script", default)]
    scripts: Vec<Script>,
}

#[derive(Debug, Deserialize, PartialEq)]
struct Html {
    lang: Option<String>,
    head: Head,
    body: Body,
}

fn crates_io() -> Result<Html, DeError> {
    let xml = "<!DOCTYPE html>
        <html lang=\"en\">
          <head>
            <meta charset=\"utf-8\">
            <meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge\">
            <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">

            <title>crates.io: Rust Package Registry</title>


        <!-- EMBER_CLI_FASTBOOT_TITLE --><!-- EMBER_CLI_FASTBOOT_HEAD -->
        <link rel=\"manifest\" href=\"/manifest.webmanifest\">
        <link rel=\"apple-touch-icon\" href=\"/cargo-835dd6a18132048a52ac569f2615b59d.png\" sizes=\"227x227\">

            <link rel=\"stylesheet\" href=\"/assets/vendor-8d023d47762d5431764f589a6012123e.css\" integrity=\"sha256-EoB7fsYkdS7BZba47+C/9D7yxwPZojsE4pO7RIuUXdE= sha512-/SzGQGR0yj5AG6YPehZB3b6MjpnuNCTOGREQTStETobVRrpYPZKneJwcL/14B8ufcvobJGFDvnTKdcDDxbh6/A==\" >
            <link rel=\"stylesheet\" href=\"/assets/cargo-cedb8082b232ce89dd449d869fb54b98.css\" integrity=\"sha256-S9K9jZr6nSyYicYad3JdiTKrvsstXZrvYqmLUX9i3tc= sha512-CDGjy3xeyiqBgUMa+GelihW394pqAARXwsU+HIiOotlnp1sLBVgO6v2ZszL0arwKU8CpvL9wHyLYBIdfX92YbQ==\" >


            <link rel=\"shortcut icon\" href=\"/favicon.ico\" type=\"image/x-icon\">
            <link rel=\"icon\" href=\"/cargo-835dd6a18132048a52ac569f2615b59d.png\" type=\"image/png\">
            <link rel=\"search\" href=\"/opensearch.xml\" type=\"application/opensearchdescription+xml\" title=\"Cargo\">
          </head>
          <body>
            <!-- EMBER_CLI_FASTBOOT_BODY -->
            <noscript>
                <div id=\"main\">
                    <div class='noscript'>
                        This site requires JavaScript to be enabled.
                    </div>
                </div>
            </noscript>

            <script src=\"/assets/vendor-bfe89101b20262535de5a5ccdc276965.js\" integrity=\"sha256-U12Xuwhz1bhJXWyFW/hRr+Wa8B6FFDheTowik5VLkbw= sha512-J/cUUuUN55TrdG8P6Zk3/slI0nTgzYb8pOQlrXfaLgzr9aEumr9D1EzmFyLy1nrhaDGpRN1T8EQrU21Jl81pJQ==\" ></script>
            <script src=\"/assets/cargo-4023b68501b7b3e17b2bb31f50f5eeea.js\" integrity=\"sha256-9atimKc1KC6HMJF/B07lP3Cjtgr2tmET8Vau0Re5mVI= sha512-XJyBDQU4wtA1aPyPXaFzTE5Wh/mYJwkKHqZ/Fn4p/ezgdKzSCFu6FYn81raBCnCBNsihfhrkb88uF6H5VraHMA==\" ></script>

          </body>
        </html>
}";
    let html: Html = from_str(xml)?;
    assert_eq!(&html.head.title, "crates.io: Rust Package Registry");
    Ok(html)
}

致谢

这主要受到了 serde-xml-rs 的启发。fast-xml 沿用了其反序列化约定,包括 $value 特殊名称。

原始 quick-xml 由 @tafia 开发,并于 2021 年底停止维护。

解析标签的 "值"

如果您有一个如下形式的输入 <foo abc="xyz">bar</foo>,并且您想访问 bar,您可以使用特殊名称 $value

struct Foo {
    pub abc: String,
    #[serde(rename = "$value")]
    pub body: String,
}

将结构体展开为详尽的 XML

如果您的XML文件看起来像这样:<root><first>value</first><second>value</second></root>,您可以使用特殊名称前缀 $unflatten= 来进行序列化和反序列化。

struct Root {
    #[serde(rename = "$unflatten=first")]
    first: String,
    #[serde(rename = "$unflatten=second")]
    other_field: String,
}

将单元变体序列化为原始数据类型

使用前缀 $primitive,您可以将没有关联值(在内部称为 单元变体)的枚举变体序列化为原始字符串,而不是自闭合标签。考虑以下定义

enum Foo {
    #[serde(rename = "$primitive=Bar")]
    Bar
}

struct Root {
    foo: Foo
}

序列化 Root { foo: Foo::Bar } 将产生 <Root foo="Bar"/> 而不是 <Root><Bar/><</Root>

性能

请注意,尽管它没有专注于性能(存在几个不必要的复制),但它仍然比 serde-xml-rs 快约 10 倍。

特性

  • encoding:支持非UTF8的XML
  • serialize:支持 serde Serialize/Deserialize

性能

基准测试很困难,结果取决于您的输入文件和您的机器。

在这里,针对我特定的文件,fast-xml 比 xml-rs 包快约 50倍(测量是在这个包名为 quick-xml 时进行的)

// quick-xml benches
test bench_quick_xml            ... bench:     198,866 ns/iter (+/- 9,663)
test bench_quick_xml_escaped    ... bench:     282,740 ns/iter (+/- 61,625)
test bench_quick_xml_namespaced ... bench:     389,977 ns/iter (+/- 32,045)

// same bench with xml-rs
test bench_xml_rs               ... bench:  14,468,930 ns/iter (+/- 321,171)

// serde-xml-rs vs serialize feature
test bench_serde_quick_xml      ... bench:   1,181,198 ns/iter (+/- 138,290)
test bench_serde_xml_rs         ... bench:  15,039,564 ns/iter (+/- 783,485)

要比较功能和性能,您还可以查看 RazrFalcon 的 解析器比较表

贡献

欢迎任何 PR!

许可

MIT

依赖项

~0.2–1.3MB
~39K SLoC