3 个不稳定版本

0.2.1	2024年3月7日
0.2.0	2024年2月27日
0.1.0	2024年2月20日

#1202 in 解析器实现

132 次每月下载

MPL-2.0 许可证

440KB
7.5K SLoC

简单的 XML 解析器和提取器。

此包提供了一个可以自动确定 UTF-8 和 UTF-16（大端和小端字节序）XML 字节流的字符编码的 XmlReader，并将 XML 解析为存储在 XmlDocument 中的不可变 Element 树。还可以使用自定义字节流解码器来读取其他字符编码的 XML。

此包的目标是尽可能紧密地支持 W3C 规范可扩展标记语言 (XML) 1.0 和 XML 1.0 中的命名空间，以支持 格式良好的 XML。此包的目标不是支持 XML 的验证，因此故意不支持 DTD（文档类型定义）。

命名空间支持始终启用，因此冒号字符不允许出现在元素或属性名称中。

已支持的 XML 概念

元素
属性
默认命名空间 xmlns="namespace.com"
前缀命名空间 xmlns:prefix="namespace.com"
处理指令
注释（跳过，因此无法检索）
CDATA 部分
元素语言 xml:lang 和按语言过滤
空白指示 xml:space
自动检测和解码 UTF-8 和 UTF-16 XML 流。
支持在解析前已知编码的自定义编码，并且客户端提供自定义解码器来处理字节到字符的转换。

示例

读取 XML 文件

假设您想从已知为 UTF-8 或 UTF-16 编码的文件中读取和提取 XML。您可以使用 XmlReader::parse_auto 读取、解析和从文件中提取 XML，并返回一个 XmlDocument 或一个 std::io::Error。

let xml_file = File::open("test_resources/xml_utf8_BOM.xml")?;
let xml_doc = XmlReader::parse_auto(xml_file)?;

遍历 `XmlDocument`

一旦您拥有一个 XmlDocument，您就可以获取对根 Element 的不可变引用，然后使用 req（必需子元素）和 opt（可选子元素）方法遍历元素树，以定位具有指定名称的第一个子元素。当我们指向目标时，可以使用 element() 或 text() 尝试获取目标元素或仅包含文本内容。

例如，让我们定义一个简单的 XML 结构，其中必需元素的名称以 "r_" 开头，可选元素的名称以 "o_" 开头。

<root>
    <r_Widget>
        <r_Name>Helix</r_Name>
        <o_AdditionalInfo>
            <r_ReleaseDate>2021-05-12</r_ReleaseDate>
            <r_CurrentVersion>23.10</r_CurrentVersion>
            <o_TopContributors>
                <r_Name>archseer</r_Name>
                <r_Name>the-mikedavis</r_Name>
                <r_Name>sudormrfbin</r_Name>
                <r_Name>pascalkuthe</r_Name>
                <r_Name>dsseng</r_Name>
                <r_Name>pickfire</r_Name>
            </o_TopContributors>
        </o_AdditionalInfo>
    </r_Widget>
</root>

// Once the above XML is turned into an XmlDocument, it gets
// passed to this method.
fn extract_xml(xml_doc: XmlDocument) -> Result<(), XmlError> {
    // Let's start by grabbing a reference to the widget element.
    // Because we use req to indicate that it should be considered
    // an error if this required element is missing, the element()
    // method will return a Result<&Element, XmlError>. So we use
    // the `?` operator to throw the XmlError if it occurs.
    let widget = xml_doc.root().req("r_Widget").element()?;

    // The name is required, so we just use req again. We also
    // expect the name to contain only simple text content (not
    // mixed with other elements or processing instructions) so we
    // call text() followed by the `?` operator to throw the
    // XmlError that will be generated if either the name element
    // is not found, or if it contains non-simple content.
    let widget_name = widget.req("r_Name").text()?;

    // The info and top contributor elements are optional (may or
    // may not appear in this type of XML document) so we can use
    // the opt method to indicate that it is not an error if
    // either element is not found. Instead of a
    // Result<&Element, XmlError> this entirely optional chain
    // will cause element() to give us an Option<&Element>
    // instead, so we use `if let` to take action only if the
    // given optional chain elements all exist.
    if let Some(top_contrib_list) = widget
        .opt("o_AdditionalInfo")
        .opt("o_TopContributors")
        .element() {
        println!("Found top {} contributors!",
            top_contrib_list.elements()
                .filter(|e| e.is_named("r_Name")).count());
    }

    // If we want the release date, that's a required element
    // within an optional element. In other words, it's not an
    // error if "o_AdditionalInfo" is missing, but if it *is*
    // found then we consider it an error if it does not contain
    // "r_ReleaseDate". This is a mixed chain, involving both
    // required and optional, which means that element() will
    // return a Result<Option<&Element>, XmlError>, an Option
    // wrapped in a Result. So we use `if let` and the `?`
    // operator together.
    if let Some(release_date) = widget
            .opt("o_AdditionalInfo")
            .req("r_ReleaseDate")
            .element()? {
        println!("Release date: {}", release_date.text()?);
    }

    Ok(())
}

请注意，element() 和 text() 方法的返回类型根据方法链是否涉及 req 或 opt 或两者而变化。下表总结了这些情况。

涉及链	`element()` 返回	`text()` 返回
仅 `req`	`结果<&元素，XmlError>`	`结果<&str, XmlError>`
仅 `opt`	`Option<&Element>`	`结果<Option<&str>, XmlError>`
两者 `req` 和 `opt`	`结果<Option<&Element>, XmlError>`	`结果<Option<&str>, XmlError>`

同样，att_req 和 att_opt 方法的返回类型也根据方法链而变化。

涉及链	`att_req(name)` 返回	`att_opt(name)` 返回
仅 `req`	`结果<&str, XmlError>`	`结果<Option<&str>, XmlError>`
仅 `opt`	`结果<Option<&str>, XmlError>`	`Option<&str>`
两者 `req` 和 `opt`	`结果<Option<&str>, XmlError>`	`结果<Option<&str>, XmlError>`

这更容易记住如下： req/att_req 如果元素或属性不存在，将生成错误，因此它们的使用意味着返回类型必须涉及某种 Result<_, XmlError>。而 opt/att_opt 可能或可能不返回值，因此它们的使用意味着返回类型必须涉及某种 Option<_>。混合两者（必需和可选）意味着返回类型必须涉及某种 Result<Option<_>, XmlError>。而 text() 如果目标元素没有简单内容（没有子元素和没有处理指令），将生成错误，因此它的使用也意味着返回类型必须涉及某种 Result。

使用 `XmlPath` 进行更复杂的遍历

方法 req 和 opt 总是关注给定名称的第一个子元素。无法使用它们来定位兄弟元素，例如在“Widget”元素列表中的第二个“Widget”。要定位兄弟元素，或迭代多个元素，您可以使用 XmlPath。 (不要与XPath混淆，它具有类似的目的但实现方式完全不同。)

例如，如果您有一个包含员工列表的XML，并且您想要迭代员工的任务截止日期，您可以使用 XmlPath 如此

<roster>
    <employee>
        <name>Angelica</name>
        <department>Finance</department>
        <task-list>
            <task>
                <name>Payroll</name>
                <deadline>tomorrow</deadline>
            </task>
            <task>
                <name>Reconciliation</name>
                <deadline>Friday</deadline>
            </task>
        </task-list>
    </employee>
    <employee>
        <name>Byron</name>
        <department>Sales</department>
        <task-list>
            <task>
                <name>Close the big deal</name>
                <deadline>Saturday night</deadline>
            </task>
        </task-list>
    </employee>
    <employee>
        <name>Cat</name>
        <department>Software</department>
        <task-list>
            <task>
                <name>Fix that bug</name>
                <deadline>Maybe later this month</deadline>
            </task>
            <task>
                <name>Add that new feature</name>
                <deadline>Possibly this year</deadline>
            </task>
            <task>
                <name>Make that customer happy</name>
                <deadline>Good luck with that</deadline>
            </task>
        </task-list>
    </employee>
</roster>

// Once the above XML is turned into an XmlDocument, it gets
// passed to this method.
fn extract_xml(xml_doc: XmlDocument) -> Result<(), XmlError> {
    for deadline in xml_doc.root()
        .all("employee")
        .first("task-list")
        .all("task")
        .first("deadline")
        .iter() {
        println!("Found task deadline: {}", deadline.text()?);
    }
    Ok(())
}

这创建并迭代一个代表“每个员工第一个任务列表中的第一个截止日期元素”的 XmlPath。根据上面的示例XML，这将打印出所有六个“deadline”元素的全部文本内容。

注意，如果我们只想获取第一个员工，我们可以使用 first("employee")。或者，如果我们只想获取第二个员工（零代表第一个），我们可以使用 nth("employee", 1)。或者，如果我们只想获取最后一个员工，我们可以使用 last("employee")。同样，如果我们只想考虑每个员工列表中的第一个任务，我们可以使用 first("task")。

在 `XmlPath` 中过滤元素

XmlPath 不仅允许您指定哪些子元素名称是感兴趣的，还允许您指定哪些 xml:lang 模式是感兴趣的，并允许您指定必须存在于子元素中的所需属性名值对，以便将其包含在迭代器中。

<inventory>
    <box type='games'>
        <item>
            <name xml:lang='en'>C&amp;C: Tiberian Dawn</name>
            <name xml:lang='en-US'>Command &amp; Conquer</name>
            <name xml:lang='de'>C&amp;C: Teil 1</name>
        </item>
        <item>
            <name xml:lang='en'>Doom</name>
            <name xml:lang='sr'>Zla kob</name>
            <name xml:lang='ja'>ドゥーム</name>
        </item>
        <item>
            <name xml:lang='en'>Half-Life</name>
            <name xml:lang='sr'>Polu-život</name>
        </item>
    </box>
    <box type='movies'>
        <item>
            <name xml:lang='en'>Aliens</name>
            <name xml:lang='sv-SE'>Aliens - Återkomsten</name>
            <name xml:lang='vi'>Quái Vật Không Gian 2</name>
        </item>
        <item>
            <name xml:lang='en'>The Cabin In The Woods</name>
            <name xml:lang='bg'>Хижа в гората</name>
            <name xml:lang='fr'>La cabane dans les bois</name>
        </item>
    </box>
</inventory>

// Once the above XML is turned into an XmlDocument, it gets
// passed to this method.
fn extract_xml(xml_doc: XmlDocument) -> Result<(), XmlError> {
    let english = ExtendedLanguageRange::new("en")?;

    for movie in xml_doc.root()
        .all("box")
        .with_attribute("type", "games")
        .all("item")
        .all("name")
        .filter_lang_range(&english)
        .iter() {
        println!("Found movie title in English: {}",
            movie.text()?);
    }
    Ok(())
}

这将打印出三个游戏中的所有四个英文标题的名称。它将跳过所有电影，以及被“en”语言过滤器拒绝的所有名称。请注意，这个“en”过滤器将匹配 xml:lang="en" 和 xml:lang="en-US"，因此您将得到第一个游戏的两个匹配名称元素。

属性提取使用方法 att_req（如果属性缺失则生成错误）和 att_opt（如果属性缺失则无错误）来获取属性的值。例如，给定这个简单的XML文档，我们可以轻松获取属性值。 <root generationDate='2023-02-09T18:10:00Z'> <record id='35517'> <temp locationId='23'>40.5</temp> </record> <record id='35518'> <temp locationId='36'>38.9</temp> </record> </root> // Once the above XML is turned into an XmlDocument, it gets // passed to this method. fn extract_xml(xml_doc: XmlDocument) -> Result<(), XmlError> { // Iterate the records using an XmlPath. for record in xml_doc.root().all("record").iter() { // The record@id attribute is required (we consider it an // error if it is missing). So use att_req and then the // `?` syntax to throw any XmlError generated. let record_id = record.att_req("id")?; let temp = record.req("temp").element()?; let temp_value = temp.text()?; // The temp@locationId attribute is optional (we don't // consider it an error if it's not found within this // element). So use att_opt and then `if let` to check for // it. if let Some(loc_id) = temp.att_opt("locationId") { println!("Found temperature {} at {}", temp_value, loc_id); } else { println!("Found temperature {} at ??? location.", temp_value); } } Ok(()) } 注意：xml:lang 和 xml:space 的值不能从 Element 中作为属性值读取，因为这些是“特殊属性”，其值由子元素继承（语言也是由元素的属性继承的）。要获取这些语言和空间属性的值，请参阅方法 language_tag 和 white_space_handling。命名空间处理到目前为止的所有示例都使用了不带任何命名空间声明的XML，这意味着元素和属性名称不在任何命名空间内（或者换句话说，它们有一个没有值的命名空间）。当命名空间没有值时，可以使用字符串切片 &str 来指定元素或属性的目标名称。但是，当目标名称有命名空间值时，您必须指定命名空间才能定位到所需的元素。直接这样做的方法是使用一个包含局部部分和元素名称的命名空间（不是前缀）的元组 (&str, &str)。但您也可以调用 pre_ns（预设或预定义命名空间）方法，让光标或XmlPath知道如果未使用元组直接在每个元素和属性中指定命名空间，它应假定给定的命名空间值。举例可能是最容易解释这一点的。  <root xmlns='example.com/DefaultNamespace' xmlns:pfx='example.com/OtherNamespace'> <one>This child element has no prefix, so it inherits the default namespace.</one> <pfx:two>This child element has prefix pfx, so inherits the other namespace.</pfx:two> <pfx:three pfx:key='value'>Attribute names can be prefixed too.</pfx:three> <four key2='value2'>Unprefixed attribute names do *not* inherit namespaces.</four> <five xmlns='' key3='value3'>The default namespace can be cleared too.</five> </root> // Once the above XML is turned into an XmlDocument, it gets // passed to this method. fn extract_xml(xml_doc: XmlDocument) -> Result<(), XmlError> { let root = xml_doc.root(); // You can use a tuple to specify the local part and namespace // of the targeted element. let one = root.req(("one", "example.com/DefaultNamespace")) .element()?; // Or you can call pre_ns before a chain of // req/opt/first/all/nth/last method calls. let two = root.pre_ns("example.com/OtherNamespace") .req("two").element()?; // The effect of pre_ns continues until you call element() or // text(), so you can keep assuming the same namespace for // child elements or attributes. let three_key = root.pre_ns("example.com/OtherNamespace") .req("three").att_req("key")?; // Be careful if the namespace changes (or is cleared) when // moving down through child elements and attributes. If that // happens, you can call pre_ns again, or you can use a tuple // to explicitly state the different namespace. let four_key = root .pre_ns("example.com/DefaultNamespace") .req("four") .pre_ns("") .att_req("key2")?; // When no namespace applies to a method or attribute name, // you don't need to specify any namespace to target it, so // you don't need to use pre_ns nor a tuple. But you can // anyway if you want to make it more explicit that there is // no namespace. let five_key = root.req(("five", "")).att_req(("key3", ""))?; Ok(()) } 需要注意的是，一旦您调用了 element()，pre_ns 的效果就会消失。所以，如果您在方法链的中间调用了 element()，请记住再次调用 pre_ns 以从该点开始指定预设命名空间。 <root xmlns='example.com/DefaultNamespace'> <topLevel> <innerLevel> <list> <item>something</item> <item>whatever</item> <item>more</item> <item>and so on</item> </list> </innerLevel> </topLevel> </root> // Defining a static constant makes it quicker to type namespaces, // and easier to read the code. const NS_DEF: &str = "example.com/DefaultNamespace"; // Once the above XML is turned into an XmlDocument, it gets // passed to this method. fn extract_xml(xml_doc: XmlDocument) -> Result<(), XmlError> { // Use a chain of req calls to get to the required list, then // use an XmlPath to iterate however many items are found // within the list and count them. // This first attempt will actually give us the wrong number, // because once we call element()? we receive an `&Element` // reference, and the preset namespace effect is lost. So the // XmlPath we chain on straight after that will be searching // the empty namespace and won't find any matching elements // and will report a count of zero. let mistake = xml_doc .root() .pre_ns(NS_DEF) .req("topLevel") .req("innerLevel") .req("list") .element()? .all("item") .iter() .count(); // You can fix the problem by either using an explicit name // tuple `("item", NS_DEF)` or by calling pre_ns again after // element() so that the XmlPath knows which namespace should // be used when searching for items. let correct = xml_doc .root() .pre_ns(NS_DEF) .req("topLevel") .req("innerLevel") .req("list") .element()? .pre_ns(NS_DEF) .all("item") .iter() .count(); // However, to avoid confusion, it's recommended to avoid // including `element()` between two different method chains, // and to instead assign it to a variable name for clarity. let list = xml_doc .root() .pre_ns(NS_DEF) .req("topLevel") .req("innerLevel") .req("list") .element()?; let cleanest = list.all(("item", NS_DEF)).iter().count(); Ok(()) } 错误处理上面的示例简化了代码片段以节省篇幅，但在实际应用中，您需要处理从读取/解析和提取XML的不同步骤返回的不同错误类型。以下是一个紧凑的示例，显示了每个步骤所需的错误处理。 fn main() { // Decide what to do if either step returns an error. // For simplicity, we'll simply panic in this example, but in // a real application you may want to remap the error to the // type used by your application, or trigger some recovery // logic instead. let xml_doc = match read_xml() { Ok(d) => d, Err(e) => panic!("XML reading or parsing failed!"), }; match extract_xml(xml_doc) { Ok(()) => println!("Finished without errors!"), Err(e) => panic!("XML extraction failed!"), } } // The XML parsing methods might throw an std::io::Error, so they // go into their own method. fn read_xml() -> Result<XmlDocument, std::io::Error> { let xml = "<root><child/></root>"; let xml_doc = XmlReader::parse_auto(xml.as_bytes()); xml_doc } // The extraction methods might throw an XmlError, so they go into // their own method. fn extract_xml(xml_doc: XmlDocument) -> Result<(), XmlError> { let child = xml_doc.root().req("child").element()?; Ok(()) }

依赖关系 sipp

3 个不稳定版本

简单的 XML 解析器和提取器。

已支持的 XML 概念

示例

读取 XML 文件

遍历 XmlDocument

使用 XmlPath 进行更复杂的遍历

在 XmlPath 中过滤元素

属性提取

命名空间处理

错误处理

依赖关系

遍历 `XmlDocument`

使用 `XmlPath` 进行更复杂的遍历

在 `XmlPath` 中过滤元素