TLDExtract — Rust 库 // Lib.rs

1 个不稳定版本

0.1.0	2023年12月14日

#8 in #top-level

GPL-3.0-only

260KB
14K SLoC

摘要

tldextract-rs 是一个高性能的从域名中提取子组件的有效顶级域名 (eTLD) 提取模块。

主机名

Cargo.toml

tld_extract = { git = "https://github.com/emo-cat/tldextract-rs" }

示例代码

use tld_extract::TLDExtract;

fn main() {
    let source = tld_extract::Source::Hardcode;
    let suffix = tld_extract::SuffixList::new(source, false, None);
    let mut extract = TLDExtract::new(suffix, true).unwrap();
    let e = extract.extract("  mirrors.tuna.tsinghua.edu.cn").unwrap();
    let s = serde_json::to_string_pretty(&e).unwrap();
    println!("{:}", s);
}

ExtractResult

{
  "subdomain": "mirrors.tuna",
  "domain": "tsinghua",
  "suffix": "edu.cn",
  "registered_domain": "tsinghua.edu.cn"
}

实现细节

为什么不能在 "." 上分割并取最后一个元素呢？

在 "." 上分割并取最后一个元素仅适用于简单的 eTLD，如 com，但不适用于更复杂的 eTLD，如 oseto.nagasaki.jp。

eTLD 尝试

tldextract-rs 使用压缩前缀树存储 eTLD。

来自 Mozilla 公共后缀列表的有效 eTLD 以反向顺序附加到压缩前缀树中。

Given the following eTLDs
au
nsw.edu.au
com.ac
edu.ac
gov.ac

and the example URL host `example.nsw.edu.au`

The compressed trie will be structured as follows:

START
 ╠═ au 🚩 ✅
 ║  ╚═ edu ✅
 ║     ╚═ nsw 🚩 ✅
 ╚═ ac
    ╠═ com 🚩
    ╠═ edu 🚩
    ╚═ gov 🚩

=== Symbol meanings ===
🚩 : path to this node is a valid eTLD
✅ : path to this node found in example URL host `example.nsw.edu.au`

从右到左解析 URL 的主机子组件，直到找不到更多匹配的节点。在这个例子中，匹配节点的路径是 au -> edu -> nsw。反转节点得到提取的 eTLD nsw.edu.au。

致谢

依赖项

~1–18MB
~248K SLoC