tldextract-rs — Rust库 // Lib.rs

2个不稳定版本

0.1.1	2023年12月19日
0.0.0	2023年12月14日

#7 in #tld

GPL-3.0-only

260KB
14K SLoC

摘要

tldextract-rs是一个高性能的有效顶级域（eTLD）提取模块，可以从域名中提取子组件。

主机名

Cargo.toml

tldextract-rs = { git = "https://github.com/emo-cat/tldextract-rs" }

示例代码

use tldextract_rs::TLDExtract;

fn main() {
    let source = tldextract_rs::Source::Hardcode;
    let suffix = tldextract_rs::SuffixList::new(source, false, None);
    let mut extract = TLDExtract::new(suffix, true).unwrap();
    let e = extract.extract("  mirrors.tuna.tsinghua.edu.cn").unwrap();
    let s = serde_json::to_string_pretty(&e).unwrap();
    println!("{:}", s);
}

ExtractResult

{
  "subdomain": "mirrors.tuna",
  "domain": "tsinghua",
  "suffix": "edu.cn",
  "registered_domain": "tsinghua.edu.cn"
}

实现细节

为什么不通过"."分割并取最后元素呢？

通过"."分割并取最后元素仅适用于像com这样的简单eTLD，但不适用于像oseto.nagasaki.jp这样的更复杂eTLD。

eTLD尝试

tldextract-rs使用压缩字典树存储eTLD。

来自Mozilla公共后缀列表的有效eTLD按逆序追加到压缩字典树中。

Given the following eTLDs
au
nsw.edu.au
com.ac
edu.ac
gov.ac

and the example URL host `example.nsw.edu.au`

The compressed trie will be structured as follows:

START
 ╠═ au 🚩 ✅
 ║  ╚═ edu ✅
 ║     ╚═ nsw 🚩 ✅
 ╚═ ac
    ╠═ com 🚩
    ╠═ edu 🚩
    ╚═ gov 🚩

=== Symbol meanings ===
🚩 : path to this node is a valid eTLD
✅ : path to this node found in example URL host `example.nsw.edu.au`

从URL主机子组件从右向左解析，直到找不到更多匹配节点。在这个例子中，匹配节点的路径是au -> edu -> nsw。反转节点得到提取的eTLD nsw.edu.au。

致谢

依赖项

~1–18MB
~245K SLoC