7个版本

0.1.6	2024年5月28日
0.1.5	2024年5月21日
0.1.4	2023年3月1日
0.1.1	2023年2月28日

#743 in Web编程

在3个Crate中使用（通过progscrape-scrapers）

Apache-2.0 OR MIT

685KB
385 行

urlnorm

URL标准化库，主要用于为https://progscrape.com标准化URL。

标准化算法使用以下启发式方法

删除URL的方案，因此 http://example.com 和 https://example.com 被视为等效。
通过删除常见的域名前缀（如 www. 和 m.）来标准化主机。
通过删除重复的斜杠和空路径段来标准化路径，因此 http://example.com//foo/ 和 http://example.com/foo 被视为等效。
对查询字符串参数进行排序，并删除任何分析查询参数（例如：utm_XYZ 等类似）。
删除片段，除了某些被认为是重要的片段模式（如 /#/ 和 #!）之外。

用法对于URL的长期存储和聚类，建议使用 UrlNormalizer::compute_normalization_string 来计算一个可以与标准字符串比较运算符进行比较的URL表示形式。标准化字符串不是完美的内容聚类算法，但它们将倾向于将指向同一数据的URL聚集在一起。对于更精确的聚类算法，可以将此库与更高级的DUST感知处理算法（例如，参见来自"Do Not Crawl in the DUST: Different URLs with Similar Text"的DustBuster）相结合。 # use url::Url; # use urlnorm::UrlNormalizer; let norm = UrlNormalizer::default(); let url = Url::parse("http://www.google.com").unwrap(); assert_eq!(norm.compute_normalization_string(&url), "google.com:"); 对于更高级的使用案例，Options 类允许最终用户为标准化提供自定义正则表达式。示例标准化字符串给出了URL中哪些部分被认为是有意义的想法 http://efekarakus.github.io/twitch-analytics/#/revenue efekarakus.github.io:twitch-analytics:revenue: http://fusion.net/story/121315/maybe-crickets-arent-the-food-of-the-future-after-all/?utm_source=facebook&utm_medium=social&utm_campaign=quartz fusion.net:story:121315:maybe-crickets-arent-the-food-of-the-future-after-all: http://www.capradio.org/news/npr/story?storyid=382276026 capradio.org:news:npr:story:storyid:382276026: http://www.charlotteobserver.com/2015/02/23/5534630/charlotte-city-council-approves.html#.VOxrajTF91E charlotteobserver.com:2015:02:23:5534630:charlotte-city-council-approves: http://www.m.webmd.com/melanoma-skin-cancer/news/20150409/fewer-us-children-getting-melanoma-study?src=RSS_PUBLIC webmd.com:melanoma-skin-cancer:news:20150409:fewer-us-children-getting-melanoma-study:src:RSS_PUBLIC:

依赖关系 ~3–4.5MB ~101K SLoC regex url dev criterion 0.5 dev rstest