6 个稳定版本

1.0.6	2023 年 9 月 28 日
1.0.5	2023 年 9 月 22 日

在国际化（i18n）中排名 #267

每月下载量 82 次

自定义许可协议和可能 LGPL-3.0

180KB
4K SLoC

字符集规范器

一个帮助你从未知字符集编码中读取文本的库。
受原始 Python 版本 charset-normalizer 的启发，我尝试采取新的方法来解决问题。支持 Rust encoding 库提供的所有 IANA 字符集名称的编解码器。

本项目是原始 Python 版本 Charset Normalizer 的移植。Python 和 Rust 版本之间最大的区别是支持的编码数量，因为每种语言都有自己的编码/解码库。Rust 版本仅支持来自 WhatWG 标准的编码。Python 版本支持更多的编码，但其中很多都是几乎未使用的旧编码。

⚡ 性能

此包的性能优于 Python 版本（比 charset-normalizer 的 MYPYC 版本快 3 倍，比常规 Python 版本快 6 倍）。然而，与 chardet 和 chardetng 包相比，它速度较慢但更准确（我认为是因为它按块处理整个文件）。以下是一些数据。

包	准确度	平均每个文件（毫秒）	每秒文件数（估算）
chardet	79 %	2.2 毫秒	450 个文件/秒
chardetng	78 %	1.6 毫秒	625 个文件/秒
charset-normalizer-rs	96.8 %	2.7 毫秒	370 个文件/秒
charset-normalizer（Python + MYPYC 版本）	98 %	8 毫秒	125 个文件/秒

包	第 99 个百分位数	第 95 个百分位数	第 50 个百分位数
chardet	8 毫秒	2 毫秒	0.2 毫秒
chardetng	14 毫秒	5 毫秒	0.5 毫秒
charset-normalizer-rs	19 毫秒	7 毫秒	1.2 毫秒
charset-normalizer（Python + MYPYC 版本）	94 毫秒	37 毫秒	3 毫秒

使用默认参数，通过400多个文件生成统计数据。这些结果可能会随时更改。数据集可以更新以包含更多文件。实际延迟高度取决于您的CPU能力。因素应保持不变。Rust版本的数据集已减少，因为支持的编码数量低于Python版本。

仍有加速库的可能性，所以我会感激任何贡献。

✨ 安装

库安装

cargo add charset-normalizer-rs

二进制CLI工具安装

cargo install charset-normalizer-rs

🚀 基本用法

CLI

此软件包附带一个CLI，应与Python版本CLI工具兼容。

normalizer -h
Usage: normalizer [OPTIONS] <FILES>...

Arguments:
  <FILES>...  File(s) to be analysed

Options:
  -v, --verbose                Display complementary information about file if any. Stdout will contain logs about the detection process
  -a, --with-alternative       Output complementary possibilities if any. Top-level JSON WILL be a list
  -n, --normalize              Permit to normalize input file. If not set, program does not write anything
  -m, --minimal                Only output the charset detected to STDOUT. Disabling JSON output
  -r, --replace                Replace file when trying to normalize it instead of creating a new one
  -f, --force                  Replace file without asking if you are sure, use this flag with caution
  -t, --threshold <THRESHOLD>  Define a custom maximum amount of chaos allowed in decoded content. 0. <= chaos <= 1 [default: 0.2]
  -h, --help                   Print help
  -V, --version                Print version

normalizer ./data/sample.1.fr.srt

🎉 CLI以JSON格式生成易于使用的stdout结果（应与Python版本相同）。

{
    "path": "/home/default/projects/charset_normalizer/data/sample.1.fr.srt",
    "encoding": "cp1252",
    "encoding_aliases": [
        "1252",
        "windows_1252"
    ],
    "alternative_encodings": [
        "cp1254",
        "cp1256",
        "cp1258",
        "iso8859_14",
        "iso8859_15",
        "iso8859_16",
        "iso8859_3",
        "iso8859_9",
        "latin_1",
        "mbcs"
    ],
    "language": "French",
    "alphabets": [
        "Basic Latin",
        "Latin-1 Supplement"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.149,
    "coherence": 97.152,
    "unicode_path": null,
    "is_preferred": true
}

Rust

库提供两种主要方法。第一种是from_bytes，它使用字节作为输入参数处理文本。

use charset_normalizer_rs::from_bytes;

fn test_from_bytes() {
    let result = from_bytes(&vec![0x84, 0x31, 0x95, 0x33], None);
    let best_guess = result.get_best();
    assert_eq!(
        best_guess.unwrap().encoding(),
        "gb18030",
    );
}
test_from_bytes();

from_path使用文件名作为输入参数处理文本。

use std::path::PathBuf;
use charset_normalizer_rs::from_path;

fn test_from_path() {
    let result = from_path(&PathBuf::from("src/tests/data/samples/sample-chinese.txt"), None).unwrap();
    let best_guess = result.get_best();
    assert_eq!(
        best_guess.unwrap().encoding(),
        "big5",
    );
}
test_from_path();

😇 为什么

当我开始使用Chardet（Python版本）时，我发现它不符合我的期望，并想提出一个可靠的替代方案，使用完全不同的方法。还有！我永远不会在好挑战面前退缩！

我对原始字符集编码不感兴趣，因为两种不同的表可以产生两个相同的渲染字符串。我想要的只是尽可能获取可读的文本。

从某种意义上说，我正在强行破解文本解码。酷不酷？ 😎

🍰 如何

丢弃所有无法适合二进制内容的字符集编码表。
使用相应的字符集编码，通过块（分块）打开噪声，或混乱。
提取检测到的噪声最低的匹配项。
此外，我们测量语言的一致性/探测。

等等，根据你，噪声/混乱和一致性是什么？

噪声：我打开了一百多人类错误编码的文本文件。我观察后，制定了一些关于混乱时“明显”规则的基本规则。我知道我对噪声的解释可能是不完整的，请自由贡献以改进或重写它。

一致性：对于地球上每种语言，我们都计算了字母出现频率的排名（尽可能好）。因此，我认为这些信息在这里是很有价值的。因此，我使用这些记录与解码文本进行比较，以检查我是否可以检测到智能设计。

⚡ 已知限制

当文本包含两种或更多共享相同字母的语言时，语言检测不可靠。（例如，HTML（英文标签）+土耳其语内容（共享拉丁字母））
每个字符集检测器都高度依赖于足够的内容。在常见情况下，不要在非常小的内容上运行检测。

👤 贡献

贡献、问题和功能请求非常受欢迎。
如果您想贡献，请自由查看问题页面。

📝 许可证

依赖关系

~10–21MB
~261K SLoC