131 个版本

0.7.26	2024 年 7 月 11 日
0.7.24	2024 年 3 月 16 日
0.7.20	2023 年 12 月 18 日
0.7.15	2023 年 11 月 19 日
0.1.33	2021 年 11 月 29 日

#179 in Rust 模式

5,994 每月下载量
用于 18 个 crate (5 直接)

MIT/Apache

590KB
3K SLoC

rustrict

rustrict 是 Rust 的一个脏话过滤器。

^{免责声明：多个源文件（.txt、.csv、.rs 测试用例）包含脏话。观看者请自行判断。}

功能

多种类型（脏话、冒犯性、色情、恶意、垃圾邮件）
多种级别（轻微、中等、严重）
抵抗规避
- 替代拼写（如 "fck"）
- 重复字符（如 "craaaap"）
- 可混淆字符（如 'ᑭ', '𝕡', 和 '🅿'）
- 间距（如 "c r_a-p"）
- 重音符号（如 "pÓöp"）
- 双向 Unicode（相关阅读）
- 自我审查（如 "f*ck"）
- 已知坏演员的安全短语列表
- 阻止无效的 Unicode 字符
- 在 Mk48.io 中经过实战测试
抵抗误报
- 单词（如 "assassin"）
- 双词（如 "push it"）
灵活
- 审查和/或分析
- 输入 &str 或 Iterator<Item = char>
- 使用 context 功能可跟踪每个用户的会话状态
- 使用 customize 功能可添加单词
- 通过 width 功能准确报告 Unicode 的宽度
- 许多选项
性能良好
- O(n) 分析和审查
- 不使用 regex（使用自定义 trie）
- 发布模式下 3 MB/s
- 调试模式下 100 KB/s

限制

主要是英语/表情符号
审查会移除大多数变音符号（重音符号）
在分析时无法检测从右到左的脏话，所以...
审查会强制 Unicode 为从左到右
不理解上下文
对运行时添加的脏话的误报没有抵抗力

用法

字符串（`&str`）

use rustrict::CensorStr;

let censored: String = "hello crap".censor();
let inappropriate: bool = "f u c k".is_inappropriate();

assert_eq!(censored, "hello c***");
assert!(inappropriate);

迭代器 (`Iterator<Type = char>`)

use rustrict::CensorIter;

let censored: String = "hello crap".chars().censor().collect();

assert_eq!(censored, "hello c***");

高级

通过构建一个 Censor，可以避免多次扫描文本以获取被审查的 String 和/或回答多个 is 查询。这还提供了更多定制选项（默认设置如下）。

use rustrict::{Censor, Type};

let (censored, analysis) = Censor::from_str("123 Crap")
    .with_censor_threshold(Type::INAPPROPRIATE)
    .with_censor_first_character_threshold(Type::OFFENSIVE & Type::SEVERE)
    .with_ignore_false_positives(false)
    .with_ignore_self_censoring(false)
    .with_censor_replacement('*')
    .censor_and_analyze();

assert_eq!(censored, "123 C***");
assert!(analysis.is(Type::INAPPROPRIATE));
assert!(analysis.isnt(Type::PROFANE & Type::SEVERE | Type::SEXUAL));

如果您无法承担任何信息泄露的风险，或者有理由相信某个用户正在试图绕过过滤器，您可以检查他们的输入是否匹配一个安全字符串短列表

use rustrict::{CensorStr, Type};

// Figure out if a user is trying to evade the filter.
assert!("pron".is(Type::EVASIVE));
assert!("porn".isnt(Type::EVASIVE));

// Only let safe messages through.
assert!("Hello there!".is(Type::SAFE));
assert!("nice work.".is(Type::SAFE));
assert!("yes".is(Type::SAFE));
assert!("NVM".is(Type::SAFE));
assert!("gtg".is(Type::SAFE));
assert!("not a common phrase".isnt(Type::SAFE));

如果您想添加自定义粗话或安全词，请启用 customize 功能。

#[cfg(feature = "customize")]
{
    use rustrict::{add_word, CensorStr, Type};

    // You must take care not to call these when the crate is being
    // used in any other way (to avoid concurrent mutation).
    unsafe {
        add_word("reallyreallybadword", (Type::PROFANE & Type::SEVERE) | Type::MEAN);
        add_word("mybrandname", Type::SAFE);
    }
    
    assert!("Reallllllyreallllllybaaaadword".is(Type::PROFANE));
    assert!("MyBrandName".is(Type::SAFE));
}

如果您的用例是聊天管理，并且您按用户存储数据，您可以使用 rustrict::Context 作为参考实现

#[cfg(feature = "context")]
{
    use rustrict::{BlockReason, Context};
    use std::time::Duration;
    
    pub struct User {
        context: Context,
    }
    
    let mut bob = User {
        context: Context::default()
    };
    
    // Ok messages go right through.
    assert_eq!(bob.context.process(String::from("hello")), Ok(String::from("hello")));
    
    // Bad words are censored.
    assert_eq!(bob.context.process(String::from("crap")), Ok(String::from("c***")));

    // Can take user reports (After many reports or inappropriate messages,
    // will only let known safe messages through.)
    for _ in 0..5 {
        bob.context.report();
    }
   
    // If many bad words are used or reports are made, the first letter of
    // future bad words starts getting censored too.
    assert_eq!(bob.context.process(String::from("crap")), Ok(String::from("****")));
    
    // Can manually mute.
    bob.context.mute_for(Duration::from_secs(2));
    assert!(matches!(bob.context.process(String::from("anything")), Err(BlockReason::Muted(_))));
}

比较

为了比较过滤器，使用此列表的前 100,000 项作为数据集。正准确率是检测为粗话的粗话百分比。负准确率是检测为干净的文本百分比。

库	准确率	正准确率	负准确率	时间
rustrict	79.74%	94.00%	76.19%	9秒
审查	76.16%	72.76%	77.01%	23秒

开发

如果您对影响误报的调整，例如添加粗话，您需要运行 false_positive_finder

运行 make downloads 下载所需的词表和字典
运行 make false_positives 以自动查找误报

如果您修改 replacements_extra.csv，请运行 make replacements 重新构建 replacements.csv。

最后，运行 make test 进行完整测试或运行 make test_debug 进行快速测试。

许可证

根据您的选择，在以下许可证下提供

Apache 许可证，版本 2.0 (LICENSE-APACHE 或 https://apache.ac.cn/licenses/LICENSE-2.0)
MIT 许可证 (LICENSE-MIT 或 https://open-source.org.cn/licenses/MIT)

。

贡献

除非您明确表示，否则您提交给作品以包含在内的任何贡献，根据 Apache-2.0 许可证的定义，应作为上述双重许可，不得附加任何其他条款或条件。

依赖项

~1-11MB
~176K SLoC

131 个版本

rustrict

功能

限制

用法

字符串（&str）

迭代器 (Iterator<Type = char>)

高级

比较

开发

许可证

贡献

依赖项

字符串（`&str`）

迭代器 (`Iterator<Type = char>`)