#音译 #算法 #Soundex #Apache #端口 #Metaphone #Caverphone

rphonetic

Rust 版本的 phonetic Apache commons-codec 算法

14 个稳定版本

新版本 2.2.1 2024 年 8 月 11 日
2.2.0 2024 年 4 月 5 日
2.1.5 2024 年 2 月 28 日
2.1.2 2023 年 11 月 26 日
1.2.0 2022 年 7 月 31 日

#70文本处理

Download history 17/week @ 2024-04-25 26/week @ 2024-05-02 53/week @ 2024-05-09 50/week @ 2024-05-16 35/week @ 2024-05-23 26/week @ 2024-05-30 9/week @ 2024-06-06 16/week @ 2024-06-13 33/week @ 2024-06-20 4/week @ 2024-06-27 11/week @ 2024-07-04 2/week @ 2024-07-11 39/week @ 2024-07-18 91/week @ 2024-07-25 41/week @ 2024-08-01 255/week @ 2024-08-08

每月 426 次下载
用于 tantivy-analysis-contrib

Apache-2.0

450KB
10K SLoC

Crate Build Status codecov dependency status Documentation Crate Crate

Rust 音译

这是 Apache commons-codec v1.15 的音译算法的 Rust 版本。

算法

目前,有

请注意,这些算法大多是为拉丁字母设计的,通常是为特定用例(例如,英语姓名/英语词典单词等)设计的。

示例

Beider-Morse

fn main() -> Result<(), rphonetic::PhoneticError> {
    use std::path::PathBuf;
    use rphonetic::{BeiderMorseBuilder, ConfigFiles, Encoder};

    let config_files = ConfigFiles::new(&PathBuf::from("./test_assets/cc-rules/"))?;
    let builder = BeiderMorseBuilder::new(&config_files);
    let beider_morse = builder.build();

    assert_eq!(beider_morse.encode("Van Helsing"),"(Ylznk|ilzn|ilznk|xilzn|xilznk)-(banilznk|bonilznk|fYnYlznk|fYnilznk|fanYlznk|fanilznk|fonYlznk|fonilznk|vYnYlznk|vYnilznk|vanYlznk|vaniilznk|vanilzn|vanilznk|vonYlznk|voniilznk|vonilzn|vonilznk)");
    Ok(())
}

Caverphone 1 & 2

fn main() {
    use rphonetic::{Caverphone1, Encoder};

    let caverphone = Caverphone1;
    assert_eq!(caverphone.encode("Thompson"), "TMPSN1");
}
fn main() {
    use rphonetic::{Caverphone2, Encoder};

    let caverphone = Caverphone2;
    assert_eq!(caverphone.encode("Thompson"), "TMPSN11111");
}

科隆

fn main() {
    use rphonetic::{Cologne, Encoder};

    let cologne = Cologne;
    assert_eq!(cologne.encode("m\u{00FC}ller"), "657");
}

Daitch-Mokotoff

fn main() -> Result<(), rphonetic::PhoneticError> {
    use rphonetic::{DaitchMokotoffSoundex, DaitchMokotoffSoundexBuilder, Encoder};

    const COMMONS_CODEC_RULES: &str = include_str!("./rules/dmrules.txt");

    let encoder = DaitchMokotoffSoundexBuilder::with_rules(COMMONS_CODEC_RULES).build()?;
    assert_eq!(encoder.soundex("Rosochowaciec"), "944744|944745|944754|944755|945744|945745|945754|945755");
    Ok(())
}

匹配评分方法

fn main() {
    use rphonetic::{Encoder, MatchRatingApproach};
    
    let match_rating = MatchRatingApproach;
    assert_eq!(match_rating.encode("Smith"), "SMTH");
}

Metaphone

fn main() {
    use rphonetic::{Encoder, Metaphone};
    
    let metaphone = Metaphone::default();
    assert_eq!(metaphone.encode("Joanne"), "JN");
}

Metaphone (Double)

fn main() {
    use rphonetic::{DoubleMetaphone, Encoder};

    let double_metaphone = DoubleMetaphone::default();
    assert_eq!(double_metaphone.encode("jumped"), "JMPT");
    assert_eq!(double_metaphone.encode_alternate("jumped"), "AMPT");
}

Phonex

fn main() {
    use rphonetic::{Phonex, Encoder};

    // Strict
    let phonex = Phonex::default();
    assert_eq!(phonex.encode("William"),"W450");
}

Nysiis

fn main() {
    use rphonetic::{Nysiis, Encoder};

    // Strict
    let nysiis = Nysiis::default();
    assert_eq!(nysiis.encode("WESTERLUND"),"WASTAR");

    // Not strict
    let nysiis = Nysiis::new(false);
    assert_eq!(nysiis.encode("WESTERLUND"),"WASTARLAD");
}

Soundex

fn main() {
    use rphonetic::{Encoder, Soundex};

    let soundex = Soundex::default();
    assert_eq!(soundex.encode("jumped"), "J513");
}

Soundex (Refined)

fn main() {
    use rphonetic::{Encoder, RefinedSoundex};
    
    let refined_soundex = RefinedSoundex::default();
    assert_eq!(refined_soundex.encode("jumped"), "J408106");
}

基准测试

基准测试使用 criterion

它们是在 Intel® Core™ i7-4720HQ 和 16GB RAM 的电脑上完成的。

要运行与 main 基线基准测试

cargo bench --bench benchmark -- --baseline main

要替换 main 基线

cargo bench --bench benchmark -- --save-baseline main

不要在 CI 中运行 Criterion 基准测试

依赖关系

~3.5–5.5MB
~99K SLoC