#unicode #unic #grapheme #unicode-text #word #boundary #character

unic-segment

UNIC — Unicode 文本分段算法

3 个版本 (重大更新)

0.9.0 2019年3月3日
0.8.0 2019年1月2日
0.7.0 2018年2月7日

#1376文本处理

Download history 81162/week @ 2024-03-15 81627/week @ 2024-03-22 94486/week @ 2024-03-29 78737/week @ 2024-04-05 78269/week @ 2024-04-12 78530/week @ 2024-04-19 78593/week @ 2024-04-26 81061/week @ 2024-05-03 86073/week @ 2024-05-10 98702/week @ 2024-05-17 93087/week @ 2024-05-24 92827/week @ 2024-05-31 104825/week @ 2024-06-07 100245/week @ 2024-06-14 98042/week @ 2024-06-21 66628/week @ 2024-06-28

387,416 每月下载量
用于 745 个软件包 (8 个直接使用)

MIT/Apache

110KB
1.5K SLoC

UNIC — Unicode 文本分段算法

Crates.io Documentation

此 UNIC 组件实现了来自 Unicode® 标准附件 #29 - Unicode 文本分段 的算法,用于检测文本元素边界,如用户感知字符(即 Grapheme Clusters)、单词句子(最后一个尚未实现)。

注意

此组件的初始代码基于 unicode-segmentation


lib.rs:

UNIC — Unicode 文本分段算法

unic 的组件:为 Rust 提供的 Unicode 和国际化软件包。

此 UNIC 组件实现了来自 Unicode® 标准附件 #29 - Unicode 文本分段 的算法,用于检测文本元素边界,如用户感知字符(即 Grapheme Clusters)、单词句子(最后一个尚未实现)。

示例

assert_eq!(
    Graphemes::new("a\u{310}e\u{301}o\u{308}\u{332}").collect::<Vec<&str>>(),
    &["a\u{310}", "e\u{301}", "o\u{308}\u{332}"]
);

assert_eq!(
    Graphemes::new("a\r\nb🇺🇳🇮🇨").collect::<Vec<&str>>(),
    &["a", "\r\n", "b", "🇺🇳", "🇮🇨"]
);

assert_eq!(
    GraphemeIndices::new("a̐éö̲\r\n").collect::<Vec<(usize, &str)>>(),
    &[(0, ""), (3, ""), (6, "ö̲"), (11, "\r\n")]
);

fn has_alphanumeric(s: &&str) -> bool {
    s.chars().any(|ch| ch.is_alphanumeric())
}

assert_eq!(
    Words::new(
        "The quick (\"brown\") fox can't jump 32.3 feet, right?",
        has_alphanumeric,
    ).collect::<Vec<&str>>(),
    &["The", "quick", "brown", "fox", "can't", "jump", "32.3", "feet", "right"]
);

assert_eq!(
    WordBounds::new("The quick (\"brown\")  fox").collect::<Vec<&str>>(),
    &["The", " ", "quick", " ", "(", "\"", "brown", "\"", ")", " ", " ", "fox"]
);

assert_eq!(
    WordBoundIndices::new("Brr, it's 29.3°F!").collect::<Vec<(usize, &str)>>(),
    &[
        (0, "Brr"),
        (3, ","),
        (4, " "),
        (5, "it's"),
        (9, " "),
        (10, "29.3"),
        (14, "°"),
        (16, "F"),
        (17, "!")
    ]
);

依赖项