kl-hyphenate — Rust 文本处理库 // Lib.rs

3 个版本

使用旧 Rust 2015

0.7.3	2020年5月20日
0.7.2	2019年8月9日
0.7.1	2019年8月9日

在文本处理中排名 964

每月下载 31 次

Apache-2.0/MIT

64KB
982 行（不含注释）

简介

有两种策略可供选择

标准的 Knuth–Liang 连字符处理，字典由 TeX UTF-8 模式构建。
扩展（非标准）的连字符处理，基于 László Németh 的 OpenOffice.org 中的自动非标准连字符处理，字典由 Libre/OpenOffice 模式构建。

用法

快速入门

可以使用以下方式构建字典

cargo build -vv --features build_dictionaries

生成的字典将保存在 dictionaries 目录中。

然后，您可以加载并使用字典

use kl_hyphenate::{Standard, Hyphenator, Language, Load};

let path_to_dict = "dictionaries/en-us.standard.bincode";
let en_us = Standard::from_path(Language::EnglishUS, path_to_dict) ?;

// Identify valid breaks in the given word.
let hyphenated = en_us.hyphenate("hyphenation");

// Word breaks are represented as byte indices into the string.
let break_indices = &hyphenated.breaks;
assert_eq!(break_indices, &[2, 6, 7]);

// The segments of a hyphenated word can be iterated over.
let segments = hyphenated.into_iter().segments();
let collected : Vec<_> = segments.collect();
assert_eq!(collected, vec!["hy", "phen", "a", "tion"]);

// `hyphenate()` is case-insensitive.
let uppercase : Vec<_> = en_us.hyphenate("CAPITAL").into_iter().collect();
assert_eq!(uppercase, vec!["CAP-", "I-", "TAL"]);

分割

字典可以与文本分割结合使用，在文本运行中连字符化单词。以下简例使用 unicode-segmentation 包进行定制的 Unicode 分割。

use unicode_segmentation::UnicodeSegmentation;

let hyphenate_text = |text : &str| -> String {
    // Split the text on word boundaries—
    text.split_word_bounds()
        // —and hyphenate each word individually.
        .flat_map(|word| en_us.hyphenate(word).into_iter())
        .collect()
};

let excerpt = "I know noble accents / And lucid, inescapable rhythms; […]";
assert_eq!("I know no-ble ac-cents / And lu-cid, in-escapable rhythms; […]"
          , hyphenate_text(excerpt));

规范化

受规范化影响的语言的连字符模式有时会覆盖多种形式，由其作者决定，但通常不会。如果您需要 kl-hyphenate 在已知的规范化形式上严格操作，如由 Unicode 标准附件 #15 和 unicode-normalization 包提供，您可以在 Cargo 清单中指定它，如下所示

[dependencies.kl-hyphenate]
version = "…"
features = ["nfc"]

features 字段可以包含以下规范化选项之一

"nfc"，用于规范复合；
"nfd"，用于规范分解；
"nfkc"，用于兼容性复合；
"nfkd"，用于兼容性分解。

如果启用了规范化，建议以发布模式构建 kl-hyphenate，因为捆绑的连字符模式需要重新处理成字典。

许可证

双重许可，根据以下任一许可证的条款

Apache 许可证 2.0 版本
麻省理工学院许可证

patterns/hyph-hu.ext.txt（扩展匈牙利断句模式）许可协议为

MPL 1.1（参考 patterns/hyph-hu.ext.lic.txt）

patterns/hyph-ca.ext.txt（扩展加泰罗尼亚断句模式）许可协议为

LGPL v.3.0 或更高版本（参考 patterns/hyph-ca.ext.lic.txt）

依赖项

~0.8–1.5MB
~34K SLoC