4 个版本 (2 个破坏性更改)

0.3.2	2024 年 8 月 12 日
0.2.0	2024 年 7 月 20 日
0.1.1	2024 年 7 月 20 日
0.1.0	2024 年 7 月 13 日

在文本处理中排名 #615

每月下载 243 次
在 hebrew_unicode_utils 中使用

MIT/Apache

39KB
354 行

希伯来语_Unicode_脚本

描述

此包 (hebrew_unicode_script) 是一个 Rust 库，旨在简化与希伯来脚本及其相关区块相关的 Unicode 字符的识别和验证。
此库提供了一组函数，允许开发者轻松确定特定字符是否属于 希伯来 unicode 脚本、是否位于 希伯来 unicode 区块 或与 字母呈现形式 Unicode 区块匹配。
对于每个适用的 unicode 区块，还有一些额外的函数，允许对每个区块内的 字符类型 进行更复杂的检查。例如，元音、重音、标记等。
此库中的每个函数都返回一个布尔值，使得将这些检查集成到现有或新应用程序中变得简单。

For unicode script (top level):
  - is_hbr_script(c: char) -> bool
  - is_hbr_script_point(c: char) -> bool
  - is_hbr_script_consonant(c: char) -> bool
  - is_hbr_script_ligature_yiddisch(c: char) -> bool

For unicode block: 'Hebrew': 
  - is_hbr_accent(c: char) -> bool
  - is_hbr_mark(c: char) -> bool
  - is_hbr_point(c: char) -> bool
    - is_hbr_point_vowel(c) -> bool
    - is_hbr_point_semi_vowel(c) -> bool
    - is_hbr_point_reading_sign(c) -> bool
  - is_hbr_punctuation(c: char) -> bool
  - is_hbr_consonant(c: char) -> bool
    - is_hbr_consonant_normal(c: char) -> bool
    - is_hbr_consonant_final(c: char) -> bool
  - is_hbr_yod_triangle(c: char) -> bool
  - is_hbr_ligature_yiddish(c: char) -> bool

For unicode block: 'Alphabetic Presentation Form':
  - is_apf_block(c: char) -> bool
  - is_apf_point_reading_sign(c: char) -> bool
  - is_apf_consonant(c: char) -> bool
    - is_apf_consonant_alternative(c: char) -> bool
    - is_apf_consonant_wide(c: char) -> bool
    - is_apf_consonant_with_vowel(c: char) -> bool
  - is_apf_ligature_yiddisch(c: char) -> bool
  - is_apf_ligature(c: char) -> bool

备注

希伯来点可以细分为
- 元音（代码点：U+05B4 .. U+05BB, U+05C7）
- 半元音（代码点：U+05B0 .. U+05B3）
- 阅读符号（代码点：U+05BC .. U+05C2） ^judeo-spanish [^judeo-spanish]: 我不知道 HEBREW POINT JUDEO-SPANISH VARIKA 是否是阅读符号！
希伯来字母可以细分为
- 正常辅音（代码点：U+05D0 .. U+05D9, U+05DB, U+05DC, U+05DE, U+05E0 .. U+05E2, U+05E4, U+05E6 .. U+05EA）
- 结尾辅音（代码点：U+05DA, U+05DD, U+05DF, U+05E3 和 U+05E5）
- 宽辅音（代码点：U+FB21 .. U+FB28）
- 带元音的辅音（代码点：U+FB2A .. U+FB36, U+FB38 .. U+FB3C, U+FB3E, U+FB40, U+FB41, U+FB43, U+FB44, U+FB46 .. U+FB4E）
- 替代辅音（代码点：U+FB20, U+FB29）

^ TOC

安装

在您的项目目录中运行以下 Cargo 命令

cargoadd hebrew_unicode_script

或者在 dependencies 下添加以下行到您的 Cargo.toml 文件中

hebrew_unicode_script= "0.2.0

查看 crates.io 获取最新版本！

示例

use hebrew_unicode_script::is_hbr_block;
if is_hbr_block('מ') {
	println!("The character you entered is part of the 'unicode code block Hebrew'");
}

use hebrew_unicode_script::{is_hbr_consonant_final, is_hbr_consonant};

let test_str = "ךםןףץ";
for c in test_str.chars() {
    assert!(is_hbr_consonant_final(c));
    assert!(is_hbr_consonant(c));
}

use hebrew_unicode_script::{is_hbr_accent,is_hbr_mark, is_hbr_point, is_hbr_punctuation};
use hebrew_unicode_script::{is_hbr_consonant_final,is_hbr_yod_triangle,is_hbr_ligature_yiddish};

fn main() {
   // define a strings of characters
   let string_of_chars = "יָ֭דַעְתָּ שִׁבְתִּ֣י abcdefg וְקוּמִ֑י";
   // get a structures that indicates if a type is present or not (bool)
   let chartypes = get_character_types(string_of_chars);
   // print the results
   println!("The following letter types are found in: {}", string_of_chars);
   println!("{:?}",chartypes);
}

#[derive(Debug, Default)]
pub struct HebrewCharacterTypes {
    accent: bool,
    mark: bool,
    point: bool,
    punctuation: bool,
    letter: bool,
    letter_normal: bool,
    letter_final: bool,
    yod_triangle: bool,
    ligature_yiddish: bool,
    whitespace: bool,
    non_hebrew: bool,
}

impl HebrewCharacterTypes {
    fn new() -> Self {
        Default::default()
    }
}

pub fn get_character_types(s: &str) -> HebrewCharacterTypes {
    let mut found_character_types = HebrewCharacterTypes::new();
    for c in s.chars() {
        match c {
            c if is_hbr_accent(c) => found_character_types.accent = true,
            c if is_hbr_mark(c) => found_character_types.mark = true,
            c if is_hbr_point(c) => found_character_types.point = true,
            c if is_hbr_punctuation(c) => found_character_types.punctuation = true,
            c if is_hbr_consonant_final(c) => found_character_types.letter_final = true,
            c if is_hbr_yod_triangle(c) => found_character_types.yod_triangle = true,
            c if is_hbr_ligature_yiddish(c) => found_character_types.ligature_yiddish = true,
            c if c.is_whitespace() => found_character_types.whitespace = true,
            _ => found_character_types.non_hebrew = true,
        }
    }
    found_character_types.letter =
        found_character_types.letter_normal | found_character_types.letter_final;
    found_character_types
}

输出结果

The following character types were found:
HebrewCharacterTypes {
    accent: true,
    mark: false,
    point: true,
    punctuation: false,
    letter: true,
    letter_normal: true,
    letter_final: false,
    yod_triangle: false,
    ligature_yiddish: false,
    whitespace: true,
    non_hebrew: true,
}

查看 crate 模块获取更多示例。

^ TOC

参考

Unicode 脚本

希伯来语的 Unicode 字符脚本

Unicode 区块名称

希伯来语
- 查看 https://www.unicode.org/charts/PDF/U0590.pdf
  - 注意： 仅以下代码点范围适用： U+0590 .. U+05FF
  - 也请参阅： https://graphemica.com/blocks/hebrew/
字母呈现形式
- 查看 https://www.unicode.org/charts/PDF/UFB00.pdf
  - 注意： 仅以下代码点范围适用： U+FB1D .. U+FB4F
  - 也请参阅： https://graphemica.com/blocks/alphabetic-presentation-forms

了解更多关于 Unicode、Unicode 字符脚本和 Unicode 代码点块的信息

^ TOC

安全性

所有函数都是用安全的 Rust 编写的。

^ TOC

恐慌

据我所知，没有。

^ TOC

错误

所有函数返回 true 或 false。

^ TOC

代码覆盖率

当前代码覆盖率是 100% [^code coverage] [^code coverage]：crates.io 中显示的代码覆盖率数字（非常）不同！我不知道为什么 :-)（进行中）

Code Coverage

我使用了本地运行代码覆盖率

操作

安装扩展 Coverage Gutters
执行： cargo clean && cargo build && cargo test && mkdir -p target/coverage/html
执行： CARGO_INCREMENTAL=0 RUSTFLAGS='-Cinstrument-coverage' LLVM_PROFILE_FILE='cargo-test-%p-%m.profraw' cargo test
- 结果 -> （新文件）cargo-test-67187-8558864636421498001_0.profraw（在我的系统上）

选项 1：使用 Coverage Gutters

执行： grcov . --binary-path ./target/debug/deps/ -s . -t lcov --branch --ignore-not-existing --ignore '../*' --ignore "/*" -o target/coverage/tests.lcov
结果 -> （新文件）tests.lcov

点击查看按钮（由扩展添加到 VSCodium）结果 -> 每行代码将显示红色/绿色的指示

选项 2：创建网页执行： grcov . --binary-path ./target/debug/deps/ -s . -t html --branch --ignore-not-existing --ignore '../*' --ignore "/*" -o target/coverage/html 结果 -> 新建一个名为 html 的目录在浏览器中打开 html 文件夹中的 index.html 文件，您将获得一份完整的报告。 ^ TOC 许可证根据您的选择，许可协议为 Apache 许可协议第 2 版或 MIT 许可协议。 ^ TOC 贡献除非您明确表示，否则根据 Apache-2.0 许可协议定义，您提交的任何有意包含在此软件包中的贡献，将以上述方式双许可，不附加任何额外条款或条件。 ^ TOC

无运行时依赖