16 个版本

0.6.3	2023年4月1日
0.6.2	2023年3月7日
0.6.1	2022年10月27日
0.5.1	2022年6月20日
0.2.0	2021年11月1日

#258 in 文本处理

3,334 每月下载量
用于 2 crates

MIT/Apache

270KB
6.5K SLoC

水运车

水运车是一个快速且轻量级的基于点预测的分词器。

示例

use std::fs::File;

use vaporetto::{Model, Predictor, Sentence};

let f = File::open("../resources/model.bin")?;
let model = Model::read(f)?;
let predictor = Predictor::new(model, true)?;

let mut buf = String::new();

let mut s = Sentence::default();

s.update_raw("まぁ社長は火星猫だ")?;
predictor.predict(&mut s);
s.fill_tags();
s.write_tokenized_text(&mut buf);
assert_eq!(
    "まぁ/名詞/マー 社長/名詞/シャチョー は/助詞/ワ 火星/名詞/カセー 猫/名詞/ネコ だ/助動詞/ダ",
    buf,
);

s.update_raw("まぁ良いだろう")?;
predictor.predict(&mut s);
s.fill_tags();
s.write_tokenized_text(&mut buf);
assert_eq!(
    "まぁ/副詞/マー 良い/形容詞/ヨイ だろう/助動詞/ダロー",
    buf,
);

特性标志

以下特性默认禁用

kytea - 启用 KyTea 生成的模型读取器。
train - 启用训练器。
portable-simd - 使用便携式 SIMD API 而不是我们的 SIMD 驱动的数据布局。（需要 Nightly Rust。）

以下特性默认启用

std - 使用标准库。如果禁用，则使用核心库。
cache-type-score - 启用缓存类型分数以加快处理速度。如果禁用，则按简单方式计算类型分数。
fix-weight-length - 使用固定大小的数组存储分数以方便优化。如果禁用，则使用向量。
tag-prediction - 启用标签预测。
charwise-pma - 使用 Charwise Daachorse 而不是标准版本以加快预测，尽管这可能会使加载模型文件变慢。

分布式模型注意事项

分布式模型以 zstd 格式压缩。如果您想加载这些压缩模型，必须在 API 外部解压缩。

// Requires zstd crate or ruzstd crate
let reader = zstd::Decoder::new(File::open("path/to/model.bin.zst")?)?;
let model = Model::read(reader)?;

您也可以使用现代 Linux 发行版中捆绑的 unzstd 命令解压缩文件。

许可

根据您的选择，许可方式为以下之一

Apache License，版本 2.0 (LICENSE-APACHE 或 https://apache.ac.cn/licenses/LICENSE-2.0)
MIT 许可证 (LICENSE-MIT 或 https://open-source.org.cn/licenses/MIT)

。

贡献

除非您明确声明，否则根据Apache-2.0许可证定义，您有意提交以包含在作品中的任何贡献，应按上述方式双重许可，不附加任何额外条款或条件。

依赖关系

约3MB
约44K SLoC