1 个不稳定版本

0.1.0	2023 年 11 月 30 日

#295 在机器学习中

每月 536 次下载

MIT 许可协议

1.5MB
327 行

Cover logo

Instant CLIP Tokenizer：CLIP 神经网络的快速标记器

Instant CLIP Tokenizer 是一个针对 OpenAI 的 CLIP 模型的快速纯 Rust 文本标记器。它旨在替代 CLIP 仓库中包含的原始基于 Python 的标记器，旨在与原始实现达到 100% 的兼容性。它也可以与 OpenCLIP 和其他使用相同标记器的实现一起使用。

除了可以作为 Rust 包使用外，它还包括使用 PyO3 构建的 Python 绑定，以便可以作为原生 Python 模块使用。

对于本仓库中包含的微基准测试，Instant CLIP Tokenizer 比 Python 实现快约 70 倍（禁用了预处理和缓存以确保公平比较）。

使用库

Rust

[dependencies]
instant-clip-tokenizer = "0.1.0"
# To enable additional functionality that depends on the `ndarray` crate:
# instant-clip-tokenizer = { version = "0.1.0", features = ["ndarray"] }

Python (≥ 3.9)

pip install instant-clip-tokenizer

使用库需要在您的 Python 环境中安装 numpy >= 1.16.0（例如，通过 pip install numpy）。

示例

use instant_clip_tokenizer::{Token, Tokenizer};

let tokenizer = Tokenizer::new();

let mut tokens = Vec::new();
tokenizer.encode("A person riding a motorcycle", &mut tokens);
let tokens = tokens.into_iter().map(Token::to_u16).collect::<Vec<_>>();
println!("{:?}", tokens);

// -> [320, 2533, 6765, 320, 10297]

import instant_clip_tokenizer

tokenizer = instant_clip_tokenizer.Tokenizer()

tokens = tokenizer.encode("A person riding a motorcycle")
print(tokens)

# -> [320, 2533, 6765, 320, 10297]

batch = tokenizer.tokenize_batch(["A person riding a motorcycle", "Hi there"], context_length=5)
print(batch)

# -> [[49406   320  2533  6765 49407]
#     [49406  1883   997 49407     0]]

测试

要运行测试，请执行以下操作

cargo test --all-features

您还可以使用以下方法测试 Python 绑定

make test-python

致谢

依赖项

~2.8–4.5MB
~72K SLoC