2 个不稳定版本
0.2.0 | 2020年2月13日 |
---|---|
0.1.0 | 2020年1月12日 |
#4 in #tokenizers
235KB
4K SLoC
rust-tokenizers
Rust-tokenizer 是 Transformers 库 中分词方法的直接替代品
设置
Rust-tokenizer 需要使用 rust nightly 版本以使用 Python API。从源码构建涉及以下步骤
- 安装 Rust 并使用 nightly 工具链
- 在仓库中运行
python setup.py install
。这将编译 Rust 库并安装 Python API - 示例用法可在
/tests
文件夹中找到,包括基准测试和集成测试
库在 Rust 级别进行了全面单元测试
用法示例
from rust_transformers import PyBertTokenizer
from transformers.modeling_bert import BertForSequenceClassification
rust_tokenizer = PyBertTokenizer('bert-base-uncased-vocab.txt')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', output_attentions=False).cuda()
model = model.eval()
sentence = '''For instance, on the planet Earth, man had always assumed that he was more intelligent than dolphins because
he had achieved so much—the wheel, New York, wars and so on—whilst all the dolphins had ever done was muck
about in the water having a good time. But conversely, the dolphins had always believed that they were far
more intelligent than man—for precisely the same reasons.'''
features = rust_tokenizer.encode(sentence, max_len=128, truncation_strategy='only_first', stride=0)
input_ids = torch.tensor([f.token_ids for f in features], dtype=torch.long).cuda()
with torch.no_grad():
output = model(all_input_ids)[0].cpu().numpy()
依赖项
~11MB
~227K SLoC