2 个不稳定版本

0.2.0 2020年2月13日
0.1.0 2020年1月12日

#4 in #tokenizers

Apache-2.0

235KB
4K SLoC

rust-tokenizers

Rust-tokenizer 是 Transformers 库 中分词方法的直接替代品

设置

Rust-tokenizer 需要使用 rust nightly 版本以使用 Python API。从源码构建涉及以下步骤

  1. 安装 Rust 并使用 nightly 工具链
  2. 在仓库中运行 python setup.py install。这将编译 Rust 库并安装 Python API
  3. 示例用法可在 /tests 文件夹中找到,包括基准测试和集成测试

库在 Rust 级别进行了全面单元测试

用法示例

from rust_transformers import PyBertTokenizer
from transformers.modeling_bert import BertForSequenceClassification

rust_tokenizer = PyBertTokenizer('bert-base-uncased-vocab.txt')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', output_attentions=False).cuda()
model = model.eval()

sentence = '''For instance, on the planet Earth, man had always assumed that he was more intelligent than dolphins because 
              he had achieved so much—the wheel, New York, wars and so on—whilst all the dolphins had ever done was muck 
              about in the water having a good time. But conversely, the dolphins had always believed that they were far 
              more intelligent than man—for precisely the same reasons.'''

features = rust_tokenizer.encode(sentence, max_len=128, truncation_strategy='only_first', stride=0)
input_ids = torch.tensor([f.token_ids for f in features], dtype=torch.long).cuda()

with torch.no_grad():
    output = model(all_input_ids)[0].cpu().numpy()

依赖项

~11MB
~227K SLoC