46 个版本
0.22.0 | 2024 年 1 月 20 日 |
---|---|
0.21.0 | 2023 年 6 月 3 日 |
0.20.0 | 2023 年 1 月 22 日 |
0.19.0 | 2022 年 7 月 25 日 |
0.5.3 | 2020 年 3 月 27 日 |
#9 in 机器学习
2,791 每月下载量
用于 17 个包 (15 个直接使用)
2.5MB
45K SLoC
rust-bert
Rust 原生最先进自然语言处理模型和管道。Hugging Face 的 Transformers 库 的移植,使用 tch-rs 或 onnxruntime 绑定 和来自 rust-tokenizers 的预处理。支持多线程标记化和 GPU 推理。此存储库公开了模型基础架构、特定任务的头部(见下文)和 即用型管道。文档末尾提供 基准测试。
通过几行代码即可开始包括问答、命名实体识别、翻译、摘要、文本生成、聊天机器人等任务
let qa_model = QuestionAnsweringModel::new(Default::default())?;
let question = String::from("Where does Amy live ?");
let context = String::from("Amy lives in Amsterdam");
let answers = qa_model.predict(&[QaInput { question, context }], 1, 32);
输出
[Answer { score: 0.9976, start: 13, end: 21, answer: "Amsterdam" }]
当前支持的任务包括
- 翻译
- 摘要
- 多轮对话
- 零样本分类
- 情感分析
- 命名实体识别
- 词性标注
- 问答
- 语言生成
- 掩码语言模型
- 句子嵌入
- 关键词提取
展开以显示支持模型/任务的矩阵
序列分类 | 标记分类 | 问答 | 文本生成 | 摘要 | 翻译 | 掩码 LM | 句子嵌入 | |
---|---|---|---|---|---|---|---|---|
DistilBERT | ✅ | ✅ | ✅ | ✅ | ✅ | |||
MobileBERT | ✅ | ✅ | ✅ | ✅ | ||||
DeBERTa | ✅ | ✅ | ✅ | ✅ | ||||
DeBERTa (v2) | ✅ | ✅ | ✅ | ✅ | ||||
FNet | ✅ | ✅ | ✅ | ✅ | ||||
BERT | ✅ | ✅ | ✅ | ✅ | ✅ | |||
RoBERTa | ✅ | ✅ | ✅ | ✅ | ✅ | |||
GPT | ✅ | |||||||
GPT2 | ✅ | |||||||
GPT-Neo | ✅ | |||||||
GPT-J | ✅ | |||||||
BART | ✅ | ✅ | ✅ | |||||
Marian | ✅ | |||||||
MBart | ✅ | ✅ | ||||||
M2M100 | ✅ | |||||||
NLLB | ✅ | |||||||
Electra | ✅ | ✅ | ||||||
ALBERT | ✅ | ✅ | ✅ | ✅ | ✅ | |||
T5 | ✅ | ✅ | ✅ | ✅ | ||||
LongT5 | ✅ | ✅ | ||||||
XLNet | ✅ | ✅ | ✅ | ✅ | ✅ | |||
Reformer | ✅ | ✅ | ✅ | ✅ | ||||
ProphetNet | ✅ | ✅ | ||||||
Longformer | ✅ | ✅ | ✅ | ✅ | ||||
Pegasus | ✅ |
入门指南
此库依赖于tch crate来实现对C++ Libtorch API的绑定。Libtorch库是必需的,可以通过自动或手动方式下载。以下提供如何设置环境以使用这些绑定的参考信息,请参考tch以获取详细信息或支持。
此外,此库依赖于一个缓存文件夹来下载预训练模型。此缓存位置默认为~/.cache/.rustbert
,但可以通过设置RUSTBERT_CACHE
环境变量进行更改。请注意,此库使用的语言模型的大小从几百MB到几GB不等。
手动安装(推荐)
- 从https://pytorch.ac.cn/get-started/locally/下载
libtorch
。此软件包需要v2.1
版本:如果此版本不再“入门”页面提供,可以通过修改目标链接来访问文件,例如Linux版本的CUDA11的文件为https://download.pytorch.org/libtorch/cu118/libtorch-cxx11-abi-shared-with-deps-2.1.1%2Bcu118.zip
。 注意:当从crates.io将rust-bert
作为依赖项时,请检查发布包readme上的所需LIBTORCH
,因为它可能与此处记录的版本不同(适用于当前存储库版本)。 - 将库解压到您选择的目录
- 设置以下环境变量
Linux
export LIBTORCH=/path/to/libtorch
export LD_LIBRARY_PATH=${LIBTORCH}/lib:$LD_LIBRARY_PATH
Windows
$Env:LIBTORCH = "X:\path\to\libtorch"
$Env:Path += ";X:\path\to\libtorch\lib"
macOS + Homebrew
brew install pytorch jq
export LIBTORCH=$(brew --cellar pytorch)/$(brew info --json pytorch | jq -r '.[0].installed[0].version')
export LD_LIBRARY_PATH=${LIBTORCH}/lib:$LD_LIBRARY_PATH
自动安装
或者,您可以让build
脚本自动为您下载libtorch
库。需要启用download-libtorch
功能标志。默认情况下将下载CPU版本的libtorch。要下载CUDA版本,请设置环境变量TORCH_CUDA_VERSION
为cu118
。请注意,libtorch库较大(CUDA启用版本的几个GB),因此第一次构建可能需要几分钟才能完成。
验证安装
通过将rust-bert
依赖项添加到您的Cargo.toml
或通过克隆rust-bert源并运行示例来验证您的安装(和与libtorch的链接)
git clone [email protected]:guillaume-be/rust-bert.git
cd rust-bert
cargo run --example sentence_embeddings
ONNX支持(可选)
可以通过可选的onnx
功能启用ONNX支持。此crate利用具有对onnxruntime C++库绑定的ort crate。我们建议用户参考此页面项目以获取进一步的安装说明/支持。
- 启用可选的
onnx
功能。rust-bert
crate不包括对ort
的任何可选依赖项,最终用户应选择足够拉取所需onnxruntime
C++库的功能集。 - 当前推荐的安装方法是使用动态链接,指向现有的库位置。使用
load-dynamic
cargo功能为ort
。 - 将
ORT_DYLIB_PATH
设置为指向下载的onnxruntime库的位置(onnxruntime.dll
/libonnxruntime.so
/libonnxruntime.dylib
,具体取决于操作系统)。这些可以从onnxruntime项目的发布页面下载
大多数架构(包括编码器、解码器和编码器-解码器)都得到支持。该库旨在与使用 optimum 库导出的模型保持兼容。有关如何使用 optimum 将 Transformer 模型导出到 ONNX 的详细指南可在 https://hugging-face.cn/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model 找到。创建 ONNX 模型所使用的资源与基于 Pytorch 的资源类似,将 Pytorch 替换为 ONNX 模型。由于 ONNX 模型在处理可选参数方面不如 Pytorch 模型灵活,因此将解码器或编码器-解码器模型导出到 ONNX 通常会产生多个文件。根据下表,这些文件(但不一定全部必要)可用于在此库中使用。
架构 | 编码器文件 | 无历史记录的解码器 | 带有历史记录的解码器 |
---|---|---|---|
编码器(例如 BERT) | 必需 | 未使用 | 未使用 |
解码器(例如 GPT2) | 未使用 | 必需 | 可选 |
编码器-解码器(例如 BART) | 必需 | 必需 | 可选 |
请注意,当“带有历史记录的解码器”文件为可选但未提供时,计算效率将降低,因为模型将不会使用缓存的历史键和值来处理注意力机制,从而导致大量冗余计算。Optimum 库提供了导出选项,以确保创建这样的“带有历史记录的解码器”模型文件。基础编码器和解码器模型架构分别在 encoder
和 decoder
模块中提供(且为方便起见公开)。
生成模型(纯解码器或编码器/解码器架构)在 models
模块中可用。大多数管道都适用于 ONNX 模型检查点,包括序列分类、零样本分类、标记分类(包括命名实体识别和词性标注)、问答、文本生成、摘要和翻译。当在管道中使用时,这些模型使用与它们的 Pytorch 对应物相同的配置和分词器文件。ONNX 模型的示例在 ./examples
目录中给出。
可用的管道
基于 Hugging Face 的管道,作为此软件包的一部分提供可立即使用的端到端 NLP 管道。目前提供以下功能:
免责声明 本存储库的贡献者不承担因使用此处提出的预训练系统而产生的任何第三方生成。
1. 问答
从给定的问题和上下文中提取提取式问答。在 SQuAD(斯坦福问答数据集)上微调的 DistilBERT 模型。
let qa_model = QuestionAnsweringModel::new(Default::default())?;
let question = String::from("Where does Amy live ?");
let context = String::from("Amy lives in Amsterdam");
let answers = qa_model.predict(&[QaInput { question, context }], 1, 32);
输出
[Answer { score: 0.9976, start: 13, end: 21, answer: "Amsterdam" }]
2. 翻译
支持广泛源语言和目标语言的翻译管道。利用两种主要架构进行翻译任务
- 基于 Marian 的模型,用于特定的源/目标组合
- M2M100 模型允许在 100 种语言之间直接翻译(对于某些选定的语言,计算成本更高,性能更低)
库中提供了以下语言对基于 Marian 的预训练模型,但用户可以导入任何 Pytorch 模型进行预测
- 英语 <-> 法语
- 英语 <-> 西班牙语
- 英语 <-> 葡萄牙语
- 英语 <-> 意大利语
- 英语 <-> 加泰罗尼亚语
- 英语 <-> 德语
- 英语 <-> 俄语
- 英语 <-> 中文
- 英语 <-> 荷兰语
- 英语 <-> 瑞典语
- 英语 <-> 阿拉伯语
- 英语 <-> 希伯来语
- 英语 <-> 印地语
- 法语 <-> 德语
对于不支持预训练 Marian 模型的语言,用户可以依赖支持 100 种语言之间直接翻译(不通过英语翻译)的 M2M100 模型。支持的语言完整列表可在 软件包文档 中找到。
use rust_bert::pipelines::translation::{Language, TranslationModelBuilder};
fn main() -> anyhow::Result<()> {
let model = TranslationModelBuilder::new()
.with_source_languages(vec![Language::English])
.with_target_languages(vec![Language::Spanish, Language::French, Language::Italian])
.create_model()?;
let input_text = "This is a sentence to be translated";
let output = model.translate(&[input_text], None, Language::Spanish)?;
for sentence in output {
println!("{}", sentence);
}
Ok(())
}
输出
Il s'agit d'une phrase à traduire
3. 摘要
使用预训练的 BART 模型进行抽象摘要。
let summarization_model = SummarizationModel::new(Default::default())?;
let input = ["In findings published Tuesday in Cornell University's arXiv by a team of scientists \
from the University of Montreal and a separate report published Wednesday in Nature Astronomy by a team \
from University College London (UCL), the presence of water vapour was confirmed in the atmosphere of K2-18b, \
a planet circling a star in the constellation Leo. This is the first such discovery in a planet in its star's \
habitable zone — not too hot and not too cold for liquid water to exist. The Montreal team, led by Björn Benneke, \
used data from the NASA's Hubble telescope to assess changes in the light coming from K2-18b's star as the planet \
passed between it and Earth. They found that certain wavelengths of light, which are usually absorbed by water, \
weakened when the planet was in the way, indicating not only does K2-18b have an atmosphere, but the atmosphere \
contains water in vapour form. The team from UCL then analyzed the Montreal team's data using their own software \
and confirmed their conclusion. This was not the first time scientists have found signs of water on an exoplanet, \
but previous discoveries were made on planets with high temperatures or other pronounced differences from Earth. \
\"This is the first potentially habitable planet where the temperature is right and where we now know there is water,\" \
said UCL astronomer Angelos Tsiaras. \"It's the best candidate for habitability right now.\" \"It's a good sign\", \
said Ryan Cloutier of the Harvard–Smithsonian Center for Astrophysics, who was not one of either study's authors. \
\"Overall,\" he continued, \"the presence of water in its atmosphere certainly improves the prospect of K2-18b being \
a potentially habitable planet, but further observations will be required to say for sure. \"
K2-18b was first identified in 2015 by the Kepler space telescope. It is about 110 light-years from Earth and larger \
but less dense. Its star, a red dwarf, is cooler than the Sun, but the planet's orbit is much closer, such that a year \
on K2-18b lasts 33 Earth days. According to The Guardian, astronomers were optimistic that NASA's James Webb space \
telescope — scheduled for launch in 2021 — and the European Space Agency's 2028 ARIEL program, could reveal more \
about exoplanets like K2-18b."];
let output = summarization_model.summarize(&input);
(示例来源:WikiNews)
输出
"Scientists have found water vapour on K2-18b, a planet 110 light-years from Earth.
This is the first such discovery in a planet in its star's habitable zone.
The planet is not too hot and not too cold for liquid water to exist."
4. 对话模型
基于微软的DialoGPT的对话模型。此流程允许生成人类与模型之间单轮或多轮对话。DialoGPT页面说明
人类评估结果表明,DialoGPT生成的响应在单轮对话图灵测试中与人类响应质量相当。(DialoGPT 代码库)
该模型使用一个ConversationManager
来跟踪活跃对话并生成对其的响应。
use rust_bert::pipelines::conversation::{ConversationModel, ConversationManager};
let conversation_model = ConversationModel::new(Default::default());
let mut conversation_manager = ConversationManager::new();
let conversation_id = conversation_manager.create("Going to the movies tonight - any suggestions?");
let output = conversation_model.generate_responses(&mut conversation_manager);
示例输出
"The Big Lebowski."
5. 自然语言生成
基于提示生成语言。提供GPT2和GPT作为基础模型。包括诸如束搜索、top-k和核采样、温度设置和重复惩罚等技术。支持从多个提示中批量生成句子。如果存在模型填充标记,则序列将使用模型填充标记左填充,否则使用未知标记。这可能会影响结果,建议提交长度相似的提示以获得最佳结果
let model = GPT2Generator::new(Default::default())?;
let input_context_1 = "The dog";
let input_context_2 = "The cat was";
let generate_options = GenerateOptions {
max_length: 30,
..Default::default()
};
let output = model.generate(Some(&[input_context_1, input_context_2]), generate_options);
示例输出
[
"The dog's owners, however, did not want to be named. According to the lawsuit, the animal's owner, a 29-year"
"The dog has always been part of the family. \"He was always going to be my dog and he was always looking out for me"
"The dog has been able to stay in the home for more than three months now. \"It's a very good dog. She's"
"The cat was discovered earlier this month in the home of a relative of the deceased. The cat\'s owner, who wished to remain anonymous,"
"The cat was pulled from the street by two-year-old Jazmine.\"I didn't know what to do,\" she said"
"The cat was attacked by two stray dogs and was taken to a hospital. Two other cats were also injured in the attack and are being treated."
]
6. 零样本分类
使用针对自然语言推理微调的模型对输入句子进行零样本分类。
let sequence_classification_model = ZeroShotClassificationModel::new(Default::default())?;
let input_sentence = "Who are you voting for in 2020?";
let input_sequence_2 = "The prime minister has announced a stimulus package which was widely criticized by the opposition.";
let candidate_labels = &["politics", "public health", "economics", "sports"];
let output = sequence_classification_model.predict_multilabel(
&[input_sentence, input_sequence_2],
candidate_labels,
None,
128,
);
输出
[
[ Label { "politics", score: 0.972 }, Label { "public health", score: 0.032 }, Label {"economics", score: 0.006 }, Label {"sports", score: 0.004 } ],
[ Label { "politics", score: 0.975 }, Label { "public health", score: 0.0818 }, Label {"economics", score: 0.852 }, Label {"sports", score: 0.001 } ],
]
7. 情感分析
预测句子的二元情感。DistilBERT模型在SST-2上微调。
let sentiment_classifier = SentimentModel::new(Default::default())?;
let input = [
"Probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it's not preachy or boring.",
"This film tried to be too many things all at once: stinging political satire, Hollywood blockbuster, sappy romantic comedy, family values promo...",
"If you like original gut wrenching laughter you will like this movie. If you are young or old then you will love this movie, hell even my mom liked it.",
];
let output = sentiment_classifier.predict(&input);
(示例由IMDb提供)
输出
[
Sentiment { polarity: Positive, score: 0.9981985493795946 },
Sentiment { polarity: Negative, score: 0.9927982091903687 },
Sentiment { polarity: Positive, score: 0.9997248985164333 }
]
8. 命名实体识别
从文本中提取实体(人名、地点、组织、其他)。BERT带大小模型,在CoNNL03上微调,由巴伐利亚州立图书馆的MDZ 数字图书馆团队提供。目前模型可供英语、德语、西班牙语和荷兰语使用。
let ner_model = NERModel::new(default::default())?;
let input = [
"My name is Amy. I live in Paris.",
"Paris is a city in France."
];
let output = ner_model.predict(&input);
输出
[
[
Entity { word: "Amy", score: 0.9986, label: "I-PER" }
Entity { word: "Paris", score: 0.9985, label: "I-LOC" }
],
[
Entity { word: "Paris", score: 0.9988, label: "I-LOC" }
Entity { word: "France", score: 0.9993, label: "I-LOC" }
]
]
9. 关键词/短语提取
从输入文档中提取关键词和短语提取
fn main() -> anyhow::Result<()> {
let keyword_extraction_model = KeywordExtractionModel::new(Default::default())?;
let input = "Rust is a multi-paradigm, general-purpose programming language. \
Rust emphasizes performance, type safety, and concurrency. Rust enforces memory safety—that is, \
that all references point to valid memory—without requiring the use of a garbage collector or \
reference counting present in other memory-safe languages. To simultaneously enforce \
memory safety and prevent concurrent data races, Rust's borrow checker tracks the object lifetime \
and variable scope of all references in a program during compilation. Rust is popular for \
systems programming but also offers high-level features including functional programming constructs.";
let output = keyword_extraction_model.predict(&[input])?;
}
输出
"rust" - 0.50910604
"programming" - 0.35731024
"concurrency" - 0.33825397
"concurrent" - 0.31229728
"program" - 0.29115444
10. 词性标注
从文本中提取词性标签(名词、动词、形容词等)。
let pos_model = POSModel::new(default::default())?;
let input = ["My name is Bob"];
let output = pos_model.predict(&input);
输出
[
Entity { word: "My", score: 0.1560, label: "PRP" }
Entity { word: "name", score: 0.6565, label: "NN" }
Entity { word: "is", score: 0.3697, label: "VBZ" }
Entity { word: "Bob", score: 0.7460, label: "NNP" }
]
11. 句子嵌入
生成句子嵌入(向量表示)。这些可以用于包括密集信息检索在内的应用程序。
let model = SentenceEmbeddingsBuilder::remote(
SentenceEmbeddingsModelType::AllMiniLmL12V2
).create_model()?;
let sentences = [
"this is an example sentence",
"each sentence is converted"
];
let output = model.encode(&sentences)?;
输出
[
[-0.000202666, 0.08148022, 0.03136178, 0.002920636 ...],
[0.064757116, 0.048519745, -0.01786038, -0.0479775 ...]
]
12. 遮罩语言模型
预测输入句子中的遮罩词。
let model = MaskedLanguageModel::new(Default::default())?;
let sentences = [
"Hello I am a <mask> student",
"Paris is the <mask> of France. It is <mask> in Europe.",
];
let output = model.predict(&sentences);
输出
[
[MaskedToken { text: "college", id: 2267, score: 8.091}],
[
MaskedToken { text: "capital", id: 3007, score: 16.7249},
MaskedToken { text: "located", id: 2284, score: 9.0452}
]
]
基准测试
对于简单的流程(序列分类、标记分类、问答),Python和Rust之间的性能预期将是相当的。这是因为这些流程中最昂贵的部分是语言模型本身,共享一个共同的Torch后端实现。《Rust中的端到端NLP流程》(End-to-end NLP Pipelines in Rust)提供基准测试部分,涵盖所有流程。
对于文本生成任务(摘要、翻译、对话、自由文本生成),可以期望获得显著的好处(根据输入和应用,处理速度可提高2到4倍)。文章《使用Rust加速文本生成》(Accelerating text generation with Rust)专注于这些文本生成应用程序,并提供了与Python的性能比较的更多详细信息。
加载预训练和自定义模型权重
对于想要公开自己的基于transformer的模型的用户,也提供了基础模型和特定任务的头部。在./examples
中提供了使用本机分词器Rust库准备数据示例,适用于BERT、DistilBERT、RoBERTa、GPT、GPT2和BART。请注意,当从Pytorch导入模型时,参数命名的约定需要与Rust架构保持一致。如果模型参数的权重在任何权重文件中找不到,预训练权重的加载将失败。如果跳过此质量检查,可以调用存储中的替代方法load_partial
。
预训练模型可在 Hugging Face 的 模型库 中找到,并可以使用本库中定义的 RemoteResources
加载。在 ./utils
目录中包含一个转换工具脚本,用于将 Pytorch 权重转换为与该库兼容的权重集。此脚本需要 Python 和 torch
已经设置好,可以使用以下方式使用: python ./utils/convert_model.py path/to/pytorch_model.bin
其中 path/to/pytorch_model.bin
是原始 Pytorch 权重的位置。
引用
如果您在工作中使用了 rust-bert
,请引用 Rust 中的端到端 NLP 管道
@inproceedings{becquin-2020-end,
title = "End-to-end {NLP} Pipelines in Rust",
author = "Becquin, Guillaume",
booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
year = "2020",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.nlposs-1.4",
pages = "20--25",
}
致谢
感谢 Hugging Face 提供与该 Rust 库兼容的一组权重。可用的预训练模型列表可在 https://hugging-face.cn/models?filter=rust 找到。
依赖
~20–39MB
~687K SLoC