3个版本

0.1.2	2023年9月4日
0.1.1	2023年9月3日
0.1.0	2023年9月3日

#725 在编码

MIT 许可证

5MB
1K SLoC

`tiktoken-rs`

使用tiktoken进行OpenAI模型文本分词的Rust库

此库提供了一系列现成的分词库，用于与GPT、tiktoken和相关OpenAI模型一起工作。用例包括对文本输入进行分词和计词。

此库建立在tiktoken库之上，并包含了一些额外的功能和增强，以方便与rust代码一起使用。

示例

有关所有支持功能的完整工作示例，请参阅仓库中的示例目录。

用法

使用cargo在本地安装此工具

cargo add tiktoken-rs

然后在您的rust代码中调用API

计数字符长度

use another_tiktoken_rs::p50k_base;

let bpe = p50k_base().unwrap();
let tokens = bpe.encode_with_special_tokens(
  "This is a sentence   with spaces"
);
println!("Token count: {}", tokens.len());

计数字符长度参数用于聊天完成请求

use another_tiktoken_rs::{get_chat_completion_max_tokens, ChatCompletionRequestMessage};

let messages = vec![
    ChatCompletionRequestMessage {
        content: Some("You are a helpful assistant that only speaks French.".to_string()),
        role: "system".to_string(),
        name: None,
        function_call: None,
    },
    ChatCompletionRequestMessage {
        content: Some("Hello, how are you?".to_string()),
        role: "user".to_string(),
        name: None,
        function_call: None,
    },
    ChatCompletionRequestMessage {
        content: Some("Parlez-vous francais?".to_string()),
        role: "system".to_string(),
        name: None,
        function_call: None,
    },
];
let max_tokens = get_chat_completion_max_tokens("gpt-4", &messages).unwrap();
println!("max_tokens: {}", max_tokens);

使用async-openai计数字符长度参数用于聊天完成请求

需要在您的Cargo.toml文件中启用async-openai功能。

use another_tiktoken_rs::async_openai::get_chat_completion_max_tokens;
use async_openai::types::{ChatCompletionRequestMessage, Role};

let messages = vec![
    ChatCompletionRequestMessage {
        content: Some("You are a helpful assistant that only speaks French.".to_string()),
        role: Role::System,
        name: None,
        function_call: None,
    },
    ChatCompletionRequestMessage {
        content: Some("Hello, how are you?".to_string()),
        role: Role::User,
        name: None,
        function_call: None,
    },
    ChatCompletionRequestMessage {
        content: Some("Parlez-vous francais?".to_string()),
        role: Role::System,
        name: None,
        function_call: None,
    },
];
let max_tokens = get_chat_completion_max_tokens("gpt-4", &messages).unwrap();
println!("max_tokens: {}", max_tokens);

tiktoken支持OpenAI模型使用的以下编码

编码名称	OpenAI模型
`cl100k_base`	ChatGPT模型，`text-embedding-ada-002`
`p50k_base`	代码模型，`text-davinci-002，text-davinci-003`
`p50k_edit`	用于编辑模型如`text-davinci-edit-001，code-davinci-edit-001`
`r50k_base`（或`gpt2`）	GPT-3模型如`davinci`

请参考仓库中的示例了解使用场景。有关不同分词器的更多信息，请参阅OpenAI 烹饪书

遇到任何错误吗？

如果您遇到任何错误或对改进有任何建议，请在仓库中提交一个问题。

致谢

感谢@spolu提供的原始代码，以及.tiktoken文件。

许可证

本项目遵循MIT 许可证。

依赖项

约5-20MB
约280K SLoC