7个版本

0.0.7 2024年6月29日
0.0.6 2024年6月9日
0.0.3 2024年5月28日

#92 in 科学

40 每月下载量
用于 llm_client

MIT 协议

7MB
4.5K SLoC

llm_utils: 无链,只有工具

  • 主要开源模型的分词器。
  • 平衡 的SotA文本分块器,具有快速并行化实现。
  • 用于本地加载模型的预设。为您的GPU计算最佳量化值。
  • 高级提示工具;开源模型或OpenAI/Anthropic模型的聊天模板,准确计数提示令牌,并构建语法和logit偏差。
  • 解析和清理HTML和文本。

安装

[dependencies]
llm_utils = "*"

分词器 🧮

  • Hugging Face的Tokenizer库用于本地模型和Tiktoken-rs用于OpenAI和Anthropic (Anthropic没有公开可用的分词器.)

  • 简单的抽象API用于编码和解码,允许跨多个架构抽象地消费LLM。

  • 安全地设置LLM的max_token参数,以确保请求不会因超过令牌限制而失败!

    // Get a Tiktoken tokenizer
    //
    let tokenizer: LlmTokenizer = LlmTokenizer::new_tiktoken("gpt-4o");

    // Get a Hugging Face tokenizer from local path
    //
    let tokenizer: LlmTokenizer = LlmTokenizer::new_from_tokenizer_json("path/to/tokenizer.json");
    
    // Or load from repo
    //
    let tokenizer: LlmTokenizer = LlmTokenizer::new_from_hf_repo(hf_token, "meta-llama/Meta-Llama-3-8B-Instruct");

    // Get tokenizan'
    //
    let token_ids: Vec<u32> = tokenizer.tokenize("Hello there");
    let count: u32 = tokenizer.count_tokens("Hello there");
    let word_probably: String = tokenizer.detokenize_one(token_ids[0])?; 
    let words_probably: String = tokenizer.detokenize_many(token_ids)?; 

    // These function are used for generating logit bias
    let token_id: u32 = tokenizer.try_into_single_token("hello");
    let word_probably: String = tokenizer.try_from_single_token_id(1234);

文本分块 🪓

平衡文本分块意味着所有块的大小大致相同。

    let text = "one, two, three, four, five, six, seven, eight, nine";

    // Give a max token count of four, other text chunkers would split this into three chunks.
    assert_eq!(["one, two, three, four", "five, six, seven, eight", "nine"], // "nine" is orphaned!
        OtherChunkers::new()
        .max_chunk_token_size(4)
        .Chunk(text));

    // A balanced text chunker, however, would also split the text into three chunks, but of even sizes.
    assert_eq!(["one, two, three", "four, five, six", "seven, eight, nine"], 
        TextChunker::new()
        .max_chunk_token_size(4)
        .run(&text)?);
       

只要传入文本的总令牌长度不能被最大令牌计数整除,最终块将比其他块小。在某些情况下,它可能非常小,以至于会被“遗弃”并变得无用。如果您要求RAG实现回答“七个吃了什么?”,那么回答问题的最终块将无法检索。

TextChunker首先尝试以下顺序进行语义分割:段落、换行符、句子。如果失败,它将使用最大的可用分割线性构建块,并在需要的地方进行分割。

模型预设 🛤️

  • 来自Hugging Face的开源LLM的预设,或API模型如OpenAI和Anthropic。

  • 加载和/或下载带有元数据、分词器和本地路径(用于本地LLM如llama.cpp、vllm、mistral.rs)的模型。

  • 自动选择适合您VRAM的最大量化GGUF!

支持的开源模型

⚪ Llama 3

⚪ Mistral 和 Mixtral

⚪ Phi 3

    // Load the largest quantized Mistral-7B-Instruct model that will fit in your vram
    //
    let model: OsLlm = PresetModelBuilder::new()
        .mistral_7b_instruct()
        .vram(48)
        .ctx_size(9001) // ctx_size impacts vram usage!
        .load()
        .await?;

    not_a_real_assert_eq!(model, OsLlm {
        pub model_id: String,
        pub model_url: String,
        pub local_model_path: String, // Use this to load the llama.cpp server
        pub model_config_json: OsLlmConfigJson,
        pub chat_template: OsLlmChatTemplate,
        pub tokenizer: Option<LlmTokenizer>,
    })

    // Or Openai
    //
    let model: OpenAiLlm = OpenAiLlm::gpt_4_o();

    not_a_real_assert_eq!(model, OpenAiLlm {
        model_id: "gpt-4o".to_string(),
        context_length: 128000,
        cost_per_m_in_tokens: 5.00,
        max_tokens_output: 4096,
        cost_per_m_out_tokens: 15.00,
        tokens_per_message: 3,
        tokens_per_name: 1,
        tokenizer: Option<LlmTokenizer>,
    })

    // Or Anthropic
    //
    let model: AnthropicLlm = AnthropicLlm::claude_3_opus();

来自 Hugging Face 或本地路径的 GGUF 模型 🚤

    // From HF
    //
    let model_url = "https://hugging-face.cn/MaziyarPanahi/Meta-Llama-3-8B-Instruct-GGUF/blob/main/Meta-Llama-3-8B-Instruct.Q6_K.gguf";
    let model: OsLlm = GGUFModelBuilder::new()
            .hf_quant_file_url(model_url)
            .load()
            .await?;

    // Note: because we can't instantiate a tokenizer from a GGUF file, the returned model will not have a tokenizer!
    // However, if we provide the base model's repo, we load from there.
    let repo_id = "meta-llama/Meta-Llama-3-8B-Instruct";
    let model: OsLlm = GGUFModelBuilder::new()
        .hf_quant_file_url(model_url)
        .hf_config_repo_id(repo_id)
        .load()
        .await?;

    // From Local
    //
    let local_path = "/root/.cache/huggingface/hub/models--MaziyarPanahi--Meta-Llama-3-8B-Instruct-GGUF/blobs/c2ca99d853de276fb25a13e369a0db2fd3782eff8d28973404ffa5ffca0b9267";
    let model: OsLlm = GGUFModelBuilder::new()
            .local_quant_file_path(local_path)
            .load()
            .await?;

    // Again, we require a tokenizer.json. This can also be loaded from a local path.
    let local_config_path = "/llm_utils/src/models/open_source/llama/llama_3_8b_instruct";
    let model: OsLlm = GGUFModelBuilder::new()
        .local_quant_file_path(model_url)
        .local_config_path(local_config_path)
        .load()
        .await?;

提示 🎶

  • 为 GGUF 模型、Openai 和 Anthropic 生成正确格式的提示。

  • 使用 GGUF 的聊天模板和 Jinja 模板来格式化提示以符合模型规范。

  • 从动态输入和/或来自文件的静态输入的组合中创建提示。

    // Default formatted prompt (Openai and Anthropic format)
    //
    let default_formatted_prompt: HashMap<String, HashMap<String, String>> = prompting::default_formatted_prompt(
        "You are a nice robot.",
        "path/to/a/file/no_birds_and_bees_yap.yaml",
        "Where do robots come from?"
    )?;

    // Get total tokens in prompt
    //
    let total_prompt_tokens: u32 = model.openai_token_count_of_prompt(&tokenizer, &default_formatted_prompt);


    // Then convert it to be used for a GGUF model
    //
    let gguf_formatted_prompt: String = prompting::convert_default_prompt_to_model_format(
        &default_formatted_prompt,
        &model.chat_template,
    )?;

    // Since the GGUF formatted prompt is just a string, we can just use the generic count_tokens function
    //
    let total_prompt_tokens: u32 = tokenizer.count_tokens(&gguf_formatted_prompt);

    // Validate requested max_tokens for a generation. If it exceeds the models limits, reduce max_tokens to a safe value.
    //
    let safe_max_tokens = get_and_check_max_tokens_for_response(
            model.context_length,
            model.max_tokens_output, // If using a GGUF model use either model.context_length or the ctx_size of the server.
            total_prompt_tokens,
            10,
            None,
            requested_max_tokens,
        )?;

语法 🤓

  • 语法是结构化 LLM 输出的最有效方法。 这是为了与 LlamaCpp 一起使用而设计的,但计划支持其他模型。

  • 创建 N 项的列表,限制字符类型。

  • 将添加更多功能(JSON、分类、限制字符、单词、短语)

    // Return a list of between 1, 4 items
    //
    let grammar = llm_utils::grammar::create_list_grammar(1, 4);

    // List will be formatted: `- <list text>\n
    //
    let response: String = text_generation_request(&req_config, Some(&grammar)).await?;

    // So you can easily split like:
    //
    let response_items: Vec<String> = response
        .lines()
        .map(|line| line[1..].trim().to_string())
        .collect();

    // Exclude numbers from text generation
    //
    let grammar = llm_utils::grammar::create_text_structured_grammar(vec![RestrictedCharacterSet::PunctuationExtended]);
    let response: String = text_generation_request(&req_config, Some(&grammar)).await?;
    assert!(!response.contains('0'))
    assert!(!response.contains("1234"))

    // Exclude a list of common, and commonly unwanted characters from text generation
    //
    let grammar = llm_utils::grammar::create_text_structured_grammar(vec![RestrictedCharacterSet::PunctuationExtended]);
    let response: String = text_generation_request(&req_config, Some(&grammar)).await?;
    assert!(!response.contains('@'))
    assert!(!response.contains('['))
    assert!(!response.contains('*'))

对数偏差 #️⃣

  • 为 LlamaCpp 和 Openai 创建正确格式的对数偏差请求。

  • 从各种来源添加对数偏差的功能,以及验证。

    // Exclude some tokens from text generation
    //
    let mut words = HashMap::new();
    words.entry("delve").or_insert(-100.0);
    words.entry("as an ai model").or_insert(-100.0);

    // Build and validate
    //
    let logit_bias = logit_bias::logit_bias_from_words(&tokenizer, &words)
    let validated_logit_bias = logit_bias::validate_logit_bias_values(&logit_bias)?;

    // Convert
    //
    let openai_logit_bias = logit_bias::convert_logit_bias_to_openai_format(&validated_logit_bias)?;
    let llama_logit_bias = logit_bias::convert_logit_bias_to_llama_format(&validated_logit_bias)?;

文本分割 🔪

根据段落、句子、单词和字符分割文本。



    let paragraph_splits: Vec<String> =  TextSplitter::new()
        .on_two_plus_newline()
        .split_text(&text)?;

    let newline_splits: Vec<String> =  TextSplitter::new()
        .on_single_newline()
        .split_text(&text)?;

    // There is no good implementation sentence splitting in Rust!
    // This implementation is better than unicode-segmentation crate or any other crate I tested.
    // But still not as good as a model based approach like Spacy or other NLP libraries.
    //
    let sentence_splits: Vec<String> =  TextSplitter::new()
        .on_sentences_rule_based()
        .split_text(&text)?;

    // Unicode

    let sentence_splits: Vec<String> =  TextSplitter::new()
        .on_sentences_unicode()
        .split_text(&text)?;

    let word_splits: Vec<String> =  TextSplitter::new()
        .on_words_unicode()
        .split_text(&text)?;

    
    let graphemes_splits: Vec<String> =  TextSplitter::new()
        .on_graphemes_unicode()
        .split_text(&text)?;

    // If the split separator produces less than two splits,
    // this mode tries the next separator.
    // It does this until it produces more than one split.
    //
    let paragraph_splits: Vec<String> =  TextSplitter::new()
        .on_two_plus_newline()
        .recursive(true)
        .split_text(&text)?;

       

文本清理 📝

    // Normalizes all whitespace chars .
    // Reduce the number of newlines to singles or doubles (paragraphs) or convert them to " ".
    // Optionally, remove all characters besides alphabetic, numbers, and punctuation. 
    //
    let mut text_cleaner: String = llm_utils::text_utils::clean_text::TextCleaner::new();
    let cleaned_text: String = text_cleaner
        .reduce_newlines_to_single_space()
        .remove_non_basic_ascii()
        .run(some_dirty_text);

    // Convert HTML to cleaned text.
    // Uses an implementation of Mozilla's readability mode and HTML2Text.
    //
    let cleaned_text: String = llm_utils::text_utils::clean_html::clean_html(raw_html);

许可协议

本项目的许可协议为 MIT 许可协议。

贡献

我发布项目的动机是希望有人指出我可能做错了什么!

依赖关系

~32–46MB
~640K SLoC