6个版本

0.3.2	2024年7月9日
0.3.1	2024年6月13日
0.3.0	2024年5月19日
0.2.0	2024年5月18日
0.1.1	2024年5月17日

#141 在数据库实现

Apache-2.0

37KB
523 行

VecEmbedStore

这是一个针对LanceDb（VectorDb）的轻量级包装，旨在提供一种在LanceDb中创建/存储/查询嵌入的方法，无需深入了解底层Arrow/ColumnarDb技术。

使用示例

将VecEmbedStore添加到依赖项

cargo add vec_embed_store
# If you want to select a Embedding engine other than the default, you currently need to add fastembed 
# This is an issue open to remove this requirement: https://github.com/samkeen/vec-embed-store/issues/9
# [optional] 
cargo add fastembed

use std::path::PathBuf;
use vec_embed_store::{EmbeddingsDb, EmbeddingEngineOptions, TextChunk, SimilaritySearch};
use fastembed::EmbeddingModel::BGESmallENV15;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Set up the embedding engine options, 
    let embedding_engine_options = EmbeddingEngineOptions {
        model_name: BGESmallENV15, // see https://docs.rs/fastembed/latest/fastembed/enum.EmbeddingModel.html
        cache_dir: PathBuf::from("path/to/cache"),
        show_download_progress: true,
        ..Default::default()
    };

    // Create a new instance of EmbeddingsDb
    let embed_db = EmbeddingsDb::new("path/to/db", embedding_engine_options).await?;

    // Define the texts to be added to the database.  Chunk texts in any way you see fit (and matches with the 
    //   chosen embedding engine).
    let texts = vec![
        TextChunk {
            id: "1".to_string(),
            text: "Once upon a midnight dreary, while I pondered, weak and weary,".to_string(),
        },
        TextChunk {
            id: "2".to_string(),
            text: "Over many a quaint and curious volume of forgotten lore—".to_string(),
        },
        TextChunk {
            id: "3".to_string(),
            text: "While I nodded, nearly napping, suddenly there came a tapping,".to_string(),
        },
    ];

    // Upsert the texts into the embeddings database (there is no separate add/update)
    // TextChunks MUST be unique on `TextChunk.id`
    embed_db.upsert_texts(&texts).await?;

    // Retrieve a text by its ID
    let retrieved_text = embed_db.get_text_by_id("1").await?;
    println!("Retrieved text: {:?}", retrieved_text);

    // Define a text for similarity search
    let search_text = "suddenly there came a tapping";

    // Perform a similarity search
    let search_results = embed_db
        .get_similar_to(search_text)
        .limit(2)
        .threshold(0.8)
        .execute()
        .await?;

    println!("Similarity search results:");
    for result in search_results {
        println!("ID: {}, Text: {}, Distance: {}", result.id, result.text, result.distance);
    }

    // Get all text chunks from the database
    let all_texts = embed_db.get_all_texts().await?;
    println!("All texts: {:?}", all_texts);

    // Delete texts by their IDs
    let ids_to_delete = vec!["2".to_string(), "3".to_string()];
    embed_db.delete_texts(&ids_to_delete).await?;

    // Get the count of items in the database
    let count = embed_db.items_count().await?;
    println!("Number of items in the database: {}", count);

    // Clear all data from the database
    embed_db.empty_db().await?;

    Ok(())
}

架构

EmbedStore封装了嵌入引擎和VectorDb，提供了一个简单的接口来存储和查询文本块。目前，使用FastEmbed-rs进行嵌入，使用LanceDb进行向量数据库


    +----------------------------------------------------------+
    |                      VecEmbedStore                       |
    |                                                          |
    |  +-------------------+           +--------------+        |
    |  | EmbeddingEngine   |           |  VectorDB    |        |
    |  +-------------------+           +--------------+        |
    |                                                          |
    +----------------------------------------------------------+
               ^                              |
               | store                        | similarity search
               |                              v
    +--------------+                   +-----------------------+
    |  TextBlock   |                   | ComparedTextBlock     |
    +--------------+                   +-----------------------+

依赖项

~89MB
~1.5M SLoC