6个版本
0.3.2 | 2024年7月9日 |
---|---|
0.3.1 | 2024年6月13日 |
0.3.0 | 2024年5月19日 |
0.2.0 | 2024年5月18日 |
0.1.1 | 2024年5月17日 |
#141 在 数据库实现
37KB
523 行
VecEmbedStore
这是一个针对LanceDb(VectorDb)的轻量级包装,旨在提供一种在LanceDb中创建/存储/查询嵌入的方法,无需深入了解底层Arrow/ColumnarDb技术。
使用示例
将VecEmbedStore添加到依赖项
cargo add vec_embed_store
# If you want to select a Embedding engine other than the default, you currently need to add fastembed
# This is an issue open to remove this requirement: https://github.com/samkeen/vec-embed-store/issues/9
# [optional]
cargo add fastembed
use std::path::PathBuf;
use vec_embed_store::{EmbeddingsDb, EmbeddingEngineOptions, TextChunk, SimilaritySearch};
use fastembed::EmbeddingModel::BGESmallENV15;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Set up the embedding engine options,
let embedding_engine_options = EmbeddingEngineOptions {
model_name: BGESmallENV15, // see https://docs.rs/fastembed/latest/fastembed/enum.EmbeddingModel.html
cache_dir: PathBuf::from("path/to/cache"),
show_download_progress: true,
..Default::default()
};
// Create a new instance of EmbeddingsDb
let embed_db = EmbeddingsDb::new("path/to/db", embedding_engine_options).await?;
// Define the texts to be added to the database. Chunk texts in any way you see fit (and matches with the
// chosen embedding engine).
let texts = vec![
TextChunk {
id: "1".to_string(),
text: "Once upon a midnight dreary, while I pondered, weak and weary,".to_string(),
},
TextChunk {
id: "2".to_string(),
text: "Over many a quaint and curious volume of forgotten lore—".to_string(),
},
TextChunk {
id: "3".to_string(),
text: "While I nodded, nearly napping, suddenly there came a tapping,".to_string(),
},
];
// Upsert the texts into the embeddings database (there is no separate add/update)
// TextChunks MUST be unique on `TextChunk.id`
embed_db.upsert_texts(&texts).await?;
// Retrieve a text by its ID
let retrieved_text = embed_db.get_text_by_id("1").await?;
println!("Retrieved text: {:?}", retrieved_text);
// Define a text for similarity search
let search_text = "suddenly there came a tapping";
// Perform a similarity search
let search_results = embed_db
.get_similar_to(search_text)
.limit(2)
.threshold(0.8)
.execute()
.await?;
println!("Similarity search results:");
for result in search_results {
println!("ID: {}, Text: {}, Distance: {}", result.id, result.text, result.distance);
}
// Get all text chunks from the database
let all_texts = embed_db.get_all_texts().await?;
println!("All texts: {:?}", all_texts);
// Delete texts by their IDs
let ids_to_delete = vec!["2".to_string(), "3".to_string()];
embed_db.delete_texts(&ids_to_delete).await?;
// Get the count of items in the database
let count = embed_db.items_count().await?;
println!("Number of items in the database: {}", count);
// Clear all data from the database
embed_db.empty_db().await?;
Ok(())
}
架构
EmbedStore封装了嵌入引擎和VectorDb,提供了一个简单的接口来存储和查询文本块。目前,使用FastEmbed-rs进行嵌入,使用LanceDb进行向量数据库
+----------------------------------------------------------+
| VecEmbedStore |
| |
| +-------------------+ +--------------+ |
| | EmbeddingEngine | | VectorDB | |
| +-------------------+ +--------------+ |
| |
+----------------------------------------------------------+
^ |
| store | similarity search
| v
+--------------+ +-----------------------+
| TextBlock | | ComparedTextBlock |
+--------------+ +-----------------------+
依赖项
~89MB
~1.5M SLoC