Hypernonsense

13个版本 (稳定)

2.2.0	2023年2月15日
2.0.0	2023年1月31日
1.1.0	2022年1月6日
1.0.2	2020年3月27日
0.4.0	2019年2月13日

#865 在算法中

每月47次 下载

MIT 许可证

23KB
360 行

基于随机投影的局部敏感哈希的实现。

局部敏感哈希允许你在高维空间中查询一个非常大的点集，以找到查询点的最近点。

这对于处理词向量非常有用，其中单词之间的余弦距离是这些单词的相似度。您可以使用hypernonsense在词空间中找到与给定点最相似的单词。

使用方法

Hypernonsense包含两种索引类型。基本的hyperindex生成结果非常快，但准确性相对较低。包含多个hyperindex的multiindex通过汇总结果来提高准确性。

hyperindex

// We're working in 300 dimensional space
const dimension : usize = 300;

// How many planes should the space be split by. More planes increases speed but decreases accuracy
// A good starting value is: 2 ^ planes * 10 == number of points
const planes : u8 = 10;

// Create the index
let mut index = HyperIndex::new(dimension, planes, &mut thread_rng());

// Populate the index with random data
let mut rng = thread_rng();
for key in 0..10000usize {

    // Generate a random vector
    let v = random_unit_vector(dimension, &mut rng);
    
    // The `key` can be any type - When you query the hyperindex you will get back a set of keys. In this case we'll just use the index.    
    index.add((key, v));
}

// Find approximately the nearest vectors to a random query vector. The key we used was `usize` so we get back a `Vec<usize>`
let result : Option<&Vec<usize>> = index.group(random_unit_vector(dimension, &mut rng));

hyperindex的结果检索非常快（它是O(planes)）。然而，结果的质量非常差，点被预先排列成组，任何位于超平面附近但未检索到的点都不会检索到所有最近点。从hyperindex查询时，它将返回索引内部结构的引用，这是一个近似最近键的（任意顺序）Vec。

multiindex // We're working in 300 dimensional space const dimension : usize = 300; // How many sub-indices should there be. More indices increases accuracy, but decreases speed and increases memory consumption. const indices : u8 = 10 // How many planes should the space be split by. More planes increases speed but decreases accuracy // A good starting value is: 2 ^ planes * 10 == number of points const planes : u8 = 10; // Create the index let mut index = MultiIndex::new(dimension, 10, 10, &mut thread_rng()); // Populate index // ...exactly the same as the hyperindex example // Find the approximately nearest vectors to a random query vector. The key we used was `usize` so we get back a `Vec<DistanceNode<usize>>` // This requires us to supply the number of points we want and a distance metric to choose them by const nearest_count : usize = 100; let result : Vec<DistanceNode<i32>> = index.nearest(&random_unit_vector(dimension, &mut rng), nearest_count, |point, key| { return distance(point, get_vector_by_key(key)); }); multiindex通过同时查询多个hyperindex实例并将它们的结果汇总来解决单个hyperindex结果质量差的问题。这允许您通过增加indices计数来直接权衡速度和准确性。从multiindex查询时，您可以指定要检索的项目数量（本例中为100）以及按距离度量的顺序。调整参数当使用这个功能时，您必须意识到它是一个概率数据结构 - 返回的结果是近似正确的。您应该尝试调整两个参数，直到达到您满意的性能和精度水平。平面 hyperindex在它生成的随机超平面之间的空间被分成区域，当您查询索引时，您将简单地得到同一区域内的点列表。如果您没有很多超平面，您将得到非常大的结果集，并且索引实际上并没有帮助您。这个极限是零超平面，在这种情况下，每个查询都将返回整个集合！如果您有太多的超平面，每个组中的点数都非常少，结果的质量将受到影响，因为正确答案被分类到不同组的可能性越来越高。一旦 2 ^ 平面 >= 数据点，很可能每个点都将单独在一个组中，并且索引几乎无用。子索引计数 multiindex包含多个hyperindex实例，每个实例使用相同数量的平面进行分割，但具有不同的（随机化）平面方向。当您查询multiindex时，它会查询所有子索引并将结果合并在一起。这种方法通过从每个子查询中选取最佳结果（根据距离度量）来提高查询的准确性。增加子索引计数会增加内存消耗和查询时间。如果您没有很多子索引，您基本上只是在查询一个hyperindex，距离度量将无用。

依赖关系 ~3MB ~61K SLoC bit-vec rand 0.8.4 rand_distr rayon time 0.3.5