3 个不稳定版本

0.2.1	2024年6月21日
0.2.0	2024年6月20日
0.1.0	2024年6月15日

#668 在算法中

每月 133 次下载

MIT 许可证

30KB
480 行

catclustering

crate 链接: crates.io/crates/catclustering

此 crate 实现了针对分类特征和（近似）完全链接优化的层次聚类。

初始化步骤

为了避免计算邻接矩阵中所有成对距离的效率低下，算法通过检查排序数据集的相邻行来识别邻居。

由于列的顺序会影响结果，数据集会被多次排序，每次使用不同的随机列顺序。这种方法大致相当于通过将随机投影到一维线上来识别邻居。可以使用 'init_iterations' 参数控制迭代次数。

此初始步骤使此 crate 特别适合于分类数据。

距离函数

为了确保算法正常工作，距离函数必须满足以下性质

d(x1 U x2, y1 U y2) >= d(x1, y1)

对于任何聚类，包括空聚类，x1、x2、y1、y2。

此性质由原始的完全链接距离满足。

然而，由于完全链接计算成本高，我们建议以下近似，它也表现出所需性质

|clusterset1| + |clusterset2| - |intersection(clusterset1, clusterset2)|

|clusterset1| 是计算聚类与其自身之间的距离后剩下的值（d(X, X) = |X|+ |X| - |intersection(X, X)| = |X|）。

这类似于原始的完全链接，其中 d(X, X) 等于聚类至少包含两个不同行时的 X 的簇内距离。从这个角度来看，|clusterset1| - 1 可以解释为簇内距离的一个近似。

请注意，如|union(clusterset1, clusterset2)| - |intersection(clusterset1, clusterset2)|之类的函数不满足d(x1 U x2, y1 U y2) >= d(x1, y1)，并且在此上下文中无法正常工作。


距离实现
通过使用位掩码来实现集合操作是最快的方式，前提是唯一类别的数量可以适应掩码。您可以根据需要组合位掩码和哈希集或roaring位图。



操作/数据结构
位掩码
哈希集
roaring位图




交集的基数
(x&y).count_ones()
x.intersection(&y).count()
x.intersection_len(&y)


并集的基数
(x|y).count_ones()
x.union(&y).count()
x.union_len(&y)


对称差的基数
(x^y).count_ones()
x.symmetric_difference(&y).count()
x.symmetric_difference_len(&y)


合并
x |= y
x.extend(&y)
x.extend(&y)



示例
以下是一个示例数据集，它包含4个低基数列和1个高基数列。
use std::vec::Vec;
use std::collections::HashSet;
use std::any::Any;
use rand::Rng;

struct SimpleMatrix {
    col1to4: u32,
    col5: HashSet<u16>
}

struct MyData {
    vecs: Vec<Vec<i32>>
}

impl catclustering::ClusterSummary for SimpleMatrix {
    fn summary_size(&self) -> usize {
        self.col1to4.count_ones() as usize + self.col5.len() 
    }
    fn distance(&self, other: &dyn catclustering::ClusterSummary) -> f32 {
        let o = other.as_any().downcast_ref::<SimpleMatrix>().unwrap();
        let intersection = (self.col1to4 & o.col1to4).count_ones() as usize + self.col5.intersection(&o.col5).count();

        (self.summary_size() + other.summary_size()) as f32 - intersection as f32
    }
    
    fn extend(&mut self, other: &dyn catclustering::ClusterSummary) {
        let o = other.as_any().downcast_ref::<SimpleMatrix>().unwrap();

        self.col1to4 |= o.col1to4;
        self.col5.extend(&o.col5);
    }
    fn clear(&mut self) {
        self.col5.clear();
        self.col5.shrink_to_fit();
    }
    fn as_any(&self) -> &dyn Any {
        self
    }
}   

impl catclustering::IndexableData for MyData {
    fn get_value(&self, row_index: usize, column_index: usize) -> f32 {
        self.vecs[row_index][column_index] as f32
    }

    fn get_num_columns(&self) -> usize {
        self.vecs[0].len()
    }

    fn get_num_rows(&self) -> usize {
        self.vecs.len()
    }

    fn create_cluster_summary(&self, row_index: usize) -> Box<dyn catclustering::ClusterSummary> {
        let row = &self.vecs[row_index];
        Box::new(SimpleMatrix {
            col1to4: (1 << row[0])
                | (1 << (row[1] + 8))
                | (1 << (row[2] + 16))
                | (1 << (row[3] + 24)),
            col5: HashSet::from_iter(vec![row[4] as u16]),
        })
    }
}

/// Just a utility function to create a random data set
fn create_random_matrix(rows: usize, cardinality: [i32; 5]) -> Vec<Vec<i32>> {
    let mut rng = rand::thread_rng();
    let mut matrix = Vec::with_capacity(rows);

    for _ in 0..rows {
        let row: Vec<i32> = (0..cardinality.len())
            .map(|k| rng.gen_range(0..1000) % cardinality[k])
            .collect();
        matrix.push(row);
    }

    matrix
}

fn main() {
    let matrix = MyData{vecs: create_random_matrix(10_000, [8, 8, 8, 8, 2000])};

    // main algorithm
    let mut rng = rand::thread_rng(); 
    let dendro = catclustering::create_dendrogram(&matrix, None, &mut rng);
   
    // more interpretable results
    let clusters = catclustering::find_clusters(&dendro, 100);
    
    // or, if you just want the assignments:
    let mut assignments = Vec::new(); // feel free to reuse this vector 
    let num_clusters = catclustering::assign_rows_to_clusters(&dendro, &mut assignments, 100);
}

基准测试
基准测试是从上面的示例中运行的。



行
时间




10,000
2.5秒


100,000
29秒


1,000,000
4分钟


10,000,000
46分钟



规格

i7-6700 @ 3.40GHz
Linux pop-os 6.5.4
rustc 1.76.0

操作/数据结构	位掩码	哈希集	roaring位图
交集的基数	(x&y).count_ones()	x.intersection(&y).count()	x.intersection_len(&y)
并集的基数	(x\|y).count_ones()	x.union(&y).count()	x.union_len(&y)
对称差的基数	(x^y).count_ones()	x.symmetric_difference(&y).count()	x.symmetric_difference_len(&y)
合并	x \|= y	x.extend(&y)	x.extend(&y)

行	时间
10,000	2.5秒
100,000	29秒
1,000,000	4分钟
10,000,000	46分钟



依赖关系

~315KB






rand




 dev  criterion 0.5