MiniBoosts — Rust 中的 ML/AI/统计学 // Lib.rs

10 个版本

0.3.4	2024 年 8 月 12 日
0.3.3	2023 年 12 月 23 日
0.3.0	2023 年 11 月 4 日
0.2.3	2023 年 8 月 5 日
0.1.0	2023 年 2 月 11 日

#30 in 机器学习

每月171次下载

MIT 许可证

405KB
9K SLoC

文档

MiniBoosts 是一个用于提升算法开发的库。
提升是一种在 提升器 和 弱学习器 之间的重复游戏。

对于游戏的每一轮，

提升器选择一个训练样本的分布，
然后弱学习器选择一个假设（函数），其相对于分布的准确率略优于随机猜测。

经过足够的轮次后，提升器输出一个在训练样本上表现显著更好的假设。

一些 提升器 需要像这样在 Cargo.toml 中启用 extended 标志

minibosts = { version = "0.3.3", features = ["extended"] }

这些提升算法使用 Gurobi 来计算训练样本的分布。感谢 Gurobi，如果您是学生，您可以免费使用 extended 功能。

`提升器`	`特性标志`
AdaBoost 由 Freund 和 Schapire 于 1997 年提出
MadaBoost 由 Domingo 和 Watanabe 于 2000 年提出
GBM（梯度提升机）由 Jerome H. Friedman 于 2001 年提出
LPBoost 由 Demiriz、Bennett 和 Shawe-Taylor 于 2002 年提出	`扩展`
SmoothBoost 由 Servedio 于 2003 年提出
AdaBoostV 由 Rätsch 和 Warmuth 于 2005 年提出
TotalBoost 由 Warmuth、Liao 和 Rätsch 于 2006 年提出	`扩展`
SoftBoost 由 Warmuth、Glocer 和 Rätsch 于 2007 年提出	`扩展`
ERLPBoost 由 Warmuth 和 Glocer 以及 Vishwanathan 于 2008 年提出	`扩展`
CERLPBoost（纠正性 ERLPBoost）由 Shalev-Shwartz 和 Singer 于 2010 年提出	`扩展`
MLPBoost 由 Mitsuboshi、Hatano 和 Takimoto 于 2022 年提出	`扩展`
GraphSepBoost（图分离提升）由 Alon、Gonen、Hazan 和 Moran 于 2023 年提出

如果您发明了一种新的提升算法，您可以通过实现Booster特质来介绍它。有关详细信息，请参阅cargo doc -F extended --open。

目前，没有弱学习器使用Gurobi。因此，您可以在不启用extended标志的情况下使用所有弱学习器。

`弱学习器学习器`
决策树
回归树
LPBoost的最坏情况弱学习器
高斯朴素贝叶斯
神经网络（实验性）

为什么是MiniBoosts？

如果您撰写一篇关于提升算法的论文，您需要将您的算法与其他算法进行比较。此时，会出现一些问题。

一些提升算法，例如LightGBM或XGBoost，已经实现并免费提供。这些算法在Python3中使用非常容易，但很难与其他算法进行比较，因为它们是用C++内部实现的。在Python3中实现您的算法会使运行时间比较不公平（与C++相比，Python3要慢得多）。然而，用C++实现它非常困难（根据我的经验）。
大多数提升算法是为决策树弱学习器设计的，尽管提升协议并不要求如此。
没有实现边缘优化提升算法。边缘优化在二元分类中比经验风险最小化是一个更好的目标。

MiniBoosts是一个用于解决上述问题的crate。
此crate提供了以下内容。

两个主要特质，分别命名为Booster和WeakLearner.。
- 如果您发明了一种新的提升算法，您只需要实现Booster.

一些著名的提升算法，包括AdaBoost、LPBoost、ERLPBoost等。一些弱学习器，包括决策树、回归树等。

MiniBoosts用于研究有时，人们想要记录提升过程中的每个步骤。您可以使用Logger结构将日志输出到.csv文件，同时打印出如下状态有关详细信息，请参阅研究功能部分。如何使用文档将以下内容写入Cargo.toml。 miniboosts = { version = "0.3.3" } 如果您想使用extended功能，启用该标志 miniboosts = { version = "0.3.3", features = ["extended"] } 以下是一个示例代码 use miniboosts::prelude::*; fn main() { // Set file name let file = "/path/to/input/data.csv"; // Read the CSV file // The column named `class` corresponds to the labels (targets). let sample = SampleReader::new() .file(file) .has_header(true) .target_feature("class") .read() .unwrap(); // Set tolerance parameter as `0.01`. let tol: f64 = 0.01; // Initialize Booster let mut booster = AdaBoost::init(&sample) .tolerance(tol); // Set the tolerance parameter. // Construct `DecisionTree` Weak Learner from `DecisionTreeBuilder`. let weak_learner = DecisionTreeBuilder::new(&sample) .max_depth(3) // Specify the max depth (default is 2) .criterion(Criterion::Twoing) // Choose the split rule. .build(); // Build `DecisionTree`. // Run the boosting algorithm // Each booster returns a combined hypothesis. let f = booster.run(&weak_learner); // Get the batch prediction for all examples in `data`. let predictions = f.predict_all(&sample); // You can predict the `i`th instance. let i = 0_usize; let prediction = f.predict(&sample, i); // You can convert the hypothesis `f` to `String`. let s = serde_json::to_string(&f); } 如果您使用提升进行软边缘优化，初始化提升器如下 let n_sample = sample.shape().0; // Get the number of training examples let nu = n_sample as f64 * 0.2; // Set the upper-bound of the number of outliers. let lpboost = LPBoost::init(&sample) .tolerance(tol) .nu(nu); // Set a capping parameter. 请注意，上限参数必须满足1 <= nu && nu <= n_sample。研究功能此crate可以输出每个步骤中此类值的CSV文件。以下是一个示例 use miniboosts::prelude::*; use miniboosts::{ Logger, LoggerBuilder, SoftMarginObjective, }; // Define a loss function fn zero_one_loss<H>(sample: &Sample, f: &H) -> f64 where H: Classifier { let n_sample = sample.shape().0 as f64; let target = sample.target(); f.predict_all(sample) .into_iter() .zip(target.into_iter()) .map(|(fx, &y)| if fx != y as i64 { 1.0 } else { 0.0 }) .sum::<f64>() / n_sample } fn main() { // Read the training data let path = "/path/to/train/data.csv"; let train = SampleReader::new() .file(path) .has_header(true) .target_feature("class") .read() .unwrap(); // Set some parameters used later. let n_sample = train.shape().0 as f64; let nu = 0.01 * n_sample; // Read the test data let path = "/path/to/test/data.csv"; let test = SampleReader::new() .file(path) .has_header(true) .target_feature("class") .read() .unwrap(); let booster = LPBoost::init(&train) .tolerance(0.01) .nu(nu); let weak_learner = DecisionTreeBuilder::new(&train) .max_depth(2) .criterion(Criterion::Entropy) .build(); // Set the objective function. // One can use your own function by implementing ObjectiveFunction trait. let objective = SoftMarginObjective::new(nu); let mut logger = LoggerBuilder::new() .booster(booster) .weak_learner(tree) .train_sample(&train) .test_sample(&test) .objective_function(objective) .loss_function(zero_one_loss) .time_limit_as_secs(120) // Terminate after 120 seconds .print_every(10) // Print log every 10 rounds. .build(); // Each line of `lpboost.csv` contains the following four information: // Objective value, Train loss, Test loss, Time per iteration // The returned value `f` is the combined hypothesis. let f = logger.run("logfile.csv") .expect("Failed to logging"); } 其他目前，此crate主要支持用于二元分类的提升算法。一些提升算法使用Gurobi优化器，因此您必须获取许可证才能使用此库。如果您有许可证，可以通过在Cargo.toml中指定features = ["extended"]来使用这些提升算法（提升器）。如果您在没有Gurobi许可证的情况下尝试使用扩展功能，则编译会失败。可以通过实现Research特质来记录您的算法。运行cargo doc -F extended --open以查看更多信息。 GraphSepBoost仅支持其论文中第4.2引理所示的数据聚合规则。未来工作提升器 AnyBoost SparsiBoost LogitBoost AdaBoost.L 分支程序弱学习器词袋 TF-IDF RBF-Net 其他并行化 LP/QP求解器（此工作允许您在不许可证的情况下使用extended功能）。

依赖关系 ~17-31MB ~451K SLoC 彩色 fixedbitset 0.5.7 extended? grb 2.0 绘图器 polars 0.41.3 rand 0.8.5 rand_distr rayon serde+rc+derive serde_json+alloc