1 个不稳定版本

0.1.0	2024年2月4日

#993 在解析实现中

MIT/Apache

8KB
71 行

kmerHLL

本软件包提供了一种通过HyperLogLog(HLL)类似算法计数序列文件中唯一kmers的概率算法的实现。

它为日常序列处理而开发。我没有找到用于在序列文件中计数kmers的独立Rust程序。它也非常适用于MinHash算法，如FracMinHash，以估计Jaccard指数，其中必须知道每个基因组文件的大小。对于现实世界中的基因组/宏基因组文件，精确算法并不实用。支持原始fasta和压缩fasta文件格式。

在知道两个基因组及其并集的大小后，它还通过包含-排除规则估计Jaccard指数。

安装（必须安装夜间Rust）

git clone https://github.com/jianshu93/kmerHLL
cd kmerHLL
cargo build --release
./target/release/kmerHLL -h

Counting unique k-mers in sequence files via HLL

Usage: kmerHLL <fasta_file1> <fasta_file2> <kmer_length>

Arguments:
  <fasta_file1>  The first FASTA file to read
  <fasta_file2>  The second FASTA file to read
  <kmer_length>  The length of the k-mers

Options:
  -h, --help     Print help
  -V, --version  Print version

实现细节

我们使用了流算法包，它利用SIMD指令（所有平台的夜间Rust）来加速某些操作。

由于HyperLogLog是可合并的草图，因此并行化非常容易。合并来自每个线程的草图需要全局草图的锁。对于像宏基因组这样的大文件，序列文件将被分成块，以便每个线程可以处理序列文件的一部分。

性能和精度

使用24个线程在Intel(R) Xeon(R) Gold 6226 CPU@2.70GHz上处理15G宏基因组需要大约1.5分钟。现在是IO的限制。

您可以通过初始化HLL来调整精度，该HLL具有给定的错误率，它控制计数器的数量（寄存器）

$$MSE=\frac{3*log2 -1}{m}\approx \frac{1.07944}{m}$$

然而，无论计数器数量有多少，精度永远不会超过以下限制

$$MSE=\frac{log2/(\frac{\pi ^{2}}{6}-1)}{m}\approx \frac{1.07475}{m}$$

这是此类计数算法的Cramér Rao下限。

参考文献

Flajolet, Philippe, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. "Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm." Discrete mathematics & theoretical computer science Proceedings (2007).

Heule, Stefan, Marc Nunkesser, and Alexander Hall. "Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm." In Proceedings of the 16th International Conference on Extending Database Technology, pp. 683-692. 2013.

依赖项

~6.5MB
~123K SLoC