11个版本 (7个破坏性版本)

0.8.0	2024年6月28日
0.7.3	2024年6月17日
0.6.0	2023年12月13日
0.5.1	2023年8月31日
0.1.1	2023年5月4日

#8 in 生物学

每月500次下载
在 5 crate 中使用

MIT 许可证

1.5MB
4K SLoC

🎼🧬 `lightmotif`

一个轻量级平台加速库，用于生物模式扫描，使用位置权重矩阵.

🗺️ 概述

模式扫描使用位置权重矩阵（也称为位置特异性评分矩阵）是识别生物序列内固定长度模式的一种稳健方法。它们可用于识别转录因子的DNA结合位点，或蛋白酶的切割位点。位置权重矩阵通常被视为序列标志

lightmotif库提供了一个Rustcrate，用于运行非常高效的搜索，用于在位置权重矩阵中编码的模式。位置扫描结合了多种技术，以允许对序列进行高通量处理

编译时定义字母表和矩阵维度。
序列符号编码，用于快速表查找，如HMMER[1]或MEME[2]中实现。
条纹序列矩阵，可并行处理多个位置，受Michael Farrar[3]启发。
使用AVX2的permute指令进行向量化矩阵行查找。

如果需要，生态系统中其他crate提供额外的功能。

lightmotif-io 是一个crate，包含各种计数矩阵、频率矩阵和位置特定评分矩阵格式的解析器实现，例如 TRANSFAC 或 JASPAR。
lightmotif-tfmpvalue 是 TFM-PVALUE[4] 算法的精确重实现，该算法用于在给定评分矩阵之间转换分数和 p-值。

这是 Rust 版本，同时也有一个可用的 Python 包。

💡 示例

use lightmotif::*;
use lightmotif::abc::Nucleotide;

// Create a count matrix from an iterable of motif sequences
let counts = CountMatrix::<Dna>::from_sequences(
    ["GTTGACCTTATCAAC", "GTTGATCCAGTCAAC"]
        .into_iter()
        .map(|s| EncodedSequence::encode(s).unwrap()),
)
.unwrap();

// Create a PSSM with 0.1 pseudocounts and uniform background frequencies.
let pssm = counts.to_freq(0.1).to_scoring(None);

// Use the pipeline to encode the target sequence into a striped matrix
let seq = "ATGTCCCAACAACGATACCCCGAGCCCATCGCCGTCATCGGCTCGGCATGCAGATTCCCAGGCG";
let encoded = EncodedSequence::encode(seq).unwrap();
let mut striped = encoded.to_striped();

// Organize layout of striped matrix to allow scoring with PSSM.
striped.configure(&pssm);

// Compute scores for every position of the matrix.
let scores = pssm.score(&striped);

// Scores can be extracted into a Vec<f32>, or indexed directly.
let v = scores.unstripe();
assert_eq!(scores[0], -23.07094);
assert_eq!(v[0], -23.07094);

// Find the highest scoring position.
let best = scores.argmax().unwrap();
assert_eq!(best, 18);

// Find the positions above an absolute score threshold.
let indices = scores.threshold(10.0);
assert_eq!(indices, []);

此示例使用动态调度管道，根据本地平台选择最佳后端（AVX2、SSE2、NEON 或通用实现）。

⏱️ 基准测试

这两个基准测试都使用了来自 PRODORIC [5] 的 MX000001 模式以及 大肠杆菌K12 菌株的完整基因组。基准测试是在一个 i7-10710U CPU 上以 1.10GHz 运行，并使用 --target-cpu=native 编译运行的。

使用模式权重矩阵对基因组中的每个位置进行评分

test bench_avx2    ... bench:   4,510,794 ns/iter (+/-     9,570) = 1029 MB/s
test bench_sse2    ... bench:  26,773,537 ns/iter (+/-    57,891) =  173 MB/s
test bench_generic ... bench: 317,731,004 ns/iter (+/- 2,567,370) =   14 MB/s

在一个 10kb 序列中找到模式得分最高的位置（与 bio::pattern_matching::pssm 中实现的 PSSM 算法相比）

test bench_avx2    ... bench:      12,797 ns/iter (+/-   380) = 781 MB/s
test bench_sse2    ... bench:      62,597 ns/iter (+/-    43) = 159 MB/s
test bench_generic ... bench:     671,900 ns/iter (+/- 1,150) =  14 MB/s
test bench_bio     ... bench:   1,193,911 ns/iter (+/- 2,519) =   8 MB/s

💭 反馈

⚠️ 问题跟踪器

发现了一个错误？有一个增强请求？如果您需要报告或询问，请访问 GitHub 问题跟踪器。如果您正在提交错误，请尽可能提供有关问题的详细信息，并尝试在简单、易于复制的场景中重现相同的错误。

📋 变更日志

本项目遵循语义版本控制并提供变更日志，格式为 Keep a Changelog。

⚖️ 许可证

此库在开源 MIT 许可证下提供。

本项目由 Martin Larralde 在欧洲分子生物学实验室的 Zeller 团队的博士课题中开发。

📚 参考文献

Eddy, Sean R. ‘Accelerated Profile HMM Searches’. PLOS Computational Biology 7, no. 10 (20 October 2011): e1002195. doi:10.1371/journal.pcbi.1002195.
Grant, Charles E., Timothy L. Bailey, and William Stafford Noble. ‘FIMO: Scanning for Occurrences of a given Motif’. Bioinformatics 27, no. 7 (1 April 2011): 1017–18. doi:10.1093/bioinformatics/btr064.
Farrar, Michael. ‘Striped Smith–Waterman Speeds Database Searches Six Times over Other SIMD Implementations’. Bioinformatics 23, no. 2 (15 January 2007): 156–61. doi:10.1093/bioinformatics/btl582.
Touzet, Hélène, and Jean-Stéphane Varré. ‘Efficient and Accurate P-Value Computation for Position Weight Matrices’. Algorithms for Molecular Biology 2, no. 1 (2007): 1–12. doi:10.1186/1748-7188-2-15.
Dudek, Christian-Alexander, and Dieter Jahn. ‘PRODORIC: State-of-the-Art Database of Prokaryotic Gene Regulation’. Nucleic Acids Research 50, no. D1 (7 January 2022): D295–302. doi:10.1093/nar/gkab1110.