2 个版本

0.2.1	2019年11月25日
0.2.0	2019年11月25日

#225 在生物学分类中

MIT 许可证

63KB
1.5K SLoC

这是测试版软件，使用风险自负

rumi

rumi 是一个基于 Rust UMI 的 PCR 去重工具，它采用方向邻接方法进行去重，类似于 UMI-tools，但实现了常量时间汉明距离。

安装

目前这依赖于 Rust 工具链。有很好的文档介绍如何设置它。

cargo install rumi

用法

$ rumi --help
rumi-dedup 0.1.0
Seth Stadick <sstadick@gmail.com>
Deduplicate reads based on umis

USAGE:
    rumi [FLAGS] [OPTIONS] <INBAM> --output <OUTBAM> --umi_tag <umi_tag>

FLAGS:
        --group_only           Don't deduplicate reads, just group them given them agroup id, and print them. Rules
                                               for filtering out unpaired reads, etc, will still be applied.
    -h, --help                 Prints help information
        --ignore_splice_pos    If two reads have the same start pos, and contain a splice site, they will be
                                                    grouped together, instead of further splitting them based on the
                               splice site
        --is_paired            Input is paired end. Read pairs with unmapped read1 will be ignored.
        --umi_in_read_id       The UMI is located in the read id after the last '_'. Otherwise use the RX tag.
    -V, --version              Prints version information

OPTIONS:
    -o, --output <OUTBAM>                                  Output bam file. Use - if stdout [default: -]
    -c, --allowed_count_factor <allowed_count_factor>
            The factor to multiply the count of a umi by when determining whether or not to group it with other umis
            within allowed_read_dist. include umi_b as adjacent to umi_a if: umi_a.counts >= allowed_count_factor *
            umi_b.counts [default: 2]
    -n, --allowed_network_depth <allowed_network_depth>
            The number of nodes deep to go when creating a group. If allowed_read_dist 1, then allowed_network_depth of
            2 will enable getting all umis with hamming distance of 2 from current umi. [default: 2]
    -d, --allowed_read_dist <allowed_read_dist>
            The distance between umis that will allow them to be counted as adjacent. [default: 1]

    -u, --umi_tag <umi_tag>                                The tag holding the umi information. [default: RX]

ARGS:
    <INBAM>    Input bam file. Use - if stdin [default: -]

性能

我还没有进行任何严肃的基准测试。据观察，在小数据集上，它的速度至少比 umi_tools 快 4 倍。但还有很多低垂的果实可以优化。

我完全期望一旦它被平滑处理，这个实现应该能够至少提高 10 倍的性能。与 umi_tools 相比，它的一个巨大优势是可以利用多个核心。umi_tools 已经将其大量工作转移到 C 代码中，所以仅仅拥有编译型语言并不算是一个巨大的优势。

与 umi_tools 的已知差异

选择读取的标准不同，而且我认为更全面。例如，如果两个读取的 mapq、编辑距离等相同，将取决于读取长度，保留较长的。在完全相同的情况下，现有的读取获胜。这可能导致结果的不同，尤其是在反向读取中。

待办事项

清理库并将它们拆分为多个模块
添加更好的错误处理
找出如何减少分配/字符串克隆
澄清工作流程/文档中实际发生的事情
允许选择要使用的核心数

现有技术

注释

第一遍：将所有读取收集到一个以位置为键的字典中。在构建此字典时跟踪指标，如 umi 频率和提取的 umi。然后遍历该字典并在每个位置进行去重。

example.bam 中的差异（来自 umi_tools）

SRR2057595.3354975_CGGGTTGGT：rumi 正确使用以 C 开头的 umi，因为有两个读取具有该 umi。umi_tools 使用频率仅为 1 的 umi。
SRR2057595.4915638_TTGGTTAAA：rumi 正确选择决定使用的 umi 作为最佳读取。
SRR2057595.5405752_AACGGTTGG：rumi 正确保留为其自己的组。umi_tools 将其更正为距离 3 的 ATTGGTTCG。我预计这将导致 rumi 输出中额外的 30 个读取。umi_tools 是如何导致这种情况的？

依赖项

~9–13MB
~262K SLoC