1 个不稳定版本

0.1.0	2023年6月20日

#300 在生物学中

MIT 许可证

36KB
772 行

skc

skc 是一个用于在两个基因组之间查找共享 k-mer 内容的简单工具。

安装

预构建的二进制文件

curl -sSL skc.mbh.sh | sh
# or with wget
wget -nv -O - skc.mbh.sh | sh

您还可以像这样向脚本传递选项

$ curl -sSL skc.mbh.sh | sh -s -- --help
install.sh [option]

Fetch and install the latest version of skc, if skc is already
installed it will be updated to the latest version.

Options
        -V, --verbose
                Enable verbose output for the installer

        -f, -y, --force, --yes
                Skip the confirmation prompt during installation

        -p, --platform
                Override the platform identified by the installer

        -b, --bin-dir
                Override the bin installation directory [default: /usr/local/bin]

        -a, --arch
                Override the architecture identified by the installer [default: x86_64]

        -B, --base-url
                Override the base URL used for downloading releases [default: https://github.com/mbhall88/skc/releases]

        -h, --help
                Display this help message

Cargo

cargo install skc

Conda

conda install skc

本地

cargo build --release
./target/release/skc --help

用法

检查 HIV-1 基因组和 Mycobacterium tuberculosis 基因组之间的共享 16-mer。

$ skc -k 16 NC_001802.1.fa NC_000962.3.fa
[2023-06-20T01:46:36Z INFO ] 9079 unique k-mers in target
[2023-06-20T01:46:38Z INFO ] 2 shared k-mers between target and query
>4233642782 tcount=1 qcount=1 tpos=NC_001802.1:739 qpos=NC_000962.3:4008106
TGCAGAACATCCAGGG
>4237062597 tcount=1 qcount=1 tpos=NC_001802.1:8415 qpos=NC_000962.3:629482
CCAGCAGCAGATAGGG

因此，我们可以看到基因组之间存在两个共享的 16-mer。默认情况下，共享的 k-mer 将写入 stdout - 使用 -o 选项将它们写入文件。

Fasta 描述

示例：>4233642782 tcount=1 qcount=1 tpos=NC_001802.1:739 qpos=NC_000962.3:4008106

ID (4233642782) 是 k-mer 在位空间中值的 64 位整数表示（有关更多信息，请参阅 Daniel Liu 的精彩 cute-nucleotides 仓库）。tcount 和 qcount 分别是 k-mer 在目标基因组中出现的次数和查询基因组中的次数。 tpos 和 qpos 是 k-mer 在目标基因组中的起始位置和查询基因组中的起始位置 - 如果 k-mer 出现多次，则这些值将用逗号分隔。

用法帮助 $ skc --help Shared k-mer content between two genomes Usage: skc [OPTIONS] <TARGET> <QUERY> Arguments: <TARGET> Target sequence Can be compressed with gzip, bzip2, xz, or zstd <QUERY> Query sequence Can be compressed with gzip, bzip2, xz, or zstd Options: -k, --kmer <KMER> Size of k-mers (max. 32) [default: 21] -o, --output <OUTPUT> Output filepath(s); stdout if not present -O, --output-type <u|b|g|l|z> u: uncompressed; b: Bzip2; g: Gzip; l: Lzma; z: Zstd Output compression format is automatically guessed from the filename extension. This option is used to override that [default: u] -l, --compress-level <INT> Compression level to use if compressing output [default: 6] -h, --help Print help (see a summary with '-h') -V, --version Print version 注意事项将第一个传入的基因组（<TARGET>）设置为最小的基因组。这是为了减少内存使用，因为所有唯一的 k-mer（以及它们的 u64 值）都将保留在内存中。我们不使用规范 k-mer 32是可使用的最大k-mer大小。这基本上是一个（懒惰的）实现决策，但同时也帮助将内存占用降至最低。如果您想使用更大的k-mer值，我建议您查看一些类似的工具。（相关工具链接）其他工具 skc并不声称是寻找共享k-mer内容的最快或最内存高效的工具。我基本上是按照以下原因编写的：要么我难以安装一些替代工具，要么它们很笨拙/冗长，要么从结果中提取共享k-mer很费力（例如，一次只能搜索一个k-mer或必须运行许多不同的子命令）。以下是一个（非详尽）列表，列出了可以用于获取共享k-mer内容的其他工具 Jellyfish REINDEER kmer-db GGCAT KAT 致谢 Daniel Liu的出色cute-nucleotides存储库用于（快速）将k-mer转换为64位整数。

依赖项 ~11MB ~188K SLoC anyhow clap 4.3+derive env_logger 0.10 itertools 0.10.5 log niffler noodles 0.40+fasta thiserror