3 个版本

0.2.41	2023年11月4日
0.2.31	~~2023年2月19日~~
0.2.5	2024年6月1日
0.2.4	2023年9月27日
0.2.1	~~2022年6月23日~~

#80 在科学

MIT 许可证

99KB
2K SLoC

端粒识别工具包 (`tidk`)

tidk 是一个用于识别和可视化达尔文生命树基因组端粒重复的工具包。 tidk 特别适用于染色体基因组，但也可以用于 PacBio HiFi 读数（请参阅端粒重复数据库中的许多示例）。该工具包包含几个模块，这些模块可能对研究基因组中端粒重复序列的任何人都有用。

explore - 尝试在基因组中找到端粒重复单元。
find 和 search 实质上是相同的。它们在基因组窗口中识别重复序列。《code>find 使用内置的端粒重复表，而在《code>search 中，你提供自己的表。
plot 做了它所说的，并将 find 或 search 的 csv 输出作为 SVG 绘制。

安装

最简单的安装方式是通过 conda

conda install -c bioconda tidk

否则...

与其他 Rust 项目一样，您必须自己编译。下载 rust，克隆此仓库， cd 进入它，然后运行

cargoinstall --path=.

要将 $PATH 作为 tidk 安装。

用法

以下是一些使用说明。从 0.2.3 版本开始，CLI 界面已经发生了重大更改。它们将在下面和发布变更日志中指明。

Explore

tidk explore 会尝试在提供的基因组中找到简单的端粒重复单元。它将以其规范形式报告这个重复单元（例如，TTAGG -> AACCT）。与之前版本不同，它只将简单的TSV打印到STDOUT。使用 distance 参数只搜索染色体臂的一部分。默认值为染色体每侧长度的1%，但您可以自由更改此值。特别是对于原始读数（PacBio），我建议将距离标志设置为0.5（--distance 0.5 或 --distance=0.5），以处理每个读数的全长。

例如： tidk explore --minimum 5 --maximum 12 fastas/iyBomHort1_1.20210303.curated_primary.fa > out.tsv 将在 Bombus hortorum 基因组中按顺序搜索长度为5到12的重复序列。

Use a range of kmer sizes to find potential telomeric repeats.
One of either length, or minimum and maximum must be specified.

Usage: tidk explore [OPTIONS] <FASTA>

Arguments:
  <FASTA>  The input fasta file

Options:
  -l, --length [<LENGTH>]        Length of substring
  -m, --minimum [<MINIMUM>]      Minimum length of substring [default: 5]
  -x, --maximum [<MAXIMUM>]      Maximum length of substring [default: 12]
  -t, --threshold [<THRESHOLD>]  Positions of repeats are only reported if they occur sequentially in a greater number than the threshold [default: 100]
      --distance [<DISTANCE>]    The distance from the end of the chromosome as a proportion of chromosome length. Must range from 0-0.5. [default: 0.01]
  -v, --verbose                  Print verbose output.
      --log                      Output a log file.
  -h, --help                     Print help
  -V, --version                  Print version

查找

tidk find 将接受一个输入系统发育群，并匹配该系统发育群已知或假定的端粒重复（或重复复数）并搜索基因组。现在使用自定义的端粒重复数据库。随着更多端粒重复的发现和添加，所使用的序列字典将增加。

Supply the name of a clade your organsim belongs to, and this submodule will find all telomeric repeat matches for that clade.

Usage: tidk find [OPTIONS] [FASTA]

Arguments:
  [FASTA]  The input fasta file

Options:
  -w, --window [<WINDOW>]  Window size to calculate telomeric repeat counts in [default: 10000]
  -c, --clade <CLADE>      The clade of organism to identify telomeres in [possible values: Accipitriformes, Actiniaria, Anura, Apiales, Aplousobranchia, Asterales, Buxales, Caprimulgiformes, Carangiformes, Carcharhiniformes, Cardiida, Carnivora, Caryophyllales, Cheilostomatida, Chiroptera, Chlamydomonadales, Coleoptera, Crassiclitellata, Cypriniformes, Eucoccidiorida, Fabales, Fagales, Forcipulatida, Hemiptera, Heteronemertea, Hirudinida, Hymenoptera, Hypnales, Labriformes, Lamiales, Lepidoptera, Malpighiales, Myrtales, Odonata, Orthoptera, Pectinida, Perciformes, Phlebobranchia, Phyllodocida, Plecoptera, Pleuronectiformes, Poales, Rodentia, Rosales, Salmoniformes, Sapindales, Solanales, Symphypleona, Syngnathiformes, Trichoptera, Trochida, Venerida]
  -o, --output <OUTPUT>    Output filename for the TSVs (without extension)
  -d, --dir <DIR>          Output directory to write files to
  -p, --print              Print a table of clades, along with their telomeric sequences
      --log                Output a log file
  -h, --help               Print help
  -V, --version            Print version

搜索

tidk search 将在基因组中搜索输入字符串。如果您知道测序生物的端粒重复，这将找到它并返回基因组中窗口的重复发生次数。

Search the input genome with a specific telomeric repeat search string.

Usage: tidk search [OPTIONS] --string <STRING> --output <OUTPUT> --dir <DIR> <FASTA>

Arguments:
  <FASTA>  The input fasta file

Options:
  -s, --string <STRING>          The DNA string to query the genome with
  -w, --window [<WINDOW>]        Window size to calculate telomeric repeat counts in [default: 10000]
  -o, --output <OUTPUT>          Output filename for the TSVs (without extension)
  -d, --dir <DIR>                Output directory to write files to
  -e, --extension [<EXTENSION>]  The extension, defining the output type of the file [default: tsv] [possible values: tsv, bedgraph]
      --log                      Output a log file
  -h, --help                     Print help
  -V, --version                  Print version

绘图

tidk plot 将绘制 tidk search 的输出。

SVG plot of TSV generated from tidk search.

Usage: tidk plot [OPTIONS] --tsv <TSV>

Options:
  -t, --tsv <TSV>          The input TSV file
      --height [<HEIGHT>]  The height of subplots (px). [default: 200]
  -w, --width [<WIDTH>]    The width of plot (px) [default: 1000]
  -o, --output [<OUTPUT>]  Output filename for the SVG (without extension) [default: tidk-plot]
  -h, --help               Print help
  -V, --version            Print version

以 Square Spot Rustic Xestia xanthographa 为例

tidk find -c lepidoptera -o Xes fastas/ilXesXant1_1.20201023.curated_primary.fa

tidk plot -t finder/Xes_telomeric_repeat_windows.tsv -o ilXes -h 120 -w 800

参考文献

Kurbessoian, Tania, et al. "In host evolution of Exophiala dermatitidis in cystic fibrosis lung micro-environment." BioRxiv (2022): 2022-09.
Yin, Denghua, et al. "Gapless genome assembly of East Asian finless porpoise." Scientific Data 9.1 (2022): 765.
Leonard, Guy, et al. "A genome sequence assembly of the phototactic and optogenetic model fungus Blastocladiella emersonii reveals a diversified nucleotide-cyclase repertoire." Genome Biology and Evolution 14.12 (2022): evac157.
Edwards, Richard J., et al. "A phased chromosome-level genome and full mitochondrial sequence for the dikaryotic myrtle rust pathogen, Austropuccinia psidii." BioRxiv (2022): 2022-04.

省略

tidk trim 和 tidk min 已从最新版本中删除。

依赖关系

~28–39MB
~647K SLoC