#range #json #command-line #bioinformatics #command-line-tool #run #links

bin+lib intspan

IntSpan 相关生物信息学操作的命令行工具

30 个版本

0.7.7 2024 年 7 月 19 日
0.7.5 2023 年 7 月 28 日
0.7.1 2022 年 6 月 14 日
0.6.1 2022 年 2 月 21 日
0.4.1 2019 年 9 月 9 日

#18 in 生物学

Download history 2/week @ 2024-04-27 46/week @ 2024-05-18 101/week @ 2024-05-25 20/week @ 2024-06-01 7/week @ 2024-06-08 13/week @ 2024-06-15 40/week @ 2024-06-22 2/week @ 2024-06-29 8/week @ 2024-07-06 93/week @ 2024-07-13 242/week @ 2024-07-20 116/week @ 2024-07-27

441 每月下载量
nwr 中使用

MIT 许可证

4MB
8K SLoC

intspan

Build Windows build status codecov Crates.io license Lines of code

安装

当前版本:0.7.7

cargo install intspan

cargo install --force --path .

# or
brew install intspan

# build under WSL 2
export CARGO_TARGET_DIR=/tmp
cargo build
cargo run --bin fasr help

# local docs
cargo doc --open

概念

范围

例如,S288c.rg。此格式呈现的信息与 BED 等格式非常相似。

我选择这种格式是因为其紧凑性、可读性和可嵌入其他制表符分隔文件的能力。

I:1-100
I(+):90-150
S288c.I(-):190-200
II:21294-22075
II:23537-24097

以下是一个 Range 对象的架构。

ranges

简单规则

  • chromosomestart 是必需的
  • speciesstrandend 是可选的
  • . 用于分隔 specieschromosome
  • strand+- 之一,并用圆括号括起来
  • : 用于分隔名称和数字
  • - 用于分隔 startend
  • 关于 species
    • species 应该是字母数字的,没有空格,一个例外字符是 /
    • species 是一个标识符,你也可以将其视为一个菌株名称、一个组装或其它。
species.chromosome(strand):start-end
--------^^^^^^^^^^--------^^^^^^----

在这个工具集中,rgr 用于操作 .rg.tsv 文件中的范围。

IntSpans

IntSpan 表示整数集的表示方法,以包含范围的数量,例如 1-10,19,45-48

以下图显示了 IntSpan 对象的架构。跳行在基线之上;循环线在基线之下。

intspans

此外,AlignDB::IntSpanjintspan 分别是 Perl 和 Java 中 IntSpan 对象的实现。

运行列表 - 存储在 JSON 中的染色体上的 IntSpan

我们经常需要处理具有相同属性的许多基因组区间,例如,一个基因的所有外显子,一个基因家族的所有启动子,一个基因组中的所有重复序列等。

现有的格式,如 bedGraph,可以部分处理这种情况,但往往面临直观性、性能等问题。同时,能够处理此类专有格式的工具数量非常有限。

IntSpan 保存到 JSON 文件是该工具集的解决方案,其中 spanr 处理这项工作。

{
    "I": "-",
    "II": "327069-327703",
    "III": "-",
    "IV": "512988-513590,757572-759779,802895-805654,981142-987119,1017673-1018183,1175134-1175738,1307621-1308556,1504223-1504728",
    "IX": "-",
    "V": "354135-354917",
    "VI": "-",
    "VII": "778784-779515,878539-879235",
    "VIII": "116405-117059,133581-134226",
    "X": "366757-367499,712641-713226",
    "XI": "162831-163399",
    "XII": "64067-65208,91960-92481,451418-455181,455933-457732,460517-464318,465070-466869,489753-490545,817840-818474",
    "XIII": "609100-609861",
    "XIV": "-",
    "XV": "437522-438484",
    "XVI": "560481-561065"
}
{
    "AT1G01010.1": {
        "1": "3631-3913,3996-4276,4486-4605,4706-5095,5174-5326,5439-5899"
    },
    "AT1G01020.1": {
        "1": "5928-6263,6437-7069,7157-7232,7384-7450,7564-7649,7762-7835,7942-7987,8236-8325,8417-8464,8571-8737"
    },
    "AT1G01020.2": {
        "1": "6790-7069,7157-7450,7564-7649,7762-7835,7942-7987,8236-8325,8417-8464,8571-8737"
    },
    "AT2G01008.1": {
        "2": "1025-1272,1458-1510,1873-2810,3706-5513,5782-5945"
    },
    "AT2G01021.1": {
        "2": "6571-6672"
    }
}

链接类型

  • 双边链接

      I(+):13063-17220    I(-):215091-219225
      I(+):139501-141431  XII(+):95564-97485
    
  • 带有命中链的双边链接

      I(+):13327-17227    I(+):215084-218967  -
      I(+):139501-141431  XII(+):95564-97485  +
    
  • 多边链接

      II(+):186984-190356 IX(+):12652-16010   X(+):12635-15993
    

概要

rgr help

`rgr` operates ranges in .rg and .tsv files

Usage: rgr [COMMAND]

Commands:
  count    Count each range overlapping with other range files
  field    Create/append ranges from fields
  merge    Merge overlapped ranges via overlapping graph
  prop     Proportion of the ranges intersecting a runlist file
  replace  Replace fields in .tsv file
  runlist  Filter .rg and .tsv files by comparison with a runlist file
  sort     Sort .rg and .tsv files by a range field
  help     Print this message or the help of the given subcommand(s)

Options:
  -h, --help     Print help
  -V, --version  Print version

* Field numbers in the TSV file start at 1

spanr help

`spanr` operates chromosome IntSpan files

Usage: spanr [COMMAND]

Commands:
  genome    Convert chr.size to runlists
  some      Extract some records from a runlist json file
  merge     Merge runlist json files
  split     Split a runlist json file
  stat      Coverage on chromosomes for runlists
  statop    Coverage on chromosomes for one JSON crossed another
  combine   Combine multiple sets of runlists in a json file
  compare   Compare one JSON file against others
  span      Operate spans in a JSON file
  cover     Output covers on chromosomes
  coverage  Output minimum or detailed depth of coverage on chromosomes
  gff       Convert gff3 to covers on chromosomes
  convert   Convert runlist file to ranges file
  help      Print this message or the help of the given subcommand(s)

Options:
  -h, --help     Print help
  -V, --version  Print version

fasr help

`fasr` operates block fasta files

Usage: fasr [COMMAND]

Commands:
  axt2fas    Convert axt to block fasta
  check      Check genome locations in block fasta headers
  concat     Concatenate sequence pieces of the same species
  consensus  Generate consensus sequences by POA
  cover      Output covers on chromosomes
  create     Create block fasta files from links of ranges
  filter     Filter blocks, and can also be used as a formatter
  join       Join multiple block fasta files by a common target
  link       Output bi/multi-lateral range links
  maf2fas    Convert maf to block fasta
  name       Output all species names
  pl-p2m     Pipeline - pairwise alignments to multiple alignments
  refine     Realign files with external programs and trim unwanted regions
  replace    Concatenate sequence pieces of the same species
  separate   Separate block fasta files by species
  slice      Extract alignment slices
  split      Split block fasta files to per-alignment/chromosome fasta files
  stat       Extract a subset of species
  subset     Extract a subset of species
  variation  List variations (substitutions/indels)
  help       Print this message or the help of the given subcommand(s)

Options:
  -h, --help     Print help
  -V, --version  Print version

linkr help

`linkr` operates ranges on chromosomes and links of ranges

Usage: linkr [COMMAND]

Commands:
  circos   Convert links to circos links or highlights
  sort     Sort links and ranges within links
  filter   Filter links by numbers of ranges or length differences
  clean    Replace ranges within links, incorporate hit strands and remove nested links
  connect  Connect bilateral links into multilateral ones
  help     Print this message or the help of the given subcommand(s)

Options:
  -h, --help     Print help
  -V, --version  Print version

示例

spanr

spanr genome tests/spanr/S288c.chr.sizes

spanr genome tests/spanr/S288c.chr.sizes |
    spanr stat tests/spanr/S288c.chr.sizes stdin --all

spanr some tests/spanr/Atha.json tests/spanr/Atha.list

spanr merge tests/spanr/I.json tests/spanr/II.json
spanr merge tests/spanr/I.json tests/spanr/II.other.json --all

spanr cover tests/spanr/S288c.rg
spanr cover tests/spanr/dazzname.rg

spanr coverage tests/spanr/S288c.rg -m 2

spanr coverage tests/spanr/S288c.rg -d

spanr gff tests/spanr/NC_007942.gff --tag tRNA

spanr span --op cover tests/spanr/brca2.json

spanr combine tests/spanr/Atha.json

spanr compare \
    --op intersect \
    tests/spanr/intergenic.json \
    tests/spanr/repeat.json

spanr compare \
    --op intersect \
    tests/spanr/I.II.json \
    tests/spanr/I.json \
    tests/spanr/II.json

spanr split tests/spanr/I.II.json

spanr stat tests/spanr/S288c.chr.sizes tests/spanr/intergenic.json

spanr stat tests/spanr/S288c.chr.sizes tests/spanr/I.II.json

spanr stat tests/spanr/Atha.chr.sizes tests/spanr/Atha.json

spanr statop \
    --op intersect \
    tests/spanr/S288c.chr.sizes \
    tests/spanr/intergenic.json \
    tests/spanr/repeat.json

spanr statop \
    --op intersect --all\
    tests/spanr/Atha.chr.sizes \
    tests/spanr/Atha.json \
    tests/spanr/paralog.json

spanr convert tests/spanr/repeat.json tests/spanr/intergenic.json |
    spanr cover stdin |
    spanr stat tests/spanr/S288c.chr.sizes stdin --all

spanr merge tests/spanr/repeat.json tests/spanr/intergenic.json |
    spanr combine stdin |
    spanr stat tests/spanr/S288c.chr.sizes stdin --all

rgr

rgr field tests/Atha/chr.sizes --chr 1 --start 2 -a -s
rgr field tests/spanr/NC_007942.gff -H --chr 1 --start 4 --end 5 --strand 7 --eq 3:tRNA --ne '7:+'
rgr field tests/rgr/ctg.tsv --chr 2 --start 3 --end 4 -H -f 6,1 > tests/rgr/ctg.range.tsv

rgr sort tests/rgr/S288c.rg
rgr sort tests/rgr/ctg.range.tsv -H -f 3
# ctg:I:1 is treated as a range
rgr sort tests/rgr/S288c.rg tests/rgr/ctg.range.tsv

rgr count tests/rgr/S288c.rg tests/rgr/S288c.rg
rgr count tests/rgr/ctg.range.tsv tests/rgr/S288c.rg -H -f 3

rgr runlist tests/rgr/intergenic.json tests/rgr/S288c.rg --op overlap
rgr runlist tests/rgr/intergenic.json tests/rgr/ctg.range.tsv --op non-overlap -H -f 3

rgr prop tests/rgr/intergenic.json tests/rgr/S288c.rg
rgr prop tests/rgr/intergenic.json tests/rgr/ctg.range.tsv -H -f 3 --prefix --full

rgr merge tests/rgr/II.links.tsv -c 0.95

rgr replace tests/rgr/1_4.ovlp.tsv tests/rgr/1_4.replace.tsv
rgr replace tests/rgr/1_4.ovlp.tsv tests/rgr/1_4.replace.tsv -r

# ctg_2_1_.gc.tsv isn't sorted,
cat tests/rgr/ctg_2_1_.gc.tsv | rgr sort stdin | cargo run --bin rgr pl-2rmp stdin > /dev/null
cat tests/rgr/II.links.tsv | cargo run --bin rgr pl-2rmp stdin

linkr

linkr sort tests/linkr/II.links.tsv -o tests/linkr/II.sort.tsv

rgr merge tests/linkr/II.links.tsv -v

linkr clean tests/linkr/II.sort.tsv
linkr clean tests/linkr/II.sort.tsv --bundle 500
linkr clean tests/linkr/II.sort.tsv -r tests/linkr/II.merge.tsv

linkr connect tests/linkr/II.clean.tsv -v

linkr filter tests/linkr/II.connect.tsv -n 2
linkr filter tests/linkr/II.connect.tsv -n 3 -r 0.99

linkr circos tests/linkr/II.connect.tsv
linkr circos --highlight tests/linkr/II.connect.tsv

步骤

    sort
      |
      v
    clean -> merge
      |     /
      |  /
      v
    clean
      |
      V
    connect
      |
      v
    filter

S288c

linkr sort tests/S288c/links.lastz.tsv tests/S288c/links.blast.tsv \
    -o tests/S288c/sort.tsv

linkr clean tests/S288c/sort.tsv \
    -o tests/S288c/sort.clean.tsv

rgr merge tests/S288c/sort.clean.tsv -c 0.95 \
    -o tests/S288c/merge.tsv

linkr clean tests/S288c/sort.clean.tsv -r tests/S288c/merge.tsv --bundle 500 \
    -o tests/S288c/clean.tsv

linkr connect tests/S288c/clean.tsv -r 0.8 \
    -o tests/S288c/connect.tsv

linkr filter tests/S288c/connect.tsv -r 0.8 \
    -o tests/S288c/filter.tsv

wc -l tests/S288c/*.tsv
#     229 tests/S288c/clean.tsv
#     148 tests/S288c/connect.tsv
#     148 tests/S288c/filter.tsv
#     566 tests/S288c/links.blast.tsv
#     346 tests/S288c/links.lastz.tsv
#      74 tests/S288c/merge.tsv
#     282 tests/S288c/sort.clean.tsv
#     626 tests/S288c/sort.tsv

cat tests/S288c/filter.tsv |
    perl -nla -F"\t" -e 'print for @F' |
    spanr cover stdin -o tests/S288c/cover.json

spanr stat tests/S288c/chr.sizes tests/S288c/cover.json -o stdout

Atha

gzip -dcf tests/Atha/links.lastz.tsv.gz tests/Atha/links.blast.tsv.gz |
    linkr sort stdin -o tests/Atha/sort.tsv

linkr clean tests/Atha/sort.tsv -o tests/Atha/sort.clean.tsv

rgr merge tests/Atha/sort.clean.tsv -c 0.95 -o tests/Atha/merge.tsv

linkr clean tests/Atha/sort.clean.tsv -r tests/Atha/merge.tsv --bundle 500 -o tests/Atha/clean.tsv

linkr connect tests/Atha/clean.tsv -o tests/Atha/connect.tsv

linkr filter tests/Atha/connect.tsv -r 0.8 -o tests/Atha/filter.tsv

wc -l tests/Atha/*.tsv
#    4500 tests/Atha/clean.tsv
#    3832 tests/Atha/connect.tsv
#    3832 tests/Atha/filter.tsv
#     785 tests/Atha/merge.tsv
#    5416 tests/Atha/sort.clean.tsv
#    7754 tests/Atha/sort.tsv

cat tests/Atha/filter.tsv |
    perl -nla -F"\t" -e 'print for @F' |
    spanr cover stdin -o tests/Atha/cover.json

spanr stat tests/Atha/chr.sizes tests/Atha/cover.json -o stdout

fasr

fasr maf2fas tests/fasr/example.maf

fasr axt2fas tests/fasr/RM11_1a.chr.sizes tests/fasr/example.axt --qname RM11_1a

cargo run --bin fasr filter tests/fasr/example.fas --ge 10

fasr name tests/fasr/example.fas --count

fasr cover tests/fasr/example.fas

fasr cover tests/fasr/example.fas --name S288c --trim 10

fasr concat tests/fasr/name.lst tests/fasr/example.fas

fasr subset tests/fasr/name.lst tests/fasr/example.fas
cargo run --bin fasr subset tests/fasr/name.lst tests/fasr/refine.fas --required

fasr link tests/fasr/example.fas --pair
fasr link tests/fasr/example.fas --best

cargo run --bin fasr replace tests/fasr/replace.tsv tests/fasr/example.fas
cargo run --bin fasr replace tests/fasr/replace.fail.tsv tests/fasr/example.fas

samtools faidx tests/fasr/NC_000932.fa NC_000932:1-10

fasr check tests/fasr/NC_000932.fa tests/fasr/A_tha.pair.fas

fasr create tests/fasr/genome.fa tests/fasr/I.connect.tsv --name S288c

# Create a fasta file containing multiple genomes
cat tests/fasr/genome.fa | sed 's/^>/>S288c./' > tests/fasr/genomes.fa
samtools faidx tests/fasr/genomes.fa S288c.I:1-100

cargo run --bin fasr create tests/fasr/genomes.fa tests/fasr/I.name.tsv --multi

fasr separate tests/fasr/example.fas -o . --suffix .tmp

spoa tests/fasr/refine.fasta -r 1

cargo run --bin fasr consensus tests/fasr/example.fas
cargo run --bin fasr consensus tests/fasr/refine.fas
cargo run --bin fasr consensus tests/fasr/refine.fas --outgroup -p 2

cargo run --bin fasr refine tests/fasr/example.fas
cargo run --bin fasr refine tests/fasr/example.fas --msa none --chop 10
cargo run --bin fasr refine tests/fasr/refine2.fas --msa clustalw --outgroup
cargo run --bin fasr refine tests/fasr/example.fas --quick

cargo run --bin fasr split tests/fasr/example.fas --simple
cargo run --bin fasr split tests/fasr/example.fas -o . --chr --suffix .tmp

cargo run --bin fasr slice tests/fasr/slice.json tests/fasr/slice.fas --name S288c

cargo run --bin fasr join tests/fasr/S288cvsYJM789.slice.fas --name YJM789
cargo run --bin fasr join \
    tests/fasr/S288cvsRM11_1a.slice.fas \
    tests/fasr/S288cvsYJM789.slice.fas \
    tests/fasr/S288cvsSpar.slice.fas

cargo run --bin fasr stat tests/fasr/example.fas --outgroup

cargo run --bin fasr variation tests/fasr/example.fas
cargo run --bin fasr variation tests/fasr/example.fas --outgroup

cargo run --bin fasr xlsx tests/fasr/example.fas
cargo run --bin fasr xlsx tests/fasr/example.fas --outgroup

cargo run --bin fasr pl-p2m tests/fasr/S288cvsRM11_1a.slice.fas tests/fasr/S288cvsSpar.slice.fas

许可证

FOSSA Status

依赖项

~20–32MB
~443K SLoC