6 个版本

0.1.5	2024年4月13日
0.1.4	2024年2月21日
0.1.3	2024年1月5日
0.1.2	2023年11月27日
0.1.1	2023年10月18日

#146 在科学

MIT 许可证

40KB
876 行

bed2gff

A Rust BED-to-GFF3 parallel translator.

转换

chr7 56766360 56805692 ENST00000581852.25 1000 + 56766360 56805692 0,0,200 3 3,135,81, 0,496,39251,

为

chr7 bed2gff gene 56399404 56805892 . + . ID=ENSG00000166960;gene_id=ENSG00000166960

chr7 bed2gff transcript 56766361 56805692 . + . ID=ENST00000581852.25;Parent=ENSG00000166960;gene_id=ENSG00000166960;transcript_id=ENST00000581852.25

chr7 bed2gff exon 56766361 56766363 . + . ID=exon:ENST00000581852.25.1;Parent=ENST00000581852.25;gene_id=ENSG00000166960;transcript_id=ENST00000581852.25,exon_number=1

chr7 bed2gff CDS 56766361 56766363 . + 0 ID=CDS:ENST00000581852.25.1;Parent=ENST00000581852.25;gene_id=ENSG00000166960;transcript_id=ENST00000581852.25,exon_number=1

...

chr7 bed2gff start_codon 56766361 56766363 . + 0 ID=start_codon:ENST00000581852.25.1;Parent=ENST00000581852.25;gene_id=ENSG00000166960;transcript_id=ENST00000581852.25,exon_number=1

chr7 bed2gff stop_codon 56805690 56805692 . + 0 ID=stop_codon:ENST00000581852.25.3;Parent=ENST00000581852.25;gene_id=ENSG00000166960;transcript_id=ENST00000581852.25,exon_number=3

...

几秒钟。

转换

Homo sapiens GRCh38 GENCODE 44 (252,835 transcripts) in 4.16 seconds.
Mus musculus GRCm39 GENCODE 44 (149,547 transcritps) in 2.15 seconds.
Canis lupus familiaris ROS_Cfam_1.0 Ensembl 110 (55,335 transcripts) in 1.30 seconds.
Gallus gallus bGalGal1 Ensembl 110 (72,689 transcripts) in 1.51 seconds.

版本 0.1.5 的新特性

添加了 --no-gene 标志，仅执行转换，不包含同源异构体！

修改了 -i 为必需项，除非存在 --no-gene 模式。

重构 BedRecord。

用法

Usage: 
    a) bed2gff[EXE] --bed <BED> --isoforms <ISOFORMS> --output <OUTPUT>
    b) bed2gff[EXE] --bed <BED> --output <OUTPUT> --no-gene

Arguments:
    -b, --bed <BED>: a .bed file
    -i, --isoforms <ISOFORMS>: a tab-delimited file
    -o, --output <OUTPUT>: path to output file
    -n, --no-gene <FLAG>: Flag to disable gene_id feature [default: false]

Options:
    --help: print help
    --version: print version
    --threads/-t: number of threads (default: max cpus)
    --gz: compress output .gtf

[!WARNING]

.bed 文件中的所有转录本都应该出现在同源异构体文件中。

crate: https://crates.io/crates/bed2gff

点击查看详细格式

bed2gff 只需要两个文件

一个 .bed 文件

带有 3 个必需字段和 9 个可选字段的制表符分隔文件

chrom   chromStart  chromEnd      name    ...
  |         |           |           |
chr20   50222035    50222038    ENST00000595977    ...

有关更多信息，请参阅 BED 格式

一个带有基因/同源异构体的制表符分隔 .txt/.tsv/.csv/... 文件（.bed 文件中的所有转录本都应该出现在同源异构体文件中）
```
> cat isoforms.txt

ENSG00000198888 ENST00000361390
ENSG00000198763 ENST00000361453
ENSG00000198804 ENST00000361624
ENSG00000188868 ENST00000595977
```
您可以使用 Ensembl BioMart 创建您偏好的物种的自定义文件。

安装

要安装 bed2gff 到您的系统，请按照以下步骤操作

获取 Rust: 在 Unix 上，使用 curl https://sh.rustup.rs -sSf | sh，或其他选项请访问这里
运行 cargo install bed2gff（确保在运行之前，~/.cargo/bin 已添加到您的 $PATH 中）
使用 bed2gff 和必需参数
享受吧！

构建

要从此仓库构建 bed2gff，请按以下步骤操作

获取 Rust（如上所述）
运行 git clone https://github.com/alejandrogzi/bed2gff.git && cd bed2gff
运行 cargo run --release -- -b <BED> -i <ISOFORMS> -o <OUTPUT>

容器镜像构建开发容器镜像运行 git clone https://github.com/alejandrogzi/bed2gff.git && cd bed2gff 使用 start docker 或 systemctl start docker 初始化 Docker 构建镜像 docker image build --tag bed2gff . 运行 docker run --rm -"[dir_where_your_gtf_is]:/dir" bed2gff -b /dir/<BED> -/dir/<ISOFORMS> -/dir/<OUTPUT> Conda 要通过 Conda 使用 bed2gff，只需 conda install bed2gff -c bioconda 或 conda create -n bed2gff - 输出如果指定，bed2gff 将输出直接发送到相同的 .bed 文件路径 bed2gff annotation.bed isoforms.txt output.gff . ├── ... ├── isoforms.txt ├── annotation.bed └── output.gff3 其中 output.gff3 是结果。常见问题解答为什么？格式转换是生物信息学中的日常实践。当与基因注释一起工作时，这种转换更为常见，因为工具的输入/输出布局各不相同。GTF/GFF/BED 是存储与基因相关的注释最常用的结构，而现有的软件并没有很好地覆盖转换需求。相当一部分基因组工具通过只接受 GTF/GFF3 文件来减少软件空间，引导 BED 用户将文件转换为不同的格式。虽然其中一些问题已经得到解决（例如，bed2gtf），但 GFF3 布局缺乏稳定的转换工具（1，2）。 bed2gff 被提出作为将 BED 文件转换为可使用的 GFF3 文件的直接选项，填补了这一空白。如何？ bed2gff，采用了 bed2gtf 的基本代码，该代码基本上是 UCSC 的 C 二进制文件的重新实现，合并为一步（bedToGenePred + genePredToGtf）。此工具评估外显子和其他特征（CDS、停止/开始、UTR）的位置，保留阅读框架并调整索引计数。目前的主要方法是并行算法，可显着减少计算时间。遵循 bed2gtf 的思路，bed2gff 通过使用等位基因文件（作为 C 二进制中的 refTable）将每个转录本映射到其相应的基因，从而能够生成可使用的 gff3 文件。参考文献 https://bioinformatics.stackexchange.com/questions/2242/how-to-convert-bed-to-gff3 https://www.biostars.org/p/2/

依赖关系 ~4–15MB ~128K SLoC chrono clap 4.0+derive colored 1.0 flate2 indoc 1.0 libc log natord num_cpus rayon simple_logger 4.0 thiserror