8 个版本

0.0.12	2022 年 10 月 29 日
0.0.11	2022 年 10 月 26 日
0.0.8	2022 年 9 月 26 日
0.0.7	2022 年 8 月 9 日
0.0.4	~~2022 年 3 月 20 日~~

#77 in 生物学

每月 38 次下载

MIT/Apache 和 GPL-2.0 许可协议

340KB
SLoC

该包已更改名称，现在称为 gsearch。该包已撤回

该包基于 probminhash 和 HNSW 的微生物基因组 rust 分类器

ARCHAEA 代表：A Rust Classifier based on Hierarchical Navigable SW graphs，et.al..

该包（目前处于开发中）计算细菌和古菌（或病毒和真菌）基因组的 probminhash 签名，并将细菌的 ID 和 probminhash 签名存储在 Hnsw 结构中进行新请求基因组的搜索。

软件部分由 Jean-Pierre Both (https://github.com/jean-pierreBoth) 开发，基因组部分由 Jianshu Zhao (https://github.com/jianshu93) 开发。

基因组/tohnsw 绘制

参考基因组的绘制由 tohnsw 模块完成。

参考基因组的绘制可能需要一些时间（使用 NCBI 的约 65,000 个细菌基因组，参数给出正确的绘制质量可能需要 1-2 个小时）。

存储处理数据和相应绘制的 Hnsw 结构。
将每个排名与 fasta ID 和 fasta 文件名关联的字典。

将 Hnsw 结构存放在 hnswdump.hnsw.graph 和 hnswdump.hnsw.data 中，字典存放在 json 文件 seqdict.json 中。

请求

对于请求，使用 request 模块。它重新加载已存档的文件，hnsw 和 seqdict，针对每个 fasta 文件请求找到的 N 个最近的基因组。

使用方法

### build database given genome file directory, fna.gz was expected. L for nt and .faa or .faa.gz for --aa. Limit for k is 32 (15 not work due to compression), for s is 65535 (u16) and for n is 255 (u8)
tohnsw -d db_dir_nt -s 12000 -k 16 --ef 1600 -n 128
tohnsw -d db_dir_aa -s 12000 -k 7 --ef 1600 -n 128 --aa
### request neighbours for each genomes (fna, fasta, faa et.al. are supported) in query_dir_nt or aa using pre-built database:
wget http://enve-omics.ce.gatech.edu/data/public_gsearch/GTDB_r207_hnsw_graph.tar.gz
tar xzvf ./GTDB_r207_hnsw_graph.tar.gz
cd ./GTDB_r207_hnsw_graph/nucl
### request neighbors for nt genomes
request -b ./ -d query_dir_nt -n 50
### request neighbors for aa genomes (predicted by Prodigal or FragGeneScanRs)
cd ./GTDB_r207_hnsw_graph/prot
request -b ./ -d query_dir_aa -n 50 --aa
### request neighbors for aa universal gene (extracted by hmmer according to hmm files provided)
cd ./GTDB_r207_hnsw_graph/universal
request -b ./ -d query_dir_universal_aa -n 50 --aa

输出说明

当前目录中的默认输出文件为 Archaea.answer。
对于查询目录中的每个基因组，将请求 N 个最近的基因组，并按距离排序（从小到大）。
如果查询中的一个基因组在输出文件中不存在，这意味着在该级别（nt 或 aa）中没有这样的最近基因组在数据库中（或远离数据库中的最佳匹配），则可以转到氨基酸水平或通用基因水平。

依赖项、功能和安装

功能

hnsw_rs 依赖于 crate simdeez 来加速距离计算。在 intel 上，可以使用 simdeez_f 功能构建 hnsw_rs。
annembed 依赖于 openblas，因此您必须在 "annembed_openblas-static" 、 "annembed_openblas-system" 或 "annembed_intel-mkl" 这三个功能之间进行选择。您可能需要安装 gcc、gfortran 和 make。这可以通过以下方式完成：使用 –features 选项，如下所述，或者通过修改 Cargo.toml 中的功能部分。在这种情况下，只需填写您想要的默认值。
kmerutils 提供了一个名为 "withzmq" 的功能。此功能可用于在服务器上存储压缩质量并运行请求。在此软件包中不需要此功能。

安装的简单案例

预构建的二进制文件 将在发布页面 (https://github.com/jean-pierreBoth/archaea/releases/tag/v0.0.10) 上提供，适用于主要平台。

否则，您可以自己安装/编译

首先安装 Rust 工具

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

从 Crates.io 安装 archaea 和编译 (推荐)

简单安装，启用 annembed 的话将使用 intel-mkl

cargo install archaea --features="annembed_intel-mkl"

或使用系统安装的 openblas

cargo build --release --features="annembed_openblas-system"

在 MacOS 上，需要动态库链接（您必须先安装 openblas，然后安装 xz，提供的 MacOS/Darwin 二进制文件也要求这样做）：（请注意，M1 MAC 上的 openblas 安装库路径不同）。
所以您需要运行

brew install openblas xz
echo 'export LDFLAGS="-L/usr/local/opt/openblas/lib"' >> ~/.bash_profile
echo 'export CPPFLAGS="-I/usr/local/opt/openblas/include"' >> ~/.bash_profile
echo 'export PKG_CONFIG_PATH="/usr/local/opt/openblas/lib/pkgconfig"' >> ~/.bash_profile
cargo install archaea --features="annembed_openblas-system"

英特尔
您可以使用 hnsw_rs/simdeez_f 功能启用 simd 指令。
使用 openblas 而不是 intel-mkl 运行时

cargo build --release --features="annembed_openblas-system" --features="hnsw_rs/simdeez_f"

从 GitHub 安装 archaea 的最新版本

直接从 GitHub 安装

cargo install archaea --features="annembed_intel-mkl" --git https://github.com/jean-pierreBoth/archaea

下载和编译

git clone https://github.com/jean-pierreBoth/archaea
cd archaea
## build
cargo build --release --features="annembed_openblas-static" 
###on MacOS, which requires dynamic library link:
cargo build --release --features="annembed_openblas-system"

然后安装 FragGeneScanRs

cargo install --git https://gitlab.com/Jianshu_Zhao/fraggenescanrs

有关问题的提示（包括在 ARM64 CPU 上安装/编译）在此处提供

预构建数据库

我们提供了预构建的细菌/古菌、病毒和真菌的基因组/蛋白质组数据库图文件。蛋白质组数据库是基于每个基因组的基因构建的，由 FragGeneScanRs (https://gitlab.com/Jianshu_Zhao/fraggenescanrs) 为细菌/古菌/病毒预测，以及 GeneMark-ES 版本 2 (http://exon.gatech.edu/GeneMark/license_download.cgi) 为真菌。

细菌/古菌基因组是 GTDB 数据库的最新版本 (https://gtdb.ecogenomic.org)，该版本定义了在 95% ANI 时的细菌物种。请注意，GSearch 还可以运行更高分辨率的物种数据库，如 99% ANI。
病毒数据库基于 JGI IMG/VR 数据库的最新版本 (https://genome.jgi.doe.gov/portal/IMG_VR/IMG_VR.home.html)，该版本也定义了在 95% ANI 时的病毒 OTU（vOTU）。
真菌数据库基于 RefSeq 真菌基因组（通过 MycoCosm 网站检索），我们进行了去冗余并定义了在 99.5% ANI 时的真菌物种。
所有三个预构建数据库都可以在这里找到：http://enve-omics.ce.gatech.edu/data/gsearch

参考文献

赵，J. 等人。GSearch：通过结合 kmer 哈希和分层可导航小世界图进行超快和可扩展的微生物基因组搜索。bioRxiv 2022:2022.2010.2021.513218. biorxiv。

依赖项

~18–40MB
~597K SLoC