3 个不稳定版本

0.2.0	2024年4月10日
0.1.1	2023年10月8日
0.1.0	2023年10月7日

#731 在解析器实现

32 每月下载次数
在 nafcodec-py 中使用

MIT 许可证

69KB
1.5K SLoC

📦🧬 `nafcodec`

Rust 编码/解码 Nucleotide Archive Format (NAF) 文件.

🗺️ 概览

Nucleotide Archive Format 是 Kryukov 等人于 2019 年提出的一种文件格式，用于存储压缩的核酸或蛋白质序列，结合了 4 位编码和 Zstandard 压缩。NAF 文件可以使用原始 C 实现进行压缩和解压缩。

此包提供了一个 NAF 解码器的 Rust 实现，从头开始，使用 nom 解析二进制格式，并使用 zstd 处理 Zstandard 解压缩。它提供了一个完整的 API，允许迭代 NAF 文件的内容。

这是 Rust 版本，还有可用的 Python 包。

📋 功能

流式解码器：解码器使用不同的读取器实现，每个读取器访问压缩文件的某个区域，允许流式传输记录而无需解码整个块。
可选解码：允许解码器跳过某些字段的解码，例如在不需要时忽略质量字符串。
灵活的编码器：编码器使用抽象存储接口实现临时数据，允许将序列保留在内存中或临时文件夹中。

🔌 使用

使用 Decoder 迭代 Nucleotide Archive Format 的内容，从任何 BufRead + Seek 实现者读取

let mut decoder = nafcodec::Decoder::from_path("../data/LuxC.naf")
    .expect("failed to open nucleotide archive");

for result in decoder {
    let record = result.unwrap();
    // .. do something with the record .. //
}

所有获取到的Record字段都是可选的，实际上取决于压缩的数据类型。解码器可以通过DecoderBuilder配置来忽略一些字段以加快解压缩速度，即使它们存在于源存档中。

let mut decoder = nafcodec::DecoderBuilder::new()
    .quality(false)
    .with_path("../data/phix.naf")
    .expect("failed to open nucleotide archive");

// the archive contains quality strings...
assert!(decoder.header().flags().test(nafcodec::Flag::Quality));

// ... but we configured the decoder to ignore them
for result in decoder {
    let record = result.unwrap();
    assert!(record.quality.is_none())
}

💭 反馈

⚠️ 问题跟踪器

发现了一个错误？有增强请求吗？如果您需要报告或询问，请前往GitHub问题跟踪器。如果您正在报告错误，请尽可能提供有关问题的信息，并尝试在简单、易于复现的情况下重现相同的错误。

📋 更新日志

此项目遵循语义版本控制，并以保持更新日志格式提供更新日志。

⚖️ 许可证

此库在开源MIT许可证下提供。NAF规范属于公共领域。NAF规范。

该项目与原始NAF作者无任何关联、赞助或支持。该项目由Martin Larralde在他的博士项目期间开发，该项目在欧洲分子生物学实验室的Zeller团队。

📚 参考资料

Kirill Kryukov, Mahoko Takahashi Ueda, So Nakagawa, Tadashi Imanishi. "Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences". Bioinformatics, Volume 35, Issue 19, October 2019, Pages 3826–3828. doi:10.1093/bioinformatics/btz144

依赖项

~5–15MB
~184K SLoC