6 个版本

新 0.3.1	2024 年 8 月 19 日
0.3.0	2023 年 7 月 28 日
0.2.2	2023 年 6 月 13 日
0.2.1	2023 年 2 月 22 日
0.1.0	2023 年 1 月 5 日

#35 in 生物学

60 每月下载量

MIT 许可证

135KB
2.5K SLoC

fqtk

使用 Rust 编写的用于处理 FASTQ 文件的工具包。

目前 fqtk 包含一个工具，demux，用于根据样本条码对 FASTQ 文件进行去重。 fqtk demux 可以用来去重一个或多个 FASTQ 文件（例如一组 R1、R2 和 I1 FASTQ 文件），其中样本条码位于读取中的固定位置。它具有高度效率和多线程性能。

关于 fqtk demux 的使用方法

Performs sample demultiplexing on FASTQs.

The sample barcode for each sample in the metadata TSV will be compared against
the sample barcode bases extracted from the FASTQs, to assign each read to a
sample.  Reads that do not match any sample within the given error tolerance
will be placed in the ``unmatched_prefix`` file.

FASTQs and associated read structures for each sub-read should be given:

- a single fragment read (with inline index) should have one FASTQ and one read
  structure 
- paired end reads should have two FASTQs and two read structures 
- a dual-index sample with paired end reads should have four FASTQs and four read
  structures given: two for the two index reads, and two for the template reads.

If multiple FASTQs are present for each sub-read, then the FASTQs for each
sub-read should be concatenated together prior to running this tool (e.g. 
`zcat s_R1_L001.fq.gz s_R1_L002.fq.gz | bgzip -c > s_R1.fq.gz`).

Read structures are made up of `<number><operator>` pairs much like the `CIGAR`
string in BAM files. Four kinds of operators are recognized:

1. `T` identifies a template read
2. `B` identifies a sample barcode read
3. `M` identifies a unique molecular index read
4. `S` identifies a set of bases that should be skipped or ignored

The last `<number><operator>` pair may be specified using a `+` sign instead of
number to denote "all remaining bases". This is useful if, e.g., fastqs have
been trimmed and contain reads of varying length. Both reads must have template
bases.  Any molecular identifiers will be concatenated using the `-` delimiter
and placed in the given SAM record tag (`RX` by default).  Similarly, the sample
barcode bases from the given read will be placed in the `BC` tag.

Metadata about the samples should be given as a headered metadata TSV file with
two columns 1. `sample_id` - the id of the sample or library. 2. `barcode` - the
expected barcode sequence associated with the `sample_id`.

The read structures will be used to extract the observed sample barcode, template
bases, and molecular identifiers from each read.  The observed sample barcode
will be matched to the sample barcodes extracted from the bases in the sample
metadata and associated read structures.

An observed barcode matches an expected barcocde if all the following are true:

1. The number of mismatches (edits/substitutions) is less than or equal to the
   maximum mismatches (see --max-mismatches).
2. The difference between number of mismatches in the best and second best
   barcodes is greater than or equal to the minimum mismatch delta
   (`--min-mismatch-delta`). The expected barcode sequence may contains Ns,
   which are not counted as mismatches regardless of the observed base (e.g.
   the expected barcode `AAN` will have zero mismatches relative to both the
   observed barcodes `AAA` and `AAN`).

## Outputs

All outputs are generated in the provided `--output` directory.  For each sample
plus the unmatched reads, FASTQ files are written for each read segment
(specified in the read structures) of one of the types supplied to
`--output-types`.

FASTQ files have names of the format:

{sample_id}.{segment_type}{read_num}.fq.gz

where `segment_type` is one of `R`, `I`, and `U` (for template, barcode/index
and molecular barcode/UMI reads respectively) and `read_num` is a number starting
at 1 for each segment type.

In addition a `demux-metrics.txt` file is written that is a tab-delimited file
with counts of how many reads were assigned to each sample and derived metrics.

## Example Command Line

As an example, if the sequencing run was 2x100bp (paired end) with two 8bp index
reads both reading a sample barcode, as well as an in-line 8bp sample barcode in
read one, the command line would be:

fqtk demux \
  --inputs r1.fq.gz i1.fq.gz i2.fq.gz r2.fq.gz \
  --read-structures 8B92T 8B 8B 100T \
  --sample-metadata metadata.tsv \
  --output output_folder

Usage: fqtk demux [OPTIONS] --inputs <INPUTS>... --read-structures <READ_STRUCTURES>... --sample-metadata <SAMPLE_METADATA> --output <OUTPUT>

Options:
  -i, --inputs <INPUTS>...
          One or more input fastq files each corresponding to a sequencing (e.g. R1, I1)

  -r, --read-structures <READ_STRUCTURES>...
          The read structures, one per input FASTQ in the same order

  -b, --output-types <OUTPUT_TYPES>...
          The read structure types to write to their own files (Must be one of T, B,
          or M for template reads, sample barcode reads, and molecular barcode reads)

          Multiple output types may be specified as a space-delimited list.

          [default: T]

  -s, --sample-metadata <SAMPLE_METADATA>
          A file containing the metadata about the samples

  -o, --output <OUTPUT>
          The output directory into which to write per-sample FASTQs

  -u, --unmatched-prefix <UNMATCHED_PREFIX>
          Output prefix for FASTQ file(s) for reads that cannot be matched to a sample

          [default: unmatched]

      --max-mismatches <MAX_MISMATCHES>
          Maximum mismatches for a barcode to be considered a match

          [default: 1]

  -d, --min-mismatch-delta <MIN_MISMATCH_DELTA>
          Minimum difference between number of mismatches in the best and second best barcodes
          for a barcode to be considered a match

          [default: 2]

  -t, --threads <THREADS>
          The number of threads to use. Cannot be less than 3

          [default: 8]

  -c, --compression-level <COMPRESSION_LEVEL>
          The level of compression to use to compress outputs

          [default: 5]

  -S, --skip-reasons <SKIP_REASONS>
          Skip demultiplexing reads for any of the following reasons, otherwise panic.

          1. `too-few-bases`: there are too few bases or qualities to extract given the
             read structures.  For example, if a read is 8bp long but the read structure
             is `10B`, or if a read is empty and the read structure is `+T`.

  -h, --help
          Print help information (use `-h` for a summary)

  -V, --version
          Print version information

安装

使用 `conda` 安装

要使用 conda 安装，您必须首先安装 conda。然后，在您的命令行（以及您希望安装 fqtk 的环境处于激活状态）中运行

conda install -c bioconda fqtk

使用 `cargo` 安装

要使用 cargo 安装，您必须首先安装 rust。在 Mac OS 和 Linux 上，可以使用以下命令完成：

curl https://sh.rustup.rs -sSf | sh

然后，要安装 fqtk，请运行

cargo install fqtk

从源码构建

首先，克隆 git 仓库

git clone https://github.com/fulcrumgenomics/fqtk.git

其次，如果您尚未安装 rust 开发工具，请通过 rustup 进行安装

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

然后以发布模式构建工具包

cd fqtk
cargo build --release
./target/release/fqtk --help

开发

fqtk 使用 Rust 进行开发，并遵循使用 rustfmt 和 clippy 的约定，以确保代码质量和标准化格式。在处理 fqtk 之前，请先运行 ./ci/check.sh 并解决报告中的任何问题。

发布新版本

先决条件

安装 cargo-release

cargo install cargo-release

在任何发布之前

创建一个不会尝试推送到 crates.io 的发布版本并验证命令

cargo release [major,minor,patch,release,rc...] --no-publish

注意："dry-run" 是 cargo release 的默认选项。

有关更多信息，请参阅 cargo-release 参考文档

语义化版本控制

该工具遵循 Semantic Versioning。简要来说

MAJOR 版本号当您进行不兼容的 API 变更时，
MINOR 版本号当您以向后兼容的方式添加功能时，
PATCH 版本号当您进行向后兼容的错误修复时。

主版本发布

要创建主版本发布

cargo release major --execute

这将删除任何预发布扩展，创建一个新的标签并将其推送到 GitHub，并将发布版本推送到 creates.io。

成功后，将版本移至下一个候选发布版本。

最后，请确保在 GitHub 上创建一个新的发布版本。

次要和修复版本发布

要创建次要 (修复) 版本，按照主版本发布指示进行，将 major 替换为 minor (patch)

cargo release minor --execute

发布候选版本

要移动到下一个发布候选版本

cargo release rc --no-tag --no-publish --execute

这将创建或增加预发布版本，并将更改推送到 GitHub 的主分支。这将不会标记和发布发布候选版本。如果您想在 GitHub 上标记发布候选版本，请移除 --no-tag 以创建一个新的标签并将其推送到 GitHub。

依赖关系

~14–24MB
~297K SLoC