DAF-QC Pipeline

DAF-QC-SMK is a Snakemake pipeline for quality control and initial processing of DAF-seq sequencing reads. It supports both PacBio HiFi and Oxford Nanopore platforms. This page covers installation, usage, and key outputs. For the wet lab steps that precede this pipeline, see the DAF-seq Protocol.

Getting started

The pipeline uses pixi for environment management. Clone the repository and install:

git clone https://github.com/StergachisLab/DAF-QC-SMK.git
cd DAF-QC-SMK
pixi install

Verify the installation

A test dataset (human chr8, hg38) is bundled with the repository. Run it to confirm everything is working before processing your own data:

pixi run test

If you encounter errors, please run the test case before contacting the developers, as it helps with troubleshooting.

Usage

Run the pipeline with pixi:

pixi run snakemake --configfile config/config.yaml

For SLURM clusters, specify a profile:

pixi run snakemake --configfile config/config.yaml --profile profiles/slurm-executor

You can also run the pipeline from a different directory using --manifest-path:

pixi run --manifest-path /path/to/DAF-QC-SMK/pixi.toml snakemake --configfile config/config.yaml

Inputs

The pipeline requires two configuration files:

Sample table (`config.tbl`)

A tab-separated table with sample name, BAM/FASTQ path, and targeted regions:

sample	file	regs
test    test.bam    chr8:144415767-144417958

For PacBio BAM inputs, files should contain either unaligned reads or primary reads only (for compatibility with pbmarkdup during consensus generation). See config/config.tbl in the repository for a template.

Configuration file (`config.yaml`)

Specifies paths to the sample table and reference genome, sequencing platform, and optional parameters:

ref: /path/to/genome.fa
manifest: config/config.tbl
platform: pacbio  # 'pacbio' or 'ont'

# Optional (both platforms)
chimera_cutoff: 0.9
min_deamination_count: 50
end_tolerance: 30
decorated_samplesize: 5000

# PacBio-specific
consensus: True
consensus_min_reads: 3

# ONT-specific
is_fastq: False

See config/config.yaml in the repository for the full list of options with descriptions.

Key outputs

Aligned BAMs: Primary, supplementary, and unaligned reads with PCR duplicates marked (du and ds tags).
Decorated BAMs: Full-length reads with top/bottom strand designation (C-to-T as top strand, G-to-A as bottom strand). Strand stored in the st tag.
Consensus BAMs (PacBio only): MSA consensus of full-length, strand-designated reads. The dc tag indicates the number of reads used to construct each consensus.
QC metrics: Targeting efficiency, deamination rates (overall and by 2-bp sequence context), strand calling, enzyme bias, and mutation rates.
HTML dashboard: results/{sample_name}/qc/{sample_name}.dashboard.html with all QC plots. The dashboard is self-contained (plots are embedded), so you can copy a single file for sharing or local viewing.

Downstream analysis

After QC, DAF-seq data can be processed for nucleosome, MSP, and transcription factor footprint calling with FiberHMM, a Hidden Markov Model toolkit that operates natively on deaminase data (DddA and DddB) and emits fibertools-compatible BAMs plus Molecular-annotation spec tags. See the FiberHMM page for installation and usage.

Alternatively, the Fiber-seq nucleosome caller ft add-nucleosomes can be applied to DAF-seq data after first converting the deamination marks to m6A-equivalent format with ft ddda-to-m6a. This routes DAF-seq data through the Fiber-seq analysis stack; see the fibertools documentation for details.

The guide to DAF-seq