fibertools-rs dark logo

This is the book for the computational tools of Fiber-seq which describes the two major software tools:

fibertools (ft) which is a CLI tool for creating and interacting with Fiber-seq BAM files.
- the samtools of Fiber-seq
Fiber-seq Inferred Regulatory Elements which is a method for identifying regulatory elements on individual fibers and peak calling.
- the MACS2 of Fiber-seq

Some Key features include:

Predicting m6A sites from PacBio Fiber-seq data
FIRE (Fiber-seq Inferred Regulatory Elements)
Extracting Fiber-seq results into plain text files.
Centering Fiber-seq results around a given position.
pyft: Python bindings for fibertools

A quick start guide can be found here.

You can help improve this book! And it is a easy as clicking the edit button in the top right.

The quick start guide for analyzing Fiber-seq data

Fiber-seq starting with PacBio

HiFi kinetics are required for predicting m6A with fibertools. Check with your sequencing provider prior to sequencing to ensure that the output file will have kinetics. Additionally, many of fibertools commands are compatible with CpG methylation which can be completed on instrument (if requested) or later with Jasmine, e.g. jasmine --keep-kinetics input.ccs.bam output.ccs.bam. This command should be run prior to fibertools if CpG methylation information is desired as Jasmine will overwrite the m6A predictions in the MM and ML tags.

Predict m6A and infer nucleosomes

To create useable Fiber-seq data you must first call m6A base-mods on the PacBio CCS bam using fibertools. First install fibertools and then process your bam file using the prediction command.

ft predict-m6a -t 16 input.ccs.bam output.fiberseq.bam

This will both make m6A calls and identify nucleosomes on each fiber.

Note, the input CCS bam must have average kinetics to be able to call m6A.

Alignment and phasing

We recommend aligning with pbmm2 and phasing with HiPhase. Please see their respective documentation for more information.

Alternatively, we have written a snakemake pipeline to align and phase Fiber-seq data; however, this pipeline is not officially supported outside of our lab at this time.

After this point, you will have a Fiber-seq BAM file that is compatible with all the extraction commands in fibertools.

Fiber-seq peaks and UCSC browser tracks

Once you have a phased bam file, you can identify Fiber-seq inferred regulatory elements (FIREs) to call Fiber-seq peaks and make a UCSC trackHub. Please see the FIRE repository for more details.

Fiber-seq starting with Oxford Nanopore Technologies (ONT)

Predict m6A

ft predict-m6a does not include a model for ONT data; however, you can use software, such as Dorado, to add CpG and m6A to your ONT BAM file.

If you do use Dorado you must then filter the m6A calls with modkit using a tenth percentile cutoff for each flow-cell independently. This is the only way to get good m6A calls in our experience, and using any hard ML threshold will not hold between flow-cells. Here is an example command:

modkit call-mods -t 8 -p 0.1 input.dorado.bam filtered.dorado.bam

We also show how to apply this filter and call nucleosomes in one line in the next section.

Infer nucleosomes and MSPs

Once you have CpG and m6A information in your filtered ONT BAM file, you can use ft add-nucleosomes to infer nucleosomes and MSPs. With Dorado, we find the best results when restricting to the 90% of calls that dorado is most confident in as determined by modkit.

modkit call-mods -p 0.1 input.bam - | ft add-nucleosomes - output.bam

Alignment and phasing

You can either use Dorado to align your ONT data or use a tool like minimap2 to align your data. If you do use minimap2 be sure to include the flag -y to preserve the CpG and m6A information in the output BAM file.

We recommend using WhatsHap for phasing ONT data. Please see their documentation for more information.

After this point, you will have a Fiber-seq BAM file that is compatible with all the extraction commands in fibertools.

Fiber-seq peaks and UCSC browser tracks

Some users report reasonable success in applying the FIRE pipeline to ONT data. However, please note that FIRE models were not trained or validated for ONT data. With that said, we have developed a heuristic for FIRE that appears to work very well with ONT data. To enable this add ont: true to your config.yaml file when setting up your FIRE run.

fibertools-rs dark logo

Think of fibertools as the samtools of Fiber-seq data. A command line tool for creating, filtering, interacting, and extracting from Fiber-seq BAM files.

This chapter covers:

A complete list of subcommands and their help pages can be found here.

Creating a Fiber-seq bam file

This chapter coverers creating a Fiber-seq bam file from a raw PacBio bam file. Including m6A prediction, nucleosome positioning, and FIRE identification.

`ft predict-m6a`

This command predicts m6A sites from PacBio Fiber-seq data. The input is a PacBio HiFi BAM file with kinetics and the output is a BAM file with the predicted m6A sites encoded in the ML and MM tags.

Details on the algorithm for m6A prediction are described in this publication: https://doi.org/10.1101/gr.279095.124.

The help page

`ft add-nucleosomes`

ft add-nucleosomes adds nucleosome positions and MSP positions to the Fiber-seq bam. The input is a Fiber-seq bam file with m6A calls and the output is a Fiber-seq bam file with the nucleosome positions encoded in the ns and nl tags and the MSP positions encoded in the as and al tags.

It is usually unnecessary to run this command manually, as it is called by ft predict-m6a automatically. However, you can run it manually if you want to adjust the parameters of the nucleosome calling algorithm.

The help page

Method

Nucleosome calling is performed by identifying stretches of DNA that are protected from Hia5 (i.e. do not have m6A signal). We have found the rate of false positive m6A calls in nucleosomes is very low when using fibertools (Jha et al.) allowing for a heuristic to perform as well as or better than our previous HMM caller (Dubocanin et al.).

There are three parameters in our heuristic nucleosome calling that can be adjusted: the minimum nucleosome length (n, default 75), the minimum combined nucleosome length (c, default 100), and the minimum extension to nucleosome length (e, default 25). These three parameters impact the three phases in nucleosome calling.

Call all regions that have no m6A events for at least n bases a candidate nucleosome.
Call all regions of size c or more that have only one internal m6A (putative false positive) a candidate nucleosome.
Extend the length of nucleosomes identified in phases one and two if by spanning one additional m6A e bases of unmodified sequence would be added to the nucleosome length.

alt text

`ft fire`

This command identifies Fiber-seq Inferred Regulatory Elements (FIREs) from a Fiber-seq BAM. The input is a Fiber-seq BAM file with m6A and nucleosome calls and the output is a Fiber-seq bam file with the FIREs encoded in the aq tags.

This command can be run in isolation; however, it is usually preferable to run the FIRE pipeline, which runs ft fire and performs many additional analyses and visualizations.

The help page.

I just need the FIRE elements

If you only need FIRE elements in the BAM file and don't need the peak-calls or trackHubs that are part of the FIRE pipeline you can do a simplified run FIRE using only ft. e.g.:

# add FIREs to the BAM file
ft fire input.bam fire.bam
# Extract the FIREs from the BAM file into bed format
ft fire --extract fire.bam fire.bed.gz
# if you want the NUC calls and non-FIRE MSPs as well
ft fire --extract --all fire.bam all.bed.gz

Extracting from a `fibertools` bam

This chapter describes various subcommands that extract information from a fibertools bam file into plain text.

`ft extract`

Extract Fiber-seq data into plain text files

Inputs and options

See the help message for details.

Output description

All outputs to ft extract can be (and should be) compressed by simply adding the .gz extension. For example, ft extract input.bam --m6a m6a.bed.gz will output a compressed bed12 file. Use - to output to stdout, e.g. ft extract input.bam --m6a -.

Shared Output columns:

Column	Description
ct	Chromosome or contig
st	Start position of the read on the chromosome
en	End position of the read on the chromosome
fiber	The fiber/read name
score	The number of ccs passes for the read (rounded)
strand	The strand of the read alignment

Columns specific to the `--m6a`, `--cpg`, `--nuc`, and `--msp` outputs

All of these files are written in standard bed12 format. The first and last block in each the bed12 record do not reflect real data, and exist only to mark the start and end positions of the read. If you would like to convert these beds into bigBeds be sure to include -allow1bpOverlap in your command.

Column	Description
thick start	Same as the start (`st`)
thick end	Same as the end (`en`)
itemRgb	Color specifc to the datatype, e.g. m6a marks get a purple RGB
blockCount	The number of blocks in the bed12 record
blockSizes	A comma separated list of the lengths of each feature in the bed12 record
blockStarts	A comma separated list of the relative start positions of each block in the bed12 record

Columns specific to the `--all` output

Column	Description
sam_flag	The sam flag of the read alignment
HP	The haplotype tag for the read
RG	The read group tag for the read
fiber_length	The length of the read in bp
fiber_sequence	The sequence of the read
ec	The number of ccs passes for the read (no rounding)
rq	The estimated accuracy of the read
total_AT_bp	The total number of AT bp in the read
total_m6a_bp	The total number of m6a bp in the read
total_nuc_bp	The total number of nucleosome bp in the read
total_msp_bp	The total number of MSP bp in the read
total_5mC_bp	The total number of 5mC bp in the read
nuc_starts	The start positions of the nucleosomes in molecular coordinates (comma separated list)
nuc_lengths	The lengths of the nucleosomes in molecular coordinates (comma separated list)
ref_nuc_starts	The start positions of the nucleosomes in reference coordinates (comma separated list)
ref_nuc_lengths	The lengths of the nucleosomes in reference coordinates (comma separated list)
msp_starts	The start positions of the MSPs in molecular coordinates (comma separated list)
msp_lengths	The lengths of the MSPs in molecular coordinates (comma separated list)
fire	The quality score of the MSP as a FIRE element (if FIRE as been applied). Scores over 230 are FIRE elements (comma separated list, range 0-255)
ref_msp_starts	The start positions of the MSPs in reference coordinates (comma separated list)
ref_msp_lengths	The lengths of the MSPs in reference coordinates (comma separated list)
m6a	The start positions of the m6a in molecular coordinates (comma separated list)
ref_m6a	The start positions of the m6a in reference coordinates (comma separated list)
m6a_qual	The quality of the m6a positions (ML value, comma separated list)
5mC	The start positions of the 5mC in molecular coordinates (comma separated list)
ref_5mC	The start positions of the 5mC in reference coordinates (comma separated list)
5mC_qual	The quality of the 5mC positions (ML value, comma separated list)

Note positions in columns starting with ref_ maybe contain -1 (NA) values if the reference sequence has an insertion or deletion relative to the read sequence at that position.

`ft center`

This command centers Fiber-seq data around given reference positions. This is useful for making aggregate m6A and CpG observations, as well as visualization of SVs

Inputs and options

See the help message for details.

Output description

This command writes Fiber-seq data in a tab-delimited format to stdout that has been centered relative to positions specified in the input bed file.

Column	Description
chrom	Chromosome
centering_position	The position on the chromosome about which data is being centered.
strand	The strand of the centering position.
subset_sequence	The sequence of the read around the centering position.
reference_start	The start position of the read in the reference.
reference_end	The end position of the read in the reference.
query_name	The name of the sequencing read
RG	The read group the read belongs to
centered_query_start	The start position of the read relative to the centering position
centered_query_end	The end position of the read relative to the centering position
query_length	The length of the read

Additional columns specific to long (default) format

Column	Description
centered_position_type	The type of position being centered. One of: `m6A`, `5mC`, `nuc`, `msp`.
centered_start	The start position of the "feature" relative to the centering position
centered_end	The end position of the "feature" relative to the centering position
centered_quality	The quality of the "feature" relative to the centering position (ML value)

Additional columns specific to the wide format

Column	Description
centered_m6a_positions	A comma separated list of m6a positions
m6a_qual	The quality of the m6a positions (ML value)
centered_5mC_positions	A comma separated list of 5mC positions
5mC_qual	The quality of the 5mC positions (ML value)
centered_nuc_starts	A comma separated list of nuc starts
centered_nuc_ends	A comma separated list of nuc ends
centered_msp_starts	A comma separated list of msp starts
centered_msp_ends	A comma separated list of msp ends
query_sequence	The sequence of the read

Note if the --wide flag is used with the --reference flag some positions in the comma separated lists can be NA when the reference sequence has an insertion or deletion relative to the read sequence at that position.

`ft footprint`

Usage

ft footprint [OPTIONS] <BAM> --bed <BED> --yaml <YAML>

The BAM file is an indexed fiber-seq bam file.

The BED file is a bed file with the motifs you'd like to test for footprints. This should include the strand the motif is on.

The YAML file is a file that describes the modules within the motif that can be footprinted. e.g. a CTCF yaml with its multiple binding sites might look like:

modules:
  - [0, 8]
  - [8, 16]
  - [16, 23]
  - [23, 29]
  - [29, 35]

Modules must start at zero, end at the length of the motif, be sorted, and be contiguous with one another. At most 15 modules are allowed, and the intervals are 0-based, half-open (like BED).

See the help message for details.

Description of output columns

The footprinting output table is a tab-separated file with the same number of entries as the input BED file and the following columns:

Column	Description
chrom	Chromosome
start	The start position of the motif
end	The end position of the motif
strand	The strand of the motif.
n_spanning_fibers	The number of fibers that span the motif.
n_spanning_msps	The number of msp that span the motif.
n_overlapping_nucs	The number of fibers that have an intersecting nucleosome.
module_X	The number of fibers that are footprinted in module X. The number of module columns is determined by the footprinting yaml.
footprint_codes	Comma separated list of footprint codes for each fiber. See details below.
fire_quals	Comma separated list of fire qualities for each fiber. -1 if the MSP is not spanning or present. Note all fire_quals will be 0 or -1 if FIRE has not been applied to the bam.
fiber_names	Comma separated list of fiber names that span the motif. Names share the same index as the previous column, so they can be matched with footprint codes.

Footprint codes

The footprint codes are an encoded bit flag similar to how filtering is done with samtools. If the first bit is set (1) then there is an MSP that spans the footprint. For each following bit, the bit is set if that module is footprinted by that fiber.

Here are some examples in python for how you could test a footprint code in a few ways:

fp_code = 0b1001 # this is a value of 9, but in binary it is 1001

# test if the first bit is set, there is a spanning MSP, true in this example
(fp_code & 1) > 0

# test if the first module is footprinted, false in this example
(fp_code & (1 << 1)) > 0 

# test if the third module is footprinted, true in this example
(fp_code & (1 << 3)) > 0

`ft-pileup`

The ft-pileup command is used to generate a per-base pileup of Fiber-seq data. This command is useful for visualizing the distribution of various features across the genome.

Inputs and options

See the help message for details.

`ft-qc`

The ft-qc command is used to generate quality control metrics for a Fiber-seq BAM file.

Inputs and options

See the help message for details.

Help pages for fibertools subcommands

`ft`

Fiber-seq toolkit in rust

Usage: ft [OPTIONS] <COMMAND>

Commands:
  predict-m6a       Predict m6A positions using HiFi kinetics data and encode the results in the MM and
                        ML bam tags. Also adds nucleosome (nl, ns) and MTase sensitive patches (al, as)
                        [aliases: m6A, m6a]
  add-nucleosomes   Add nucleosomes to a bam file with m6a predictions
  fire              Add FIREs (Fiber-seq Inferred Regulatory Elements) to a bam file with m6a
                        predictions
  extract           Extract fiberseq data into plain text files [aliases: ex, e]
  center            This command centers fiberseq data around given reference positions. This is useful
                        for making aggregate m6A and CpG observations, as well as visualization of SVs
                        [aliases: c, ct]
  footprint         Infer footprints from fiberseq data
  qc                Collect QC metrics from a fiberseq bam file
  track-decorators  Make decorated bed files for fiberseq data
  pileup            Make a pileup track of Fiber-seq features from a FIRE bam
  clear-kinetics    Remove HiFi kinetics tags from the input bam file
  strip-basemods    Strip out select base modifications
  ddda-to-m6a       Convert a DddA BAM file to pseudo m6A BAM file
  help              Print this message or the help of the given subcommand(s)

Options:
  -h, --help     Print help
  -V, --version  Print version

Global-Options:
  -t, --threads <THREADS>  Threads [default: 8]

Debug-Options:
  -v, --verbose...  Logging level [-v: Info, -vv: Debug, -vvv: Trace]
      --quiet       Turn off all logging

`ft predict-m6a`

Predict m6A positions using HiFi kinetics data and encode the results in the MM and ML bam tags. Also adds
nucleosome (nl, ns) and MTase sensitive patches (al, as)

Usage: ft predict-m6a [OPTIONS] [BAM] [OUT]

Arguments:
  [BAM]  Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
         file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
  [OUT]  Output bam file with m6A calls in new/extended MM and ML bam tags [default: -]

Options:
  -n, --nucleosome-length <NUCLEOSOME_LENGTH>
          Minium nucleosome length [default: 75]
  -c, --combined-nucleosome-length <COMBINED_NUCLEOSOME_LENGTH>
          Minium nucleosome length when combining over a single m6A [default: 100]
      --min-distance-added <MIN_DISTANCE_ADDED>
          Minium distance needed to add to an already existing nuc by crossing an m6a [default: 25]
  -d, --distance-from-end <DISTANCE_FROM_END>
          Minimum distance from the end of a fiber to call a nucleosome or MSP [default: 45]
  -k, --keep
          Keep hifi kinetics data
  -h, --help
          Print help (see more with '--help')
  -V, --version
          Print version

BAM-Options:
  -F, --filter <BIT_FLAG>        BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
                                 0]
  -x, --ftx <FILTER_EXPRESSION>  Filtering expression to use for filtering records Example: filter to
                                 nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
                                 filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
                                 Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
                                 Filtering expressions support len() and qual() functions over msp, nuc,
                                 m6a, cpg
      --ml <MIN_ML_SCORE>        Minium score in the ML tag to use or include in the output [env:
                                 FT_MIN_ML_SCORE=] [default: 125]

Global-Options:
  -t, --threads <THREADS>  Threads [default: 8]

Debug-Options:
  -v, --verbose...  Logging level [-v: Info, -vv: Debug, -vvv: Trace]
      --quiet       Turn off all logging

Developer-Options:
      --force-min-ml-score <FORCE_MIN_ML_SCORE>  Force a different minimum ML score
      --all-calls                                Keep all m6A calls regardless of how low the ML value is
  -b, --batch-size <BATCH_SIZE>                  Number of reads to include in batch prediction [default: 1]

`ft fire`

Add FIREs (Fiber-seq Inferred Regulatory Elements) to a bam file with m6a predictions

Usage: ft fire [OPTIONS] [BAM] [OUT]

Arguments:
  [BAM]  Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
         file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
  [OUT]  Output file (BAM by default, table of MSP features if `--feats-to-text` is used, and bed9 + if
         `--extract`` is used) [default: -]

Options:
      --ont
          Use a ONT heuristic adjustment for FIRE calling. This adjusts the observed number of m6A counts by
          adding pseudo counts to account for the single stranded nature of ONT data [env: ONT=]
  -e, --extract
          Output just FIRE elements in bed9 format
      --all
          When extracting bed9 format include all MSPs and nucleosomes
  -f, --feats-to-text
          Output FIREs features for training in a table format
  -s, --skip-no-m6a
          Don't write reads with no m6A calls to the output bam
      --min-msp <MIN_MSP>
          Skip reads without at least `N` MSP calls [env: MIN_MSP=] [default: 0]
      --min-ave-msp-size <MIN_AVE_MSP_SIZE>
          Skip reads without an average MSP size greater than `N` [env: MIN_AVE_MSP_SIZE=] [default: 0]
  -w, --width-bin <WIDTH_BIN>
          Width of bin for feature collection [env: WIDTH_BIN=] [default: 40]
  -b, --bin-num <BIN_NUM>
          Number of bins to collect [env: BIN_NUM=] [default: 9]
      --best-window-size <BEST_WINDOW_SIZE>
          Calculate stats for the highest X bp window within each MSP Should be a fair amount higher than
          the expected linker length [env: BEST_WINDOW_SIZE=] [default: 100]
      --min-msp-length-for-positive-fire-call <MIN_MSP_LENGTH_FOR_POSITIVE_FIRE_CALL>
          Minium length of msp to call a FIRE [env: MIN_MSP_LENGTH_FOR_POSITIVE_FIRE_CALL=] [default: 85]
      --model <MODEL>
          Optional path to a model json file. If not provided ft will use the default model (recommended)
          [env: FIRE_MODEL=]
      --fdr-table <FDR_TABLE>
          Optional path to a FDR table [env: FDR_TABLE=]
  -h, --help
          Print help
  -V, --version
          Print version

BAM-Options:
  -F, --filter <BIT_FLAG>        BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
                                 0]
  -x, --ftx <FILTER_EXPRESSION>  Filtering expression to use for filtering records Example: filter to
                                 nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
                                 filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
                                 Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
                                 Filtering expressions support len() and qual() functions over msp, nuc,
                                 m6a, cpg
      --ml <MIN_ML_SCORE>        Minium score in the ML tag to use or include in the output [env:
                                 FT_MIN_ML_SCORE=] [default: 125]

Global-Options:
  -t, --threads <THREADS>  Threads [default: 8]

Debug-Options:
  -v, --verbose...  Logging level [-v: Info, -vv: Debug, -vvv: Trace]
      --quiet       Turn off all logging

`ft extract`

Extract fiberseq data into plain text files

Usage: ft extract [OPTIONS] [BAM]

Arguments:
  [BAM]  Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
         file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]

Options:
  -r, --reference  Report in reference sequence coordinates
      --molecular  Report positions in the molecular sequence coordinates
      --m6a <M6A>  Output path for m6a bed12
  -c, --cpg <CPG>  Output path for 5mC (CpG, primrose) bed12
      --msp <MSP>  Output path for methylation sensitive patch (msp) bed12
  -n, --nuc <NUC>  Output path for nucleosome bed12
  -a, --all <ALL>  Output path for a tabular format including "all" fiberseq information in the bam
  -h, --help       Print help (see more with '--help')
  -V, --version    Print version

BAM-Options:
  -F, --filter <BIT_FLAG>        BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
                                 0]
  -x, --ftx <FILTER_EXPRESSION>  Filtering expression to use for filtering records Example: filter to
                                 nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
                                 filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
                                 Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
                                 Filtering expressions support len() and qual() functions over msp, nuc,
                                 m6a, cpg
      --ml <MIN_ML_SCORE>        Minium score in the ML tag to use or include in the output [env:
                                 FT_MIN_ML_SCORE=] [default: 125]

Global-Options:
  -t, --threads <THREADS>  Threads [default: 8]

Debug-Options:
  -v, --verbose...  Logging level [-v: Info, -vv: Debug, -vvv: Trace]
      --quiet       Turn off all logging

All-Format-Options:
  -q, --quality   Include per base quality scores in "fiber_qual"
  -s, --simplify  Simplify output by removing fiber sequence

`ft center`

This command centers fiberseq data around given reference positions. This is useful for making aggregate m6A
and CpG observations, as well as visualization of SVs

Usage: ft center [OPTIONS] --bed <BED> [BAM]

Arguments:
  [BAM]  Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
         file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]

Options:
  -b, --bed <BED>    Bed file on which to center fiberseq reads. Data is adjusted to the start position of
                     the bed file and corrected for strand if the strand is indicated in the 6th column of
                     the bed file. The 4th column will also be checked for the strand but only after the 6th
                     is. If you include strand information in the 4th (or 6th) column it will orient data
                     accordingly and use the end position of bed record instead of the start if on the minus
                     strand. This means that profiles of motifs in both the forward and minus orientation
                     will align to the same central position
  -d, --dist <DIST>  Set a maximum distance from the start of the motif to keep a feature
  -w, --wide         Provide data in wide format, one row per read
  -r, --reference    Return relative reference position instead of relative molecular position
  -s, --simplify     Replace the sequence output column with just "N"
  -h, --help         Print help (see more with '--help')
  -V, --version      Print version

BAM-Options:
  -F, --filter <BIT_FLAG>        BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
                                 0]
  -x, --ftx <FILTER_EXPRESSION>  Filtering expression to use for filtering records Example: filter to
                                 nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
                                 filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
                                 Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
                                 Filtering expressions support len() and qual() functions over msp, nuc,
                                 m6a, cpg
      --ml <MIN_ML_SCORE>        Minium score in the ML tag to use or include in the output [env:
                                 FT_MIN_ML_SCORE=] [default: 125]

Global-Options:
  -t, --threads <THREADS>  Threads [default: 8]

Debug-Options:
  -v, --verbose...  Logging level [-v: Info, -vv: Debug, -vvv: Trace]
      --quiet       Turn off all logging

`ft add-nucleosomes`

Add nucleosomes to a bam file with m6a predictions

Usage: ft add-nucleosomes [OPTIONS] [BAM] [OUT]

Arguments:
  [BAM]  Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
         file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
  [OUT]  Output bam file with nucleosome calls [default: -]

Options:
  -n, --nucleosome-length <NUCLEOSOME_LENGTH>
          Minium nucleosome length [default: 75]
  -c, --combined-nucleosome-length <COMBINED_NUCLEOSOME_LENGTH>
          Minium nucleosome length when combining over a single m6A [default: 100]
      --min-distance-added <MIN_DISTANCE_ADDED>
          Minium distance needed to add to an already existing nuc by crossing an m6a [default: 25]
  -d, --distance-from-end <DISTANCE_FROM_END>
          Minimum distance from the end of a fiber to call a nucleosome or MSP [default: 45]
  -h, --help
          Print help
  -V, --version
          Print version

BAM-Options:
  -F, --filter <BIT_FLAG>        BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
                                 0]
  -x, --ftx <FILTER_EXPRESSION>  Filtering expression to use for filtering records Example: filter to
                                 nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
                                 filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
                                 Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
                                 Filtering expressions support len() and qual() functions over msp, nuc,
                                 m6a, cpg
      --ml <MIN_ML_SCORE>        Minium score in the ML tag to use or include in the output [env:
                                 FT_MIN_ML_SCORE=] [default: 125]

Global-Options:
  -t, --threads <THREADS>  Threads [default: 8]

Debug-Options:
  -v, --verbose...  Logging level [-v: Info, -vv: Debug, -vvv: Trace]
      --quiet       Turn off all logging

`ft footprint`

Infer footprints from fiberseq data

Usage: ft footprint [OPTIONS] --bed <BED> --yaml <YAML> [BAM]

Arguments:
  [BAM]  Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
         file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]

Options:
  -b, --bed <BED>    BED file with the regions to footprint. Should all contain the same motif with proper
                     strand information, and ideally be ChIP-seq peaks
  -y, --yaml <YAML>  yaml describing the modules of the footprint
  -o, --out <OUT>    Output bam [default: -]
  -h, --help         Print help
  -V, --version      Print version

BAM-Options:
  -F, --filter <BIT_FLAG>        BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
                                 0]
  -x, --ftx <FILTER_EXPRESSION>  Filtering expression to use for filtering records Example: filter to
                                 nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
                                 filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
                                 Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
                                 Filtering expressions support len() and qual() functions over msp, nuc,
                                 m6a, cpg
      --ml <MIN_ML_SCORE>        Minium score in the ML tag to use or include in the output [env:
                                 FT_MIN_ML_SCORE=] [default: 125]

Global-Options:
  -t, --threads <THREADS>  Threads [default: 8]

Debug-Options:
  -v, --verbose...  Logging level [-v: Info, -vv: Debug, -vvv: Trace]
      --quiet       Turn off all logging

`ft clear-kinetics`

Remove HiFi kinetics tags from the input bam file

Usage: ft clear-kinetics [OPTIONS] [BAM] [OUT]

Arguments:
  [BAM]  Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
         file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
  [OUT]  Output bam file without hifi kinetics [default: -]

Options:
  -h, --help     Print help
  -V, --version  Print version

BAM-Options:
  -F, --filter <BIT_FLAG>        BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
                                 0]
  -x, --ftx <FILTER_EXPRESSION>  Filtering expression to use for filtering records Example: filter to
                                 nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
                                 filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
                                 Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
                                 Filtering expressions support len() and qual() functions over msp, nuc,
                                 m6a, cpg
      --ml <MIN_ML_SCORE>        Minium score in the ML tag to use or include in the output [env:
                                 FT_MIN_ML_SCORE=] [default: 125]

Global-Options:
  -t, --threads <THREADS>  Threads [default: 8]

Debug-Options:
  -v, --verbose...  Logging level [-v: Info, -vv: Debug, -vvv: Trace]
      --quiet       Turn off all logging

`ft strip-basemods`

Strip out select base modifications

Usage: ft strip-basemods [OPTIONS] [BAM] [OUT]

Arguments:
  [BAM]  Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
         file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
  [OUT]  Output bam file [default: -]

Options:
  -b, --basemod <BASEMOD>  base modification to strip out of the bam file [possible values: m6A, 6mA, 5mC,
                           CpG]
      --ml-m6a <ML_M6A>    filter out m6A modifications with less than this ML value [default: 0]
      --ml-5mc <ML_5MC>    filter out 5mC modifications with less than this ML value [default: 0]
      --drop-forward       Drop forward strand of base modifications
      --drop-reverse       Drop reverse strand of base modifications
  -h, --help               Print help
  -V, --version            Print version

BAM-Options:
  -F, --filter <BIT_FLAG>        BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
                                 0]
  -x, --ftx <FILTER_EXPRESSION>  Filtering expression to use for filtering records Example: filter to
                                 nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
                                 filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
                                 Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
                                 Filtering expressions support len() and qual() functions over msp, nuc,
                                 m6a, cpg
      --ml <MIN_ML_SCORE>        Minium score in the ML tag to use or include in the output [env:
                                 FT_MIN_ML_SCORE=] [default: 125]

Global-Options:
  -t, --threads <THREADS>  Threads [default: 8]

Debug-Options:
  -v, --verbose...  Logging level [-v: Info, -vv: Debug, -vvv: Trace]
      --quiet       Turn off all logging

`ft track-decorators`

Make decorated bed files for fiberseq data

Usage: ft track-decorators [OPTIONS] --bed12 <BED12> [BAM]

Arguments:
  [BAM]  Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
         file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]

Options:
  -b, --bed12 <BED12>          Output path for bed12 file to be decorated
  -d, --decorator <DECORATOR>  Output path for decorator bed file [default: -]
  -h, --help                   Print help
  -V, --version                Print version

BAM-Options:
  -F, --filter <BIT_FLAG>        BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
                                 0]
  -x, --ftx <FILTER_EXPRESSION>  Filtering expression to use for filtering records Example: filter to
                                 nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
                                 filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
                                 Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
                                 Filtering expressions support len() and qual() functions over msp, nuc,
                                 m6a, cpg
      --ml <MIN_ML_SCORE>        Minium score in the ML tag to use or include in the output [env:
                                 FT_MIN_ML_SCORE=] [default: 125]

Global-Options:
  -t, --threads <THREADS>  Threads [default: 8]

Debug-Options:
  -v, --verbose...  Logging level [-v: Info, -vv: Debug, -vvv: Trace]
      --quiet       Turn off all logging

`ft pileup`

Make a pileup track of Fiber-seq features from a FIRE bam

Usage: ft pileup [OPTIONS] [BAM] [RGN]

Arguments:
  [BAM]  Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
         file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
  [RGN]  Region string to make a pileup of. e.g. chr1:1-1000 or chr1:1-1,000 If not provided will make a
         pileup of the whole genome

Options:
  -o, --out <OUT>                  Output file [default: -]
  -m, --m6a                        include m6A calls
  -c, --cpg                        include 5mC calls
      --haps                       For each column add two new columns with the hap1 and hap2 specific data
  -k, --keep-zeros                 Keep zero coverage regions
  -p, --per-base                   Write output one base at a time even if the values do not change
      --fiber-coverage             Calculate coverage starting from the first MSP/NUC to the last MSP/NUC
                                   position instead of the complete span of the read alignment
      --shuffle <SHUFFLE>          Shuffle the fiber-seq data according to a bed file of the shuffled
                                   positions of the fiber-seq data
      --rolling-max <ROLLING_MAX>  Output a rolling max of the score column over X bases
      --no-msp                     No MSP columns
      --no-nuc                     No NUC columns
  -h, --help                       Print help (see more with '--help')
  -V, --version                    Print version

BAM-Options:
  -F, --filter <BIT_FLAG>        BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
                                 0]
  -x, --ftx <FILTER_EXPRESSION>  Filtering expression to use for filtering records Example: filter to
                                 nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
                                 filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
                                 Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
                                 Filtering expressions support len() and qual() functions over msp, nuc,
                                 m6a, cpg
      --ml <MIN_ML_SCORE>        Minium score in the ML tag to use or include in the output [env:
                                 FT_MIN_ML_SCORE=] [default: 125]

Global-Options:
  -t, --threads <THREADS>  Threads [default: 8]

Debug-Options:
  -v, --verbose...  Logging level [-v: Info, -vv: Debug, -vvv: Trace]
      --quiet       Turn off all logging

`ft qc`

Collect QC metrics from a fiberseq bam file

Usage: ft qc [OPTIONS] [BAM] [OUT]

Arguments:
  [BAM]  Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
         file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
  [OUT]  Output text file with QC metrics. The format is a tab-separated file with the following columns:
         "statistic\tvalue\tcount" where "statistic" is the name of the metric, "value" is the value of the
         metric, and "count" is the number of times the metric was observed [default: -]

Options:
      --acf                                Calculate the auto-correlation function of the m6A marks in the
                                           fiber-seq data
      --acf-max-lag <ACF_MAX_LAG>          maximum lag for the ACF calculation [default: 250]
      --acf-min-m6a <ACF_MIN_M6A>          Minimum number of m6A marks to use a read in the ACF calculation
                                           [default: 100]
      --acf-max-reads <ACF_MAX_READS>      maximum number of reads to use in the ACF calculation [default:
                                           10000]
      --acf-sample-rate <ACF_SAMPLE_RATE>  After sampling the first "acf-max-reads" randomly sample one of
                                           every "acf-sample-rate" reads and replace one of the previous
                                           reads at random [default: 100]
  -m, --m6a-per-msp                        In the output include a measure of the number of m6A events per
                                           MSPs of a given size. The output format is:
                                           "m6a_per_msp_size\t{m6A count},{MSP size},{is a FIRE}\t{count}"
                                           e.g. "m6a_per_msp_size\t35,100,false\t100"
      --n-reads <N_READS>                  Only process the first "n" reads in the input bam file
  -h, --help                               Print help
  -V, --version                            Print version

BAM-Options:
  -F, --filter <BIT_FLAG>        BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
                                 0]
  -x, --ftx <FILTER_EXPRESSION>  Filtering expression to use for filtering records Example: filter to
                                 nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
                                 filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
                                 Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
                                 Filtering expressions support len() and qual() functions over msp, nuc,
                                 m6a, cpg
      --ml <MIN_ML_SCORE>        Minium score in the ML tag to use or include in the output [env:
                                 FT_MIN_ML_SCORE=] [default: 125]

Global-Options:
  -t, --threads <THREADS>  Threads [default: 8]

Debug-Options:
  -v, --verbose...  Logging level [-v: Info, -vv: Debug, -vvv: Trace]
      --quiet       Turn off all logging

pyft: Python bindings for `fibertools`

See the documentation for pyft on readthedocs for more information.

Installation

This chapter covers the various ways to install fibertools.

It is easiest to install fibertools using conda (instructions). However, for the fastest m6A predictions you need to install with libtorch enabled (instructions) if you would like to install from source, you can use the following instructions.

From `bioconda`

fibertools-rs is avalible through bioconda and can be installed with the following command:

mamba install -c conda-forge -c bioconda fibertools-rs

However, due to size constraints in bioconda this version does not support contain the pytorch libraries or GPU acceleration for m6A predictions. m6A predictions will still work in the bioconda version but may be much slower. If you would like to use m6A prediction and GPU acceleration, you will need to install using the directions here.

From `crates.io`

Installation from crates.io requires the rust package manager cargo. You can find how to install cargo here.. Furthermore, a recent version of gcc and cmake is required. I have tested and recommend gcc v10.2.0 and cmake v3.21.1, though other versions may work.

cargo install fibertools-rs

From GitHub (active development)

Using cargo from source:

cargo install --git https://github.com/fiberseq/fibertools-rs

or using git form source:

git clone https://github.com/fiberseq/fibertools-rs
cd fibertools-rs
cargo build --release
./target/release/ft --help

With libtorch

Get libtorch v2.2.0 from the PyTorch website and extract the content of the zip file.

On Linux/Unix system you can download with:
- wget https://download.pytorch.org/libtorch/cu118/libtorch-shared-with-deps-2.2.0%2Bcu118.zip
On macOS you can download with:
- wget https://download.pytorch.org/libtorch/cpu/libtorch-macos-2.2.0.zip
Windows is not supported and will not be.

Then add the following to your .bashrc or equivalent, where /path/to/libtorch is the path to the directory that was created when unzipping the file:

export LIBTORCH_CXX11_ABI=0
export LIBTORCH=/path/to/libtorch # e.g. export LIBTORCH=/Users/mrvollger/lib/libtorch
export LD_LIBRARY_PATH=${LIBTORCH}/lib:$LD_LIBRARY_PATH
export DYLD_LIBRARY_PATH=${LIBTORCH}/lib:$LD_LIBRARY_PATH

Finally install using cargo:

cargo install --all-features fibertools-rs

FIRE: descriptions, methods, and outputs

The code for running the FIRE pipeline can be found on GitHub, and if you do use FIRE in your work please cite our publication:

Vollger, M. R., Swanson, E. G., Neph, S. J., Ranchalis, J., Munson, K. M., Ho, C.-H., Sedeño-Cortés, A. E., Fondrie, W. E., Bohaczuk, S. C., Mao, Y., Parmalee, N. L., Mallory, B. J., Harvey, W. T., Kwon, Y., Garcia, G. H., Hoekzema, K., Meyer, J. G., Cicek, M., Eichler, E. E., … Stergachis, A. B. (2024). A haplotype-resolved view of human gene regulation. bioRxiv. https://doi.org/10.1101/2024.06.14.599122

Summary of Fiber-seq inferred regulatory elements (FIREs)

For those who would prefer video I have recorded a lab meeting discussing FIRE and the methods used here. If you prefer to read please continue on.

FIREs are MTase sensitive patches (MSPs) that are inferred to be regulatory elements on single chromatin fibers. To do this we used semi-supervised machine learning to identify MSPs that are likely to be regulatory elements using the Mokapot framework and XGBoost. Every individual FIRE element is associated with an estimated precision value, which indicates the probability that the FIRE element is a true regulatory element. The estimated precision of FIREs elements are created using Mokapot and validation data not used in training. We train our model targeting FIRE elements with at least 95% precision, MSPs with less than 90% precision are considered to have average level of accessibility expected between two nucleosomes, and are referred to as linker regions.

Semi-superivized machine learning with Mokapot requires a mixed-positive training set and a clean negative training set. To create mixed positive training data we selected MSPs that overlapped DNase hypersensitive sites (DHSs) and CTCF ChIP-seq peaks. And to create a clean negative training set we selected MSPs that did not overlap DHSs or CTCF ChIP-seq peaks.

For details on running the FIRE workflow see:

Running and installing

For details on the outputs of the FIRE workflow see:

Outputs

Running the FIRE pipeline

The FIRE pipeline is a Snakemake workflow for calling Fiber-seq Inferred Regulatory Elements (FIREs) on single molecules and peak calling with Fiber-seq.

⚠️ WARNING

This pipeline generates peak-calls and trackHubs for whole genome Fiber-seq datasets. If instead you only want to add FIRE elements to a BAM file see: ft fire. You will also want to use ft fire if your dataset is not whole genome, as the FIRE pipeline requires a whole genome dataset for the FDR calibration used in peak-calling.

Install

Please start by installing pixi which handles the environment of the FIRE workflow.

Then install FIRE using git and pixi:

git clone https://github.com/fiberseq/FIRE.git
cd FIRE
pixi install

We then recommend quickly testing your installation by running the test suite:

pixi run test

We recommend setting a Snakemake conda prefix in your bashrc, e.g. in the Stergachis lab add:

export SNAKEMAKE_CONDA_PREFIX=/mmfs1/gscratch/stergachislab/snakemake-conda-envs

Then Snakemake installs all the additional requirements as conda envs in that directory.

If you wish to distribute jobs across a cluster you may need to install the appropriate snakemake executor plugin. The SLURM executor is included in the environment (snakemake-executor-slurm)

Configuring

See the configuration README, the example configuration file, and the example manifest file for configuration options.

⚠️ ONT users

If you are using ONT please follow the preprocessing steps outlined here, and then be sure to include ont: true in your FIRE config file (e.g. config.yaml). This will ensure that the correct parameters are used for the ft fire command.

Run

The FIRE workflow can be executed using the pixi run fire command. Under the hood this runs a snakemake workflow and any extra parameters are passed directly to snakemake. For example:

pixi run fire --configfile config/config.yaml

If you want to do a dry run:

pixi run fire --configfile config/config.yaml -n

If you want to execute across a cluster (modify profiles/slurm-executor as needed for distributed execution):

pixi run fire --configfile config/config.yaml --profile profiles/slurm-executor

And if you want to run the FIRE workflow from another directory you can do so with:

pixi run --manifest-path /path/to/snakemake/pixi.toml fire ...

where you update /path/to/snakemake/pixi.toml to the path of the pixi.toml in the cloned FIRE repository.

You can also run snakemake directly, e.g.:

pixi shell
snakemake \
  --configfile config/config.yaml \
  --profile profiles/slurm-executor \
  --local-cores 8 -k

Outputs of the FIRE pipeline

The FIRE pipeline returns outputs for a sample in the directory results/{sample}/ and files are labeled with the sample ({sample}) and the version of the FIRE pipeline ({v}). The following files and directories are generated:

Output	Description
{sample}-fire-{v}-filtered.cram	CRAM file containing the all the data used in the FIRE pipeline.
{sample}-fire-{v}-peaks.bed.gz	BED file containing the FIRE peaks calls
{sample}-fire-{v}-hap-differences.bed.gz	BED file containing the results of searching for haplotype-selective peaks.
{sample}-fire-{v}-pileup.bed.gz	BED file containing per-base information on number of FIREs, MSPs, nucleosomes, coverage and more.
{sample}-fire-{v}-qc.tbl.gz	Table containing quality control metrics for the FIRE CRAM.
trackHub-{v}/	Directory containing a UCSC trackHub for visualizing all the results of the FIRE pipeline.
additional-outputs-{v}/	Directory containing additional outputs from the FIRE pipeline.

More details on the individual outputs

The `{sample}-fire-{v}-filtered.cram` file

The CRAM file contains all the data used in the FIRE pipeline. It is a CRAM file that can be viewed with IGV or other genome browsers. Sequencing quality scores are removed from the CRAM file to reduce the file size since per base quality scores are not used in the FIRE pipeline, as well as reads with insufficient m6A signal. The CRAM file is sorted and indexed.

The `{sample}-fire-{v}-peaks.bed.gz` file

This is the peak file for the FIRE method. Peaks are called by identifying FIRE score (methods) local-maxima that have FDR values below a threshold. By default, the pipeline reports peaks at a 5% FDR threshold. Once a local-maxima is identified, the start and end positions of the peak are determined by the median start and end positions of the underlying FIRE elements. We also calculate and report wide peaks in the additional-outputs/ by taking the union of the FIRE peaks and all regions below the FDR threshold and then merging resulting regions that are within one nucleosome (147 bp) of one another.

The FIRE peaks file has the following columns:

Column	Description
#chrom	Chromosome of the peak
peak*start	Start of the peak
peak_end	End of the peak
start	Start of the maximum of the peak
end	End of the maximum of the peak
coverage	Coverage of the peak
fire_coverage	Coverage of the FIREs in the peak
score	FIRE score of the peak (see methods)
nuc_coverage	Coverage of the nucleosomes in the peak
msp_coverage	Coverage of the MSPs in the peak
.**{H1,H2}	Repeats of previous columns but specific for the two haplotypes
FDR	False discovery rate of the peak
log_FDR	-10*log10 of the FDR
FIRE_size_mean	Mean size of the FIREs in the peak
FIRE_size_ssd	Standard deviation of the size of the FIREs in the peak
FIRE_start_ssd	Standard deviation of the start of the FIREs in the peak
FIRE_end_ssd	Standard deviation of the end of the FIREs in the peak
pass_coverage	Whether the peak passes coverage filters

The `{sample}-fire-{v}-hap-differences.bed.gz` file

This file primarily contains the same columns as the FIRE peaks file but additionally has a p_value column with the results of a Fisher's exact test for the difference in coverage between the two haplotypes, and a p_adjust column with the Benjamini-Hochberg adjusted p-value. See the methods for more details.

The `{sample}-fire-{v}-pileup.bed.gz` file

This is a BED file containing per-base information on number of FIREs, MSPs, nucleosomes, coverage and more. The columns are calculated using ft-pileup and more details can be found in the ft-pileup documentation.

The `{sample}-fire-{v}-qc.tbl.gz` file

This file contains quality control metrics for the FIRE CRAM. The results are directly created by ft-qc and more details can be found in the ft-qc documentation.

The `trackHub-{v}/` directory

The trackHub-{v}/ directory contains a UCSC trackHub for visualizing all the results of the FIRE pipeline. A description of the trackHub can be found in trackHub/fire-description.html. The trackHub can be loaded into UCSC by uploading the trackHub directory to a public facing website and then loading the hub.txt's URL into the UCSC trackHub browser.

A copy of the trackHub description can be found here.

The `additional-outputs-{v}/` directory

The additional-outputs-{v}/ directory contains the following files: TODO

FIRE methods

This chapter describes the methods and training used in the FIRE pipeline. There are two major components, identifying per-molecule FIRE elements, and aggregating these elements into peaks.

Training of Fire-seq inferred regulatory elements.

Training data

For training data, we generated 21 different GM12878 Fiber-seq experiments with a range of under- and over-methylated experimental conditions to ensure we captured a broad range of percent m6A (5.8-13.3%) to ensure our model could generalize to new samples with varying levels of m6A. We merged these sequencing results and randomly selected 10% of the Fiber-seq reads from 100,000 randomly selected DNase I and CTCF ChIP-seq peaks for mixed-positive labels and 100,000 equally sized regions that did not overlap DNase or CTCF Chip-seq for negative labels. We then generated features for each of these MSPs, including length, log fold enrichment of m6A, the A/T content, and windowed measures of m6A around the MSP (Supplemental Table 3) and held out 20% of the Fiber-seq reads to be used as test data.

Semi-supervised training

To carry out semi-supervised training, we used an established method, Mokapot (Fondrie & Noble, 2021; Käll et al., 2007), which we summarize below. In the first round of semi-supervised training, Mokapot identifies the feature that best discriminates between our mixed-positive and negative labels and then selects a threshold for that feature such that the mixed-positive labels can be discriminated from the negative labels with 95% estimated precision (defined below). The subset of mixed-positive labels above this threshold is then used as an initial set of positive labels in training an XGBoost model with five-fold cross-validation (Chen & Guestrin, 2016). Then this process is iteratively repeated, using the learned prediction from the previous iteration’s model to create positive labels at 95% estimated precision, until the number of positive identifications at 95% precision in the validation set ceases to increase (15 iterations, Fig. S1) (Fondrie & Noble, 2021; Käll et al., 2007).

Estimated precision of individual FIRE elements

We cannot compute the precision associated with a particular XGBoost score because we do not have access to a set of clean-positive labels. Instead, we define a notion of “estimated precision” using a balanced held-out test set of mixed-positive and negative labels (20% of the data). We defined the estimated precision (EP) of a FIRE element to be

$$ EP = 1 - \frac{FP+1}{TMP} $$

where TMP is the number of “true” identifications from the mixed positive labels with at least that element’s XGBoost score, and FP is the number of false positive identifications from negative labels with at least that element’s XGBoost score. We add a pseudo count of one to the numerator of false positive identifications so as to prevent liberal estimates for smaller collections of identifications (Fondrie & Noble, 2021).

Features of FIRE elements

The following features were used as features with Mokapot in FIRE element classification:

Feature	Description
msp_len	Length of the MSP
msp_len_times_m6a_fc	Length of the MSP times the fold-change of m6A in the MSP
ccs_passes	Number of CCS passes in the MSP
fiber_m6a_count	Number of m6A sites in the Fiber-seq read
fiber_AT_count	Number of AT sites in the Fiber-seq read
fiber_m6a_frac	Fraction of m6A sites in the Fiber-seq read
msp_m6a	Number of m6A sites in the MSP
msp_AT	Number of AT sites in the MSP
msp_m6a_frac	Fraction of m6A sites in the MSP
msp_fc	Fold-change of m6A fraction in the MSP relative to the whole reads m6A fraction
bin_X_m6a_count	Number of m6A sites in the Xth 40 bp window
bin_X_AT_count	Number of AT sites in the Xth 40 bp window
bin_X_m6a_frac	Fraction of m6A sites in the Xth 40 bp window
bin_X_m6a_fc	Fold-change of m6A fraction in the Xth 40 bp window relative to the whole reads m6A fraction
{best, worst}_m6a_count_X	Number of m6A sites in the 100 bp window with the most (best) or least (worst) m6A sites
{best, worst}_AT_count_X	Number of AT sites in the 100 bp window with the most (best) or least (worst) AT sites
{best, worst}_m6a_frac_X	Fraction of m6A sites in the 100 bp window with the most (best) or least (worst) m6A sites
{best, worst}_m6a_fc_X	Fold-change of m6A fraction in the 100 bp window with the most (best) or least (worst) m6A sites relative to the whole reads m6A fraction

There are nine 40 bp windows for each MSP (X = 1, 2, ..., 9), with the 5th window centered on the MSP.

Aggregation of FIRE elements for peak calling

Aggregate FIRE score calculation

The FIRE score ($S_g$) for a position in the genome ($g$) is calculated using the following formula:

$$ S_g=-\frac{50}{R_g} \sum_{i=1}^{C_g} \log_{10}(1-min(EP_i , 0.99)) $$

where $C_g$ is the number of FIRE elements at the $g$th position, $R_g$ is the number of Fiber-seq reads at the $g$th position, and $EP_i$ is the estimated precision of the $i$th FIRE element at the $g$th position. The estimated precision of each FIRE element is thresholded at 0.99, such that the FIRE score takes on values between 0 and 100. Regions covered by less than four FIRE elements (i.e., if $C_g < 4$) are not scored and are given a value of negative one.

Regions of unreliable coverage

Regions with unreliable coverage are defined as regions with Fiber-seq coverage that deviate from the median coverage by five standard deviations, where a standard deviation is defined by the Poisson distribution (i.e., the square root of the mean coverage). This is a parameter that can be adjusted by the user.

FIRE score FDR calculation

We shuffle the location of an entire Fiber-seq read by selecting a random start position within the chromosome and relocating the entire read to that start position. Fiber-seq reads originating from regions with unreliable coverage are not shuffled, and reads from regions with reliable coverage are not shuffled into regions with unreliable coverage (bedtools shuffle -chrom -excl, v2.31.0). We then compute FIRE scores associated with the shuffled genome. Recalling that the FIRE score for the $g$th position in the genome is denoted $S_g$, we divide the number of bases that have shuffled FIRE scores above $S_g$ by the number of bases that have un-shuffled FIRE scores above $S_g$. This provides an estimate of the FDR associated with the FIRE score $S_g$.

Peak calling

Peaks are called by identifying FIRE score local-maxima that have FDR values below a 5% threshold and that exceed an optional percent actuation threshold $C_g / R_g$. Adjacent local maxima that share 50% of the underlying FIRE elements or have 90% reciprocal overlap are merged into a single peak, using the higher of the two local maxima. Then, the start and end positions of the peak are determined by the median start and end positions of the underlying FIRE elements. We also calculated and reported wide peaks by taking the union of the FIRE peaks and all regions below the FDR threshold and then merged the resulting regions that were within one nucleosome (147 bp) of one another.

Haplotype-selective peaks

For peaks with at least 10 Fiber-seq reads on both haplotypes, we test the difference in percent actuation in each haplotype (fraction of reads with FIRE elements) using a two-sided Fisher’s exact test. Specifically, the inputs for the test are the number of FIRE elements in haplotype one, the number of Fiber-seq reads without FIRE elements in haplotype one, the number of FIRE elements in haplotype two, and the number of Fiber-seq reads without FIRE elements in haplotype two. We then apply an FDR correction (Benjamini-Hochberg) to correct for multiple testing.

Glossary of Fiber-seq terms

Fiber-seq protocol

A treatment of DNA chromatin fibers with the Hia5 enzyme which preferentially methylates accessible adenosines. The fibers are then sequenced using a long-read platform that can detect the m6A modifications.

Fiber-seq read or fiber

A DNA fiber that has been sequenced using the Fiber-seq protocol and has m6A modifications called.

m6A

N6-methyladenosine: A DNA modification made during the Fiber-seq protocol on accessible adenosines. Native m6A sites are exceedingly rare or non-existent in eukaryotic DNA.

Inferred nucleosome

A nucleosome inferred from the Fiber-seq data. The nucleosome position along the DNA is inferred by the ft add-nucleosomes algorithm using the m6A modifications (not a direct observation of a nucleosome).

MSP

Methyltransferase sensitive patch: A stretch of a Fiber-seq read that has a high density of m6A sites. Specifically it is defined as a region that is inferred to not be occluded by a nucleosome.

FIRE

Fiber-seq Inferred Regulatory Element: A MSP that is inferred to be a regulatory element based on features of the MSP including m6A density.

Internucleosomal linker region

Any MSPs that is not a FIRE element.

The Fiber-seq BAM format

The Fiber-seq BAM format adds to m6A BAM data from a PacBio (ft predict-m6a) or ONT (dorado) sequencing run with m6A.

The following tags are added to the BAM file:

ns:B:I,: Nucleosome start sites (0-based) on the forward strand of the sequencing read (u32).
nl:B:I,: Equal length array to ns of nucleosome lengths (u32).
as:B:I,: MSP start sites (0-based) on the forward strand of the sequencing read (u32).
al:B:I,: Equal length array to as of MSP lengths (u32).
aq:B:C,: Quality scores for the MSP [0, 255] (u8). This tag is optional and added by ft fire.

The ns/as/nl/al tags are with respect to the unaligned sequencing read (forward strand) and are not affected by alignment or the orientation of the read after alignment. This also means that the starts and lengths encoded in these tags may change when ft lifts these molecular coordinates to reference coordinates.

The ns or as tag do not need to begin or end at the start or end of the read; however, once begun, they must be contiguous. i.e. the ns and as tags must combine to form a contiguous set of alternating nucleosome and MSP sites. The ns, nl, as, and al tags are added to the BAM file automatically using ft predict-m6a or later using the ft add-nucleosomes command.

The aq tag is added using the ft fire command and represents the estimated precision of the MSP being a FIRE. Specifically, the estimated precision of a FIRE is the value of the aq tag divided by 255.

Data Access

Broadly Consented Data Downloads

We provide access to broadly consented Fiber-seq data that can be used by researchers for various genomic studies.

Available Datasets

Our broadly consented datasets include the following samples:

CHM1: Complete hydatidiform mole cell line
CHM13: Complete hydatidiform mole cell line (GRCh38 alignment)
CHM13_DSA: CHM13 data aligned against the DSA (T2Tv2.0)
GM12878: Lymphoblastoid cell line from Coriell Cell Repository
HG002: Human genome reference sample (Ashkenazi Jewish trio son)
K562: Chronic myelogenous leukemia cell line

All samples are processed with FIRE (Fiber-seq Inferred Regulatory Elements).

Data Formats

All datasets are provided in standard formats:

Fiber-seq CRAM files: Primary data format containing m6A calls and nucleosome positions
Processed FIRE outputs: Summary statistics, peaks, and aggregated results

Download Instructions

Browse Available Data

You can explore all available samples using the AWS CLI:

aws s3 ls --no-sign-request --endpoint-url https://s3.kopah.uw.edu s3://stergachis/public/FIRE/broadly-consented/

Download Data

To download specific samples or datasets, use the AWS CLI with the following pattern:

aws s3 sync --no-sign-request --endpoint-url https://s3.kopah.uw.edu s3://stergachis/public/FIRE/broadly-consented/[SAMPLE_NAME]/ ./[LOCAL_DIRECTORY]/

For example, to download CHM13 data:

aws s3 sync --no-sign-request --endpoint-url https://s3.kopah.uw.edu s3://stergachis/public/FIRE/broadly-consented/CHM13/ ./CHM13_data/

Data Use Guidelines

When using our broadly consented data, please:

Cite the original studies and data contributors

Cite

If you use fibertools in your research, please cite the following paper:

Jha, A., Bohaczuk, S. C., Mao, Y., Ranchalis, J., Mallory, B. J., Min, A. T., Hamm, M. O., Swanson, E., Dubocanin, D., Finkbeiner, C., Li, T., Whittington, D., Noble, W. S., Stergachis, A. B., & Vollger, M. R. (2024). DNA-m6A calling and integrated long-read epigenetic and genetic analysis with fibertools. Genome Research. https://doi.org/10.1101/gr.279095.124

If you use FIRE in your research, please cite the following paper:

Vollger, M. R., Swanson, E. G., Neph, S. J., Ranchalis, J., Munson, K. M., Ho, C.-H., Sedeño-Cortés, A. E., Fondrie, W. E., Bohaczuk, S. C., Mao, Y., Parmalee, N. L., Mallory, B. J., Harvey, W. T., Kwon, Y., Garcia, G. H., Hoekzema, K., Meyer, J. G., Cicek, M., Eichler, E. E., … Stergachis, A. B. (2024). A haplotype-resolved view of human gene regulation. bioRxiv. https://doi.org/10.1101/2024.06.14.599122

Analyses of Fiber-seq data

This notebook contains some of the prepublication analyses of Fiber-seq data we have done:

ONT Fiber-seq Analysis

Keyboard shortcuts

The computational guide to Fiber-seq