fibertools-rs dark logo

This is the book for the computational tools of Fiber-seq which describes the two major software tools:

  1. fibertools (ft) which is a CLI tool for creating and interacting with Fiber-seq BAM files.
  2. Fiber-seq Inferred Regulatory Elements which is a method for identifying regulatory elements on individual fibers and peak calling.

Some Key features include:

  • Predicting m6A sites from PacBio Fiber-seq data
  • FIRE (Fiber-seq Inferred Regulatory Elements)
  • Extracting Fiber-seq results into plain text files.
  • Centering Fiber-seq results around a given position.
  • pyft: Python bindings for fibertools

A quick start guide can be found here.

You can help improve this book! And it is a easy as clicking the edit button in the top right.

The quick start guide for analyzing Fiber-seq data

Fiber-seq starting with PacBio

HiFi kinetics are required for predicting m6A with fibertools. Check with your sequencing provider prior to sequencing to ensure that the output file will have kinetics. Additionally, many of fibertools commands are compatible with CpG methylation which can be completed on instrument (if requested) or later with Jasmine, e.g. jasmine --keep-kinetics input.ccs.bam output.ccs.bam. This command should be run prior to fibertools if CpG methylation information is desired as Jasmine will overwrite the m6A predictions in the MM and ML tags.

Predict m6A and infer nucleosomes

To create useable Fiber-seq data you must first call m6A base-mods on the PacBio CCS bam using fibertools. First install fibertools and then process your bam file using the prediction command.

ft predict-m6a -t 16 input.ccs.bam output.fiberseq.bam 

This will both make m6A calls and identify nucleosomes on each fiber.

Note, the input CCS bam must have average kinetics to be able to call m6A.

Alignment and phasing

We recommend aligning with pbmm2 and phasing with HiPhase. Please see their respective documentation for more information.

Alternatively, we have written a snakemake pipeline to align and phase Fiber-seq data; however, this pipeline is not officially supported outside of our lab at this time.

After this point, you will have a Fiber-seq BAM file that is compatible with all the extraction commands in fibertools.

Fiber-seq peaks and UCSC browser tracks

Once you have a phased bam file, you can identify Fiber-seq inferred regulatory elements (FIREs) to call Fiber-seq peaks and make a UCSC trackHub. Please see the FIRE repository for more details.

Fiber-seq starting with Oxford Nanopore Technologies (ONT)

Predict m6A

ft predict-m6a does not include a model for ONT data; however, you can use software, such as Dorado, to add CpG and m6A to your ONT BAM file.

If you do use Dorado you must then filter the m6A calls with modkit using a tenth percentile cutoff for each flow-cell independently. This is the only way to get good m6A calls in our experience, and using any hard ML threshold will not hold between flow-cells. Here is an example command:

modkit call-mods -t 8 -p 0.1 input.dorado.bam filtered.dorado.bam

We also show how to apply this filter and call nucleosomes in one line in the next section.

Infer nucleosomes and MSPs

Once you have CpG and m6A information in your filtered ONT BAM file, you can use ft add-nucleosomes to infer nucleosomes and MSPs. With Dorado, we find the best results when restricting to the 90% of calls that dorado is most confident in as determined by modkit.

modkit call-mods -p 0.1 input.bam - | ft add-nucleosomes - output.bam

Alignment and phasing

You can either use Dorado to align your ONT data or use a tool like minimap2 to align your data. If you do use minimap2 be sure to include the flag -y to preserve the CpG and m6A information in the output BAM file.

We recommend using WhatsHap for phasing ONT data. Please see their documentation for more information.

After this point, you will have a Fiber-seq BAM file that is compatible with all the extraction commands in fibertools.

Fiber-seq peaks and UCSC browser tracks

Some users report reasonable success in applying the FIRE pipeline to ONT data. However, please note that FIRE models were not trained or validated for ONT data. With that said, we have developed a heuristic for FIRE that appears to work very well with ONT data. To enable this add ont: true to your config.yaml file when setting up your FIRE run.

fibertools-rs dark logo

Actions Status Conda (channel only) Downloads crates.io version crates.io downloads DOI

Think of fibertools as the samtools of Fiber-seq data. A command line tool for creating, filtering, interacting, and extracting from Fiber-seq BAM files.

This chapter covers:

A complete list of subcommands and their help pages can be found here.

Creating a Fiber-seq bam file

This chapter coverers creating a Fiber-seq bam file from a raw PacBio bam file. Including m6A prediction, nucleosome positioning, and FIRE identification.

ft predict-m6a

This command predicts m6A sites from PacBio Fiber-seq data. The input is a PacBio HiFi BAM file with kinetics and the output is a BAM file with the predicted m6A sites encoded in the ML and MM tags.

Details on the algorithm for m6A prediction are described in this publication: https://doi.org/10.1101/gr.279095.124.

The help page

ft add-nucleosomes

ft add-nucleosomes adds nucleosome positions and MSP positions to the Fiber-seq bam. The input is a Fiber-seq bam file with m6A calls and the output is a Fiber-seq bam file with the nucleosome positions encoded in the ns and nl tags and the MSP positions encoded in the as and al tags.

It is usually unnecessary to run this command manually, as it is called by ft predict-m6a automatically. However, you can run it manually if you want to adjust the parameters of the nucleosome calling algorithm.

The help page

Method

Nucleosome calling is performed by identifying stretches of DNA that are protected from Hia5 (i.e. do not have m6A signal). We have found the rate of false positive m6A calls in nucleosomes is very low when using fibertools (Jha et al.) allowing for a heuristic to perform as well as or better than our previous HMM caller (Dubocanin et al.).

There are three parameters in our heuristic nucleosome calling that can be adjusted: the minimum nucleosome length (n, default 75), the minimum combined nucleosome length (c, default 100), and the minimum extension to nucleosome length (e, default 25). These three parameters impact the three phases in nucleosome calling.

  • Call all regions that have no m6A events for at least n bases a candidate nucleosome.
  • Call all regions of size c or more that have only one internal m6A (putative false positive) a candidate nucleosome.
  • Extend the length of nucleosomes identified in phases one and two if by spanning one additional m6A e bases of unmodified sequence would be added to the nucleosome length.

alt text

ft fire

This command identifies Fiber-seq Inferred Regulatory Elements (FIREs) from a Fiber-seq BAM. The input is a Fiber-seq BAM file with m6A and nucleosome calls and the output is a Fiber-seq bam file with the FIREs encoded in the aq tags.

This command can be run in isolation; however, it is usually preferable to run the FIRE pipeline, which runs ft fire and performs many additional analyses and visualizations.

The help page.

Extracting from a FIRE BAM

ft fire can also be used as an extraction tool to extract Fiber-seq data from an already processed FIRE BAM file.

ft fire --extract fire.bam > all.bed

This produces a file in BED format that contains all the MSPs, FIREs, and nucleosomes in the FIRE BAM file. This command produces output analogous to the now removed model.results.bed.gz result from older versions of FIRE pipeline.

Extracting from a fibertools bam

This chapter describes various subcommands that extract information from a fibertools bam file into plain text.

ft extract

Extract Fiber-seq data into plain text files

Inputs and options

See the help message for details.

Output description

All outputs to ft extract can be (and should be) compressed by simply adding the .gz extension. For example, ft extract input.bam --m6a m6a.bed.gz will output a compressed bed12 file. Use - to output to stdout, e.g. ft extract input.bam --m6a -.

Shared Output columns:

ColumnDescription
ctChromosome or contig
stStart position of the read on the chromosome
enEnd position of the read on the chromosome
fiberThe fiber/read name
scoreThe number of ccs passes for the read (rounded)

Columns specific to the --m6a, --cpg, --nuc, and --msp outputs

All of these files are written in standard bed12 format. The first and last block in each the bed12 record do not reflect real data, and exist only to mark the start and end positions of the read. If you would like to convert these beds into bigBeds be sure to include -allow1bpOverlap in your command.

ColumnDescription
thick startSame as the start (st)
thick endSame as the end (en)
itemRgbColor specifc to the datatype, e.g. m6a marks get a purple RGB
blockCountThe number of blocks in the bed12 record
blockSizesA comma separated list of the lengths of each feature in the bed12 record
blockStartsA comma separated list of the relative start positions of each block in the bed12 record

Columns specific to the --all output

ColumnDescription
sam_flagThe sam flag of the read alignment
HPThe haplotype tag for the read
RGThe read group tag for the read
fiber_lengthThe length of the read in bp
fiber_sequenceThe sequence of the read
ecThe number of ccs passes for the read (no rounding)
rqThe estimated accuracy of the read
total_AT_bpThe total number of AT bp in the read
total_m6a_bpThe total number of m6a bp in the read
total_nuc_bpThe total number of nucleosome bp in the read
total_msp_bpThe total number of MSP bp in the read
total_5mC_bpThe total number of 5mC bp in the read
nuc_startsThe start positions of the nucleosomes in molecular coordinates
nuc_lengthsThe lengths of the nucleosomes in molecular coordinates
ref_nuc_startsThe start positions of the nucleosomes in reference coordinates
ref_nuc_lengthsThe lengths of the nucleosomes in reference coordinates
msp_startsThe start positions of the MSPs in molecular coordinates
msp_lengthsThe lengths of the MSPs in molecular coordinates
ref_msp_startsThe start positions of the MSPs in reference coordinates
ref_msp_lengthsThe lengths of the MSPs in reference coordinates
m6aThe start positions of the m6a in molecular coordinates
ref_m6aThe start positions of the m6a in reference coordinates
m6a_qualThe quality of the m6a positions (ML value)
5mCThe start positions of the 5mC in molecular coordinates
ref_5mCThe start positions of the 5mC in reference coordinates
5mC_qualThe quality of the 5mC positions (ML value)

Note positions in columns starting with ref_ maybe contain -1 (NA) values if the reference sequence has an insertion or deletion relative to the read sequence at that position.

ft center

This command centers Fiber-seq data around given reference positions. This is useful for making aggregate m6A and CpG observations, as well as visualization of SVs

Inputs and options

See the help message for details.

Output description

This command writes Fiber-seq data in a tab-delimited format to stdout that has been centered relative to positions specified in the input bed file.

ColumnDescription
chromChromosome
centering_positionThe position on the chromosome about which data is being centered.
strandThe strand of the centering position.
subset_sequenceThe sequence of the read around the centering position.
reference_startThe start position of the read in the reference.
reference_endThe end position of the read in the reference.
query_nameThe name of the sequencing read
RGThe read group the read belongs to
centered_query_startThe start position of the read relative to the centering position
centered_query_endThe end position of the read relative to the centering position
query_lengthThe length of the read

Additional columns specific to long (default) format

ColumnDescription
centered_position_typeThe type of position being centered. One of: m6A, 5mC, nuc, msp.
centered_startThe start position of the "feature" relative to the centering position
centered_endThe end position of the "feature" relative to the centering position
centered_qualityThe quality of the "feature" relative to the centering position (ML value)

Additional columns specific to the wide format

ColumnDescription
centered_m6a_positionsA comma separated list of m6a positions
m6a_qualThe quality of the m6a positions (ML value)
centered_5mC_positionsA comma separated list of 5mC positions
5mC_qualThe quality of the 5mC positions (ML value)
centered_nuc_startsA comma separated list of nuc starts
centered_nuc_endsA comma separated list of nuc ends
centered_msp_startsA comma separated list of msp starts
centered_msp_endsA comma separated list of msp ends
query_sequenceThe sequence of the read

Note if the --wide flag is used with the --reference flag some positions in the comma separated lists can be NA when the reference sequence has an insertion or deletion relative to the read sequence at that position.

ft footprint

Usage

ft footprint [OPTIONS] <BAM> --bed <BED> --yaml <YAML>

The BAM file is an indexed fiber-seq bam file.

The BED file is a bed file with the motifs you'd like to test for footprints. This should include the strand the motif is on.

The YAML file is a file that describes the modules within the motif that can be footprinted. e.g. a CTCF yaml with its multiple binding sites might look like:

modules:
  - [0, 8]
  - [8, 16]
  - [16, 23]
  - [23, 29]
  - [29, 35]

Modules must start at zero, end at the length of the motif, be sorted, and be contiguous with one another. At most 15 modules are allowed, and the intervals are 0-based, half-open (like BED).

See the help message for details.

Description of output columns

The footprinting output table is a tab-separated file with the same number of entries as the input BED file and the following columns:

ColumnDescription
chromChromosome
startThe start position of the motif
endThe end position of the motif
strandThe strand of the motif.
n_spanning_fibersThe number of fibers that span the motif.
n_spanning_mspsThe number of msp that span the motif.
n_overlapping_nucsThe number of fibers that have an intersecting nucleosome.
module_XThe number of fibers that are footprinted in module X. The number of module columns is determined by the footprinting yaml.
footprint_codesComma separated list of footprint codes for each fiber. See details below.
fire_qualsComma separated list of fire qualities for each fiber. -1 if the MSP is not spanning or present. Note all fire_quals will be 0 or -1 if FIRE has not been applied to the bam.
fiber_namesComma separated list of fiber names that span the motif. Names share the same index as the previous column, so they can be matched with footprint codes.

Footprint codes

The footprint codes are an encoded bit flag similar to how filtering is done with samtools. If the first bit is set (1) then there is an MSP that spans the footprint. For each following bit, the bit is set if that module is footprinted by that fiber.

Here are some examples in python for how you could test a footprint code in a few ways:

fp_code = 0b1001 # this is a value of 9, but in binary it is 1001

# test if the first bit is set, there is a spanning MSP, true in this example
(fp_code & 1) > 0

# test if the first module is footprinted, false in this example
(fp_code & (1 << 1)) > 0 

# test if the third module is footprinted, true in this example
(fp_code & (1 << 3)) > 0

ft-pileup

The ft-pileup command is used to generate a per-base pileup of Fiber-seq data. This command is useful for visualizing the distribution of various features across the genome.

Inputs and options

See the help message for details.

ft-qc

The ft-qc command is used to generate quality control metrics for a Fiber-seq BAM file.

Inputs and options

See the help message for details.

Help pages for fibertools subcommands

ft

Fiber-seq toolkit in rust

Usage: ft [OPTIONS] <COMMAND>

Commands:
  predict-m6a       Predict m6A positions using HiFi kinetics data and encode the results in the MM and
                        ML bam tags. Also adds nucleosome (nl, ns) and MTase sensitive patches (al, as)
                        [aliases: m6A, m6a]
  add-nucleosomes   Add nucleosomes to a bam file with m6a predictions
  fire              Add FIREs (Fiber-seq Inferred Regulatory Elements) to a bam file with m6a
                        predictions
  extract           Extract fiberseq data into plain text files [aliases: ex, e]
  center            This command centers fiberseq data around given reference positions. This is useful
                        for making aggregate m6A and CpG observations, as well as visualization of SVs
                        [aliases: c, ct]
  footprint         Infer footprints from fiberseq data
  qc                Collect QC metrics from a fiberseq bam file
  track-decorators  Make decorated bed files for fiberseq data
  pileup            Make a pileup track of Fiber-seq features from a FIRE bam
  clear-kinetics    Remove HiFi kinetics tags from the input bam file
  strip-basemods    Strip out select base modifications
  ddda-to-m6a       Convert a DddA BAM file to pseudo m6A BAM file
  help              Print this message or the help of the given subcommand(s)

Options:
  -h, --help     Print help
  -V, --version  Print version

Global-Options:
  -t, --threads <THREADS>  Threads [default: 8]

Debug-Options:
  -v, --verbose...  Logging level [-v: Info, -vv: Debug, -vvv: Trace]
      --quiet       Turn off all logging

ft predict-m6a

Predict m6A positions using HiFi kinetics data and encode the results in the MM and ML bam tags. Also adds
nucleosome (nl, ns) and MTase sensitive patches (al, as)

Usage: ft predict-m6a [OPTIONS] [BAM] [OUT]

Arguments:
  [BAM]  Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
         file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
  [OUT]  Output bam file with m6A calls in new/extended MM and ML bam tags [default: -]

Options:
  -n, --nucleosome-length <NUCLEOSOME_LENGTH>
          Minium nucleosome length [default: 75]
  -c, --combined-nucleosome-length <COMBINED_NUCLEOSOME_LENGTH>
          Minium nucleosome length when combining over a single m6A [default: 100]
      --min-distance-added <MIN_DISTANCE_ADDED>
          Minium distance needed to add to an already existing nuc by crossing an m6a [default: 25]
  -d, --distance-from-end <DISTANCE_FROM_END>
          Minimum distance from the end of a fiber to call a nucleosome or MSP [default: 45]
  -k, --keep
          Keep hifi kinetics data
  -h, --help
          Print help (see more with '--help')
  -V, --version
          Print version

BAM-Options:
  -F, --filter <BIT_FLAG>        BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
                                 0]
  -x, --ftx <FILTER_EXPRESSION>  Filtering expression to use for filtering records Example: filter to
                                 nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
                                 filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
                                 Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
                                 Filtering expressions support len() and qual() functions over msp, nuc,
                                 m6a, cpg
      --ml <MIN_ML_SCORE>        Minium score in the ML tag to use or include in the output [env:
                                 FT_MIN_ML_SCORE=] [default: 125]

Global-Options:
  -t, --threads <THREADS>  Threads [default: 8]

Debug-Options:
  -v, --verbose...  Logging level [-v: Info, -vv: Debug, -vvv: Trace]
      --quiet       Turn off all logging

Developer-Options:
      --force-min-ml-score <FORCE_MIN_ML_SCORE>  Force a different minimum ML score
      --all-calls                                Keep all m6A calls regardless of how low the ML value is
  -b, --batch-size <BATCH_SIZE>                  Number of reads to include in batch prediction [default: 1]

ft fire

Add FIREs (Fiber-seq Inferred Regulatory Elements) to a bam file with m6a predictions

Usage: ft fire [OPTIONS] [BAM] [OUT]

Arguments:
  [BAM]  Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
         file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
  [OUT]  Output file (BAM by default, table of MSP features if `--feats-to-text` is used, and bed9 + if
         `--extract`` is used) [default: -]

Options:
      --ont
          Use a ONT heuristic adjustment for FIRE calling. This adjusts the observed number of m6A counts by
          adding pseudo counts to account for the single stranded nature of ONT data [env: ONT=]
  -e, --extract
          Output just FIRE elements in bed9 format
      --all
          When extracting bed9 format include all MSPs and nucleosomes
  -f, --feats-to-text
          Output FIREs features for training in a table format
  -s, --skip-no-m6a
          Don't write reads with no m6A calls to the output bam
      --min-msp <MIN_MSP>
          Skip reads without at least `N` MSP calls [env: MIN_MSP=] [default: 0]
      --min-ave-msp-size <MIN_AVE_MSP_SIZE>
          Skip reads without an average MSP size greater than `N` [env: MIN_AVE_MSP_SIZE=] [default: 0]
  -w, --width-bin <WIDTH_BIN>
          Width of bin for feature collection [env: WIDTH_BIN=] [default: 40]
  -b, --bin-num <BIN_NUM>
          Number of bins to collect [env: BIN_NUM=] [default: 9]
      --best-window-size <BEST_WINDOW_SIZE>
          Calculate stats for the highest X bp window within each MSP Should be a fair amount higher than
          the expected linker length [env: BEST_WINDOW_SIZE=] [default: 100]
      --min-msp-length-for-positive-fire-call <MIN_MSP_LENGTH_FOR_POSITIVE_FIRE_CALL>
          Minium length of msp to call a FIRE [env: MIN_MSP_LENGTH_FOR_POSITIVE_FIRE_CALL=] [default: 85]
      --model <MODEL>
          Optional path to a model json file. If not provided ft will use the default model (recommended)
          [env: FIRE_MODEL=]
      --fdr-table <FDR_TABLE>
          Optional path to a FDR table [env: FDR_TABLE=]
  -h, --help
          Print help
  -V, --version
          Print version

BAM-Options:
  -F, --filter <BIT_FLAG>        BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
                                 0]
  -x, --ftx <FILTER_EXPRESSION>  Filtering expression to use for filtering records Example: filter to
                                 nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
                                 filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
                                 Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
                                 Filtering expressions support len() and qual() functions over msp, nuc,
                                 m6a, cpg
      --ml <MIN_ML_SCORE>        Minium score in the ML tag to use or include in the output [env:
                                 FT_MIN_ML_SCORE=] [default: 125]

Global-Options:
  -t, --threads <THREADS>  Threads [default: 8]

Debug-Options:
  -v, --verbose...  Logging level [-v: Info, -vv: Debug, -vvv: Trace]
      --quiet       Turn off all logging

ft extract

Extract fiberseq data into plain text files

Usage: ft extract [OPTIONS] [BAM]

Arguments:
  [BAM]  Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
         file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]

Options:
  -r, --reference  Report in reference sequence coordinates
      --molecular  Report positions in the molecular sequence coordinates
      --m6a <M6A>  Output path for m6a bed12
  -c, --cpg <CPG>  Output path for 5mC (CpG, primrose) bed12
      --msp <MSP>  Output path for methylation sensitive patch (msp) bed12
  -n, --nuc <NUC>  Output path for nucleosome bed12
  -a, --all <ALL>  Output path for a tabular format including "all" fiberseq information in the bam
  -h, --help       Print help (see more with '--help')
  -V, --version    Print version

BAM-Options:
  -F, --filter <BIT_FLAG>        BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
                                 0]
  -x, --ftx <FILTER_EXPRESSION>  Filtering expression to use for filtering records Example: filter to
                                 nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
                                 filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
                                 Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
                                 Filtering expressions support len() and qual() functions over msp, nuc,
                                 m6a, cpg
      --ml <MIN_ML_SCORE>        Minium score in the ML tag to use or include in the output [env:
                                 FT_MIN_ML_SCORE=] [default: 125]

Global-Options:
  -t, --threads <THREADS>  Threads [default: 8]

Debug-Options:
  -v, --verbose...  Logging level [-v: Info, -vv: Debug, -vvv: Trace]
      --quiet       Turn off all logging

All-Format-Options:
  -q, --quality   Include per base quality scores in "fiber_qual"
  -s, --simplify  Simplify output by removing fiber sequence

ft center

This command centers fiberseq data around given reference positions. This is useful for making aggregate m6A
and CpG observations, as well as visualization of SVs

Usage: ft center [OPTIONS] --bed <BED> [BAM]

Arguments:
  [BAM]  Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
         file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]

Options:
  -b, --bed <BED>    Bed file on which to center fiberseq reads. Data is adjusted to the start position of
                     the bed file and corrected for strand if the strand is indicated in the 6th column of
                     the bed file. The 4th column will also be checked for the strand but only after the 6th
                     is. If you include strand information in the 4th (or 6th) column it will orient data
                     accordingly and use the end position of bed record instead of the start if on the minus
                     strand. This means that profiles of motifs in both the forward and minus orientation
                     will align to the same central position
  -d, --dist <DIST>  Set a maximum distance from the start of the motif to keep a feature
  -w, --wide         Provide data in wide format, one row per read
  -r, --reference    Return relative reference position instead of relative molecular position
  -s, --simplify     Replace the sequence output column with just "N"
  -h, --help         Print help (see more with '--help')
  -V, --version      Print version

BAM-Options:
  -F, --filter <BIT_FLAG>        BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
                                 0]
  -x, --ftx <FILTER_EXPRESSION>  Filtering expression to use for filtering records Example: filter to
                                 nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
                                 filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
                                 Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
                                 Filtering expressions support len() and qual() functions over msp, nuc,
                                 m6a, cpg
      --ml <MIN_ML_SCORE>        Minium score in the ML tag to use or include in the output [env:
                                 FT_MIN_ML_SCORE=] [default: 125]

Global-Options:
  -t, --threads <THREADS>  Threads [default: 8]

Debug-Options:
  -v, --verbose...  Logging level [-v: Info, -vv: Debug, -vvv: Trace]
      --quiet       Turn off all logging

ft add-nucleosomes

Add nucleosomes to a bam file with m6a predictions

Usage: ft add-nucleosomes [OPTIONS] [BAM] [OUT]

Arguments:
  [BAM]  Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
         file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
  [OUT]  Output bam file with nucleosome calls [default: -]

Options:
  -n, --nucleosome-length <NUCLEOSOME_LENGTH>
          Minium nucleosome length [default: 75]
  -c, --combined-nucleosome-length <COMBINED_NUCLEOSOME_LENGTH>
          Minium nucleosome length when combining over a single m6A [default: 100]
      --min-distance-added <MIN_DISTANCE_ADDED>
          Minium distance needed to add to an already existing nuc by crossing an m6a [default: 25]
  -d, --distance-from-end <DISTANCE_FROM_END>
          Minimum distance from the end of a fiber to call a nucleosome or MSP [default: 45]
  -h, --help
          Print help
  -V, --version
          Print version

BAM-Options:
  -F, --filter <BIT_FLAG>        BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
                                 0]
  -x, --ftx <FILTER_EXPRESSION>  Filtering expression to use for filtering records Example: filter to
                                 nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
                                 filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
                                 Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
                                 Filtering expressions support len() and qual() functions over msp, nuc,
                                 m6a, cpg
      --ml <MIN_ML_SCORE>        Minium score in the ML tag to use or include in the output [env:
                                 FT_MIN_ML_SCORE=] [default: 125]

Global-Options:
  -t, --threads <THREADS>  Threads [default: 8]

Debug-Options:
  -v, --verbose...  Logging level [-v: Info, -vv: Debug, -vvv: Trace]
      --quiet       Turn off all logging

ft footprint

Infer footprints from fiberseq data

Usage: ft footprint [OPTIONS] --bed <BED> --yaml <YAML> [BAM]

Arguments:
  [BAM]  Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
         file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]

Options:
  -b, --bed <BED>    BED file with the regions to footprint. Should all contain the same motif with proper
                     strand information, and ideally be ChIP-seq peaks
  -y, --yaml <YAML>  yaml describing the modules of the footprint
  -o, --out <OUT>    Output bam [default: -]
  -h, --help         Print help
  -V, --version      Print version

BAM-Options:
  -F, --filter <BIT_FLAG>        BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
                                 0]
  -x, --ftx <FILTER_EXPRESSION>  Filtering expression to use for filtering records Example: filter to
                                 nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
                                 filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
                                 Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
                                 Filtering expressions support len() and qual() functions over msp, nuc,
                                 m6a, cpg
      --ml <MIN_ML_SCORE>        Minium score in the ML tag to use or include in the output [env:
                                 FT_MIN_ML_SCORE=] [default: 125]

Global-Options:
  -t, --threads <THREADS>  Threads [default: 8]

Debug-Options:
  -v, --verbose...  Logging level [-v: Info, -vv: Debug, -vvv: Trace]
      --quiet       Turn off all logging

ft clear-kinetics

Remove HiFi kinetics tags from the input bam file

Usage: ft clear-kinetics [OPTIONS] [BAM] [OUT]

Arguments:
  [BAM]  Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
         file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
  [OUT]  Output bam file without hifi kinetics [default: -]

Options:
  -h, --help     Print help
  -V, --version  Print version

BAM-Options:
  -F, --filter <BIT_FLAG>        BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
                                 0]
  -x, --ftx <FILTER_EXPRESSION>  Filtering expression to use for filtering records Example: filter to
                                 nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
                                 filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
                                 Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
                                 Filtering expressions support len() and qual() functions over msp, nuc,
                                 m6a, cpg
      --ml <MIN_ML_SCORE>        Minium score in the ML tag to use or include in the output [env:
                                 FT_MIN_ML_SCORE=] [default: 125]

Global-Options:
  -t, --threads <THREADS>  Threads [default: 8]

Debug-Options:
  -v, --verbose...  Logging level [-v: Info, -vv: Debug, -vvv: Trace]
      --quiet       Turn off all logging

ft strip-basemods

Strip out select base modifications

Usage: ft strip-basemods [OPTIONS] [BAM] [OUT]

Arguments:
  [BAM]  Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
         file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
  [OUT]  Output bam file [default: -]

Options:
  -b, --basemod <BASEMOD>  base modification to strip out of the bam file [possible values: m6A, 6mA, 5mC,
                           CpG]
      --ml-m6a <ML_M6A>    filter out m6A modifications with less than this ML value [default: 0]
      --ml-5mc <ML_5MC>    filter out 5mC modifications with less than this ML value [default: 0]
      --drop-forward       Drop forward strand of base modifications
      --drop-reverse       Drop reverse strand of base modifications
  -h, --help               Print help
  -V, --version            Print version

BAM-Options:
  -F, --filter <BIT_FLAG>        BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
                                 0]
  -x, --ftx <FILTER_EXPRESSION>  Filtering expression to use for filtering records Example: filter to
                                 nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
                                 filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
                                 Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
                                 Filtering expressions support len() and qual() functions over msp, nuc,
                                 m6a, cpg
      --ml <MIN_ML_SCORE>        Minium score in the ML tag to use or include in the output [env:
                                 FT_MIN_ML_SCORE=] [default: 125]

Global-Options:
  -t, --threads <THREADS>  Threads [default: 8]

Debug-Options:
  -v, --verbose...  Logging level [-v: Info, -vv: Debug, -vvv: Trace]
      --quiet       Turn off all logging

ft track-decorators

Make decorated bed files for fiberseq data

Usage: ft track-decorators [OPTIONS] --bed12 <BED12> [BAM]

Arguments:
  [BAM]  Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
         file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]

Options:
  -b, --bed12 <BED12>          Output path for bed12 file to be decorated
  -d, --decorator <DECORATOR>  Output path for decorator bed file [default: -]
  -h, --help                   Print help
  -V, --version                Print version

BAM-Options:
  -F, --filter <BIT_FLAG>        BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
                                 0]
  -x, --ftx <FILTER_EXPRESSION>  Filtering expression to use for filtering records Example: filter to
                                 nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
                                 filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
                                 Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
                                 Filtering expressions support len() and qual() functions over msp, nuc,
                                 m6a, cpg
      --ml <MIN_ML_SCORE>        Minium score in the ML tag to use or include in the output [env:
                                 FT_MIN_ML_SCORE=] [default: 125]

Global-Options:
  -t, --threads <THREADS>  Threads [default: 8]

Debug-Options:
  -v, --verbose...  Logging level [-v: Info, -vv: Debug, -vvv: Trace]
      --quiet       Turn off all logging

ft pileup

Make a pileup track of Fiber-seq features from a FIRE bam

Usage: ft pileup [OPTIONS] [BAM] [RGN]

Arguments:
  [BAM]  Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
         file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
  [RGN]  Region string to make a pileup of. e.g. chr1:1-1000 or chr1:1-1,000 If not provided will make a
         pileup of the whole genome

Options:
  -o, --out <OUT>                  Output file [default: -]
  -m, --m6a                        include m6A calls
  -c, --cpg                        include 5mC calls
      --haps                       For each column add two new columns with the hap1 and hap2 specific data
  -k, --keep-zeros                 Keep zero coverage regions
  -p, --per-base                   Write output one base at a time even if the values do not change
      --fiber-coverage             Calculate coverage starting from the first MSP/NUC to the last MSP/NUC
                                   position instead of the complete span of the read alignment
      --shuffle <SHUFFLE>          Shuffle the fiber-seq data according to a bed file of the shuffled
                                   positions of the fiber-seq data
      --rolling-max <ROLLING_MAX>  Output a rolling max of the score column over X bases
      --no-msp                     No MSP columns
      --no-nuc                     No NUC columns
  -h, --help                       Print help (see more with '--help')
  -V, --version                    Print version

BAM-Options:
  -F, --filter <BIT_FLAG>        BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
                                 0]
  -x, --ftx <FILTER_EXPRESSION>  Filtering expression to use for filtering records Example: filter to
                                 nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
                                 filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
                                 Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
                                 Filtering expressions support len() and qual() functions over msp, nuc,
                                 m6a, cpg
      --ml <MIN_ML_SCORE>        Minium score in the ML tag to use or include in the output [env:
                                 FT_MIN_ML_SCORE=] [default: 125]

Global-Options:
  -t, --threads <THREADS>  Threads [default: 8]

Debug-Options:
  -v, --verbose...  Logging level [-v: Info, -vv: Debug, -vvv: Trace]
      --quiet       Turn off all logging

ft qc

Collect QC metrics from a fiberseq bam file

Usage: ft qc [OPTIONS] [BAM] [OUT]

Arguments:
  [BAM]  Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
         file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
  [OUT]  Output text file with QC metrics. The format is a tab-separated file with the following columns:
         "statistic\tvalue\tcount" where "statistic" is the name of the metric, "value" is the value of the
         metric, and "count" is the number of times the metric was observed [default: -]

Options:
      --acf                                Calculate the auto-correlation function of the m6A marks in the
                                           fiber-seq data
      --acf-max-lag <ACF_MAX_LAG>          maximum lag for the ACF calculation [default: 250]
      --acf-min-m6a <ACF_MIN_M6A>          Minimum number of m6A marks to use a read in the ACF calculation
                                           [default: 100]
      --acf-max-reads <ACF_MAX_READS>      maximum number of reads to use in the ACF calculation [default:
                                           10000]
      --acf-sample-rate <ACF_SAMPLE_RATE>  After sampling the first "acf-max-reads" randomly sample one of
                                           every "acf-sample-rate" reads and replace one of the previous
                                           reads at random [default: 100]
  -m, --m6a-per-msp                        In the output include a measure of the number of m6A events per
                                           MSPs of a given size. The output format is:
                                           "m6a_per_msp_size\t{m6A count},{MSP size},{is a FIRE}\t{count}"
                                           e.g. "m6a_per_msp_size\t35,100,false\t100"
      --n-reads <N_READS>                  Only process the first "n" reads in the input bam file
  -h, --help                               Print help
  -V, --version                            Print version

BAM-Options:
  -F, --filter <BIT_FLAG>        BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
                                 0]
  -x, --ftx <FILTER_EXPRESSION>  Filtering expression to use for filtering records Example: filter to
                                 nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
                                 filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
                                 Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
                                 Filtering expressions support len() and qual() functions over msp, nuc,
                                 m6a, cpg
      --ml <MIN_ML_SCORE>        Minium score in the ML tag to use or include in the output [env:
                                 FT_MIN_ML_SCORE=] [default: 125]

Global-Options:
  -t, --threads <THREADS>  Threads [default: 8]

Debug-Options:
  -v, --verbose...  Logging level [-v: Info, -vv: Debug, -vvv: Trace]
      --quiet       Turn off all logging

pyft: Python bindings for fibertools

Documentation Status PyPI version

See the documentation for pyft on readthedocs for more information.

Installation

This chapter covers the various ways to install fibertools.

It is easiest to install fibertools using conda (instructions). However, for the fastest m6A predictions you need to install with libtorch enabled (instructions) if you would like to install from source, you can use the following instructions.

From bioconda Conda

fibertools-rs is avalible through bioconda and can be installed with the following command:

mamba install -c conda-forge -c bioconda fibertools-rs

However, due to size constraints in bioconda this version does not support contain the pytorch libraries or GPU acceleration for m6A predictions. m6A predictions will still work in the bioconda version but may be much slower. If you would like to use m6A prediction and GPU acceleration, you will need to install using the directions here.

From crates.io crates.io

Installation from crates.io requires the rust package manager cargo. You can find how to install cargo here.. Furthermore, a recent version of gcc and cmake is required. I have tested and recommend gcc v10.2.0 and cmake v3.21.1, though other versions may work.

cargo install fibertools-rs

From GitHub (active development)

Using cargo from source:

cargo install --git https://github.com/fiberseq/fibertools-rs

or using git form source:

git clone https://github.com/fiberseq/fibertools-rs
cd fibertools-rs
cargo build --release
./target/release/ft --help

With libtorch

Get libtorch v2.2.0 from the PyTorch website and extract the content of the zip file.

  • On Linux/Unix system you can download with:
    • wget https://download.pytorch.org/libtorch/cu118/libtorch-shared-with-deps-2.2.0%2Bcu118.zip
  • On macOS you can download with:
    • wget https://download.pytorch.org/libtorch/cpu/libtorch-macos-2.2.0.zip
  • Windows is not supported and will not be.

Then add the following to your .bashrc or equivalent, where /path/to/libtorch is the path to the directory that was created when unzipping the file:

export LIBTORCH_CXX11_ABI=0
export LIBTORCH=/path/to/libtorch # e.g. export LIBTORCH=/Users/mrvollger/lib/libtorch
export LD_LIBRARY_PATH=${LIBTORCH}/lib:$LD_LIBRARY_PATH
export DYLD_LIBRARY_PATH=${LIBTORCH}/lib:$LD_LIBRARY_PATH

Finally install using cargo:

cargo install --all-features fibertools-rs

FIRE: descriptions, methods, and outputs

The code for running the FIRE pipeline can be found on GitHub, and if you do use FIRE in your work please cite our publication:

Vollger, M. R., Swanson, E. G., Neph, S. J., Ranchalis, J., Munson, K. M., Ho, C.-H., Sedeño-Cortés, A. E., Fondrie, W. E., Bohaczuk, S. C., Mao, Y., Parmalee, N. L., Mallory, B. J., Harvey, W. T., Kwon, Y., Garcia, G. H., Hoekzema, K., Meyer, J. G., Cicek, M., Eichler, E. E., … Stergachis, A. B. (2024). A haplotype-resolved view of human gene regulation. bioRxiv. https://doi.org/10.1101/2024.06.14.599122

Summary of Fiber-seq inferred regulatory elements (FIREs)

For those who would prefer video I have recorded a lab meeting discussing FIRE and the methods used here. If you prefer to read please continue on.

FIREs are MTase sensitive patches (MSPs) that are inferred to be regulatory elements on single chromatin fibers. To do this we used semi-supervised machine learning to identify MSPs that are likely to be regulatory elements using the Mokapot framework and XGBoost. Every individual FIRE element is associated with an estimated precision value, which indicates the probability that the FIRE element is a true regulatory element. The estimated precision of FIREs elements are created using Mokapot and validation data not used in training. We train our model targeting FIRE elements with at least 95% precision, MSPs with less than 90% precision are considered to have average level of accessibility expected between two nucleosomes, and are referred to as linker regions.

Semi-superivized machine learning with Mokapot requires a mixed-positive training set and a clean negative training set. To create mixed positive training data we selected MSPs that overlapped DNase hypersensitive sites (DHSs) and CTCF ChIP-seq peaks. And to create a clean negative training set we selected MSPs that did not overlap DHSs or CTCF ChIP-seq peaks.

For details on running the FIRE workflow see:

For details on the outputs of the FIRE workflow see:

Running the FIRE pipeline

The FIRE pipeline is a Snakemake workflow for calling Fiber-seq Inferred Regulatory Elements (FIREs) on single molecules and peak calling with Fiber-seq.

Install

Please start by installing pixi which handles the environment of the FIRE workflow.

Then install FIRE using git and pixi:

git clone https://github.com/fiberseq/FIRE.git
cd FIRE
pixi install

We then recommend quickly testing your installation by running the test suite:

pixi run test

We recommend setting a Snakemake conda prefix and the Apptainer cache directory in your bashrc, e.g. in the Stergachis lab add:

export SNAKEMAKE_CONDA_PREFIX=/mmfs1/gscratch/stergachislab/snakemake-conda-envs
export APPTAINER_CACHEDIR=/mmfs1/gscratch/stergachislab/snakemake-conda-envs/apptainer-cache

Then Snakemake installs all the additional requirements as conda envs in that directory.

If you wish to distribute jobs across a cluster you may need to install the appropriate snakemake executor plugin. The SLURM executor is included in the environment (snakemake-executor-slurm)

Configuring

See the configuration README, the example configuration file, and the example manifest file for configuration options.

Run

The FIRE workflow can be executed using the pixi run fire command. Under the hood this runs a snakemake workflow and any extra parameters are passed directly to snakemake. For example:

pixi run fire --configfile config/config.yaml

If you want to do a dry run:

pixi run fire --configfile config/config.yaml -n

If you want to execute across a cluster (modify profiles/slurm-executor as needed for distributed execution):

pixi run fire --configfile config/config.yaml --profile profiles/slurm-executor

And if you want to run the FIRE workflow from another directory you can do so with:

pixi run --manifest-path /path/to/snakemake/pixi.toml fire ...

where you update /path/to/snakemake/pixi.toml to the path of the pixi.toml in the cloned FIRE repository.

You can also run snakemake directly, e.g.:

pixi shell
snakemake \
  --configfile config/config.yaml \
  --profile profiles/slurm-executor \
  --local-cores 8 -k

Outputs of the FIRE pipeline

The FIRE pipeline returns outputs for a sample in the directory results/{sample}/ and files are labeled with the sample ({sample}) and the version of the FIRE pipeline ({v}). The following files and directories are generated:

OutputDescription
{sample}-fire-{v}-filtered.cramCRAM file containing the all the data used in the FIRE pipeline.
{sample}-fire-{v}-peaks.bed.gzBED file containing the FIRE peaks calls
{sample}-fire-{v}-hap-differences.bed.gzBED file containing the results of searching for haplotype-selective peaks.
{sample}-fire-{v}-pileup.bed.gzBED file containing per-base information on number of FIREs, MSPs, nucleosomes, coverage and more.
{sample}-fire-{v}-qc.tbl.gzTable containing quality control metrics for the FIRE CRAM.
trackHub-{v}/Directory containing a UCSC trackHub for visualizing all the results of the FIRE pipeline.
additional-outputs-{v}/Directory containing additional outputs from the FIRE pipeline.

More details on the individual outputs

The {sample}-fire-{v}-filtered.cram file

The CRAM file contains all the data used in the FIRE pipeline. It is a CRAM file that can be viewed with IGV or other genome browsers. Sequencing quality scores are removed from the CRAM file to reduce the file size since per base quality scores are not used in the FIRE pipeline, as well as reads with insufficient m6A signal. The CRAM file is sorted and indexed.

The {sample}-fire-{v}-peaks.bed.gz file

This is the peak file for the FIRE method. Peaks are called by identifying FIRE score (methods) local-maxima that have FDR values below a threshold. By default, the pipeline reports peaks at a 5% FDR threshold. Once a local-maxima is identified, the start and end positions of the peak are determined by the median start and end positions of the underlying FIRE elements. We also calculate and report wide peaks in the additional-outputs/ by taking the union of the FIRE peaks and all regions below the FDR threshold and then merging resulting regions that are within one nucleosome (147 bp) of one another.

The FIRE peaks file has the following columns:

ColumnDescription
#chromChromosome of the peak
peak*startStart of the peak
peak_endEnd of the peak
startStart of the maximum of the peak
endEnd of the maximum of the peak
coverageCoverage of the peak
fire_coverageCoverage of the FIREs in the peak
scoreFIRE score of the peak (see methods)
nuc_coverageCoverage of the nucleosomes in the peak
msp_coverageCoverage of the MSPs in the peak
.**{H1,H2}Repeats of previous columns but specific for the two haplotypes
FDRFalse discovery rate of the peak
log_FDR-10*log10 of the FDR
FIRE_size_meanMean size of the FIREs in the peak
FIRE_size_ssdStandard deviation of the size of the FIREs in the peak
FIRE_start_ssdStandard deviation of the start of the FIREs in the peak
FIRE_end_ssdStandard deviation of the end of the FIREs in the peak
pass_coverageWhether the peak passes coverage filters

The {sample}-fire-{v}-hap-differences.bed.gz file

This file primarily contains the same columns as the FIRE peaks file but additionally has a p_value column with the results of a Fisher's exact test for the difference in coverage between the two haplotypes, and a p_adjust column with the Benjamini-Hochberg adjusted p-value. See the methods for more details.

The {sample}-fire-{v}-pileup.bed.gz file

This is a BED file containing per-base information on number of FIREs, MSPs, nucleosomes, coverage and more. The columns are calculated using ft-pileup and more details can be found in the ft-pileup documentation.

The {sample}-fire-{v}-qc.tbl.gz file

This file contains quality control metrics for the FIRE CRAM. The results are directly created by ft-qc and more details can be found in the ft-qc documentation.

The trackHub-{v}/ directory

The trackHub-{v}/ directory contains a UCSC trackHub for visualizing all the results of the FIRE pipeline. A description of the trackHub can be found in trackHub/fire-description.html. The trackHub can be loaded into UCSC by uploading the trackHub directory to a public facing website and then loading the hub.txt's URL into the UCSC trackHub browser.

A copy of the trackHub description can be found here.

The additional-outputs-{v}/ directory

The additional-outputs-{v}/ directory contains the following files: TODO

FIRE methods

This chapter describes the methods and training used in the FIRE pipeline. There are two major components, identifying per-molecule FIRE elements, and aggregating these elements into peaks.

Training of Fire-seq inferred regulatory elements.

Training data

For training data, we generated 21 different GM12878 Fiber-seq experiments with a range of under- and over-methylated experimental conditions to ensure we captured a broad range of percent m6A (5.8-13.3%) to ensure our model could generalize to new samples with varying levels of m6A. We merged these sequencing results and randomly selected 10% of the Fiber-seq reads from 100,000 randomly selected DNase I and CTCF ChIP-seq peaks for mixed-positive labels and 100,000 equally sized regions that did not overlap DNase or CTCF Chip-seq for negative labels. We then generated features for each of these MSPs, including length, log fold enrichment of m6A, the A/T content, and windowed measures of m6A around the MSP (Supplemental Table 3) and held out 20% of the Fiber-seq reads to be used as test data.

Semi-supervised training

To carry out semi-supervised training, we used an established method, Mokapot (Fondrie & Noble, 2021; Käll et al., 2007), which we summarize below. In the first round of semi-supervised training, Mokapot identifies the feature that best discriminates between our mixed-positive and negative labels and then selects a threshold for that feature such that the mixed-positive labels can be discriminated from the negative labels with 95% estimated precision (defined below). The subset of mixed-positive labels above this threshold is then used as an initial set of positive labels in training an XGBoost model with five-fold cross-validation (Chen & Guestrin, 2016). Then this process is iteratively repeated, using the learned prediction from the previous iteration’s model to create positive labels at 95% estimated precision, until the number of positive identifications at 95% precision in the validation set ceases to increase (15 iterations, Fig. S1) (Fondrie & Noble, 2021; Käll et al., 2007).

Estimated precision of individual FIRE elements

We cannot compute the precision associated with a particular XGBoost score because we do not have access to a set of clean-positive labels. Instead, we define a notion of “estimated precision” using a balanced held-out test set of mixed-positive and negative labels (20% of the data). We defined the estimated precision (EP) of a FIRE element to be

$$ EP = 1 - \frac{FP+1}{TMP} $$

where TMP is the number of “true” identifications from the mixed positive labels with at least that element’s XGBoost score, and FP is the number of false positive identifications from negative labels with at least that element’s XGBoost score. We add a pseudo count of one to the numerator of false positive identifications so as to prevent liberal estimates for smaller collections of identifications (Fondrie & Noble, 2021).

Features of FIRE elements

The following features were used as features with Mokapot in FIRE element classification:

FeatureDescription
msp_lenLength of the MSP
msp_len_times_m6a_fcLength of the MSP times the fold-change of m6A in the MSP
ccs_passesNumber of CCS passes in the MSP
fiber_m6a_countNumber of m6A sites in the Fiber-seq read
fiber_AT_countNumber of AT sites in the Fiber-seq read
fiber_m6a_fracFraction of m6A sites in the Fiber-seq read
msp_m6aNumber of m6A sites in the MSP
msp_ATNumber of AT sites in the MSP
msp_m6a_fracFraction of m6A sites in the MSP
msp_fcFold-change of m6A fraction in the MSP relative to the whole reads m6A fraction
bin_X_m6a_countNumber of m6A sites in the Xth 40 bp window
bin_X_AT_countNumber of AT sites in the Xth 40 bp window
bin_X_m6a_fracFraction of m6A sites in the Xth 40 bp window
bin_X_m6a_fcFold-change of m6A fraction in the Xth 40 bp window relative to the whole reads m6A fraction
{best, worst}_m6a_count_XNumber of m6A sites in the 100 bp window with the most (best) or least (worst) m6A sites
{best, worst}_AT_count_XNumber of AT sites in the 100 bp window with the most (best) or least (worst) AT sites
{best, worst}_m6a_frac_XFraction of m6A sites in the 100 bp window with the most (best) or least (worst) m6A sites
{best, worst}_m6a_fc_XFold-change of m6A fraction in the 100 bp window with the most (best) or least (worst) m6A sites relative to the whole reads m6A fraction

There are nine 40 bp windows for each MSP (X = 1, 2, ..., 9), with the 5th window centered on the MSP.

Aggregation of FIRE elements for peak calling

Aggregate FIRE score calculation

The FIRE score (\(S_g\)) for a position in the genome (\(g\)) is calculated using the following formula:

$$ S_g=-\frac{50}{R_g} \sum_{i=1}^{C_g} \log_{10}(1-min(EP_i , 0.99)) $$

where \(C_g\) is the number of FIRE elements at the \(g\)th position, \(R_g\) is the number of Fiber-seq reads at the \(g\)th position, and \(EP_i\) is the estimated precision of the \(i\)th FIRE element at the \(g\)th position. The estimated precision of each FIRE element is thresholded at 0.99, such that the FIRE score takes on values between 0 and 100. Regions covered by less than four FIRE elements (i.e., if \(C_g < 4\)) are not scored and are given a value of negative one.

Regions of unreliable coverage

Regions with unreliable coverage are defined as regions with Fiber-seq coverage that deviate from the median coverage by five standard deviations, where a standard deviation is defined by the Poisson distribution (i.e., the square root of the mean coverage). This is a parameter that can be adjusted by the user.

FIRE score FDR calculation

We shuffle the location of an entire Fiber-seq read by selecting a random start position within the chromosome and relocating the entire read to that start position. Fiber-seq reads originating from regions with unreliable coverage are not shuffled, and reads from regions with reliable coverage are not shuffled into regions with unreliable coverage (bedtools shuffle -chrom -excl, v2.31.0). We then compute FIRE scores associated with the shuffled genome. Recalling that the FIRE score for the \(g\)th position in the genome is denoted \(S_g\), we divide the number of bases that have shuffled FIRE scores above \(S_g\) by the number of bases that have un-shuffled FIRE scores above \(S_g\). This provides an estimate of the FDR associated with the FIRE score \(S_g\).

Peak calling

Peaks are called by identifying FIRE score local-maxima that have FDR values below a 5% threshold and that exceed an optional percent actuation threshold \(C_g / R_g\). Adjacent local maxima that share 50% of the underlying FIRE elements or have 90% reciprocal overlap are merged into a single peak, using the higher of the two local maxima. Then, the start and end positions of the peak are determined by the median start and end positions of the underlying FIRE elements. We also calculated and reported wide peaks by taking the union of the FIRE peaks and all regions below the FDR threshold and then merged the resulting regions that were within one nucleosome (147 bp) of one another.

Haplotype-selective peaks

Haplotype-selective peaks

For peaks with at least 10 Fiber-seq reads on both haplotypes, we test the difference in percent actuation in each haplotype (fraction of reads with FIRE elements) using a two-sided Fisher’s exact test. Specifically, the inputs for the test are the number of FIRE elements in haplotype one, the number of Fiber-seq reads without FIRE elements in haplotype one, the number of FIRE elements in haplotype two, and the number of Fiber-seq reads without FIRE elements in haplotype two. We then apply an FDR correction (Benjamini-Hochberg) to correct for multiple testing.

Glossary of Fiber-seq terms

Fiber-seq protocol

A treatment of DNA chromatin fibers with the Hia5 enzyme which preferentially methylates accessible adenosines. The fibers are then sequenced using a long-read platform that can detect the m6A modifications.

Fiber-seq read or fiber

A DNA fiber that has been sequenced using the Fiber-seq protocol and has m6A modifications called.

m6A

N6-methyladenosine: A DNA modification made during the Fiber-seq protocol on accessible adenosines. Native m6A sites are exceedingly rare or non-existent in eukaryotic DNA.

Inferred nucleosome

A nucleosome inferred from the Fiber-seq data. The nucleosome position along the DNA is inferred by the ft add-nucleosomes algorithm using the m6A modifications (not a direct observation of a nucleosome).

MSP

Methyltransferase sensitive patch: A stretch of a Fiber-seq read that has a high density of m6A sites. Specifically it is defined as a region that is inferred to not be occluded by a nucleosome.

FIRE

Fiber-seq Inferred Regulatory Element: A MSP that is inferred to be a regulatory element based on features of the MSP including m6A density.

Internucleosomal linker region

Any MSPs that is not a FIRE element.

The Fiber-seq BAM format

The Fiber-seq BAM format adds to m6A BAM data from a PacBio (ft predict-m6a) or ONT (dorado) sequencing run with m6A.

The following tags are added to the BAM file:

  • ns:B:I,: Nucleosome start sites (0-based) on the forward strand of the sequencing read (u32).
  • nl:B:I,: Equal length array to ns of nucleosome lengths (u32).
  • as:B:I,: MSP start sites (0-based) on the forward strand of the sequencing read (u32).
  • al:B:I,: Equal length array to as of MSP lengths (u32).
  • aq:B:C,: Quality scores for the MSP [0, 255] (u8). This tag is optional and added by ft fire.

The ns/as/nl/al tags are with respect to the unaligned sequencing read (forward strand) and are not affected by alignment or the orientation of the read after alignment. This also means that the starts and lengths encoded in these tags may change when ft lifts these molecular coordinates to reference coordinates.

The ns or as tag do not need to begin or end at the start or end of the read; however, once begun, they must be contiguous. i.e. the ns and as tags must combine to form a contiguous set of alternating nucleosome and MSP sites. The ns, nl, as, and al tags are added to the BAM file automatically using ft predict-m6a or later using the ft add-nucleosomes command.

The aq tag is added using the ft fire command and represents the estimated precision of the MSP being a FIRE. Specifically, the estimated precision of a FIRE is the value of the aq tag divided by 255.

Cite

If you use fibertools in your research, please cite the following paper:

Jha, A., Bohaczuk, S. C., Mao, Y., Ranchalis, J., Mallory, B. J., Min, A. T., Hamm, M. O., Swanson, E., Dubocanin, D., Finkbeiner, C., Li, T., Whittington, D., Noble, W. S., Stergachis, A. B., & Vollger, M. R. (2024). DNA-m6A calling and integrated long-read epigenetic and genetic analysis with fibertools. Genome Research. https://doi.org/10.1101/gr.279095.124

If you use FIRE in your research, please cite the following paper:

Vollger, M. R., Swanson, E. G., Neph, S. J., Ranchalis, J., Munson, K. M., Ho, C.-H., Sedeño-Cortés, A. E., Fondrie, W. E., Bohaczuk, S. C., Mao, Y., Parmalee, N. L., Mallory, B. J., Harvey, W. T., Kwon, Y., Garcia, G. H., Hoekzema, K., Meyer, J. G., Cicek, M., Eichler, E. E., … Stergachis, A. B. (2024). A haplotype-resolved view of human gene regulation. bioRxiv. https://doi.org/10.1101/2024.06.14.599122

Analyses of Fiber-seq data

This notebook contains some of the prepublication analyses of Fiber-seq data we have done:

ONT Fiber-seq Analysis

Matched ONT and PacBio preparations

Read lengths unchanged by Fiber-seq

Impact of Fiber-seq on ONT read quality

A comparison of ONT and PacBio Fiber-seq

Genome-wide comparison of ONT and PacBio Fiber-seq

Transcription factor foot printing has reduced resolution with ONT

Stand does not significantly impact ONT Fiber-seq FIRE results

ONT FIREs have ~half as many m6A sites as PacBio (as expected)

Auto-correlation of m6A sites in ONT nearly matches a single strand of PacBio

Overall conclusions