This is the book for the computational tools of Fiber-seq which describes the two major software tools:
fibertools
(ft
) which is a CLI tool for creating and interacting with Fiber-seq BAM files.- the
samtools
of Fiber-seq
- the
- Fiber-seq Inferred Regulatory Elements which is a method for identifying regulatory elements on individual fibers and peak calling.
- the
MACS2
of Fiber-seq
- the
Some Key features include:
- Predicting m6A sites from PacBio Fiber-seq data
- FIRE (Fiber-seq Inferred Regulatory Elements)
- Extracting Fiber-seq results into plain text files.
- Centering Fiber-seq results around a given position.
- pyft: Python bindings for
fibertools
A quick start guide can be found here.
You can help improve this book! And it is a easy as clicking the edit button in the top right.
The quick start guide for analyzing Fiber-seq data
Fiber-seq starting with PacBio
HiFi kinetics are required for predicting m6A with fibertools
. Check with your sequencing provider prior to sequencing to ensure that the output file will have kinetics. Additionally, many of fibertools
commands are compatible with CpG methylation which can be completed on instrument (if requested) or later with Jasmine, e.g. jasmine --keep-kinetics input.ccs.bam output.ccs.bam
. This command should be run prior to fibertools
if CpG methylation information is desired as Jasmine will overwrite the m6A predictions in the MM and ML tags.
Predict m6A and infer nucleosomes
To create useable Fiber-seq data you must first call m6A base-mods on the PacBio CCS bam using fibertools
. First install fibertools and then process your bam file using the prediction command.
ft predict-m6a -t 16 input.ccs.bam output.fiberseq.bam
This will both make m6A calls and identify nucleosomes on each fiber.
Note, the input CCS bam must have average kinetics to be able to call m6A.
Alignment and phasing
We recommend aligning with pbmm2 and phasing with HiPhase. Please see their respective documentation for more information.
Alternatively, we have written a snakemake pipeline to align and phase Fiber-seq data; however, this pipeline is not officially supported outside of our lab at this time.
After this point, you will have a Fiber-seq BAM file that is compatible with all the extraction commands in fibertools
.
Fiber-seq peaks and UCSC browser tracks
Once you have a phased bam file, you can identify Fiber-seq inferred regulatory elements (FIREs) to call Fiber-seq peaks and make a UCSC trackHub. Please see the FIRE repository for more details.
Fiber-seq starting with Oxford Nanopore Technologies (ONT)
Predict m6A
ft predict-m6a does not include a model for ONT data; however, you can use software, such as Dorado, to add CpG and m6A to your ONT BAM file.
If you do use Dorado you must then filter the m6A calls with modkit using a tenth percentile cutoff for each flow-cell independently. This is the only way to get good m6A calls in our experience, and using any hard ML threshold will not hold between flow-cells. Here is an example command:
modkit call-mods -t 8 -p 0.1 input.dorado.bam filtered.dorado.bam
We also show how to apply this filter and call nucleosomes in one line in the next section.
Infer nucleosomes and MSPs
Once you have CpG and m6A information in your filtered ONT BAM file, you can use ft add-nucleosomes
to infer nucleosomes and MSPs. With Dorado, we find the best results when restricting to the 90% of calls that dorado
is most confident in as determined by modkit.
modkit call-mods -p 0.1 input.bam - | ft add-nucleosomes - output.bam
Alignment and phasing
You can either use Dorado to align your ONT data or use a tool like minimap2 to align your data. If you do use minimap2
be sure to include the flag -y
to preserve the CpG and m6A information in the output BAM file.
We recommend using WhatsHap for phasing ONT data. Please see their documentation for more information.
After this point, you will have a Fiber-seq BAM file that is compatible with all the extraction commands in fibertools
.
Fiber-seq peaks and UCSC browser tracks
Some users report reasonable success in applying the FIRE pipeline to ONT data. However, please note that FIRE models were not trained or validated for ONT data. With that said, we have developed a heuristic for FIRE that appears to work very well with ONT data. To enable this add ont: true
to your config.yaml
file when setting up your FIRE run.
Think of fibertools
as the samtools
of Fiber-seq data. A command line tool for creating, filtering, interacting, and extracting from Fiber-seq BAM files.
This chapter covers:
- Creating a Fiber-seq BAM
- Extracting from a Fiber-seq BAM
- installing
fibertools
- pyft: Python bindings
A complete list of subcommands and their help pages can be found here.
Creating a Fiber-seq bam file
This chapter coverers creating a Fiber-seq bam file from a raw PacBio bam file. Including m6A prediction, nucleosome positioning, and FIRE identification.
ft predict-m6a
This command predicts m6A sites from PacBio Fiber-seq data. The input is a PacBio HiFi BAM file with kinetics and the output is a BAM file with the predicted m6A sites encoded in the ML and MM tags.
Details on the algorithm for m6A prediction are described in this publication: https://doi.org/10.1101/gr.279095.124.
ft add-nucleosomes
ft add-nucleosomes
adds nucleosome positions and MSP positions to the Fiber-seq bam. The input is a Fiber-seq bam file with m6A calls and the output is a Fiber-seq bam file with the nucleosome positions encoded in the ns
and nl
tags and the MSP positions encoded in the as
and al
tags.
It is usually unnecessary to run this command manually, as it is called by ft predict-m6a
automatically. However, you can run it manually if you want to adjust the parameters of the nucleosome calling algorithm.
Method
Nucleosome calling is performed by identifying stretches of DNA that are protected from Hia5 (i.e. do not have m6A signal). We have found the rate of false positive m6A calls in nucleosomes is very low when using fibertools
(Jha et al.) allowing for a heuristic to perform as well as or better than our previous HMM caller (Dubocanin et al.).
There are three parameters in our heuristic nucleosome calling that can be adjusted: the minimum nucleosome length (n
, default 75), the minimum combined nucleosome length (c
, default 100), and the minimum extension to nucleosome length (e
, default 25). These three parameters impact the three phases in nucleosome calling.
- Call all regions that have no m6A events for at least
n
bases a candidate nucleosome. - Call all regions of size
c
or more that have only one internal m6A (putative false positive) a candidate nucleosome. - Extend the length of nucleosomes identified in phases one and two if by spanning one additional m6A
e
bases of unmodified sequence would be added to the nucleosome length.
ft fire
This command identifies Fiber-seq Inferred Regulatory Elements (FIREs) from a Fiber-seq BAM. The input is a Fiber-seq BAM file with m6A and nucleosome calls and the output is a Fiber-seq bam file with the FIREs encoded in the aq
tags.
This command can be run in isolation; however, it is usually preferable to run the FIRE pipeline, which runs ft fire
and performs many additional analyses and visualizations.
Extracting from a FIRE BAM
ft fire
can also be used as an extraction tool to extract Fiber-seq data from an already processed FIRE BAM file.
ft fire --extract fire.bam > all.bed
This produces a file in BED format that contains all the MSPs, FIREs, and nucleosomes in the FIRE BAM file. This command produces output analogous to the now removed model.results.bed.gz
result from older versions of FIRE pipeline.
Extracting from a fibertools
bam
This chapter describes various subcommands that extract information from a fibertools
bam file into plain text.
ft extract
Extract Fiber-seq data into plain text files
Inputs and options
See the help message for details.
Output description
All outputs to ft extract
can be (and should be) compressed by simply adding the .gz
extension.
For example, ft extract input.bam --m6a m6a.bed.gz
will output a compressed bed12 file. Use -
to output to stdout
, e.g. ft extract input.bam --m6a -
.
Shared Output columns:
Column | Description |
---|---|
ct | Chromosome or contig |
st | Start position of the read on the chromosome |
en | End position of the read on the chromosome |
fiber | The fiber/read name |
score | The number of ccs passes for the read (rounded) |
Columns specific to the --m6a
, --cpg
, --nuc
, and --msp
outputs
All of these files are written in standard bed12 format. The first and last block in each the bed12 record do not reflect real data, and exist only to mark the start and end positions of the read. If you would like to convert these beds into bigBeds be sure to include -allow1bpOverlap
in your command.
Column | Description |
---|---|
thick start | Same as the start (st ) |
thick end | Same as the end (en ) |
itemRgb | Color specifc to the datatype, e.g. m6a marks get a purple RGB |
blockCount | The number of blocks in the bed12 record |
blockSizes | A comma separated list of the lengths of each feature in the bed12 record |
blockStarts | A comma separated list of the relative start positions of each block in the bed12 record |
Columns specific to the --all
output
Column | Description |
---|---|
sam_flag | The sam flag of the read alignment |
HP | The haplotype tag for the read |
RG | The read group tag for the read |
fiber_length | The length of the read in bp |
fiber_sequence | The sequence of the read |
ec | The number of ccs passes for the read (no rounding) |
rq | The estimated accuracy of the read |
total_AT_bp | The total number of AT bp in the read |
total_m6a_bp | The total number of m6a bp in the read |
total_nuc_bp | The total number of nucleosome bp in the read |
total_msp_bp | The total number of MSP bp in the read |
total_5mC_bp | The total number of 5mC bp in the read |
nuc_starts | The start positions of the nucleosomes in molecular coordinates |
nuc_lengths | The lengths of the nucleosomes in molecular coordinates |
ref_nuc_starts | The start positions of the nucleosomes in reference coordinates |
ref_nuc_lengths | The lengths of the nucleosomes in reference coordinates |
msp_starts | The start positions of the MSPs in molecular coordinates |
msp_lengths | The lengths of the MSPs in molecular coordinates |
ref_msp_starts | The start positions of the MSPs in reference coordinates |
ref_msp_lengths | The lengths of the MSPs in reference coordinates |
m6a | The start positions of the m6a in molecular coordinates |
ref_m6a | The start positions of the m6a in reference coordinates |
m6a_qual | The quality of the m6a positions (ML value) |
5mC | The start positions of the 5mC in molecular coordinates |
ref_5mC | The start positions of the 5mC in reference coordinates |
5mC_qual | The quality of the 5mC positions (ML value) |
Note positions in columns starting with ref_
maybe contain -1
(NA) values if the reference sequence has an insertion or deletion relative to the read sequence at that position.
ft center
This command centers Fiber-seq data around given reference positions. This is useful for making aggregate m6A and CpG observations, as well as visualization of SVs
Inputs and options
See the help message for details.
Output description
This command writes Fiber-seq data in a tab-delimited format to stdout
that has been centered relative to positions specified in the input bed file.
Column | Description |
---|---|
chrom | Chromosome |
centering_position | The position on the chromosome about which data is being centered. |
strand | The strand of the centering position. |
subset_sequence | The sequence of the read around the centering position. |
reference_start | The start position of the read in the reference. |
reference_end | The end position of the read in the reference. |
query_name | The name of the sequencing read |
RG | The read group the read belongs to |
centered_query_start | The start position of the read relative to the centering position |
centered_query_end | The end position of the read relative to the centering position |
query_length | The length of the read |
Additional columns specific to long (default) format
Column | Description |
---|---|
centered_position_type | The type of position being centered. One of: m6A , 5mC , nuc , msp . |
centered_start | The start position of the "feature" relative to the centering position |
centered_end | The end position of the "feature" relative to the centering position |
centered_quality | The quality of the "feature" relative to the centering position (ML value) |
Additional columns specific to the wide format
Column | Description |
---|---|
centered_m6a_positions | A comma separated list of m6a positions |
m6a_qual | The quality of the m6a positions (ML value) |
centered_5mC_positions | A comma separated list of 5mC positions |
5mC_qual | The quality of the 5mC positions (ML value) |
centered_nuc_starts | A comma separated list of nuc starts |
centered_nuc_ends | A comma separated list of nuc ends |
centered_msp_starts | A comma separated list of msp starts |
centered_msp_ends | A comma separated list of msp ends |
query_sequence | The sequence of the read |
Note if the --wide
flag is used with the --reference
flag some positions in the comma separated lists can be NA
when the reference sequence has an insertion or deletion relative to the read sequence at that position.
ft footprint
Usage
ft footprint [OPTIONS] <BAM> --bed <BED> --yaml <YAML>
The BAM
file is an indexed fiber-seq bam file.
The BED
file is a bed file with the motifs you'd like to test for footprints. This should include the strand the motif is on.
The YAML
file is a file that describes the modules within the motif that can be footprinted. e.g. a CTCF yaml
with its multiple binding sites might look like:
modules:
- [0, 8]
- [8, 16]
- [16, 23]
- [23, 29]
- [29, 35]
Modules must start at zero, end at the length of the motif, be sorted, and be contiguous with one another. At most 15 modules are allowed, and the intervals are 0-based, half-open (like BED
).
See the help message for details.
Description of output columns
The footprinting output table is a tab-separated file with the same number of entries as the input BED file and the following columns:
Column | Description |
---|---|
chrom | Chromosome |
start | The start position of the motif |
end | The end position of the motif |
strand | The strand of the motif. |
n_spanning_fibers | The number of fibers that span the motif. |
n_spanning_msps | The number of msp that span the motif. |
n_overlapping_nucs | The number of fibers that have an intersecting nucleosome. |
module_X | The number of fibers that are footprinted in module X. The number of module columns is determined by the footprinting yaml. |
footprint_codes | Comma separated list of footprint codes for each fiber. See details below. |
fire_quals | Comma separated list of fire qualities for each fiber. -1 if the MSP is not spanning or present. Note all fire_quals will be 0 or -1 if FIRE has not been applied to the bam. |
fiber_names | Comma separated list of fiber names that span the motif. Names share the same index as the previous column, so they can be matched with footprint codes. |
Footprint codes
The footprint codes are an encoded bit flag similar to how filtering is done with samtools
. If the first bit is set (1) then there is an MSP that spans the footprint. For each following bit, the bit is set if that module is footprinted by that fiber.
Here are some examples in python for how you could test a footprint code in a few ways:
fp_code = 0b1001 # this is a value of 9, but in binary it is 1001
# test if the first bit is set, there is a spanning MSP, true in this example
(fp_code & 1) > 0
# test if the first module is footprinted, false in this example
(fp_code & (1 << 1)) > 0
# test if the third module is footprinted, true in this example
(fp_code & (1 << 3)) > 0
ft-pileup
The ft-pileup
command is used to generate a per-base pileup of Fiber-seq data. This command is useful for visualizing the distribution of various features across the genome.
Inputs and options
See the help message for details.
ft-qc
The ft-qc
command is used to generate quality control metrics for a Fiber-seq BAM file.
Inputs and options
See the help message for details.
Help pages for fibertools subcommands
ft
ft predict-m6a
ft fire
ft extract
ft center
ft add-nucleosomes
ft footprint
ft clear-kinetics
ft strip-basemods
ft track-decorators
ft pileup
ft qc
ft
Fiber-seq toolkit in rust
Usage: ft [OPTIONS] <COMMAND>
Commands:
predict-m6a Predict m6A positions using HiFi kinetics data and encode the results in the MM and
ML bam tags. Also adds nucleosome (nl, ns) and MTase sensitive patches (al, as)
[aliases: m6A, m6a]
add-nucleosomes Add nucleosomes to a bam file with m6a predictions
fire Add FIREs (Fiber-seq Inferred Regulatory Elements) to a bam file with m6a
predictions
extract Extract fiberseq data into plain text files [aliases: ex, e]
center This command centers fiberseq data around given reference positions. This is useful
for making aggregate m6A and CpG observations, as well as visualization of SVs
[aliases: c, ct]
footprint Infer footprints from fiberseq data
qc Collect QC metrics from a fiberseq bam file
track-decorators Make decorated bed files for fiberseq data
pileup Make a pileup track of Fiber-seq features from a FIRE bam
clear-kinetics Remove HiFi kinetics tags from the input bam file
strip-basemods Strip out select base modifications
ddda-to-m6a Convert a DddA BAM file to pseudo m6A BAM file
help Print this message or the help of the given subcommand(s)
Options:
-h, --help Print help
-V, --version Print version
Global-Options:
-t, --threads <THREADS> Threads [default: 8]
Debug-Options:
-v, --verbose... Logging level [-v: Info, -vv: Debug, -vvv: Trace]
--quiet Turn off all logging
ft predict-m6a
Predict m6A positions using HiFi kinetics data and encode the results in the MM and ML bam tags. Also adds
nucleosome (nl, ns) and MTase sensitive patches (al, as)
Usage: ft predict-m6a [OPTIONS] [BAM] [OUT]
Arguments:
[BAM] Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
[OUT] Output bam file with m6A calls in new/extended MM and ML bam tags [default: -]
Options:
-n, --nucleosome-length <NUCLEOSOME_LENGTH>
Minium nucleosome length [default: 75]
-c, --combined-nucleosome-length <COMBINED_NUCLEOSOME_LENGTH>
Minium nucleosome length when combining over a single m6A [default: 100]
--min-distance-added <MIN_DISTANCE_ADDED>
Minium distance needed to add to an already existing nuc by crossing an m6a [default: 25]
-d, --distance-from-end <DISTANCE_FROM_END>
Minimum distance from the end of a fiber to call a nucleosome or MSP [default: 45]
-k, --keep
Keep hifi kinetics data
-h, --help
Print help (see more with '--help')
-V, --version
Print version
BAM-Options:
-F, --filter <BIT_FLAG> BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
0]
-x, --ftx <FILTER_EXPRESSION> Filtering expression to use for filtering records Example: filter to
nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
Filtering expressions support len() and qual() functions over msp, nuc,
m6a, cpg
--ml <MIN_ML_SCORE> Minium score in the ML tag to use or include in the output [env:
FT_MIN_ML_SCORE=] [default: 125]
Global-Options:
-t, --threads <THREADS> Threads [default: 8]
Debug-Options:
-v, --verbose... Logging level [-v: Info, -vv: Debug, -vvv: Trace]
--quiet Turn off all logging
Developer-Options:
--force-min-ml-score <FORCE_MIN_ML_SCORE> Force a different minimum ML score
--all-calls Keep all m6A calls regardless of how low the ML value is
-b, --batch-size <BATCH_SIZE> Number of reads to include in batch prediction [default: 1]
ft fire
Add FIREs (Fiber-seq Inferred Regulatory Elements) to a bam file with m6a predictions
Usage: ft fire [OPTIONS] [BAM] [OUT]
Arguments:
[BAM] Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
[OUT] Output file (BAM by default, table of MSP features if `--feats-to-text` is used, and bed9 + if
`--extract`` is used) [default: -]
Options:
--ont
Use a ONT heuristic adjustment for FIRE calling. This adjusts the observed number of m6A counts by
adding pseudo counts to account for the single stranded nature of ONT data [env: ONT=]
-e, --extract
Output just FIRE elements in bed9 format
--all
When extracting bed9 format include all MSPs and nucleosomes
-f, --feats-to-text
Output FIREs features for training in a table format
-s, --skip-no-m6a
Don't write reads with no m6A calls to the output bam
--min-msp <MIN_MSP>
Skip reads without at least `N` MSP calls [env: MIN_MSP=] [default: 0]
--min-ave-msp-size <MIN_AVE_MSP_SIZE>
Skip reads without an average MSP size greater than `N` [env: MIN_AVE_MSP_SIZE=] [default: 0]
-w, --width-bin <WIDTH_BIN>
Width of bin for feature collection [env: WIDTH_BIN=] [default: 40]
-b, --bin-num <BIN_NUM>
Number of bins to collect [env: BIN_NUM=] [default: 9]
--best-window-size <BEST_WINDOW_SIZE>
Calculate stats for the highest X bp window within each MSP Should be a fair amount higher than
the expected linker length [env: BEST_WINDOW_SIZE=] [default: 100]
--min-msp-length-for-positive-fire-call <MIN_MSP_LENGTH_FOR_POSITIVE_FIRE_CALL>
Minium length of msp to call a FIRE [env: MIN_MSP_LENGTH_FOR_POSITIVE_FIRE_CALL=] [default: 85]
--model <MODEL>
Optional path to a model json file. If not provided ft will use the default model (recommended)
[env: FIRE_MODEL=]
--fdr-table <FDR_TABLE>
Optional path to a FDR table [env: FDR_TABLE=]
-h, --help
Print help
-V, --version
Print version
BAM-Options:
-F, --filter <BIT_FLAG> BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
0]
-x, --ftx <FILTER_EXPRESSION> Filtering expression to use for filtering records Example: filter to
nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
Filtering expressions support len() and qual() functions over msp, nuc,
m6a, cpg
--ml <MIN_ML_SCORE> Minium score in the ML tag to use or include in the output [env:
FT_MIN_ML_SCORE=] [default: 125]
Global-Options:
-t, --threads <THREADS> Threads [default: 8]
Debug-Options:
-v, --verbose... Logging level [-v: Info, -vv: Debug, -vvv: Trace]
--quiet Turn off all logging
ft extract
Extract fiberseq data into plain text files
Usage: ft extract [OPTIONS] [BAM]
Arguments:
[BAM] Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
Options:
-r, --reference Report in reference sequence coordinates
--molecular Report positions in the molecular sequence coordinates
--m6a <M6A> Output path for m6a bed12
-c, --cpg <CPG> Output path for 5mC (CpG, primrose) bed12
--msp <MSP> Output path for methylation sensitive patch (msp) bed12
-n, --nuc <NUC> Output path for nucleosome bed12
-a, --all <ALL> Output path for a tabular format including "all" fiberseq information in the bam
-h, --help Print help (see more with '--help')
-V, --version Print version
BAM-Options:
-F, --filter <BIT_FLAG> BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
0]
-x, --ftx <FILTER_EXPRESSION> Filtering expression to use for filtering records Example: filter to
nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
Filtering expressions support len() and qual() functions over msp, nuc,
m6a, cpg
--ml <MIN_ML_SCORE> Minium score in the ML tag to use or include in the output [env:
FT_MIN_ML_SCORE=] [default: 125]
Global-Options:
-t, --threads <THREADS> Threads [default: 8]
Debug-Options:
-v, --verbose... Logging level [-v: Info, -vv: Debug, -vvv: Trace]
--quiet Turn off all logging
All-Format-Options:
-q, --quality Include per base quality scores in "fiber_qual"
-s, --simplify Simplify output by removing fiber sequence
ft center
This command centers fiberseq data around given reference positions. This is useful for making aggregate m6A
and CpG observations, as well as visualization of SVs
Usage: ft center [OPTIONS] --bed <BED> [BAM]
Arguments:
[BAM] Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
Options:
-b, --bed <BED> Bed file on which to center fiberseq reads. Data is adjusted to the start position of
the bed file and corrected for strand if the strand is indicated in the 6th column of
the bed file. The 4th column will also be checked for the strand but only after the 6th
is. If you include strand information in the 4th (or 6th) column it will orient data
accordingly and use the end position of bed record instead of the start if on the minus
strand. This means that profiles of motifs in both the forward and minus orientation
will align to the same central position
-d, --dist <DIST> Set a maximum distance from the start of the motif to keep a feature
-w, --wide Provide data in wide format, one row per read
-r, --reference Return relative reference position instead of relative molecular position
-s, --simplify Replace the sequence output column with just "N"
-h, --help Print help (see more with '--help')
-V, --version Print version
BAM-Options:
-F, --filter <BIT_FLAG> BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
0]
-x, --ftx <FILTER_EXPRESSION> Filtering expression to use for filtering records Example: filter to
nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
Filtering expressions support len() and qual() functions over msp, nuc,
m6a, cpg
--ml <MIN_ML_SCORE> Minium score in the ML tag to use or include in the output [env:
FT_MIN_ML_SCORE=] [default: 125]
Global-Options:
-t, --threads <THREADS> Threads [default: 8]
Debug-Options:
-v, --verbose... Logging level [-v: Info, -vv: Debug, -vvv: Trace]
--quiet Turn off all logging
ft add-nucleosomes
Add nucleosomes to a bam file with m6a predictions
Usage: ft add-nucleosomes [OPTIONS] [BAM] [OUT]
Arguments:
[BAM] Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
[OUT] Output bam file with nucleosome calls [default: -]
Options:
-n, --nucleosome-length <NUCLEOSOME_LENGTH>
Minium nucleosome length [default: 75]
-c, --combined-nucleosome-length <COMBINED_NUCLEOSOME_LENGTH>
Minium nucleosome length when combining over a single m6A [default: 100]
--min-distance-added <MIN_DISTANCE_ADDED>
Minium distance needed to add to an already existing nuc by crossing an m6a [default: 25]
-d, --distance-from-end <DISTANCE_FROM_END>
Minimum distance from the end of a fiber to call a nucleosome or MSP [default: 45]
-h, --help
Print help
-V, --version
Print version
BAM-Options:
-F, --filter <BIT_FLAG> BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
0]
-x, --ftx <FILTER_EXPRESSION> Filtering expression to use for filtering records Example: filter to
nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
Filtering expressions support len() and qual() functions over msp, nuc,
m6a, cpg
--ml <MIN_ML_SCORE> Minium score in the ML tag to use or include in the output [env:
FT_MIN_ML_SCORE=] [default: 125]
Global-Options:
-t, --threads <THREADS> Threads [default: 8]
Debug-Options:
-v, --verbose... Logging level [-v: Info, -vv: Debug, -vvv: Trace]
--quiet Turn off all logging
ft footprint
Infer footprints from fiberseq data
Usage: ft footprint [OPTIONS] --bed <BED> --yaml <YAML> [BAM]
Arguments:
[BAM] Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
Options:
-b, --bed <BED> BED file with the regions to footprint. Should all contain the same motif with proper
strand information, and ideally be ChIP-seq peaks
-y, --yaml <YAML> yaml describing the modules of the footprint
-o, --out <OUT> Output bam [default: -]
-h, --help Print help
-V, --version Print version
BAM-Options:
-F, --filter <BIT_FLAG> BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
0]
-x, --ftx <FILTER_EXPRESSION> Filtering expression to use for filtering records Example: filter to
nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
Filtering expressions support len() and qual() functions over msp, nuc,
m6a, cpg
--ml <MIN_ML_SCORE> Minium score in the ML tag to use or include in the output [env:
FT_MIN_ML_SCORE=] [default: 125]
Global-Options:
-t, --threads <THREADS> Threads [default: 8]
Debug-Options:
-v, --verbose... Logging level [-v: Info, -vv: Debug, -vvv: Trace]
--quiet Turn off all logging
ft clear-kinetics
Remove HiFi kinetics tags from the input bam file
Usage: ft clear-kinetics [OPTIONS] [BAM] [OUT]
Arguments:
[BAM] Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
[OUT] Output bam file without hifi kinetics [default: -]
Options:
-h, --help Print help
-V, --version Print version
BAM-Options:
-F, --filter <BIT_FLAG> BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
0]
-x, --ftx <FILTER_EXPRESSION> Filtering expression to use for filtering records Example: filter to
nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
Filtering expressions support len() and qual() functions over msp, nuc,
m6a, cpg
--ml <MIN_ML_SCORE> Minium score in the ML tag to use or include in the output [env:
FT_MIN_ML_SCORE=] [default: 125]
Global-Options:
-t, --threads <THREADS> Threads [default: 8]
Debug-Options:
-v, --verbose... Logging level [-v: Info, -vv: Debug, -vvv: Trace]
--quiet Turn off all logging
ft strip-basemods
Strip out select base modifications
Usage: ft strip-basemods [OPTIONS] [BAM] [OUT]
Arguments:
[BAM] Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
[OUT] Output bam file [default: -]
Options:
-b, --basemod <BASEMOD> base modification to strip out of the bam file [possible values: m6A, 6mA, 5mC,
CpG]
--ml-m6a <ML_M6A> filter out m6A modifications with less than this ML value [default: 0]
--ml-5mc <ML_5MC> filter out 5mC modifications with less than this ML value [default: 0]
--drop-forward Drop forward strand of base modifications
--drop-reverse Drop reverse strand of base modifications
-h, --help Print help
-V, --version Print version
BAM-Options:
-F, --filter <BIT_FLAG> BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
0]
-x, --ftx <FILTER_EXPRESSION> Filtering expression to use for filtering records Example: filter to
nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
Filtering expressions support len() and qual() functions over msp, nuc,
m6a, cpg
--ml <MIN_ML_SCORE> Minium score in the ML tag to use or include in the output [env:
FT_MIN_ML_SCORE=] [default: 125]
Global-Options:
-t, --threads <THREADS> Threads [default: 8]
Debug-Options:
-v, --verbose... Logging level [-v: Info, -vv: Debug, -vvv: Trace]
--quiet Turn off all logging
ft track-decorators
Make decorated bed files for fiberseq data
Usage: ft track-decorators [OPTIONS] --bed12 <BED12> [BAM]
Arguments:
[BAM] Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
Options:
-b, --bed12 <BED12> Output path for bed12 file to be decorated
-d, --decorator <DECORATOR> Output path for decorator bed file [default: -]
-h, --help Print help
-V, --version Print version
BAM-Options:
-F, --filter <BIT_FLAG> BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
0]
-x, --ftx <FILTER_EXPRESSION> Filtering expression to use for filtering records Example: filter to
nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
Filtering expressions support len() and qual() functions over msp, nuc,
m6a, cpg
--ml <MIN_ML_SCORE> Minium score in the ML tag to use or include in the output [env:
FT_MIN_ML_SCORE=] [default: 125]
Global-Options:
-t, --threads <THREADS> Threads [default: 8]
Debug-Options:
-v, --verbose... Logging level [-v: Info, -vv: Debug, -vvv: Trace]
--quiet Turn off all logging
ft pileup
Make a pileup track of Fiber-seq features from a FIRE bam
Usage: ft pileup [OPTIONS] [BAM] [RGN]
Arguments:
[BAM] Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
[RGN] Region string to make a pileup of. e.g. chr1:1-1000 or chr1:1-1,000 If not provided will make a
pileup of the whole genome
Options:
-o, --out <OUT> Output file [default: -]
-m, --m6a include m6A calls
-c, --cpg include 5mC calls
--haps For each column add two new columns with the hap1 and hap2 specific data
-k, --keep-zeros Keep zero coverage regions
-p, --per-base Write output one base at a time even if the values do not change
--fiber-coverage Calculate coverage starting from the first MSP/NUC to the last MSP/NUC
position instead of the complete span of the read alignment
--shuffle <SHUFFLE> Shuffle the fiber-seq data according to a bed file of the shuffled
positions of the fiber-seq data
--rolling-max <ROLLING_MAX> Output a rolling max of the score column over X bases
--no-msp No MSP columns
--no-nuc No NUC columns
-h, --help Print help (see more with '--help')
-V, --version Print version
BAM-Options:
-F, --filter <BIT_FLAG> BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
0]
-x, --ftx <FILTER_EXPRESSION> Filtering expression to use for filtering records Example: filter to
nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
Filtering expressions support len() and qual() functions over msp, nuc,
m6a, cpg
--ml <MIN_ML_SCORE> Minium score in the ML tag to use or include in the output [env:
FT_MIN_ML_SCORE=] [default: 125]
Global-Options:
-t, --threads <THREADS> Threads [default: 8]
Debug-Options:
-v, --verbose... Logging level [-v: Info, -vv: Debug, -vvv: Trace]
--quiet Turn off all logging
ft qc
Collect QC metrics from a fiberseq bam file
Usage: ft qc [OPTIONS] [BAM] [OUT]
Arguments:
[BAM] Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam
file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
[OUT] Output text file with QC metrics. The format is a tab-separated file with the following columns:
"statistic\tvalue\tcount" where "statistic" is the name of the metric, "value" is the value of the
metric, and "count" is the number of times the metric was observed [default: -]
Options:
--acf Calculate the auto-correlation function of the m6A marks in the
fiber-seq data
--acf-max-lag <ACF_MAX_LAG> maximum lag for the ACF calculation [default: 250]
--acf-min-m6a <ACF_MIN_M6A> Minimum number of m6A marks to use a read in the ACF calculation
[default: 100]
--acf-max-reads <ACF_MAX_READS> maximum number of reads to use in the ACF calculation [default:
10000]
--acf-sample-rate <ACF_SAMPLE_RATE> After sampling the first "acf-max-reads" randomly sample one of
every "acf-sample-rate" reads and replace one of the previous
reads at random [default: 100]
-m, --m6a-per-msp In the output include a measure of the number of m6A events per
MSPs of a given size. The output format is:
"m6a_per_msp_size\t{m6A count},{MSP size},{is a FIRE}\t{count}"
e.g. "m6a_per_msp_size\t35,100,false\t100"
--n-reads <N_READS> Only process the first "n" reads in the input bam file
-h, --help Print help
-V, --version Print version
BAM-Options:
-F, --filter <BIT_FLAG> BAM bit flags to filter on, equivalent to `-F` in samtools view [default:
0]
-x, --ftx <FILTER_EXPRESSION> Filtering expression to use for filtering records Example: filter to
nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example:
filter to msps with lengths between 30 and 49 bp -x "len(msp)=30:50"
Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50"
Filtering expressions support len() and qual() functions over msp, nuc,
m6a, cpg
--ml <MIN_ML_SCORE> Minium score in the ML tag to use or include in the output [env:
FT_MIN_ML_SCORE=] [default: 125]
Global-Options:
-t, --threads <THREADS> Threads [default: 8]
Debug-Options:
-v, --verbose... Logging level [-v: Info, -vv: Debug, -vvv: Trace]
--quiet Turn off all logging
pyft: Python bindings for fibertools
See the documentation for pyft
on readthedocs for more information.
Installation
This chapter covers the various ways to install fibertools
.
It is easiest to install fibertools
using conda (instructions). However, for the fastest m6A predictions you need to install with libtorch
enabled (instructions) if you would like to install from source, you can use the following instructions.
From bioconda
fibertools-rs
is avalible through bioconda
and can be installed with the following command:
mamba install -c conda-forge -c bioconda fibertools-rs
However, due to size constraints in bioconda
this version does not support contain the pytorch libraries or GPU acceleration for m6A predictions. m6A predictions will still work in the bioconda version but may be much slower. If you would like to use m6A prediction and GPU acceleration, you will need to install using the directions here.
From crates.io
Installation from crates.io
requires the rust package manager cargo
. You can find how to install cargo
here.. Furthermore, a recent version of gcc
and cmake
is required. I have tested and recommend gcc v10.2.0
and cmake v3.21.1
, though other versions may work.
cargo install fibertools-rs
From GitHub (active development)
Using cargo from source:
cargo install --git https://github.com/fiberseq/fibertools-rs
or using git form source:
git clone https://github.com/fiberseq/fibertools-rs
cd fibertools-rs
cargo build --release
./target/release/ft --help
With libtorch
Get libtorch
v2.2.0 from the PyTorch website and extract the content of the zip file.
- On Linux/Unix system you can download with:
wget https://download.pytorch.org/libtorch/cu118/libtorch-shared-with-deps-2.2.0%2Bcu118.zip
- On macOS you can download with:
wget https://download.pytorch.org/libtorch/cpu/libtorch-macos-2.2.0.zip
- Windows is not supported and will not be.
Then add the following to your .bashrc
or equivalent, where /path/to/libtorch
is the path to the directory that was created when unzipping the file:
export LIBTORCH_CXX11_ABI=0
export LIBTORCH=/path/to/libtorch # e.g. export LIBTORCH=/Users/mrvollger/lib/libtorch
export LD_LIBRARY_PATH=${LIBTORCH}/lib:$LD_LIBRARY_PATH
export DYLD_LIBRARY_PATH=${LIBTORCH}/lib:$LD_LIBRARY_PATH
Finally install using cargo
:
cargo install --all-features fibertools-rs
FIRE: descriptions, methods, and outputs
The code for running the FIRE pipeline can be found on GitHub, and if you do use FIRE in your work please cite our publication:
Vollger, M. R., Swanson, E. G., Neph, S. J., Ranchalis, J., Munson, K. M., Ho, C.-H., Sedeño-Cortés, A. E., Fondrie, W. E., Bohaczuk, S. C., Mao, Y., Parmalee, N. L., Mallory, B. J., Harvey, W. T., Kwon, Y., Garcia, G. H., Hoekzema, K., Meyer, J. G., Cicek, M., Eichler, E. E., … Stergachis, A. B. (2024). A haplotype-resolved view of human gene regulation. bioRxiv. https://doi.org/10.1101/2024.06.14.599122
Summary of Fiber-seq inferred regulatory elements (FIREs)
For those who would prefer video I have recorded a lab meeting discussing FIRE and the methods used here. If you prefer to read please continue on.
FIREs are MTase sensitive patches (MSPs) that are inferred to be regulatory elements on single chromatin fibers. To do this we used semi-supervised machine learning to identify MSPs that are likely to be regulatory elements using the Mokapot
framework and XGBoost
. Every individual FIRE element is associated with an estimated precision value, which indicates the probability that the FIRE element is a true regulatory element. The estimated precision of FIREs elements are created using Mokapot
and validation data not used in training. We train our model targeting FIRE elements with at least 95% precision, MSPs with less than 90% precision are considered to have average level of accessibility expected between two nucleosomes, and are referred to as linker regions.
Semi-superivized machine learning with Mokapot
requires a mixed-positive training set and a clean negative training set. To create mixed positive training data we selected MSPs that overlapped DNase hypersensitive sites (DHSs) and CTCF ChIP-seq peaks. And to create a clean negative training set we selected MSPs that did not overlap DHSs or CTCF ChIP-seq peaks.
For details on running the FIRE workflow see:
For details on the outputs of the FIRE workflow see:
Running the FIRE pipeline
The FIRE pipeline is a Snakemake workflow for calling Fiber-seq Inferred Regulatory Elements (FIREs) on single molecules and peak calling with Fiber-seq.
Install
Please start by installing pixi which handles the environment of the FIRE workflow.
Then install FIRE using git
and pixi
:
git clone https://github.com/fiberseq/FIRE.git
cd FIRE
pixi install
We then recommend quickly testing your installation by running the test suite:
pixi run test
We recommend setting a Snakemake conda prefix and the Apptainer cache directory in your bashrc
, e.g. in the Stergachis lab add:
export SNAKEMAKE_CONDA_PREFIX=/mmfs1/gscratch/stergachislab/snakemake-conda-envs
export APPTAINER_CACHEDIR=/mmfs1/gscratch/stergachislab/snakemake-conda-envs/apptainer-cache
Then Snakemake installs all the additional requirements as conda envs in that directory.
If you wish to distribute jobs across a cluster you may need to install the appropriate snakemake executor plugin. The SLURM executor is included in the environment (snakemake-executor-slurm
)
Configuring
See the configuration README, the example configuration file, and the example manifest file for configuration options.
Run
The FIRE
workflow can be executed using the pixi run fire
command. Under the hood this runs a snakemake
workflow and any extra parameters are passed directly to snakemake. For example:
pixi run fire --configfile config/config.yaml
If you want to do a dry run:
pixi run fire --configfile config/config.yaml -n
If you want to execute across a cluster (modify profiles/slurm-executor
as needed for distributed execution):
pixi run fire --configfile config/config.yaml --profile profiles/slurm-executor
And if you want to run the FIRE workflow from another directory you can do so with:
pixi run --manifest-path /path/to/snakemake/pixi.toml fire ...
where you update /path/to/snakemake/pixi.toml
to the path of the pixi.toml
in the cloned FIRE repository.
You can also run snakemake directly, e.g.:
pixi shell
snakemake \
--configfile config/config.yaml \
--profile profiles/slurm-executor \
--local-cores 8 -k
Outputs of the FIRE pipeline
The FIRE pipeline returns outputs for a sample in the directory results/{sample}/
and files are labeled with the sample ({sample}
) and the version of the FIRE pipeline ({v}
). The following files and directories are generated:
Output | Description |
---|---|
{sample}-fire-{v}-filtered.cram | CRAM file containing the all the data used in the FIRE pipeline. |
{sample}-fire-{v}-peaks.bed.gz | BED file containing the FIRE peaks calls |
{sample}-fire-{v}-hap-differences.bed.gz | BED file containing the results of searching for haplotype-selective peaks. |
{sample}-fire-{v}-pileup.bed.gz | BED file containing per-base information on number of FIREs, MSPs, nucleosomes, coverage and more. |
{sample}-fire-{v}-qc.tbl.gz | Table containing quality control metrics for the FIRE CRAM. |
trackHub-{v}/ | Directory containing a UCSC trackHub for visualizing all the results of the FIRE pipeline. |
additional-outputs-{v}/ | Directory containing additional outputs from the FIRE pipeline. |
More details on the individual outputs
The {sample}-fire-{v}-filtered.cram
file
The CRAM file contains all the data used in the FIRE pipeline. It is a CRAM file that can be viewed with IGV or other genome browsers. Sequencing quality scores are removed from the CRAM file to reduce the file size since per base quality scores are not used in the FIRE pipeline, as well as reads with insufficient m6A signal. The CRAM file is sorted and indexed.
The {sample}-fire-{v}-peaks.bed.gz
file
This is the peak file for the FIRE method. Peaks are called by identifying FIRE score (methods) local-maxima that have FDR values below a threshold. By default, the pipeline reports peaks at a 5% FDR threshold. Once a local-maxima is identified, the start and end positions of the peak are determined by the median start and end positions of the underlying FIRE elements. We also calculate and report wide peaks in the additional-outputs/
by taking the union of the FIRE peaks and all regions below the FDR threshold and then merging resulting regions that are within one nucleosome (147 bp) of one another.
The FIRE peaks file has the following columns:
Column | Description |
---|---|
#chrom | Chromosome of the peak |
peak*start | Start of the peak |
peak_end | End of the peak |
start | Start of the maximum of the peak |
end | End of the maximum of the peak |
coverage | Coverage of the peak |
fire_coverage | Coverage of the FIREs in the peak |
score | FIRE score of the peak (see methods) |
nuc_coverage | Coverage of the nucleosomes in the peak |
msp_coverage | Coverage of the MSPs in the peak |
.**{H1,H2} | Repeats of previous columns but specific for the two haplotypes |
FDR | False discovery rate of the peak |
log_FDR | -10*log10 of the FDR |
FIRE_size_mean | Mean size of the FIREs in the peak |
FIRE_size_ssd | Standard deviation of the size of the FIREs in the peak |
FIRE_start_ssd | Standard deviation of the start of the FIREs in the peak |
FIRE_end_ssd | Standard deviation of the end of the FIREs in the peak |
pass_coverage | Whether the peak passes coverage filters |
The {sample}-fire-{v}-hap-differences.bed.gz
file
This file primarily contains the same columns as the FIRE peaks file but additionally has a p_value
column with the results of a Fisher's exact test for the difference in coverage between the two haplotypes, and a p_adjust
column with the Benjamini-Hochberg adjusted p-value. See the methods for more details.
The {sample}-fire-{v}-pileup.bed.gz
file
This is a BED file containing per-base information on number of FIREs, MSPs, nucleosomes, coverage and more. The columns are calculated using ft-pileup
and more details can be found in the ft-pileup
documentation.
The {sample}-fire-{v}-qc.tbl.gz
file
This file contains quality control metrics for the FIRE CRAM. The results are directly created by ft-qc
and more details can be found in the ft-qc
documentation.
The trackHub-{v}/
directory
The trackHub-{v}/
directory contains a UCSC trackHub for visualizing all the results of the FIRE pipeline. A description of the trackHub can be found in trackHub/fire-description.html
. The trackHub can be loaded into UCSC by uploading the trackHub directory to a public facing website and then loading the hub.txt
's URL into the UCSC trackHub browser.
A copy of the trackHub description can be found here.
The additional-outputs-{v}/
directory
The additional-outputs-{v}/
directory contains the following files:
TODO
FIRE methods
This chapter describes the methods and training used in the FIRE pipeline. There are two major components, identifying per-molecule FIRE elements, and aggregating these elements into peaks.
Training of Fire-seq inferred regulatory elements.
Training data
For training data, we generated 21 different GM12878 Fiber-seq experiments with a range of under- and over-methylated experimental conditions to ensure we captured a broad range of percent m6A (5.8-13.3%) to ensure our model could generalize to new samples with varying levels of m6A. We merged these sequencing results and randomly selected 10% of the Fiber-seq reads from 100,000 randomly selected DNase I and CTCF ChIP-seq peaks for mixed-positive labels and 100,000 equally sized regions that did not overlap DNase or CTCF Chip-seq for negative labels. We then generated features for each of these MSPs, including length, log fold enrichment of m6A, the A/T content, and windowed measures of m6A around the MSP (Supplemental Table 3) and held out 20% of the Fiber-seq reads to be used as test data.
Semi-supervised training
To carry out semi-supervised training, we used an established method, Mokapot (Fondrie & Noble, 2021; Käll et al., 2007), which we summarize below. In the first round of semi-supervised training, Mokapot identifies the feature that best discriminates between our mixed-positive and negative labels and then selects a threshold for that feature such that the mixed-positive labels can be discriminated from the negative labels with 95% estimated precision (defined below). The subset of mixed-positive labels above this threshold is then used as an initial set of positive labels in training an XGBoost model with five-fold cross-validation (Chen & Guestrin, 2016). Then this process is iteratively repeated, using the learned prediction from the previous iteration’s model to create positive labels at 95% estimated precision, until the number of positive identifications at 95% precision in the validation set ceases to increase (15 iterations, Fig. S1) (Fondrie & Noble, 2021; Käll et al., 2007).
Estimated precision of individual FIRE elements
We cannot compute the precision associated with a particular XGBoost score because we do not have access to a set of clean-positive labels. Instead, we define a notion of “estimated precision” using a balanced held-out test set of mixed-positive and negative labels (20% of the data). We defined the estimated precision (EP) of a FIRE element to be
$$ EP = 1 - \frac{FP+1}{TMP} $$
where TMP is the number of “true” identifications from the mixed positive labels with at least that element’s XGBoost score, and FP is the number of false positive identifications from negative labels with at least that element’s XGBoost score. We add a pseudo count of one to the numerator of false positive identifications so as to prevent liberal estimates for smaller collections of identifications (Fondrie & Noble, 2021).
Features of FIRE elements
The following features were used as features with Mokapot
in FIRE element classification:
Feature | Description |
---|---|
msp_len | Length of the MSP |
msp_len_times_m6a_fc | Length of the MSP times the fold-change of m6A in the MSP |
ccs_passes | Number of CCS passes in the MSP |
fiber_m6a_count | Number of m6A sites in the Fiber-seq read |
fiber_AT_count | Number of AT sites in the Fiber-seq read |
fiber_m6a_frac | Fraction of m6A sites in the Fiber-seq read |
msp_m6a | Number of m6A sites in the MSP |
msp_AT | Number of AT sites in the MSP |
msp_m6a_frac | Fraction of m6A sites in the MSP |
msp_fc | Fold-change of m6A fraction in the MSP relative to the whole reads m6A fraction |
bin_X_m6a_count | Number of m6A sites in the Xth 40 bp window |
bin_X_AT_count | Number of AT sites in the Xth 40 bp window |
bin_X_m6a_frac | Fraction of m6A sites in the Xth 40 bp window |
bin_X_m6a_fc | Fold-change of m6A fraction in the Xth 40 bp window relative to the whole reads m6A fraction |
{best, worst}_m6a_count_X | Number of m6A sites in the 100 bp window with the most (best) or least (worst) m6A sites |
{best, worst}_AT_count_X | Number of AT sites in the 100 bp window with the most (best) or least (worst) AT sites |
{best, worst}_m6a_frac_X | Fraction of m6A sites in the 100 bp window with the most (best) or least (worst) m6A sites |
{best, worst}_m6a_fc_X | Fold-change of m6A fraction in the 100 bp window with the most (best) or least (worst) m6A sites relative to the whole reads m6A fraction |
There are nine 40 bp windows for each MSP (X = 1, 2, ..., 9), with the 5th window centered on the MSP.
Aggregation of FIRE elements for peak calling
Aggregate FIRE score calculation
The FIRE score (\(S_g\)) for a position in the genome (\(g\)) is calculated using the following formula:
$$ S_g=-\frac{50}{R_g} \sum_{i=1}^{C_g} \log_{10}(1-min(EP_i , 0.99)) $$
where \(C_g\) is the number of FIRE elements at the \(g\)th position, \(R_g\) is the number of Fiber-seq reads at the \(g\)th position, and \(EP_i\) is the estimated precision of the \(i\)th FIRE element at the \(g\)th position. The estimated precision of each FIRE element is thresholded at 0.99, such that the FIRE score takes on values between 0 and 100. Regions covered by less than four FIRE elements (i.e., if \(C_g < 4\)) are not scored and are given a value of negative one.
Regions of unreliable coverage
Regions with unreliable coverage are defined as regions with Fiber-seq coverage that deviate from the median coverage by five standard deviations, where a standard deviation is defined by the Poisson distribution (i.e., the square root of the mean coverage). This is a parameter that can be adjusted by the user.
FIRE score FDR calculation
We shuffle the location of an entire Fiber-seq read by selecting a random start position within the chromosome and relocating the entire read to that start position. Fiber-seq reads originating from regions with unreliable coverage are not shuffled, and reads from regions with reliable coverage are not shuffled into regions with unreliable coverage (bedtools shuffle -chrom -excl
, v2.31.0). We then compute FIRE scores associated with the shuffled genome. Recalling that the FIRE score for the \(g\)th position in the genome is denoted \(S_g\), we divide the number of bases that have shuffled FIRE scores above \(S_g\) by the number of bases that have un-shuffled FIRE scores above \(S_g\). This provides an estimate of the FDR associated with the FIRE score \(S_g\).
Peak calling
Peaks are called by identifying FIRE score local-maxima that have FDR values below a 5% threshold and that exceed an optional percent actuation threshold \(C_g / R_g\). Adjacent local maxima that share 50% of the underlying FIRE elements or have 90% reciprocal overlap are merged into a single peak, using the higher of the two local maxima. Then, the start and end positions of the peak are determined by the median start and end positions of the underlying FIRE elements. We also calculated and reported wide peaks by taking the union of the FIRE peaks and all regions below the FDR threshold and then merged the resulting regions that were within one nucleosome (147 bp) of one another.
Haplotype-selective peaks
Haplotype-selective peaks
For peaks with at least 10 Fiber-seq reads on both haplotypes, we test the difference in percent actuation in each haplotype (fraction of reads with FIRE elements) using a two-sided Fisher’s exact test. Specifically, the inputs for the test are the number of FIRE elements in haplotype one, the number of Fiber-seq reads without FIRE elements in haplotype one, the number of FIRE elements in haplotype two, and the number of Fiber-seq reads without FIRE elements in haplotype two. We then apply an FDR correction (Benjamini-Hochberg) to correct for multiple testing.
Glossary of Fiber-seq terms
Fiber-seq protocol
A treatment of DNA chromatin fibers with the Hia5 enzyme which preferentially methylates accessible adenosines. The fibers are then sequenced using a long-read platform that can detect the m6A modifications.
Fiber-seq read or fiber
A DNA fiber that has been sequenced using the Fiber-seq protocol and has m6A modifications called.
m6A
N6-methyladenosine: A DNA modification made during the Fiber-seq protocol on accessible adenosines. Native m6A sites are exceedingly rare or non-existent in eukaryotic DNA.
Inferred nucleosome
A nucleosome inferred from the Fiber-seq data. The nucleosome position along the DNA is inferred by the ft add-nucleosomes
algorithm using the m6A modifications (not a direct observation of a nucleosome).
MSP
Methyltransferase sensitive patch: A stretch of a Fiber-seq read that has a high density of m6A sites. Specifically it is defined as a region that is inferred to not be occluded by a nucleosome.
FIRE
Fiber-seq Inferred Regulatory Element: A MSP that is inferred to be a regulatory element based on features of the MSP including m6A density.
Internucleosomal linker region
Any MSPs that is not a FIRE element.
The Fiber-seq BAM format
The Fiber-seq BAM format adds to m6A BAM data from a PacBio (ft predict-m6a
) or ONT (dorado
) sequencing run with m6A.
The following tags are added to the BAM file:
ns:B:I,
: Nucleosome start sites (0-based) on the forward strand of the sequencing read (u32).nl:B:I,
: Equal length array tons
of nucleosome lengths (u32).as:B:I,
: MSP start sites (0-based) on the forward strand of the sequencing read (u32).al:B:I,
: Equal length array toas
of MSP lengths (u32).aq:B:C,
: Quality scores for the MSP [0, 255] (u8). This tag is optional and added byft fire
.
The ns
/as
/nl
/al
tags are with respect to the unaligned sequencing read (forward strand) and are not affected by alignment or the orientation of the read after alignment. This also means that the starts and lengths encoded in these tags may change when ft
lifts these molecular coordinates to reference coordinates.
The ns
or as
tag do not need to begin or end at the start or end of the read; however, once begun, they must be contiguous. i.e. the ns
and as
tags must combine to form a contiguous set of alternating nucleosome and MSP sites. The ns
, nl
, as
, and al
tags are added to the BAM file automatically using ft predict-m6a
or later using the ft add-nucleosomes
command.
The aq
tag is added using the ft fire
command and represents the estimated precision of the MSP being a FIRE. Specifically, the estimated precision of a FIRE is the value of the aq
tag divided by 255.
Cite
If you use fibertools
in your research, please cite the following paper:
Jha, A., Bohaczuk, S. C., Mao, Y., Ranchalis, J., Mallory, B. J., Min, A. T., Hamm, M. O., Swanson, E., Dubocanin, D., Finkbeiner, C., Li, T., Whittington, D., Noble, W. S., Stergachis, A. B., & Vollger, M. R. (2024). DNA-m6A calling and integrated long-read epigenetic and genetic analysis with fibertools. Genome Research. https://doi.org/10.1101/gr.279095.124
If you use FIRE
in your research, please cite the following paper:
Vollger, M. R., Swanson, E. G., Neph, S. J., Ranchalis, J., Munson, K. M., Ho, C.-H., Sedeño-Cortés, A. E., Fondrie, W. E., Bohaczuk, S. C., Mao, Y., Parmalee, N. L., Mallory, B. J., Harvey, W. T., Kwon, Y., Garcia, G. H., Hoekzema, K., Meyer, J. G., Cicek, M., Eichler, E. E., … Stergachis, A. B. (2024). A haplotype-resolved view of human gene regulation. bioRxiv. https://doi.org/10.1101/2024.06.14.599122
Analyses of Fiber-seq data
This notebook contains some of the prepublication analyses of Fiber-seq data we have done:
ONT Fiber-seq Analysis
Matched ONT and PacBio preparations
Read lengths unchanged by Fiber-seq
Impact of Fiber-seq on ONT read quality
A comparison of ONT and PacBio Fiber-seq