This is the book for the computational tools of Fiber-seq which describes the two major software tools:
fibertools
(ft
) which is a CLI tool for creating and interacting with Fiber-seq BAM files.- the
samtools
of Fiber-seq
- the
- Fiber-seq Inferred Regulatory Elements which is a method for identifying regulatory elements on individual fibers and peak calling.
- the
MACS2
of Fiber-seq
- the
Some Key features include:
- Predicting m6A sites from PacBio Fiber-seq data
- FIRE (Fiber-seq Inferred Regulatory Elements)
- Extracting Fiber-seq results into plain text files.
- Centering Fiber-seq results around a given position.
- pyft: Python bindings for
fibertools
A quick start guide can be found here.
You can help improve this book! And it is a easy as clicking the edit button in the top right.
The quick start guide for analyzing Fiber-seq data
Fiber-seq starting with PacBio
HiFi kinetics are required for predicting m6A with fibertools
. Check with your sequencing provider prior to sequencing to ensure that the output file will have kinetics. Additionally, many of fibertools
commands are compatible with CpG methylation which can be completed on instrument (if requested) or later with Jasmine, e.g. jasmine --keep-kinetics input.ccs.bam output.ccs.bam
. This command should be run prior to fibertools
if CpG methylation information is desired as Jasmine will overwrite the m6A predictions in the MM and ML tags.
Predict m6A and infer nucleosomes
To create useable Fiber-seq data you must first call m6A base-mods on the PacBio CCS bam using fibertools
. First install fibertools and then process your bam file using the prediction command.
ft predict-m6a -t 16 input.ccs.bam output.fiberseq.bam
This will both make m6A calls and identify nucleosomes on each fiber.
Note, the input CCS bam must have average kinetics to be able to call m6A.
Alignment and phasing
We recommend aligning with pbmm2 and phasing with HiPhase. Please see their respective documentation for more information.
Alternatively, we have written a snakemake pipeline to align and phase Fiber-seq data; however, this pipeline is not officially supported outside of our lab at this time.
After this point, you will have a Fiber-seq BAM file that is compatible with all the extraction commands in fibertools
.
Fiber-seq peaks and UCSC browser tracks
Once you have a phased bam file, you can identify Fiber-seq inferred regulatory elements (FIREs) to call Fiber-seq peaks and make a UCSC trackHub. Please see the FIRE repository for more details.
Fiber-seq starting with Oxford Nanopore Technologies (ONT)
Predict m6A
ft predict-m6a does not include a model for ONT data; however, you can use software, such as Dorado, to add CpG and m6A to your ONT BAM file.
If you do use Dorado you must then filter the m6A calls with modkit using a tenth percentile cutoff for each flow-cell independently. This is the only way to get good m6A calls in our experience, and using any hard ML threshold will not hold between flow-cells. Here is an example command:
modkit call-mods -t 8 -p 0.1 input.dorado.bam filtered.dorado.bam
We also show how to apply this filter and call nucleosomes in one line in the next section.
Infer nucleosomes and MSPs
Once you have CpG and m6A information in your filtered ONT BAM file, you can use ft add-nucleosomes
to infer nucleosomes and MSPs. With Dorado, we find the best results when restricting to the 90% of calls that dorado
is most confident in as determined by modkit.
modkit call-mods -p 0.1 input.bam - | ft add-nucleosomes - output.bam
Alignment and phasing
You can either use Dorado to align your ONT data or use a tool like minimap2 to align your data. If you do use minimap2
be sure to include the flag -y
to preserve the CpG and m6A information in the output BAM file.
We recommend using WhatsHap for phasing ONT data. Please see their documentation for more information.
After this point, you will have a Fiber-seq BAM file that is compatible with all the extraction commands in fibertools
.
Fiber-seq peaks and UCSC browser tracks
Some users report reasonable success in applying the FIRE pipeline to ONT data. However, please note that FIRE models were not trained or validated for ONT data. With that said, we have developed a heuristic for FIRE that appears to work very well with ONT data. To enable this add ont: true
to your config.yaml
file when setting up your FIRE run.
Think of fibertools
as the samtools
of Fiber-seq data. A command line tool for creating, filtering, interacting, and extracting from Fiber-seq BAM files.
This chapter covers:
- Creating a Fiber-seq BAM
- Extracting from a Fiber-seq BAM
- installing
fibertools
- pyft: Python bindings
A complete list of subcommands and their help pages can be found here.
Creating a Fiber-seq bam file
This chapter coverers creating a Fiber-seq bam file from a raw PacBio bam file. Including m6A prediction, nucleosome positioning, and FIRE identification.
ft predict-m6a
This command predicts m6A sites from PacBio Fiber-seq data. The input is a PacBio HiFi BAM file with kinetics and the output is a BAM file with the predicted m6A sites encoded in the ML and MM tags.
Details on the algorithm for m6A prediction are described in this publication: https://doi.org/10.1101/gr.279095.124.
ft add-nucleosomes
ft add-nucleosomes
adds nucleosome positions and MSP positions to the Fiber-seq bam. The input is a Fiber-seq bam file with m6A calls and the output is a Fiber-seq bam file with the nucleosome positions encoded in the ns
and nl
tags and the MSP positions encoded in the as
and al
tags.
It is usually unnecessary to run this command manually, as it is called by ft predict-m6a
automatically. However, you can run it manually if you want to adjust the parameters of the nucleosome calling algorithm.
ft fire
This command identifies Fiber-seq Inferred Regulatory Elements (FIREs) from a Fiber-seq BAM. The input is a Fiber-seq BAM file with m6A and nucleosome calls and the output is a Fiber-seq bam file with the FIREs encoded in the aq
tags.
This command can be run in isolation; however, it is usually preferable to run the FIRE pipeline, which runs ft fire
and performs many additional analyses and visualizations.
Extracting from a FIRE BAM
ft fire
can also be used as an extraction tool to extract Fiber-seq data from an already processed FIRE BAM file.
ft fire --extract fire.bam > all.bed
This produces a file in BED format that contains all the MSPs, FIREs, and nucleosomes in the FIRE BAM file. This command produces output analogous to the now removed model.results.bed.gz
result from older versions of FIRE pipeline.
Extracting from a fibertools
bam
This chapter describes various subcommands that extract information from a fibertools
bam file into plain text.
ft extract
Extract Fiber-seq data into plain text files
Inputs and options
See the help message for details.
Output description
All outputs to ft extract
can be (and should be) compressed by simply adding the .gz
extension.
For example, ft extract input.bam --m6a m6a.bed.gz
will output a compressed bed12 file. Use -
to output to stdout
, e.g. ft extract input.bam --m6a -
.
Shared Output columns:
Column | Description |
---|---|
ct | Chromosome or contig |
st | Start position of the read on the chromosome |
en | End position of the read on the chromosome |
fiber | The fiber/read name |
score | The number of ccs passes for the read (rounded) |
Columns specific to the --m6a
, --cpg
, --nuc
, and --msp
outputs
All of these files are written in standard bed12 format. The first and last block in each the bed12 record do not reflect real data, and exist only to mark the start and end positions of the read. If you would like to convert these beds into bigBeds be sure to include -allow1bpOverlap
in your command.
Column | Description |
---|---|
thick start | Same as the start (st ) |
thick end | Same as the end (en ) |
itemRgb | Color specifc to the datatype, e.g. m6a marks get a purple RGB |
blockCount | The number of blocks in the bed12 record |
blockSizes | A comma separated list of the lengths of each feature in the bed12 record |
blockStarts | A comma separated list of the relative start positions of each block in the bed12 record |
Columns specific to the --all
output
Column | Description |
---|---|
sam_flag | The sam flag of the read alignment |
HP | The haplotype tag for the read |
RG | The read group tag for the read |
fiber_length | The length of the read in bp |
fiber_sequence | The sequence of the read |
ec | The number of ccs passes for the read (no rounding) |
rq | The estimated accuracy of the read |
total_AT_bp | The total number of AT bp in the read |
total_m6a_bp | The total number of m6a bp in the read |
total_nuc_bp | The total number of nucleosome bp in the read |
total_msp_bp | The total number of MSP bp in the read |
total_5mC_bp | The total number of 5mC bp in the read |
nuc_starts | The start positions of the nucleosomes in molecular coordinates |
nuc_lengths | The lengths of the nucleosomes in molecular coordinates |
ref_nuc_starts | The start positions of the nucleosomes in reference coordinates |
ref_nuc_lengths | The lengths of the nucleosomes in reference coordinates |
msp_starts | The start positions of the MSPs in molecular coordinates |
msp_lengths | The lengths of the MSPs in molecular coordinates |
ref_msp_starts | The start positions of the MSPs in reference coordinates |
ref_msp_lengths | The lengths of the MSPs in reference coordinates |
m6a | The start positions of the m6a in molecular coordinates |
ref_m6a | The start positions of the m6a in reference coordinates |
m6a_qual | The quality of the m6a positions (ML value) |
5mC | The start positions of the 5mC in molecular coordinates |
ref_5mC | The start positions of the 5mC in reference coordinates |
5mC_qual | The quality of the 5mC positions (ML value) |
Note positions in columns starting with ref_
maybe contain -1
(NA) values if the reference sequence has an insertion or deletion relative to the read sequence at that position.
ft center
This command centers Fiber-seq data around given reference positions. This is useful for making aggregate m6A and CpG observations, as well as visualization of SVs
Inputs and options
See the help message for details.
Output description
This command writes Fiber-seq data in a tab-delimited format to stdout
that has been centered relative to positions specified in the input bed file.
Column | Description |
---|---|
chrom | Chromosome |
centering_position | The position on the chromosome about which data is being centered. |
strand | The strand of the centering position. |
subset_sequence | The sequence of the read around the centering position. |
reference_start | The start position of the read in the reference. |
reference_end | The end position of the read in the reference. |
query_name | The name of the sequencing read |
RG | The read group the read belongs to |
centered_query_start | The start position of the read relative to the centering position |
centered_query_end | The end position of the read relative to the centering position |
query_length | The length of the read |
Additional columns specific to long (default) format
Column | Description |
---|---|
centered_position_type | The type of position being centered. One of: m6A , 5mC , nuc , msp . |
centered_start | The start position of the "feature" relative to the centering position |
centered_end | The end position of the "feature" relative to the centering position |
centered_quality | The quality of the "feature" relative to the centering position (ML value) |
Additional columns specific to the wide format
Column | Description |
---|---|
centered_m6a_positions | A comma separated list of m6a positions |
m6a_qual | The quality of the m6a positions (ML value) |
centered_5mC_positions | A comma separated list of 5mC positions |
5mC_qual | The quality of the 5mC positions (ML value) |
centered_nuc_starts | A comma separated list of nuc starts |
centered_nuc_ends | A comma separated list of nuc ends |
centered_msp_starts | A comma separated list of msp starts |
centered_msp_ends | A comma separated list of msp ends |
query_sequence | The sequence of the read |
Note if the --wide
flag is used with the --reference
flag some positions in the comma separated lists can be NA
when the reference sequence has an insertion or deletion relative to the read sequence at that position.
ft footprint
Usage
ft footprint [OPTIONS] <BAM> --bed <BED> --yaml <YAML>
The BAM
file is an indexed fiber-seq bam file.
The BED
file is a bed file with the motifs you'd like to test for footprints. This should include the strand the motif is on.
The YAML
file is a file that describes the modules within the motif that can be footprinted. e.g. a CTCF yaml
with its multiple binding sites might look like:
modules:
- [0, 8]
- [8, 16]
- [16, 23]
- [23, 29]
- [29, 35]
Modules must start at zero, end at the length of the motif, be sorted, and be contiguous with one another. At most 15 modules are allowed, and the intervals are 0-based, half-open (like BED
).
See the help message for details.
Description of output columns
The footprinting output table is a tab-separated file with the same number of entries as the input BED file and the following columns:
Column | Description |
---|---|
chrom | Chromosome |
start | The start position of the motif |
end | The end position of the motif |
strand | The strand of the motif. |
n_spanning_fibers | The number of fibers that span the motif. |
n_spanning_msps | The number of msp that span the motif. |
n_overlapping_nucs | The number of fibers that have an intersecting nucleosome. |
module_X | The number of fibers that are footprinted in module X. The number of module columns is determined by the footprinting yaml. |
footprint_codes | Comma separated list of footprint codes for each fiber. See details below. |
fire_quals | Comma separated list of fire qualities for each fiber. -1 if the MSP is not spanning or present. Note all fire_quals will be 0 or -1 if FIRE has not been applied to the bam. |
fiber_names | Comma separated list of fiber names that span the motif. Names share the same index as the previous column, so they can be matched with footprint codes. |
Footprint codes
The footprint codes are an encoded bit flag similar to how filtering is done with samtools
. If the first bit is set (1) then there is an MSP that spans the footprint. For each following bit, the bit is set if that module is footprinted by that fiber.
Here are some examples in python for how you could test a footprint code in a few ways:
fp_code = 0b1001 # this is a value of 9, but in binary it is 1001
# test if the first bit is set, there is a spanning MSP, true in this example
(fp_code & 1) > 0
# test if the first module is footprinted, false in this example
(fp_code & (1 << 1)) > 0
# test if the third module is footprinted, true in this example
(fp_code & (1 << 3)) > 0
Help pages for fibertools subcommands
ft
ft predict-m6a
ft fire
ft extract
ft center
ft add-nucleosomes
ft footprint
ft clear-kinetics
ft strip-basemods
ft track-decorators
ft pileup
ft
Fiber-seq toolkit in rust
Usage: ft [OPTIONS] <COMMAND>
Commands:
predict-m6a Predict m6A positions using HiFi kinetics data and encode the results in the MM and ML bam tags. Also adds nucleosome (nl, ns) and MTase sensitive patches (al, as) [aliases: m6A, m6a]
add-nucleosomes Add nucleosomes to a bam file with m6a predictions
fire Add FIREs (Fiber-seq Inferred Regulatory Elements) to a bam file with m6a predictions
extract Extract fiberseq data into plain text files [aliases: ex, e]
center This command centers fiberseq data around given reference positions. This is useful for making aggregate m6A and CpG observations, as well as visualization of SVs [aliases: c, ct]
footprint Infer footprints from fiberseq data
track-decorators Make decorated bed files for fiberseq data
pileup Make a pileup track of Fiber-seq features from a FIRE bam
clear-kinetics Remove HiFi kinetics tags from the input bam file
strip-basemods Strip out select base modifications
ddda-to-m6a Convert a DddA BAM file to pseudo m6A BAM file
help Print this message or the help of the given subcommand(s)
Options:
-h, --help Print help
-V, --version Print version
Global-Options:
-t, --threads <THREADS> Threads [default: 8]
Debug-Options:
-v, --verbose... Logging level [-v: Info, -vv: Debug, -vvv: Trace]
--quiet Turn off all logging
ft predict-m6a
Predict m6A positions using HiFi kinetics data and encode the results in the MM and ML bam tags. Also adds nucleosome (nl, ns) and MTase sensitive patches (al, as)
Usage: ft predict-m6a [OPTIONS] [BAM] [OUT]
Arguments:
[BAM] Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
[OUT] Output bam file with m6A calls in new/extended MM and ML bam tags [default: -]
Options:
-n, --nucleosome-length <NUCLEOSOME_LENGTH> Minium nucleosome length [default: 75]
-c, --combined-nucleosome-length <COMBINED_NUCLEOSOME_LENGTH> Minium nucleosome length when combining over a single m6A [default: 100]
--min-distance-added <MIN_DISTANCE_ADDED> Minium distance needed to add to an already existing nuc by crossing an m6a [default: 25]
-d, --distance-from-end <DISTANCE_FROM_END> Minimum distance from the end of a fiber to call a nucleosome or MSP [default: 45]
-k, --keep Keep hifi kinetics data
-h, --help Print help (see more with '--help')
-V, --version Print version
BAM-Options:
-F, --filter <BIT_FLAG> BAM bit flags to filter on, equivalent to `-F` in samtools view [default: 0]
--ml <MIN_ML_SCORE> Minium score in the ML tag to use or include in the output [env: FT_MIN_ML_SCORE=] [default: 125]
Global-Options:
-t, --threads <THREADS> Threads [default: 8]
Debug-Options:
-v, --verbose... Logging level [-v: Info, -vv: Debug, -vvv: Trace]
--quiet Turn off all logging
Developer-Options:
--force-min-ml-score <FORCE_MIN_ML_SCORE> Force a different minimum ML score
--all-calls Keep all m6A calls regardless of how low the ML value is
-b, --batch-size <BATCH_SIZE> Number of reads to include in batch prediction [default: 1]
ft fire
Add FIREs (Fiber-seq Inferred Regulatory Elements) to a bam file with m6a predictions
Usage: ft fire [OPTIONS] [BAM] [OUT]
Arguments:
[BAM] Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
[OUT] Output file (BAM by default, table of MSP features if `--feats-to-text` is used, and bed9 + if `--extract`` is used) [default: -]
Options:
--ont Use a ONT heuristic adjustment for FIRE calling. This adjusts the observed number of m6A counts by adding pseudo counts to account for the
single stranded nature of ONT data [env: ONT=]
-e, --extract Output just FIRE elements in bed9 format
--all When extracting bed9 format include all MSPs and nucleosomes
-f, --feats-to-text Output FIREs features for training in a table format
-s, --skip-no-m6a Don't write reads with no m6A calls to the output bam
--min-msp <MIN_MSP> Skip reads without at least `N` MSP calls [env: MIN_MSP=] [default: 0]
--min-ave-msp-size <MIN_AVE_MSP_SIZE> Skip reads without an average MSP size greater than `N` [env: MIN_AVE_MSP_SIZE=] [default: 0]
-w, --width-bin <WIDTH_BIN> Width of bin for feature collection [env: WIDTH_BIN=] [default: 40]
-b, --bin-num <BIN_NUM> Number of bins to collect [env: BIN_NUM=] [default: 9]
--best-window-size <BEST_WINDOW_SIZE> Calculate stats for the highest X bp window within each MSP Should be a fair amount higher than the expected linker length [env:
BEST_WINDOW_SIZE=] [default: 100]
--min-msp-length-for-positive-fire-call <MIN_MSP_LENGTH_FOR_POSITIVE_FIRE_CALL> Minium length of msp to call a FIRE [env: MIN_MSP_LENGTH_FOR_POSITIVE_FIRE_CALL=] [default: 85]
--model <MODEL> Optional path to a model json file. If not provided ft will use the default model (recommended) [env: FIRE_MODEL=]
--fdr-table <FDR_TABLE> Optional path to a FDR table [env: FDR_TABLE=]
-h, --help Print help
-V, --version Print version
BAM-Options:
-F, --filter <BIT_FLAG> BAM bit flags to filter on, equivalent to `-F` in samtools view [default: 0]
--ml <MIN_ML_SCORE> Minium score in the ML tag to use or include in the output [env: FT_MIN_ML_SCORE=] [default: 125]
Global-Options:
-t, --threads <THREADS> Threads [default: 8]
Debug-Options:
-v, --verbose... Logging level [-v: Info, -vv: Debug, -vvv: Trace]
--quiet Turn off all logging
ft extract
Extract fiberseq data into plain text files
Usage: ft extract [OPTIONS] [BAM]
Arguments:
[BAM] Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
Options:
-r, --reference Report in reference sequence coordinates
--molecular Report positions in the molecular sequence coordinates
--m6a <M6A> Output path for m6a bed12
-c, --cpg <CPG> Output path for 5mC (CpG, primrose) bed12
--msp <MSP> Output path for methylation sensitive patch (msp) bed12
-n, --nuc <NUC> Output path for nucleosome bed12
-a, --all <ALL> Output path for a tabular format including "all" fiberseq information in the bam
-h, --help Print help (see more with '--help')
-V, --version Print version
BAM-Options:
-F, --filter <BIT_FLAG> BAM bit flags to filter on, equivalent to `-F` in samtools view [default: 0]
--ml <MIN_ML_SCORE> Minium score in the ML tag to use or include in the output [env: FT_MIN_ML_SCORE=] [default: 125]
Global-Options:
-t, --threads <THREADS> Threads [default: 8]
Debug-Options:
-v, --verbose... Logging level [-v: Info, -vv: Debug, -vvv: Trace]
--quiet Turn off all logging
All-Format-Options:
-q, --quality Include per base quality scores in "fiber_qual"
-s, --simplify Simplify output by remove fiber sequence
ft center
This command centers fiberseq data around given reference positions. This is useful for making aggregate m6A and CpG observations, as well as visualization of SVs
Usage: ft center [OPTIONS] --bed <BED> [BAM]
Arguments:
[BAM] Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
Options:
-b, --bed <BED> Bed file on which to center fiberseq reads. Data is adjusted to the start position of the bed file and corrected for strand if the strand is indicated in the 6th column of the bed file. The 4th column will
also be checked for the strand but only after the 6th is. If you include strand information in the 4th (or 6th) column it will orient data accordingly and use the end position of bed record instead of the
start if on the minus strand. This means that profiles of motifs in both the forward and minus orientation will align to the same central position
-d, --dist <DIST> Set a maximum distance from the start of the motif to keep a feature
-w, --wide Provide data in wide format, one row per read
-r, --reference Return relative reference position instead of relative molecular position
-s, --simplify Replace the sequence output column with just "N"
-h, --help Print help (see more with '--help')
-V, --version Print version
BAM-Options:
-F, --filter <BIT_FLAG> BAM bit flags to filter on, equivalent to `-F` in samtools view [default: 0]
--ml <MIN_ML_SCORE> Minium score in the ML tag to use or include in the output [env: FT_MIN_ML_SCORE=] [default: 125]
Global-Options:
-t, --threads <THREADS> Threads [default: 8]
Debug-Options:
-v, --verbose... Logging level [-v: Info, -vv: Debug, -vvv: Trace]
--quiet Turn off all logging
ft add-nucleosomes
Add nucleosomes to a bam file with m6a predictions
Usage: ft add-nucleosomes [OPTIONS] [BAM] [OUT]
Arguments:
[BAM] Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
[OUT] Output bam file with nucleosome calls [default: -]
Options:
-n, --nucleosome-length <NUCLEOSOME_LENGTH> Minium nucleosome length [default: 75]
-c, --combined-nucleosome-length <COMBINED_NUCLEOSOME_LENGTH> Minium nucleosome length when combining over a single m6A [default: 100]
--min-distance-added <MIN_DISTANCE_ADDED> Minium distance needed to add to an already existing nuc by crossing an m6a [default: 25]
-d, --distance-from-end <DISTANCE_FROM_END> Minimum distance from the end of a fiber to call a nucleosome or MSP [default: 45]
-h, --help Print help
-V, --version Print version
BAM-Options:
-F, --filter <BIT_FLAG> BAM bit flags to filter on, equivalent to `-F` in samtools view [default: 0]
--ml <MIN_ML_SCORE> Minium score in the ML tag to use or include in the output [env: FT_MIN_ML_SCORE=] [default: 125]
Global-Options:
-t, --threads <THREADS> Threads [default: 8]
Debug-Options:
-v, --verbose... Logging level [-v: Info, -vv: Debug, -vvv: Trace]
--quiet Turn off all logging
ft footprint
Infer footprints from fiberseq data
Usage: ft footprint [OPTIONS] --bed <BED> --yaml <YAML> [BAM]
Arguments:
[BAM] Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
Options:
-b, --bed <BED> BED file with the regions to footprint. Should all contain the same motif with proper strand information, and ideally be ChIP-seq peaks
-y, --yaml <YAML> yaml describing the modules of the footprint
-o, --out <OUT> Output bam [default: -]
-h, --help Print help
-V, --version Print version
BAM-Options:
-F, --filter <BIT_FLAG> BAM bit flags to filter on, equivalent to `-F` in samtools view [default: 0]
--ml <MIN_ML_SCORE> Minium score in the ML tag to use or include in the output [env: FT_MIN_ML_SCORE=] [default: 125]
Global-Options:
-t, --threads <THREADS> Threads [default: 8]
Debug-Options:
-v, --verbose... Logging level [-v: Info, -vv: Debug, -vvv: Trace]
--quiet Turn off all logging
ft clear-kinetics
Remove HiFi kinetics tags from the input bam file
Usage: ft clear-kinetics [OPTIONS] [BAM] [OUT]
Arguments:
[BAM] Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
[OUT] Output bam file without hifi kinetics [default: -]
Options:
-h, --help Print help
-V, --version Print version
BAM-Options:
-F, --filter <BIT_FLAG> BAM bit flags to filter on, equivalent to `-F` in samtools view [default: 0]
--ml <MIN_ML_SCORE> Minium score in the ML tag to use or include in the output [env: FT_MIN_ML_SCORE=] [default: 125]
Global-Options:
-t, --threads <THREADS> Threads [default: 8]
Debug-Options:
-v, --verbose... Logging level [-v: Info, -vv: Debug, -vvv: Trace]
--quiet Turn off all logging
ft strip-basemods
Strip out select base modifications
Usage: ft strip-basemods [OPTIONS] [BAM] [OUT]
Arguments:
[BAM] Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
[OUT] Output bam file [default: -]
Options:
-b, --basemod <BASEMOD> base modification to strip out of the bam file [default: m6A] [possible values: m6A, 6mA, 5mC, CpG]
-h, --help Print help
-V, --version Print version
BAM-Options:
-F, --filter <BIT_FLAG> BAM bit flags to filter on, equivalent to `-F` in samtools view [default: 0]
--ml <MIN_ML_SCORE> Minium score in the ML tag to use or include in the output [env: FT_MIN_ML_SCORE=] [default: 125]
Global-Options:
-t, --threads <THREADS> Threads [default: 8]
Debug-Options:
-v, --verbose... Logging level [-v: Info, -vv: Debug, -vvv: Trace]
--quiet Turn off all logging
ft track-decorators
Make decorated bed files for fiberseq data
Usage: ft track-decorators [OPTIONS] --bed12 <BED12> [BAM]
Arguments:
[BAM] Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
Options:
-b, --bed12 <BED12> Output path for bed12 file to be decorated
-d, --decorator <DECORATOR> Output path for decorator bed file [default: -]
-h, --help Print help
-V, --version Print version
BAM-Options:
-F, --filter <BIT_FLAG> BAM bit flags to filter on, equivalent to `-F` in samtools view [default: 0]
--ml <MIN_ML_SCORE> Minium score in the ML tag to use or include in the output [env: FT_MIN_ML_SCORE=] [default: 125]
Global-Options:
-t, --threads <THREADS> Threads [default: 8]
Debug-Options:
-v, --verbose... Logging level [-v: Info, -vv: Debug, -vvv: Trace]
--quiet Turn off all logging
ft pileup
Make a pileup track of Fiber-seq features from a FIRE bam
Usage: ft pileup [OPTIONS] [BAM] [RGN]
Arguments:
[BAM] Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
[RGN] Region string to make a pileup of. e.g. chr1:1-1000 or chr1:1-1,000 If not provided will make a pileup of the whole genome
Options:
-o, --out <OUT> Output file [default: -]
-m, --m6a include m6A calls
-c, --cpg include 5mC calls
--haps For each column add two new columns with the hap1 and hap2 specific data
-k, --keep-zeros Keep zero coverage regions
-p, --per-base Write output one base at a time even if the values do not change
-h, --help Print help
-V, --version Print version
BAM-Options:
-F, --filter <BIT_FLAG> BAM bit flags to filter on, equivalent to `-F` in samtools view [default: 0]
--ml <MIN_ML_SCORE> Minium score in the ML tag to use or include in the output [env: FT_MIN_ML_SCORE=] [default: 125]
Global-Options:
-t, --threads <THREADS> Threads [default: 8]
Debug-Options:
-v, --verbose... Logging level [-v: Info, -vv: Debug, -vvv: Trace]
--quiet Turn off all logging
pyft: Python bindings for fibertools
See the documentation for pyft
on readthedocs for more information.
Installation
This chapter covers the various ways to install fibertools
.
It is easiest to install fibertools
using conda (instructions). However, for the fastest m6A predictions you need to install with libtorch
enabled (instructions) if you would like to install from source, you can use the following instructions.
From bioconda
fibertools-rs
is avalible through bioconda
and can be installed with the following command:
mamba install -c conda-forge -c bioconda fibertools-rs
However, due to size constraints in bioconda
this version does not support contain the pytorch libraries or GPU acceleration for m6A predictions. m6A predictions will still work in the bioconda version but may be much slower. If you would like to use m6A prediction and GPU acceleration, you will need to install using the directions here.
From crates.io
Installation from crates.io
requires the rust package manager cargo
. You can find how to install cargo
here.. Furthermore, a recent version of gcc
and cmake
is required. I have tested and recommend gcc v10.2.0
and cmake v3.21.1
, though other versions may work.
cargo install fibertools-rs
From GitHub (active development)
Using cargo from source:
cargo install --git https://github.com/fiberseq/fibertools-rs
or using git form source:
git clone https://github.com/fiberseq/fibertools-rs
cd fibertools-rs
cargo build --release
./target/release/ft --help
With libtorch
Get libtorch
v2.2.0 from the PyTorch website and extract the content of the zip file.
- On Linux/Unix system you can download with:
wget https://download.pytorch.org/libtorch/cu118/libtorch-shared-with-deps-2.2.0%2Bcu118.zip
- On macOS you can download with:
wget https://download.pytorch.org/libtorch/cpu/libtorch-macos-2.2.0.zip
- Windows is not supported and will not be.
Then add the following to your .bashrc
or equivalent, where /path/to/libtorch
is the path to the directory that was created when unzipping the file:
export LIBTORCH_CXX11_ABI=0
export LIBTORCH=/path/to/libtorch # e.g. export LIBTORCH=/Users/mrvollger/lib/libtorch
export LD_LIBRARY_PATH=${LIBTORCH}/lib:$LD_LIBRARY_PATH
export DYLD_LIBRARY_PATH=${LIBTORCH}/lib:$LD_LIBRARY_PATH
Finally install using cargo
:
cargo install --all-features fibertools-rs
FIRE: descriptions, methods, and outputs
For details on running the FIRE pipeline see the README.md.
Summary of Fiber-seq inferred regulatory elements (FIREs)
For those who would prefer video I have recorded a lab meeting discussing FIRE and the methods used here. If you prefer to read please continue on.
FIREs are MTase sensitive patches (MSPs) that are inferred to be regulatory elements on single chromatin fibers. To do this we used semi-supervised machine learning to identify MSPs that are likely to be regulatory elements using the Mokapot
framework and XGBoost
. Every individual FIRE element is associated with an estimated precision value (defined below), which indicates the probability that the FIRE element is a true regulatory element. The estimated precision of FIREs elements are created using Mokapot
and validation data not used in training. We train our model targeting FIRE elements with at least 95% precision, MSPs with less than 90% precision are considered to have average level of accessibility expected between two nucleosomes, and are referred to as linker regions.
Semi-superivized machine learning with Mokapot
requires a mixed-positive training set and a clean negative training set. To create mixed positive training data we selected MSPs that overlapped DNase hypersensitive sites (DHSs) and CTCF ChIP-seq peaks. And to create a clean negative training set we selected MSPs that did not overlap DHSs or CTCF ChIP-seq peaks.
Training of Fire-seq inferred regulatory elements.
Training data
For training data, we generated 21 different GM12878 Fiber-seq experiments with a range of under- and over-methylated experimental conditions to ensure we captured a broad range of percent m6A (5.8-13.3%) to ensure our model could generalize to new samples with varying levels of m6A. We merged these sequencing results and randomly selected 10% of the Fiber-seq reads from 100,000 randomly selected DNase I and CTCF ChIP-seq peaks for mixed-positive labels and 100,000 equally sized regions that did not overlap DNase or CTCF Chip-seq for negative labels. We then generated features for each of these MSPs, including length, log fold enrichment of m6A, the A/T content, and windowed measures of m6A around the MSP (Supplemental Table 3) and held out 20% of the Fiber-seq reads to be used as test data.
Semi-supervised training
To carry out semi-supervised training, we used an established method, Mokapot (Fondrie & Noble, 2021; Käll et al., 2007), which we summarize below. In the first round of semi-supervised training, Mokapot identifies the feature that best discriminates between our mixed-positive and negative labels and then selects a threshold for that feature such that the mixed-positive labels can be discriminated from the negative labels with 95% estimated precision (defined below). The subset of mixed-positive labels above this threshold is then used as an initial set of positive labels in training an XGBoost model with five-fold cross-validation (Chen & Guestrin, 2016). Then this process is iteratively repeated, using the learned prediction from the previous iteration’s model to create positive labels at 95% estimated precision, until the number of positive identifications at 95% precision in the validation set ceases to increase (15 iterations, Fig. S1) (Fondrie & Noble, 2021; Käll et al., 2007).
Estimated precision of individual FIRE elements
We cannot compute the precision associated with a particular XGBoost score because we do not have access to a set of clean-positive labels. Instead, we define a notion of “estimated precision” using a balanced held-out test set of mixed-positive and negative labels (20% of the data). We defined the estimated precision (EP) of a FIRE element to be
$$ EP = 1 - \frac{FP+1}{TMP} $$
where TMP is the number of “true” identifications from the mixed positive labels with at least that element’s XGBoost score, and FP is the number of false positive identifications from negative labels with at least that element’s XGBoost score. We add a pseudo count of one to the numerator of false positive identifications so as to prevent liberal estimates for smaller collections of identifications (Fondrie & Noble, 2021).
Features of FIRE elements
The following features were used as features with Mokapot
in FIRE element classification:
Feature | Description |
---|---|
msp_len | Length of the MSP |
msp_len_times_m6a_fc | Length of the MSP times the fold-change of m6A in the MSP |
ccs_passes | Number of CCS passes in the MSP |
fiber_m6a_count | Number of m6A sites in the Fiber-seq read |
fiber_AT_count | Number of AT sites in the Fiber-seq read |
fiber_m6a_frac | Fraction of m6A sites in the Fiber-seq read |
msp_m6a | Number of m6A sites in the MSP |
msp_AT | Number of AT sites in the MSP |
msp_m6a_frac | Fraction of m6A sites in the MSP |
msp_fc | Fold-change of m6A fraction in the MSP relative to the whole reads m6A fraction |
m6a_count_X | Number of m6A sites in the Xth 40 bp window |
AT_count_X | Number of AT sites in the Xth 40 bp window |
m6a_frac_X | Fraction of m6A sites in the Xth 40 bp window |
m6a_fc_X | Fold-change of m6A fraction in the Xth 40 bp window relative to the whole reads m6A fraction |
There are nine 40 bp windows for each MSP (X = 1, 2, ..., 9), with the 5th window centered on the MSP.
Aggregation of FIRE elements for peak calling
Aggregate FIRE score calculation
The FIRE score (\(S_g\)) for a position in the genome (\(g\)) is calculated using the following formula:
$$ S_g=-\frac{50}{R_g} \sum_{i=1}^{C_g} \log_{10}(1-min(EP_i , 0.99)) $$
where \(C_g\) is the number of FIRE elements at the \(g\)th position, \(R_g\) is the number of Fiber-seq reads at the \(g\)th position, and \(EP_i\) is the estimated precision of the \(i\)th FIRE element at the \(g\)th position. The estimated precision of each FIRE element is thresholded at 0.99, such that the FIRE score takes on values between 0 and 100. Regions covered by less than four FIRE elements (i.e., if \(C_g < 4\)) are not scored and are given a value of negative one.
Regions of unreliable coverage
Regions with unreliable coverage are defined as regions with Fiber-seq coverage that deviate from the median coverage by five standard deviations, where a standard deviation is defined by the Poisson distribution (i.e., the square root of the mean coverage). This is a parameter that can be adjusted by the user.
FIRE score FDR calculation
We shuffle the location of an entire Fiber-seq read by selecting a random start position within the chromosome and relocating the entire read to that start position. Fiber-seq reads originating from regions with unreliable coverage (defined above) are not shuffled, and reads from regions with reliable coverage are not shuffled into regions with unreliable coverage (bedtools shuffle -chrom -excl
, v2.31.0) (Quinlan & Hall, 2010). We then compute FIRE scores associated with the shuffled genome. Recalling that the FIRE score for the \(g\)th position in the genome is denoted \(S_g\), we divide the number of bases that have shuffled FIRE scores above \(S_g\) by the number of bases that have un-shuffled FIRE scores above \(S_g\). This provides an estimate of the FDR associated with the FIRE score \(S_g\).
Peak calling
Peaks are called by identifying FIRE score local-maxima that have FDR values below a 5% threshold and at least 10% actuation (Cg / Rg). Adjacent local maxima that share 50% of the underlying FIRE elements or have 90% reciprocal overlap are merged into a single peak, using the higher of the two local maxima. Then, the start and end positions of the peak are determined by the median start and end positions of the underlying FIRE elements. We also calculated and reported wide peaks by taking the union of the FIRE peaks and all regions below the FDR threshold and then merged the resulting regions that were within one nucleosome (147 bp) of one another.
Haplotype-selective peaks
Haplotype-selective peaks
For peaks with at least 10 Fiber-seq reads on both haplotypes, we test the difference in percent actuation in each haplotype (fraction of reads with FIRE elements) using a two-sided Fisher’s exact test. Specifically, the inputs for the test are the number of FIRE elements in haplotype one, the number of Fiber-seq reads without FIRE elements in haplotype one, the number of FIRE elements in haplotype two, and the number of Fiber-seq reads without FIRE elements in haplotype two. We then apply an FDR correction (Benjamini-Hochberg) to correct for multiple testing.
Glossary of Fiber-seq terms
Fiber-seq protocol
A treatment of DNA chromatin fibers with the Hia5 enzyme which preferentially methylates accessible adenosines. The fibers are then sequenced using a long-read platform that can detect the m6A modifications.
Fiber-seq read or fiber
A DNA fiber that has been sequenced using the Fiber-seq protocol and has m6A modifications called.
m6A
N6-methyladenosine: A DNA modification made during the Fiber-seq protocol on accessible adenosines. Native m6A sites are exceedingly rare or non-existent in eukaryotic DNA.
Inferred nucleosome
A nucleosome inferred from the Fiber-seq data. The nucleosome position along the DNA is inferred by the ft add-nucleosomes
algorithm using the m6A modifications (not a direct observation of a nucleosome).
MSP
Methyltransferase sensitive patch: A stretch of a Fiber-seq read that has a high density of m6A sites. Specifically it is defined as a region that is inferred to not be occluded by a nucleosome.
FIRE
Fiber-seq Inferred Regulatory Element: A MSP that is inferred to be a regulatory element based on features of the MSP including m6A density.
Internucleosomal linker region
Any MSPs that is not a FIRE element.
The Fiber-seq BAM format
The Fiber-seq BAM format adds to m6A BAM data from a PacBio (ft predict-m6a
) or ONT (dorado
) sequencing run with m6A.
The following tags are added to the BAM file:
ns:B:I,
: Nucleosome start sites (0-based) on the forward strand of the sequencing read (u32).nl:B:I,
: Equal length array tons
of nucleosome lengths (u32).as:B:I,
: MSP start sites (0-based) on the forward strand of the sequencing read (u32).al:B:I,
: Equal length array toas
of MSP lengths (u32).aq:B:C,
: Quality scores for the MSP [0, 255] (u8). This tag is optional and added byft fire
.
The ns
or as
tag do not need to begin or end at the start or end of the read; however, once begun, they must be contiguous. i.e. the ns
and as
tags must combine to form a contiguous set of alternating nucleosome and MSP sites. The ns
, nl
, as
, and al
tags are added to the BAM file automatically using ft predict-m6a
or later using the ft add-nucleosomes
command.
The aq
tag is added using the ft fire
command and represents the estimated precision of the MSP being a FIRE. Specifically, the estimated precision of a FIRE is the value of the aq
tag divided by 255.
Cite
If you use fibertools
in your research, please cite the following paper:
Jha, A., Bohaczuk, S. C., Mao, Y., Ranchalis, J., Mallory, B. J., Min, A. T., Hamm, M. O., Swanson, E., Dubocanin, D., Finkbeiner, C., Li, T., Whittington, D., Noble, W. S., Stergachis, A. B., & Vollger, M. R. (2024). DNA-m6A calling and integrated long-read epigenetic and genetic analysis with fibertools. Genome Research. https://doi.org/10.1101/gr.279095.124
If you use FIRE
in your research, please cite the following paper:
Vollger, M. R., Swanson, E. G., Neph, S. J., Ranchalis, J., Munson, K. M., Ho, C.-H., Sedeño-Cortés, A. E., Fondrie, W. E., Bohaczuk, S. C., Mao, Y., Parmalee, N. L., Mallory, B. J., Harvey, W. T., Kwon, Y., Garcia, G. H., Hoekzema, K., Meyer, J. G., Cicek, M., Eichler, E. E., … Stergachis, A. B. (2024). A haplotype-resolved view of human gene regulation. bioRxiv. https://doi.org/10.1101/2024.06.14.599122
Analyses of Fiber-seq data
This notebook contains some of the prepublication analyses of Fiber-seq data we have done:
ONT Fiber-seq Analysis
Matched ONT and PacBio preparations
Read lengths unchanged by Fiber-seq
Impact of Fiber-seq on ONT read quality
A comparison of ONT and PacBio Fiber-seq