Genomic Datasets¶
One of the central features of Janggu are the genomic datasets Cover
and
Bioseq
. On the one hand, they allow
quick and flexible access to genomics data, including FASTA,
BAM, BIGWIG, BED and GFF file formats, which bridges the gap
between the data being present in raw file formats
and the numpy inputs required for python-based deep learning models (e.g. keras).
On the other hand, predictions from a deep learning library again are in numpy
format. Janggu facilitates a convertion between numpy arrays and Cover
objects
in order to associate the predictions with the respective genomic coordinates.
Finally, coverage information may be exported to BIGWIG or inspected directly
via genome browser-like plots.
General principle of the Genomics Datasets¶
Internally, the genomics datasets maintains coverage or sequence type of information along with the associated genomic intervals. Externally, the datasets behave similar to a numpy array. This makes it possible to directly consume the datasets using keras, for instance.
In this section we briefly describe the internal organisation of these datasets.
The classes Cover
and Bioseq
maintain a
GenomicArray
and a GenomicIndexer
object.
GenomicArray
is a general data structure that holds numeric
information about the genome. For instance, read count coverage.
It can be accessed via
genomic coordinates (e.g. chromosome name, start and end) and returns
the respective data as numpy array.
The GenomicIndexer
maintains a list of genomic coordinates,
which should be traversed during training/evaluation.
The GenomicIndexer can be accessed by an integer-valued
index which returns the associated genomic coordinates.
When querying the ‘i’-th region from Cover
or Bioseq
, the index is passed
to the GenomicIndexer
which yields a genomic coordinates
that is passed on to the GenomicArray
.
The result is returned in numpy format.
Similarly, the dataset objects
also support slicing and indexing via a list of indices, which is usually relevant
when using mini-batch learning.
Normalization¶
Upon creation of a Cover
object, normalization of the raw data might be require.
For instance, to make coverage tracks comparable across replicates or experiments.
To this end, create_from_bam
, create_from_bigwig
and create_from_bed
expose a normalizer
option.
Janggu already implements various normalization methods which can be called by name,
TPM (transcript per million) normalization. For instance, using
Cover.create_from_bam('track', bamfiles=samplefile, roi=roi, normalizer='tpm')
Other preprocessing and normalization options are: zscore
, zscorelog
, binsizenorm
and perctrim
.
The latter two apply normalization for read depth and trimming the signal intensities at the 99%-ile.
Normalizers can also be applied via callables and/or in combination with other transformations. For instance, suppose we want to trim the outliers at the 95%-tile instead and subsequently apply the z-score transformation then we could use
from janggu.data import PercentileTrimming
from janggu.data import ZScore
Cover.create_from_bam('track', bamfiles=samplefile, roi=roi,
normalizer=[PercentileTrimming(95), ZScore()])
It might be necessary to evaluate the normalization parameter on one dataset and apply the same transformation on other datasets. For instance, in the case of the ZScore, we might want to keep the mean and standard deviation that was obtained from, say the training set, and reuse the to normalize the test set. This is possible by just creating a zscore object that is used multiple times. At the first invokation the mean and standard deviation are determine and the transformation is applied. Subsequently, the zscore is determined using the predetermined mean and standard deviation. For example:
from janggu.data import ZScore
zscore = ZScore()
# First, mean and std will be determined.
# Then zscore transformation is applied.
Cover.create_from_bam('track_train', bamfiles=samplefile, roi=roitrain,
normalizer=[zscore])
# Subsequently, zscore transformation is applied with
# the same mean and std determined from the training set.
Cover.create_from_bam('track_test', bamfiles=samplefile, roi=roitest,
normalizer=[zscore])
In case a different normalization procedure is required that is not contained in janggu, it is possible to define a custom_normalizer as follows:
def custom_normalizer(genomicarray):
# perform normalization genomicarray
return genomicarray
The currently implemented normalizers may be a good starting point for this purpose.
Granularity of the coverage¶
Depending on the applications, different granularity of the coverage data might be required. For instance, one might be interested in reading out nucleotide-resolution coverage for one purpose or 50 base-pair resolution bins for another. Furthermore, in some cases the signal of variable size regions might be of interest. For example, the read counts across the gene bodies, to measure gene expression levels.
These adjustments can be made when invoking create_from_bam
,
create_from_bigwig
and create_from_bed
using an appropriate region of interest ROI file in conjunction
with specifying the resolution
and collapser
parameter.
First, we the resolution parameter allows to the coverage granularity. For example, base-pair and 50-base-pair resolution would be possible using
Cover.create_from_bam('track', bamfiles=samplefile, roi=roi,
resolution=1)
Cover.create_from_bam('track', bamfiles=samplefile, roi=roi,
resolution=50)
In case the signal intensity should be summarized across the entire interval,
specify resolution=None
.
For instance, if the region of interest contains a set of variable length
gene bodies, the total read count per gene can be obtained using
Cover.create_from_bam('genes',
bamfiles=samplefile,
roi=geneannot,
resolution=None)
It is also possible to use resolution=None
in conjunction with e.g. binsize=200
which would have the same effect as chosing binsize=resolution=200
.
Whenever we deal with resolution>1
, an aggregation operation needs to be performed
to summarize the signal intensity across the region. For instance, for
create_from_bam
the reads are summed within each interval.
For create_from_bigwig
and create_from_bed
,
it is possible to adjust the collapser. For example, ‘mean’ or ‘sum’ aggregation
can be applied by name or by handing over a callable according to
import numpy as np
Cover.create_from_bigwig('bwtrack',
bigwigfiles=samplefile,
roi=roi,
resolution=50,
collapser='mean')
Cover.create_from_bigwig('bwtrack',
bigwigfiles=samplefile,
roi=roi,
resolution=50,
collapser=np.sum)
Moreover, more specialized aggregations may require a custom collaper function. In that case, it is important to note that the function expects a 3D numpy array and the aggragation should be performed across the second dimension. For example
def custom_collapser(numpyarray):
# Initially, the dimensions of numpyarray correspond to
# (intervallength // resolution, resolution, strand)
numpyarray = numpyarray.sum(axis=1)
# Subsequently, we return the array of shape
# (intervallength // resolution, strand)
return numpyarray
Caching¶
The construction, including loading and preprocessing,
of a genomic dataset might require a significant amount of time.
In order to avoid having to create the coverage profiles each time you want
to use them, they can be cached and quickly reloaded
later.
Caching can be activated via the options cache=True
.
When caching is required, janggu will check for changes in the
file content, file composition and various dataset specific argument
(e.g. binsize, resolution) by constructing a SHA256. The dataset will
be loaded or reloaded from scratch if the determined hash does not exist.
Example:
# load hg19 if the cache file does not exist yet, otherwise
# reload it.
Bioseq.create_from_refgenome('dna', refgenome, order=1, cache=True)
Dataset storage¶
Storage option¶
Depending on the structure of the dataset, the required memory to store the data
and the available memory on your machine, different storage options are available
for the genomic datasets, including numpy array, as sparse array or as hdf5 dataset.
To this end, create_from_bam
, create_from_bigwig
,
create_from_bed
, create_from_seq
and create_from_refgenome
expose the storage option, which may be ‘ndarray’,
‘sparse’ or ‘hdf5’, respectively.
‘ndarray’ amounts to perhaps the fastest access time, but also most memory demanding option for storing the data. It might be useful for dense datasets, and relatively small datasets that conveniently fit into memory.
If the data is sparse, the option sparse yields a good compromise between access time and speed. In that case, the data is stored in its compressed sparse form and converted to a dense representation when querying mini-batches. This option may be used to store e.g. genome wide ChIP-seq peaks profiles, if peaks occur relatively rarely.
Finally, if the data is too large to be kept in memory, the option hdf5 allows to consume the data directly from disk. While, the access time for processing data from hdf5 files may be higher, it allows to processing huge datasets with a small amount of RAM in your machine.
Whole and partial genome storage¶
Cover
and Bioseq
further allow to maintain coverage and sequence information
from the entire genome or only the part that is actively consumed during training.
This option can be configured by store_whole_genome=True/False
.
In most situations, the user may find it convenient to set store_whole_genome=False.
In that case, when loading Cover
and Bioseq
only information overlapping
the region of interest will be gathered. The advantage of this would be not to have
to store an overhead of information when only a small part of the genome is of interest
for consumption.
On the other hand, store_whole_genome=True might be an advantage for the following purposes:
- If a large part of the genome is consumed for training/evaluation
- If in addition the stepsize for traversing the genome is smaller than binsize, in which case mutually overlapping intervals do not have to be stored redundantly.
- It simplifies sharing of the same genomic array for different tasks. For example, during training and testing different parts of the same genomic array may be consumed.
Converting Numpy to Cover¶
When performing predictions, e.g. with a keras model,
the output corresponds to an ordinary numpy array.
In order to reestablish the association of the predicted values
with the genomic coordinates Cover exposes the constructor: create_from_array.
Upon invocation, a new Cover
object is composed that holds the predicted values.
These predictions may subsequently be illustrated via plotGenomeTrack or exported to a BIGWIG file.
Evaluation features¶
Cover
objects may be exported as BIGWIG files. Accordingly,
for each condition in the Cover
a file will be created.
It is also possible to illustrate predictions in terms of a genome browser-like plot using plotGenomeTrack, allowing to interactively explore prediction scores (perhaps in comparison with the true labels) or feature activities of the internal layers of a neural net. plotGenomeTrack return a matplotlib figure that can be stored into a file using native matplotlib functionality.
Rearranging channel dimensions¶
Depending on the deep learning library that is used, the dimensionality
of the tensors need to be set up in a specific order.
For example, tensorflow expects the channel to be represented by the last
dimension, while theano or pytorch expect the channel at the first dimension.
With the option channel_last=True/False it is possible to configure the output
dimensionality of Cover
and Bioseq
.
Wrapper Datasets¶
A Cover object is represents a 4D object. However, sometimes one or more
dimensions of Cover might be single dimensional (e.g. containing only one element).
These dimensions can be dropped using ReduceDim
.
For example ReduceDim(cover)
.
Different views datasets¶
Suppose you already have loaded DNA sequence from a reference genome and you want to use a different parts of it for training and validating the model performance. This is achieved by the view mechanism, which allows to reuse the same dataset by instantiating views that reading out different subsets.
For example, a view constituting the training and test set, respectively.
# union ROI for training and test set.
ROI_FILE = resource_filename('janggu', 'resources/roi.bed')
ROI_TRAIN_FILE = resource_filename('janggu', 'resources/roi_train.bed')
ROI_TEST_FILE = resource_filename('janggu', 'resources/roi_test.bed')
DNA = Bioseq.create_from_refgenome('dna', refgenome=REFGENOME,
roi=ROI_FILE,
binsize=200,
store_whole_genome=True)
DNA_TRAIN = view(DNA, ROI_TRAIN_FILE)
DNA_TEST = view(DNA, ROI_TEST_FILE)
Since underneath the actual dataset is just referenced rather than copied, the memory footprint won’t increase. It just allows to read out different parts of the genome.
An example is illustrated in the jupyter notebook.
Randomized dataset¶
In order to achieve good predictive performances, it is recommended to randomize the mini-batches during model fitting. This is usually achieved by specifying shuffle=True in the fit method.
However, when using HDF5 dataset, this approach may be prohibitively slow due to the limitations that data from HDF5 files need to be accessed in chunks rather than in random access fashion.
In order to overcome this issue, it is possible to randomize the dataset already during loading time such that the data can be consumed later by reading coherent chunks by setting shuffle=False.
For example, randomization is induced by specifying an integer-valued
random_state
as in the example below
DNA = Bioseq.create_from_refgenome('dna', refgenome=REFGENOME,
roi=ROI_TRAIN_FILE,
binsize=200,
storage='hdf5',
cache=True,
store_whole_genome=False,
random_state=43)
For this option to be effective and correct, all datasets consumed during
e.g. training need to be provided with the same random_state
value.
Furthermore, the HDF5 file needs to be stored with store_whole_genome=False
,
since data storage is not affected by the random_state when the entire genome
is stored.
An example is illustrated in the jupyter notebook.
Output directory configuration¶
Optionally, janggu produces various kinds of output files, including cache files for the datasets, log files for monitoring the training / evaluation procedure, stored model parameters or summary output files about the evaluation performance.
The root directory specifying the janggu output location can be configured
via setting the environment variable JANGGU_OUTPUT
.
This might be done in the following ways:
Setting the directory globally:
export JANGGU_OUTPUT='/output/dir'
on startup of the script:
JANGGU_OUTPUT='/output/dir' python classify.py
or inside your model script using
import os
os.environ['JANGGU_OUTPUT']='/output/dir'
If JANGGU_OUTPUT
is not set, root directory will be set
to /home/user/janggu_results
.