Genomic Datasets

One of the central features of Janggu are the genomic datasets Cover and Bioseq. On the one hand, they allow quick and flexible access to genomics data, including FASTA, BAM, BIGWIG, BED and GFF file formats, which bridges the gap between the data being present in raw file formats and the numpy inputs required for python-based deep learning models (e.g. keras). On the other hand, predictions from a deep learning library again are in numpy format. Janggu facilitates a convertion between numpy arrays and Cover objects in order to associate the predictions with the respective genomic coordinates. Finally, coverage information may be exported to BIGWIG or inspected directly via genome browser-like plots.

General principle of the Genomics Datasets

Internally, the genomics datasets maintains coverage or sequence type of information along with the associated genomic intervals. Externally, the datasets behave similar to a numpy array. This makes it possible to directly consume the datasets using keras, for instance.

In this section we briefly describe the internal organisation of these datasets. The classes Cover and Bioseq maintain a GenomicArray and a GenomicIndexer object. GenomicArray is a general data structure that holds numeric information about the genome. For instance, read count coverage. It can be accessed via genomic coordinates (e.g. chromosome name, start and end) and returns the respective data as numpy array. The GenomicIndexer maintains a list of genomic coordinates, which should be traversed during training/evaluation. The GenomicIndexer can be accessed by an integer-valued index which returns the associated genomic coordinates.

When querying the ‘i’-th region from Cover or Bioseq, the index is passed to the GenomicIndexer which yields a genomic coordinates that is passed on to the GenomicArray. The result is returned in numpy format. Similarly, the dataset objects also support slicing and indexing via a list of indices, which is usually relevant when using mini-batch learning.

Normalization

Upon creation of a Cover object, normalization of the raw data might be require. For instance, to make coverage tracks comparable across replicates or experiments. To this end, create_from_bam, create_from_bigwig and create_from_bed expose a normalizer option. Janggu already implements various normalization methods which can be called by name, TPM (transcript per million) normalization. For instance, using

Cover.create_from_bam('track', bamfiles=samplefile, roi=roi, normalizer='tpm')

Other preprocessing and normalization options are: zscore, zscorelog, binsizenorm and perctrim. The latter two apply normalization for read depth and trimming the signal intensities at the 99%-ile.

Normalizers can also be applied via callables and/or in combination with other transformations. For instance, suppose we want to trim the outliers at the 95%-tile instead and subsequently apply the z-score transformation then we could use

from janggu.data import PercentileTrimming
from janggu.data import ZScore

Cover.create_from_bam('track', bamfiles=samplefile, roi=roi,
                      normalizer=[PercentileTrimming(95), ZScore()])

It might be necessary to evaluate the normalization parameter on one dataset and apply the same transformation on other datasets. For instance, in the case of the ZScore, we might want to keep the mean and standard deviation that was obtained from, say the training set, and reuse the to normalize the test set. This is possible by just creating a zscore object that is used multiple times. At the first invokation the mean and standard deviation are determine and the transformation is applied. Subsequently, the zscore is determined using the predetermined mean and standard deviation. For example:

from janggu.data import ZScore

zscore = ZScore()

# First, mean and std will be determined.
# Then zscore transformation is applied.
Cover.create_from_bam('track_train', bamfiles=samplefile, roi=roitrain,
                      normalizer=[zscore])

# Subsequently, zscore transformation is applied with
# the same mean and std determined from the training set.
Cover.create_from_bam('track_test', bamfiles=samplefile, roi=roitest,
                      normalizer=[zscore])

In case a different normalization procedure is required that is not contained in janggu, it is possible to define a custom_normalizer as follows:

def custom_normalizer(genomicarray):

   # perform normalization genomicarray

   return genomicarray

The currently implemented normalizers may be a good starting point for this purpose.

Granularity of the coverage

Depending on the applications, different granularity of the coverage data might be required. For instance, one might be interested in reading out nucleotide-resolution coverage for one purpose or 50 base-pair resolution bins for another. Furthermore, in some cases the signal of variable size regions might be of interest. For example, the read counts across the gene bodies, to measure gene expression levels.

These adjustments can be made when invoking create_from_bam, create_from_bigwig and create_from_bed using an appropriate region of interest ROI file in conjunction with specifying the resolution and collapser parameter.

First, we the resolution parameter allows to the coverage granularity. For example, base-pair and 50-base-pair resolution would be possible using

Cover.create_from_bam('track', bamfiles=samplefile, roi=roi,
                      resolution=1)

Cover.create_from_bam('track', bamfiles=samplefile, roi=roi,
                      resolution=50)

In case the signal intensity should be summarized across the entire interval, specify resolution=None. For instance, if the region of interest contains a set of variable length gene bodies, the total read count per gene can be obtained using

Cover.create_from_bam('genes',
                      bamfiles=samplefile,
                      roi=geneannot,
                      resolution=None)

It is also possible to use resolution=None in conjunction with e.g. binsize=200 which would have the same effect as chosing binsize=resolution=200.

Whenever we deal with resolution>1, an aggregation operation needs to be performed to summarize the signal intensity across the region. For instance, for create_from_bam the reads are summed within each interval.

For create_from_bigwig and create_from_bed, it is possible to adjust the collapser. For example, ‘mean’ or ‘sum’ aggregation can be applied by name or by handing over a callable according to

import numpy as np

Cover.create_from_bigwig('bwtrack',
                         bigwigfiles=samplefile,
                         roi=roi,
                         resolution=50,
                         collapser='mean')

Cover.create_from_bigwig('bwtrack',
                         bigwigfiles=samplefile,
                         roi=roi,
                         resolution=50,
                         collapser=np.sum)

Moreover, more specialized aggregations may require a custom collaper function. In that case, it is important to note that the function expects a 3D numpy array and the aggragation should be performed across the second dimension. For example

def custom_collapser(numpyarray):

   # Initially, the dimensions of numpyarray correspond to
   # (intervallength // resolution, resolution, strand)

   numpyarray = numpyarray.sum(axis=1)

   # Subsequently, we return the array of shape
   # (intervallength // resolution, strand)

   return numpyarray

Caching

The construction, including loading and preprocessing, of a genomic dataset might require a significant amount of time. In order to avoid having to create the coverage profiles each time you want to use them, they can be cached and quickly reloaded later. Caching can be activated via the options cache=True. When caching is required, janggu will check for changes in the file content, file composition and various dataset specific argument (e.g. binsize, resolution) by constructing a SHA256. The dataset will be loaded or reloaded from scratch if the determined hash does not exist.

Example:

# load hg19 if the cache file does not exist yet, otherwise
# reload it.
Bioseq.create_from_refgenome('dna', refgenome, order=1, cache=True)

Dataset storage

Storage option

Depending on the structure of the dataset, the required memory to store the data and the available memory on your machine, different storage options are available for the genomic datasets, including numpy array, as sparse array or as hdf5 dataset. To this end, create_from_bam, create_from_bigwig, create_from_bed, create_from_seq and create_from_refgenome expose the storage option, which may be ‘ndarray’, ‘sparse’ or ‘hdf5’, respectively.

‘ndarray’ amounts to perhaps the fastest access time, but also most memory demanding option for storing the data. It might be useful for dense datasets, and relatively small datasets that conveniently fit into memory.

If the data is sparse, the option sparse yields a good compromise between access time and speed. In that case, the data is stored in its compressed sparse form and converted to a dense representation when querying mini-batches. This option may be used to store e.g. genome wide ChIP-seq peaks profiles, if peaks occur relatively rarely.

Finally, if the data is too large to be kept in memory, the option hdf5 allows to consume the data directly from disk. While, the access time for processing data from hdf5 files may be higher, it allows to processing huge datasets with a small amount of RAM in your machine.

Whole and partial genome storage

Cover and Bioseq further allow to maintain coverage and sequence information from the entire genome or only the part that is actively consumed during training. This option can be configured by store_whole_genome=True/False.

In most situations, the user may find it convenient to set store_whole_genome=False. In that case, when loading Cover and Bioseq only information overlapping the region of interest will be gathered. The advantage of this would be not to have to store an overhead of information when only a small part of the genome is of interest for consumption.

On the other hand, store_whole_genome=True might be an advantage for the following purposes:

  1. If a large part of the genome is consumed for training/evaluation
  2. If in addition the stepsize for traversing the genome is smaller than binsize, in which case mutually overlapping intervals do not have to be stored redundantly.
  3. It simplifies sharing of the same genomic array for different tasks. For example, during training and testing different parts of the same genomic array may be consumed.

Converting Numpy to Cover

When performing predictions, e.g. with a keras model, the output corresponds to an ordinary numpy array. In order to reestablish the association of the predicted values with the genomic coordinates Cover exposes the constructor: create_from_array. Upon invocation, a new Cover object is composed that holds the predicted values. These predictions may subsequently be illustrated via plotGenomeTrack or exported to a BIGWIG file.

Evaluation features

Cover objects may be exported as BIGWIG files. Accordingly, for each condition in the Cover a file will be created.

It is also possible to illustrate predictions in terms of a genome browser-like plot using plotGenomeTrack, allowing to interactively explore prediction scores (perhaps in comparison with the true labels) or feature activities of the internal layers of a neural net. plotGenomeTrack return a matplotlib figure that can be stored into a file using native matplotlib functionality.

Rearranging channel dimensions

Depending on the deep learning library that is used, the dimensionality of the tensors need to be set up in a specific order. For example, tensorflow expects the channel to be represented by the last dimension, while theano or pytorch expect the channel at the first dimension. With the option channel_last=True/False it is possible to configure the output dimensionality of Cover and Bioseq.

Wrapper Datasets

A Cover object is represents a 4D object. However, sometimes one or more dimensions of Cover might be single dimensional (e.g. containing only one element). These dimensions can be dropped using ReduceDim. For example ReduceDim(cover).

Different views datasets

Suppose you already have loaded DNA sequence from a reference genome and you want to use a different parts of it for training and validating the model performance. This is achieved by the view mechanism, which allows to reuse the same dataset by instantiating views that reading out different subsets.

For example, a view constituting the training and test set, respectively.

# union ROI for training and test set.
ROI_FILE = resource_filename('janggu', 'resources/roi.bed')
ROI_TRAIN_FILE = resource_filename('janggu', 'resources/roi_train.bed')
ROI_TEST_FILE = resource_filename('janggu', 'resources/roi_test.bed')

DNA = Bioseq.create_from_refgenome('dna', refgenome=REFGENOME,
                                   roi=ROI_FILE,
                                   binsize=200,
                                   store_whole_genome=True)

DNA_TRAIN = view(DNA, ROI_TRAIN_FILE)
DNA_TEST = view(DNA, ROI_TEST_FILE)

Since underneath the actual dataset is just referenced rather than copied, the memory footprint won’t increase. It just allows to read out different parts of the genome.

An example is illustrated in the jupyter notebook.

Randomized dataset

In order to achieve good predictive performances, it is recommended to randomize the mini-batches during model fitting. This is usually achieved by specifying shuffle=True in the fit method.

However, when using HDF5 dataset, this approach may be prohibitively slow due to the limitations that data from HDF5 files need to be accessed in chunks rather than in random access fashion.

In order to overcome this issue, it is possible to randomize the dataset already during loading time such that the data can be consumed later by reading coherent chunks by setting shuffle=False.

For example, randomization is induced by specifying an integer-valued random_state as in the example below

DNA = Bioseq.create_from_refgenome('dna', refgenome=REFGENOME,
                                   roi=ROI_TRAIN_FILE,
                                   binsize=200,
                                   storage='hdf5',
                                   cache=True,
                                   store_whole_genome=False,
                                   random_state=43)

For this option to be effective and correct, all datasets consumed during e.g. training need to be provided with the same random_state value. Furthermore, the HDF5 file needs to be stored with store_whole_genome=False, since data storage is not affected by the random_state when the entire genome is stored. An example is illustrated in the jupyter notebook.

Output directory configuration

Optionally, janggu produces various kinds of output files, including cache files for the datasets, log files for monitoring the training / evaluation procedure, stored model parameters or summary output files about the evaluation performance.

The root directory specifying the janggu output location can be configured via setting the environment variable JANGGU_OUTPUT. This might be done in the following ways:

Setting the directory globally:

export JANGGU_OUTPUT='/output/dir'

on startup of the script:

JANGGU_OUTPUT='/output/dir' python classify.py

or inside your model script using

import os
os.environ['JANGGU_OUTPUT']='/output/dir'

If JANGGU_OUTPUT is not set, root directory will be set to /home/user/janggu_results.