janggu.data - Genomics datasets for deep learning

Bioseq.create_from_seq(name, fastafile[, …]) Create a Bioseq class from a biological sequences.
Bioseq.create_from_refgenome(name, refgenome) Create a Bioseq class from a reference genome.
Cover.create_from_bam(name, bamfiles[, roi, …]) Create a Cover class from a bam-file (or files).
Cover.create_from_bigwig(name, bigwigfiles) Create a Cover class from a bigwig-file (or files).
Cover.create_from_bed(name, bedfiles[, roi, …]) Create a Cover class from a bed-file (or files).
Cover.create_from_array(name, array, gindexer) Create a Cover class from a numpy.array.
plotGenomeTrack(tracks, chrom, start, end[, …]) plotGenomeTrack shows plots of a specific interval from cover objects data.
Track(data, height) General track
HeatTrack(data[, height]) Heatmap Track
LineTrack(data[, height, linestyle, marker, …]) Line track
SeqTrack(data[, height]) Sequence Track

Main Dataset classes

Dataset(name) Dataset interface.
Cover(name, garray, gindexer) Cover class.
Bioseq(name, garray, gindexer, alphabet) Bioseq class.
Array(name, array[, conditions]) Array class.
GenomicIndexer(binsize, stepsize[, flank, …]) GenomicIndexer maps a set of integer indices to respective genomic intervals.
class janggu.data.Dataset(name)[source]

Dataset interface.

All dataset classes in janggu inherit from the Dataset class which mimics a numpy array and can be used directly with keras.

Parameters:

name (str) – Name of the dataset

Variables:
  • name (str) – Name of the dataset
  • shape (tuple) – numpy-style shape of the dataset
name

Dataset name

shape

Shape of the dataset

class janggu.data.Cover(name, garray, gindexer)[source]

Cover class.

This datastructure holds coverage information across the genome. The coverage can conveniently fetched from a list of bam-files, bigwig-file, bed-files or gff-files.

Parameters:
  • name (str) – Name of the dataset
  • garray (GenomicArray) – A genomic array that holds the coverage data
  • gindexer (GenomicIndexer or None) – A genomic indexer translates an integer index to a corresponding genomic coordinate. It can be None the genomic indexer is supplied later.
classmethod create_from_array(name, array, gindexer, genomesize=None, conditions=None, resolution=None, storage='ndarray', overwrite=False, cache=False, datatags=None, padding_value=0.0, store_whole_genome=False, verbose=False)[source]

Create a Cover class from a numpy.array.

The purpose of this function is to convert output prediction from keras which are in numpy.array format into a Cover object.

Parameters:
  • name (str) – Name of the dataset
  • array (numpy.array) – A 4D numpy array that will be re-interpreted as genomic array.
  • gindexer (GenomicIndexer) – Genomic indices associated with the values contained in array.
  • genomesize (dict or None) – Dictionary containing the genome size to fetch the coverage from. If genomesize=None, the genome size is automatically determined from the GenomicIndexer. If store_whole_genome=False this option does not have an effect.
  • conditions (list(str) or None) – List of conditions. If conditions=None, the conditions are obtained from the filenames (without the directories and file-ending).
  • storage (str) – Storage mode for storing the coverage data can be ‘ndarray’, ‘hdf5’ or ‘sparse’. Default: ‘ndarray’.
  • overwrite (boolean) – Overwrite cachefiles. Default: False.
  • datatags (list(str) or None) – List of datatags. Together with the dataset name, the datatags are used to construct a cache file. If cache=False, this option does not have an effect. Default: None.
  • store_whole_genome (boolean) – Indicates whether the whole genome or only ROI should be loaded. Default: False.
  • padding_value (float) – Padding value. Default: 0.
  • verbose (boolean) – Verbosity. Default: False
classmethod create_from_bam(name, bamfiles, roi=None, genomesize=None, conditions=None, min_mapq=None, binsize=None, stepsize=None, flank=0, resolution=1, storage='ndarray', dtype='float32', stranded=True, overwrite=False, pairedend='5prime', template_extension=0, datatags=None, cache=False, normalizer=None, zero_padding=True, random_state=None, store_whole_genome=False, verbose=False)[source]

Create a Cover class from a bam-file (or files).

This constructor can be used to obtain coverage from BAM files. For single-end reads the read will be counted at the 5 prime end. Paired-end reads can be counted relative to the 5 prime ends of the read (default) or with respect to the midpoint.

Parameters:
  • name (str) – Name of the dataset
  • bamfiles (str or list) – bam-file or list of bam files.
  • roi (str, list(Interval), BedTool, pandas.DataFrame or None) – Region of interest over which to iterate. If set to None, the coverage will be fetched from the entire genome and a genomic indexer must be attached later.
  • genomesize (dict or None) – Dictionary containing the genome size. If genomesize=None, the genome size is determined from the bam header. If store_whole_genome=False, this option does not have an effect.
  • conditions (list(str) or None) – List of conditions. If conditions=None, the conditions are obtained from the filenames (without the directories and file-ending).
  • min_mapq (int) – Minimal mapping quality. Reads with lower mapping quality are filtered out. If None, all reads are used.
  • binsize (int or None) – Binsize in basepairs. For binsize=None, the binsize will be determined from the bed-file. If resolution is of type integer, this requires that all intervals in the bed-file are of equal length. If resolution is None, the intervals in the bed-file may be of variable size. Default: None.
  • stepsize (int or None) – stepsize in basepairs for traversing the genome. If stepsize is None, it will be set equal to binsize. Default: None.
  • flank (int) – Flanking size increases the interval size at both ends by flank base pairs. Default: 0
  • resolution (int or None) – If resolution represents an interger, it determines the base pairs resolution by which an interval should be divided. This requires equally sized bins or zero padding and effectively reduces the storage for coverage data. If resolution=None, the intervals will be represented by a collapsed summary score. For example, gene expression may be expressed by TPM in that manner. In the latter case, variable size intervals are permitted and zero padding does not have an effect. Default: 1.
  • storage (str) – Storage mode for storing the coverage data can be ‘ndarray’, ‘hdf5’ or ‘sparse’. Default: ‘ndarray’.
  • dtype (str) – Typecode to be used for storage the data. Default: ‘int’.
  • stranded (boolean) – Indicates whether to extract stranded or unstranded coverage. For unstranded coverage, reads aligning to both strands will be aggregated.
  • overwrite (boolean) – Overwrite cachefiles. Default: False.
  • datatags (list(str) or None) – List of datatags. Together with the dataset name, the datatags are used to construct a cache file. If cache=False, this option does not have an effect. Default: None.
  • pairedend (str) – Indicates whether to count reads at the ‘5prime’ end or at the ‘midpoint’ for paired-end reads. Default: ‘5prime’.
  • template_extension (int) – Elongates intervals by template_extension which allows to properly count template mid-points whose reads lie outside of the interval. This option is only relevant for paired-end reads counted at the ‘midpoint’ and if the coverage is not obtained from the whole genome, e.g. roi is not None.
  • cache (boolean) – Indicates whether to cache the dataset. Default: False.
  • zero_padding (boolean) – Indicates if variable size intervals should be zero padded. Zero padding is only supported with a specified binsize. If zero padding is false, intervals shorter than binsize will be skipped. Default: True.
  • normalizer (None, str or callable) – This option specifies the normalization that can be applied. If None, no normalization is applied. If ‘zscore’, ‘zscorelog’, ‘rpkm’ then zscore transformation, zscore transformation on log transformed data and rpkm normalization are performed, respectively. If callable, a function with signature norm(garray) should be provided that performs the normalization on the genomic array. Normalization is ignored when using storage=’sparse’. Default: None.
  • random_state (None or int) – random_state used to internally randomize the dataset. This option is best used when consuming data for training from an HDF5 file. Since random data access from HDF5 may be probibitively slow, this option allows to randomize the dataset during loading. In case an integer-valued random_state seed is supplied, make sure that all training datasets (e.g. input and output datasets) use the same random_state value so that the datasets are synchronized. Default: None means that no randomization is used.
  • store_whole_genome (boolean) – Indicates whether the whole genome or only ROI should be loaded. If False, a bed-file with regions of interest must be specified. Default: False
  • verbose (boolean) – Verbosity. Default: False
classmethod create_from_bed(name, bedfiles, roi=None, genomesize=None, conditions=None, binsize=None, stepsize=None, resolution=1, flank=0, storage='ndarray', dtype='float32', mode='binary', store_whole_genome=False, overwrite=False, zero_padding=True, normalizer=None, collapser=None, minoverlap=None, random_state=None, datatags=None, cache=False, verbose=False)[source]

Create a Cover class from a bed-file (or files).

Parameters:
  • name (str) – Name of the dataset
  • bedfiles (str or list) – bed-file or list of bed files.
  • roi (str, list(Interval), BedTool, pandas.DataFrame or None) – Region of interest over which to iterate. If set to None a genomesize must be supplied and a genomic indexer must be attached later.
  • genomesize (dict or None) – Dictionary containing the genome size to fetch the coverage from. If genomesize=None, the genome size is fetched from the region of interest.
  • conditions (list(str) or None) – List of conditions. If conditions=None, the conditions are obtained from the filenames (without the directories and file-ending).
  • binsize (int or None) – Binsize in basepairs. For binsize=None, the binsize will be determined from the bed-file. If resolution is of type integer, this requires that all intervals in the bed-file are of equal length. If resolution is None, the intervals in the bed-file may be of variable size. Default: None.
  • stepsize (int or None) – stepsize in basepairs for traversing the genome. If stepsize is None, it will be set equal to binsize. Default: None.
  • resolution (int or None) – If resolution represents an interger, it determines the base pairs resolution by which an interval should be divided. This requires equally sized bins or zero padding and effectively reduces the storage for coverage data. If resolution=None, the intervals will be represented by a collapsed summary score. For example, gene expression may be expressed by TPM in that manner. In the latter case, variable size intervals are permitted and zero padding does not have an effect. Default: 1.
  • flank (int) – Flanking size increases the interval size at both ends by flank bins. Note that the binsize is defined by the resolution parameter. Default: 0.
  • storage (str) – Storage mode for storing the coverage data can be ‘ndarray’, ‘hdf5’ or ‘sparse’. Default: ‘ndarray’.
  • dtype (str) – Typecode to define the datatype to be used for storage. Default: ‘int’.
  • mode (str) – Determines how the BED-like file should be interpreted, e.g. as class labels or scores. ‘binary’ is used for presence/absence representation of features for a binary classification setting. Regions in the bedfiles that intersect the ROI are considered positive examples, while the remaining ROI intervals are negative examples. ‘score’ allows to use the score-value associated with the intervals (e.g. for regression). ‘score_category’ (formerly ‘categorical’) allows to interpret the integer-valued score as class-label for categorical labels. The labels will be one-hot encoded. ‘name_category’ allows to interpret the name field as class-label for categorical labels. The labels will be one-hot encoded. ‘bedgraph’ indicates that the input file is in bedgraph format and reads out the associated score for each interval. Mode of the dataset may be ‘binary’, ‘score’, ‘score_category’ (or ‘categorical’), ‘name_category’ or ‘bedgraph’. Default: ‘binary’.
  • overwrite (boolean) – Overwrite cachefiles. Default: False.
  • datatags (list(str) or None) – List of datatags. Together with the dataset name, the datatags are used to construct a cache file. If cache=False, this option does not have an effect. Default: None.
  • store_whole_genome (boolean) – Indicates whether the whole genome or only ROI should be loaded. If False, a bed-file with regions of interest must be specified. Default: False.
  • zero_padding (boolean) – Indicates if variable size intervals should be zero padded. Zero padding is only supported with a specified binsize. If zero padding is false, intervals shorter than binsize will be skipped. Default: True.
  • normalizer (None, str or callable) – This option specifies the normalization that can be applied. If None, no normalization is applied. If ‘zscore’, ‘zscorelog’, ‘tpm’ then zscore transformation, zscore transformation on log transformed data and rpkm normalization are performed, respectively. If callable, a function with signature norm(garray) should be provided that performs the normalization on the genomic array. Normalization is ignored when using storage=’sparse’. Default: None.
  • collapser (None, str or callable) – This option defines how the genomic signal should be summarized when resolution is None or greater than one. It is possible to choose a number of options by name, including ‘sum’, ‘mean’, ‘max’. In addtion, a function may be supplied that defines a custom aggregation method. If collapser is None, ‘max’ aggregation will be used. Default: None.
  • minoverlap (float or None) – Minimum fraction of overlap of a given feature with a ROI bin. If None, any overlap (e.g. a single base-pair overlap) is considered as overlap. Default: None
  • cache (boolean) – Indicates whether to cache the dataset. Default: False.
  • random_state (None or int) – random_state used to internally randomize the dataset. This option is best used when consuming data for training from an HDF5 file. Since random data access from HDF5 may be probibitively slow, this option allows to randomize the dataset during loading. In case an integer-valued random_state seed is supplied, make sure that all training datasets (e.g. input and output datasets) use the same random_state value so that the datasets are synchronized. Default: None means that no randomization is used.
  • verbose (boolean) – Verbosity. Default: False
classmethod create_from_bigwig(name, bigwigfiles, roi=None, genomesize=None, conditions=None, binsize=None, stepsize=None, resolution=1, flank=0, storage='ndarray', dtype='float32', overwrite=False, datatags=None, cache=False, store_whole_genome=False, zero_padding=True, normalizer=None, collapser=None, random_state=None, nan_to_num=True, verbose=False)[source]

Create a Cover class from a bigwig-file (or files).

Parameters:
  • name (str) – Name of the dataset
  • bigwigfiles (str or list) – bigwig-file or list of bigwig files.
  • roi (str, list(Interval), BedTool, pandas.DataFrame or None) – Region of interest over which to iterate. If set to None, the coverage will be fetched from the entire genome and a genomic indexer must be attached later. Otherwise, the coverage is only determined for the region of interest.
  • genomesize (dict or None) – Dictionary containing the genome size. If genomesize=None, the genome size is determined from the bigwig file. If store_whole_genome=False, this option does not have an effect.
  • conditions (list(str) or None) – List of conditions. If conditions=None, the conditions are obtained from the filenames (without the directories and file-ending).
  • binsize (int or None) – Binsize in basepairs. For binsize=None, the binsize will be determined from the bed-file. If resolution is of type integer, this requires that all intervals in the bed-file are of equal length. If resolution is None, the intervals in the bed-file may be of variable size. Default: None.
  • stepsize (int or None) – stepsize in basepairs for traversing the genome. If stepsize is None, it will be set equal to binsize. Default: None.
  • resolution (int or None) – If resolution represents an interger, it determines the base pairs resolution by which an interval should be divided. This requires equally sized bins or zero padding and effectively reduces the storage for coverage data. If resolution=None, the intervals will be represented by a collapsed summary score. For example, gene expression may be expressed by TPM in that manner. In the latter case, variable size intervals are permitted and zero padding does not have an effect. Default: 1.
  • flank (int) – Flanking size increases the interval size at both ends by flank bins. Note that the binsize is defined by the resolution parameter. Default: 0.
  • storage (str) – Storage mode for storing the coverage data can be ‘ndarray’, ‘hdf5’ or ‘sparse’. Default: ‘ndarray’.
  • dtype (str) – Typecode to define the datatype to be used for storage. Default: ‘float32’.
  • overwrite (boolean) – Overwrite cachefiles. Default: False.
  • datatags (list(str) or None) – List of datatags. Together with the dataset name, the datatags are used to construct a cache file. If cache=False, this option does not have an effect. Default: None.
  • cache (boolean) – Indicates whether to cache the dataset. Default: False.
  • store_whole_genome (boolean) – Indicates whether the whole genome or only ROI should be loaded. If False, a bed-file with regions of interest must be specified. Default: False.
  • zero_padding (boolean) – Indicates if variable size intervals should be zero padded. Zero padding is only supported with a specified binsize. If zero padding is false, intervals shorter than binsize will be skipped. Default: True.
  • normalizer (None, str or callable) – This option specifies the normalization that can be applied. If None, no normalization is applied. If ‘zscore’, ‘zscorelog’, ‘rpkm’ then zscore transformation, zscore transformation on log transformed data and rpkm normalization are performed, respectively. If callable, a function with signature norm(garray) should be provided that performs the normalization on the genomic array. Normalization is ignored when using storage=’sparse’. Default: None.
  • collapser (None, str or callable) – This option defines how the genomic signal should be summarized when resolution is None or greater than one. It is possible to choose a number of options by name, including ‘sum’, ‘mean’, ‘max’. In addtion, a function may be supplied that defines a custom aggregation method. If collapser is None, ‘mean’ aggregation will be used. Default: None.
  • nan_to_num (boolean) – Indicates whether NaN values contained in the bigwig files should be interpreted as zeros. Default: True
  • random_state (None or int) – random_state used to internally randomize the dataset. This option is best used when consuming data for training from an HDF5 file. Since random data access from HDF5 may be probibitively slow, this option allows to randomize the dataset during loading. In case an integer-valued random_state seed is supplied, make sure that all training datasets (e.g. input and output datasets) use the same random_state value so that the datasets are synchronized. Default: None means that no randomization is used.
  • verbose (boolean) – Verbosity. Default: False
class janggu.data.Bioseq(name, garray, gindexer, alphabet)[source]

Bioseq class.

This class maintains a set of biological sequences, e.g. nucleotide or amino acid sequences, and determines its one-hot encoding.

Parameters:
  • name (str) – Name of the dataset
  • garray (GenomicArray) – A genomic array that holds the sequence data.
  • gindexer (GenomicIndexer or None) – A genomic index mapper that translates an integer index to a genomic coordinate. Can be None, if the Dataset is only loaded.
  • alphabet (str) – String of sequence alphabet. For example, ‘ACGT’.
classmethod create_from_refgenome(name, refgenome, roi=None, binsize=None, stepsize=None, flank=0, order=1, storage='ndarray', datatags=None, cache=False, overwrite=False, random_state=None, store_whole_genome=False, verbose=False)[source]

Create a Bioseq class from a reference genome.

This constructor loads nucleotide sequences from a reference genome. If regions of interest (ROI) is supplied, only the respective sequences are loaded, otherwise the entire genome is fetched.

Parameters:
  • name (str) – Name of the dataset
  • refgenome (str or Bio.SeqRecord.SeqRecord) – Reference genome location pointing to a fasta file or a SeqRecord object from Biopython that contains the sequences.
  • roi (str, list(Interval), BedTool, pandas.DataFrame or None) – Region of interest over which to iterate. If set to None, the sequence will be fetched from the entire genome and a genomic indexer must be attached later. Otherwise, the coverage is only determined for the region of interest.
  • binsize (int or None) – Binsize in basepairs. For binsize=None, the binsize will be determined from the bed-file directly which requires that all intervals in the bed-file are of equal length. Otherwise, the intervals in the bed-file will be split to subintervals of length binsize in conjunction with stepsize. Default: None.
  • stepsize (int or None) – stepsize in basepairs for traversing the genome. If stepsize is None, it will be set equal to binsize. Default: None.
  • flank (int) – Flanking region in basepairs to be extended up and downstream of each interval. Default: 0.
  • order (int) – Order for the one-hot representation. Default: 1.
  • storage (str) – Storage mode for storing the sequence may be ‘ndarray’ or ‘hdf5’. Default: ‘ndarray’.
  • datatags (list(str) or None) – List of datatags. Together with the dataset name, the datatags are used to construct a cache file. If cache=False, this option does not have an effect. Default: None.
  • cache (boolean) – Indicates whether to cache the dataset. Default: False.
  • overwrite (boolean) – Overwrite the cachefiles. Default: False.
  • store_whole_genome (boolean) – Indicates whether the whole genome or only ROI should be loaded. If False, a bed-file with regions of interest must be specified. Default: False.
  • random_state (None or int) – random_state used to internally randomize the dataset. This option is best used when consuming data for training from an HDF5 file. Since random data access from HDF5 may be probibitively slow, this option allows to randomize the dataset during loading. In case an integer-valued random_state seed is supplied, make sure that all training datasets (e.g. input and output datasets) use the same random_state value so that the datasets are synchronized. Default: None means that no randomization is used.
  • verbose (boolean) – Verbosity. Default: False
classmethod create_from_seq(name, fastafile, storage='ndarray', seqtype='dna', order=1, fixedlen=None, datatags=None, cache=False, overwrite=False, verbose=False)[source]

Create a Bioseq class from a biological sequences.

This constructor loads a set of nucleotide or amino acid sequences. By default, the sequence are assumed to be of equal length. Alternatively, sequences can be truncated and padded to a fixed length.

Parameters:
  • name (str) – Name of the dataset
  • fastafile (str or list(str) or list(Bio.SeqRecord)) – Fasta file or list of fasta files from which the sequences are loaded or a list of Bio.SeqRecord.SeqRecord.
  • seqtype (str) – Indicates whether a nucleotide or peptide sequence is loaded using ‘dna’ or ‘protein’ respectively. Default: ‘dna’.
  • order (int) – Order for the one-hot representation. Default: 1.
  • fixedlen (int or None) – Forces the sequences to be of equal length by truncation or zero-padding. If set to None, it will be assumed that the sequences are already of equal length. An exception is raised if this is not the case. Default: None.
  • storage (str) – Storage mode for storing the sequence may be ‘ndarray’ or ‘hdf5’. Default: ‘ndarray’.
  • datatags (list(str) or None) – List of datatags. Together with the dataset name, the datatags are used to construct a cache file. If cache=False, this option does not have an effect. Default: None.
  • cache (boolean) – Indicates whether to cache the dataset. Default: False.
  • overwrite (boolean) – Overwrite the cachefiles. Default: False.
  • verbose (boolean) – Verbosity. Default: False
class janggu.data.Array(name, array, conditions=None)[source]

Array class.

This datastructure wraps arbitrary numpy.arrays for a deep learning application with Janggu. The main difference to an ordinary numpy.array is that Array has a name attribute.

Parameters:
  • name (str) – Name of the dataset
  • array (numpy.array) – Numpy array.
  • conditions (list(str) or None) – Conditions or label names of the dataset.
class janggu.data.GenomicIndexer(binsize, stepsize, flank=0, zero_padding=True, collapse=False, random_state=None)[source]

GenomicIndexer maps a set of integer indices to respective genomic intervals.

The genomic intervals can be directly used to obtain data from a genomic array.

classmethod create_from_file(regions, binsize, stepsize, flank=0, zero_padding=True, collapse=False, random_state=None)[source]

Creates a GenomicIndexer object.

This method constructs a GenomicIndexer from a given BED or GFF file.

Parameters:
  • regions (str or list(Interval)) – Path to a BED or GFF file.
  • binsize (int or None) – Binsize in base pairs. If None, the binsize is obtained from the interval lengths in the bed file, which requires intervals to be of equal length.
  • stepsize (int or None) – Stepsize in base pairs. If stepsize is None, stepsize is set to equal to binsize.
  • flank (int) – flank size in bp to be attached to both ends of a region. Default: 0.
  • zero_padding (boolean) – zero_padding indicate if variable sequence lengths are used in conjunction with zero-padding. If zero_padding is True, a binsize must be specified. Default: True.
  • collapse (boolean) – collapse indicates that the genomic interval will be represented by a scalar summary value. For example, the gene expression value in TPM. In this case, zero_padding does not have an effect. Intervals may be of fixed or variable lengths. Default: False.
  • random_state (None or int) – random_state for shuffling intervals. Default: None

Dataset wrappers

Utilities for reshaping, data augmentation, NaN removal.

ReduceDim(array[, aggregator, axis]) ReduceDim class.
SqueezeDim(array[, axis]) SqueezeDim class.
Transpose(array, axis) Transpose class.
NanToNumConverter(array) NanToNumConverter class.
RandomOrientation(array) RandomOrientation class.
RandomSignalScale(array, deviance) RandomSignalScale class.
class janggu.data.ReduceDim(array, aggregator=None, axis=None)[source]

ReduceDim class.

This class wraps an 4D coverage object and reduces the middle two dimensions by applying the aggregate function. Therefore, it transforms the 4D object into a table-like 2D representation

Example

# given some dataset, e.g. a Cover object
# originally, the cover object is a 4D-object.
cover.shape
cover = ReduceDim(cover, aggregator='mean')
cover.shape
# Afterwards, the cover object is 2D, where the second and
# third dimension have been averaged out.
Parameters:
  • array (Dataset) – Dataset
  • aggregator (str or callable) – Aggregator used for reducing the intermediate dimensions. Available aggregators are ‘sum’, ‘mean’, ‘max’ for performing summation, averaging or obtaining the maximum value. It is also possible to supply a callable directly that performs the operation. Default: ‘sum’
  • axis (None or tuple(ints)) – Dimensions over which to perform aggregation. Default: None aggregates with axis=(1, 2)
class janggu.data.SqueezeDim(array, axis=None)[source]

SqueezeDim class.

This class wraps an 4D coverage object and reduces the middle two dimensions by applying the aggregate function. Therefore, it transforms the 4D object into a table-like 2D representation

Parameters:
  • array (Dataset) – Dataset
  • axis (None or tuple(ints)) – Dimensions over which to perform aggregation. Default: None aggregates with axis=(1, 2)
class janggu.data.Transpose(array, axis)[source]

Transpose class.

This class can be used to shuffle the dimensions. For example, if the channel is expected to be at a specific location.

Parameters:
  • array (Dataset) – Dataset
  • axis (tuple(ints)) – Order to the dimensions.
class janggu.data.NanToNumConverter(array)[source]

NanToNumConverter class.

This wrapper dataset converts NAN’s in the dataset to zeros.

Example

# given some dataset, e.g. a Cover object
cover
cover = NanToNumConverter(cover)

# now all remaining NaNs will be converted to zeros.
Parameters:array (Dataset) – Dataset
class janggu.data.RandomOrientation(array)[source]

RandomOrientation class.

This wrapper randomly inverts the directionality of the signal tracks. For example a signal track is randomely presented in 5’ to 3’ and 3’ to 5’ orientation. Furthermore, if the dataset is stranded, the strand is switched as well.

Parameters:array (Dataset) – Dataset object must be 4D.
class janggu.data.RandomSignalScale(array, deviance)[source]

RandomSignalScale class.

This wrapper performs performs random uniform scaling of the original input. For example, this can be used to randomly change the peak or signal heights during training.

Parameters:
  • array (Dataset) – Dataset object
  • deviance (float) – The signal is rescaled using (1 + uniform(-deviance, deviance)) x original signal.
class janggu.data.RandomShift(array, shift, batchwise=False)[source]

Randomshift class.

This wrapper randomly shifts the input sequence by a random number of up to ‘shift’ bases in either direction. Meant for use with BioSeq.

This form of data-augmentation has been shown to reduce overfitting in a number of settings.

The sequence is zero-padded in order to remain the same length.

When ‘batchwise’ is set to True it will shift all the sequences retrieved by a single call to __getitem__ by the same amount (useful for computationally efficient batching).

Parameters:array (Dataset) – Dataset object must be 4D.

Normalization and transformation

LogTransform() Log transformation of intput signal.
PercentileTrimming(percentile) Percentile trimming normalization.
RegionLengthNormalization([regionmask]) Normalization for variable-region length.
ZScore([mean, std]) ZScore normalization.
ZScoreLog([mean, std]) ZScore normalization after log transformation.
normalize_garray_tpm(garray) This function performs TPM normalization for a given GenomicArray.
class janggu.data.LogTransform[source]

Log transformation of intput signal.

This class performs log-transformation of a GenomicArray using log(x + 1.) to avoid NAN’s from zeros.

class janggu.data.PercentileTrimming(percentile)[source]

Percentile trimming normalization.

This class performs percentile trimming of a GenomicArray to aleviate the effect of outliers. All values that exceed the value associated with the given percentile are set to be equal to the percentile.

Parameters:percentile (float) – Percentile at which to perform chromosome-level trimming.
class janggu.data.RegionLengthNormalization(regionmask=None)[source]

Normalization for variable-region length.

This class performs region length normalization of a GenomicArray. This is relevant when genomic features are of variable size, e.g. enhancer regions of different width or when using variable length genes.

Parameters:regionmask (str or GenomicIndexer, None) – A bed file or a genomic indexer that contains the masking region that is considered for the signal. For instance, when normalizing gene expression to TPM, the mask contains exons. Otherwise, the TPM would normalize for the full length gene annotation. If None, no mask is included.
class janggu.data.ZScore(mean=None, std=None)[source]

ZScore normalization.

This class performs ZScore normalization of a GenomicArray. It automatically adjusts for variable interval lenths.

Parameters:
  • means (float or None) – Provided means will be applied for zero-centering. If None, the means will be determined from the GenomicArray and then applied. Default: None.
  • stds (float or None) – Provided standard deviations will be applied for scaling. If None, the stds will be determined from the GenomicArray and then applied. Default: None.
class janggu.data.ZScoreLog(mean=None, std=None)[source]

ZScore normalization after log transformation.

This class performs ZScore normalization after log-transformation of a GenomicArray using log(x + 1.) to avoid NAN’s from zeros. It automatically adjusts for variable interval lenths.

Parameters:
  • means (float or None) – Provided means will be applied for zero-centering. If None, the means will be determined from the GenomicArray and then applied. Default: None.
  • stds (float or None) – Provided standard deviations will be applied for scaling. If None, the stds will be determined from the GenomicArray and then applied. Default: None.
janggu.data.normalize_garray_tpm(garray)[source]

This function performs TPM normalization for a given GenomicArray.

Visualization utilitites

janggu.data.plotGenomeTrack(tracks, chrom, start, end, figsize=(10, 5), plottypes=None)[source]

plotGenomeTrack shows plots of a specific interval from cover objects data.

It takes one or more cover objects as well as a genomic interval consisting of chromosome name, start and end and creates a genome browser-like plot.

Parameters:
  • tracks (janggu.data.Cover, list(Cover), janggu.data.Track or list(Track)) – One or more track objects.
  • chrom (str) – chromosome name.
  • start (int) – The start of the required interval.
  • end (int) – The end of the required interval.
  • figsize (tuple(int, int)) – Figure size passed on to matplotlib.
  • plottype (None or list(str)) – Plot type indicates whether to plot coverage tracks as line plots, heatmap, or seqplot using ‘line’ or ‘heatmap’, respectively. By default, all coverage objects are depicted as line plots if plottype=None. Otherwise, a list of types must be supplied containing the plot types for each coverage object explicitly. For example, [‘line’, ‘heatmap’, ‘seqplot’]. While, ‘line’ and ‘heatmap’ can be used for any type of coverage data, ‘seqplot’ is reserved to plot sequence influence on the output. It is intended to be used in conjunction with ‘input_attribution’ method which determines the importance of paricular sequence letters for the output prediction.
Returns:

matplotlib Figure – A matplotlib figure illustrating the genome browser-view of the coverage objects for the given interval. To depict and save the figure the native matplotlib functions show() and savefig() can be used.

class janggu.data.Track(data, height)[source]

General track

Parameters:
  • data (Cover object) – Coverage object
  • height (int) – Track height.
class janggu.data.HeatTrack(data, height=3)[source]

Heatmap Track

Visualizes genomic data as heatmap.

Parameters:
  • data (Cover object) – Coverage object
  • height (int) – Track height. Default=3
class janggu.data.LineTrack(data, height=3, linestyle='-', marker='o', color='b', linewidth=2)[source]

Line track

Visualizes genomic data as line plot.

Parameters:
  • data (Cover object) – Coverage object
  • height (int) – Track height. Default=3
  • linestyle (str) – Linestyle for plot
  • marker (str) – Marker code for plot
  • color (str) – Color code for plot
  • linewidth (float) – Line width.
class janggu.data.SeqTrack(data, height=3)[source]

Sequence Track

Visualizes sequence importance.

Parameters:
  • data (Cover object) – Coverage object
  • height (int) – Track height. Default=3