janggu - Utilities for creating, fitting and evaluating models

This section describes the interface and utilities to build build and evaluate deep learning applications with janggu.

Janggu model and utilities for deep learning in genomics.

Janggu(inputs, outputs[, name]) Janggu class
Janggu.create(template[, modelparams, …]) Janggu constructor method.
Janggu.create_by_name(name[, custom_objects]) Creates a Janggu object by name.
Janggu.fit([inputs, outputs, batch_size, …]) Model fitting.
Janggu.predict(inputs[, batch_size, …]) Performs a prediction.
Janggu.evaluate([inputs, outputs, …]) Evaluates the performance.
input_attribution(model, inputs[, chrom, …]) Evaluates the integrated gradients method on the input coverage tracks.

Janggu Model

class janggu.Janggu(inputs, outputs, name=None)[source]

Janggu class

The class Janggu maintains a keras.models.Model object, that is an instance of a neural network. Furthermore, to the outside, Janggu behaves similarly to keras.models.Model which allows you to create, fit, and evaluate the model.

Parameters:
  • inputs (Input or list(Input)) – Input layer or list of Inputs as defined by keras. See https://keras.io/.
  • outputs (Layer or list(Layer)) – Output layer or list of outputs. See https://keras.io/.
  • name (str) – Name of the model.

Examples

Define a Janggu object similar to keras.models.Model using Input and Output layers.

from keras.layers import Input
from keras.layers import Dense

from janggu import Janggu

# Define neural network layers using keras
in_ = Input(shape=(10,), name='ip')
layer = Dense(3)(in_)
output = Dense(1, activation='sigmoid', name='out')(layer)

# Instantiate a model.
model = Janggu(inputs=in_, outputs=output, name='test_model')
model.summary()
compile(optimizer, loss, metrics=None, loss_weights=None, sample_weight_mode=None, weighted_metrics=None, target_tensors=None)[source]

Model compilation.

This method delegates the compilation to keras.models.Model.compile. See also https://keras.io/models/model/

Examples

model.compile(optimizer='adadelta', loss='binary_crossentropy')
classmethod create(template, modelparams=None, inputs=None, outputs=None, name=None)[source]

Janggu constructor method.

This method instantiates a Janggu model with model template and parameters. It also allows to automatically infer and extend the correct input and output layers for the network.

Parameters:
  • template (function) – Python function that defines a model template of a neural network. The function signature must adhere to the signature template(inputs, inputs, outputs, modelparams) and is expected to return (input_tensor, output_tensor) of the neural network.
  • modelparams (list or tuple or None) – Additional model parameters that are passed along to template upon creation of the neural network. For instance, this could contain number of neurons at each layer. Default: None.
  • inputs (Dataset, list(Dataset) or None) – Input datasets from which the input layer shapes should be derived. Use this option together with the inputlayer decorator (see Example below).
  • outputs (Dataset, list(Dataset) or None) – Output datasets from which the output layer shapes should be derived. Use this option toghether with outputdense or outputconv decorators (see Example below).
  • name (str or None) – Model name. If None, a unique model name is generated based on the model configuration and network architecture.

Examples

Specify a model using a model template and parameters:

def test_manual_model(inputs, inp, oup, params):
    in_ = Input(shape=(10,), name='ip')
    layer = Dense(params)(in_)
    output = Dense(1, activation='sigmoid', name='out')(in_)
    return in_, output

# Defines the same model by invoking the definition function
# and the create constructor.
model = Janggu.create(template=test_manual_model, modelparams=3)
model.summary()

Specify a model using automatic input and output layer determination. That is, only the model body needs to be specified:

import numpy as np
from janggu import Janggu
from janggu import inputlayer, outputdense
from janggu.data import Array

# Some random data which you would like to use as input for the
# model.
DATA = Array('ip', np.random.random((1000, 10)))
LABELS = Array('out', np.random.randint(2, size=(1000, 1)))

# The decorators inputlayer and outputdense
# extract the layer shapes and append the respective layers
# to the network
# so that only the model body remains to be specified.
# Note that the the decorator order matters.
# inputlayer must be specified before outputdense.
@inputlayer
@outputdense(activation='sigmoid')
def test_inferred_model(inputs, inp, oup, params):
    with inputs.use('ip') as in_:
        # the with block allows for easy
        # access of a specific named input.
        output = Dense(params)(in_)
    return in_, output

# create a model.
model = Janggu.create(template=test_inferred_model, modelparams=3,
                      name='test_model',
                      inputs=DATA,
                      outputs=LABELS)
classmethod create_by_name(name, custom_objects=None)[source]

Creates a Janggu object by name.

This option is used to load a pre-trained model.

Parameters:
  • name (str) – Name of the model.
  • custom_objects (dict or None) – This allows loading of custom layers using load_model. All janggu specific layers are automatically included as custom_objects. Default: None

Examples

in_ = Input(shape=(10,), name='ip')
layer = Dense(3)(in_)
output = Dense(1, activation='sigmoid', name='out')(layer)

# Instantiate a model.
model = Janggu(inputs=in_, outputs=output, name='test_model')

# saves the model to <janggu_results>/models
model.save()

# remove the original model
del model

# reload the model
model = Janggu.create_by_name('test_model')
evaluate(inputs=None, outputs=None, batch_size=None, sample_weight=None, steps=None, datatags=None, callbacks=None, use_multiprocessing=False, workers=1)[source]

Evaluates the performance.

This method is used to evaluate a given model. All of the parameters are directly delegated the evalute_generator of the keras model. See https://keras.io/models/model/#methods.

Parameters:
  • inputs (Dataset, list(Dataset) or Sequence (keras.utils.Sequence)) – Input Dataset or Sequence to use for evaluating the model.
  • outputs (Dataset, list(Dataset) or None) – Output Dataset containing the training targets. If a Sequence is used for inputs, outputs will have no effect.
  • batch_size (int or None) – Batch size. If set to None a batch size of 32 is used.
  • sample_weight (np.array or None) – Sample weights. See https://keras.io.
  • steps (int, None.) – Number of predict steps. If None, this value is determined from the dataset size and the batch_size.
  • datatags (list(str) or None) – Tags to annotate the evaluation results. Default: None.
  • callbacks (List(Scorer or str)) – Scorer instances to be applied on the predictions. Furthermore, commonly used scoring metrics can be added by name, including ‘roc’, ‘auroc’, ‘prc’, ‘auprc’ for evaluating binary classification applications and ‘cor’ (for Pearson’s correlation), ‘mae’, ‘mse’ and ‘var_explained’ for regression applications.
  • use_multiprocessing (boolean) – Whether to use multiprocessing for the prediction. Default: False.
  • workers (int) – Number of workers to use. Default: 1.

Examples

model.evaluate(DATA, LABELS)

# binary classification evaluation with callbacks
model.evaluate(DATA, LABELS, callcacks=['auprc', 'auroc'])
fit(inputs=None, outputs=None, batch_size=None, epochs=1, verbose=1, callbacks=None, validation_data=None, shuffle=True, class_weight=None, sample_weight=None, initial_epoch=0, steps_per_epoch=None, use_multiprocessing=False, workers=1)[source]

Model fitting.

This method is used to fit a given model. Most of parameters are directly delegated the fit_generator of the keras model.

Parameters:
  • inputs (Dataset, list(Dataset) or Sequence (keras.utils.Sequence)) – Input Dataset or Sequence to use for fitting the model.
  • outputs (Dataset, list(Dataset) or None) – Output Dataset containing the training targets. If a Sequence is used for inputs, outputs will have no effect.
  • batch_size (int or None) – Batch size. If set to None a batch size of 32 is used.
  • epochs (int) – Number of epochs. Default: 1.
  • verbose (int) – Verbosity level. See https://keras.io.
  • callbacks (List(keras.callbacks.Callback)) – Callbacks to be applied during training. See https://keras.io/callbacks
  • validation_data (tuple, Sequence or None) – Validation data can be a tuple (input_dataset, output_dataset), or (input_dataset, output_dataset, sample_weights) or a keras.utils.Sequence instance or a list of validation chromsomoes. The latter choice only works with when using Cover and Bioseq dataset. This allows you to train on a dedicated set of chromosomes and to validate the performance on respective heldout chromosomes. If None, validation is not applied.
  • shuffle (boolean) – shuffle batches. Default: True.
  • class_weight (dict) – Class weights. See https://keras.io.
  • sample_weight (np.array or None) – Sample weights. See https://keras.io.
  • initial_epoch (int) – Initial epoch at which to start training.
  • steps_per_epoch (int, None.) – Number of steps per epoch. If None, this value is determined from the dataset size and the batch_size.
  • use_multiprocessing (boolean) – Whether to use multiprocessing. See https://keras.io. Default: False.
  • workers (int) – Number of workers to use in multiprocessing mode. Default: 1.

Examples

model.fit(DATA, LABELS)
get_config()[source]

Get model config.

name

Model name

predict(inputs, batch_size=None, verbose=0, steps=None, layername=None, datatags=None, callbacks=None, use_multiprocessing=False, workers=1)[source]

Performs a prediction.

This method predicts the targets. All of the parameters are directly delegated the predict_generator of the keras model. See https://keras.io/models/model/#methods.

Parameters:
  • inputs (Dataset, list(Dataset) or Sequence (keras.utils.Sequence)) – Input Dataset or Sequence to use for fitting the model.
  • batch_size (int or None) – Batch size. If set to None a batch size of 32 is used.
  • verbose (int) – Verbosity level. See https://keras.io.
  • steps (int, None.) – Number of predict steps. If None, this value is determined from the dataset size and the batch_size.
  • layername (str or None) – Layername for which the prediction should be performed. If None, the output layer will be used automatically.
  • datatags (list(str) or None) – Tags to annotate the evaluation results. Default: None.
  • callbacks (List(Scorer)) – Scorer instances to be applied on the predictions.
  • use_multiprocessing (boolean) – Whether to use multiprocessing for the prediction. Default: False.
  • workers (int) – Number of workers to use. Default: 1.

Examples

model.predict(DATA)
predict_variant_effect(bioseq, variants, conditions, output_folder, condition_filter=None, batch_size=None, annotation=None, ignore_reference_match=False)[source]

Evaluates the performance.

Parameters:
  • bioseq (Bioseq) – Input sequence containing the reference genome.
  • variants (str) – File name of a VCF file containg the variants under study.
  • conditions (list(str)) – Condition labels for each output prediction.
  • output_folder (str) – The method produces an hdf5 and a bed file as output. The bed-file contains the variant positions while the hdf5 file contains the reference and alternative variant scores for each output feature.
  • condition_filter (str or None) – Regular expression filter on which conditions should be evaluated. If None, all output conditions will be returned.
  • batch_size (int, None.) – Batch size. If None, a batch_size of 128 is used.
  • annotation (BedTool object or None) – BedTool holding feature annotation e.g. gene annotation. The annotation may be used to perform strand-specific variant effect predictions. Each variant is intersected with the annotation in order to derive the correct strandedness. If variants do not overlap with an annotation features or for missing annotation, the forward strand is used.
  • ignore_reference_match (boolean) – Whether to ignore mismatches between the reference sequence and the reference base in the VCF file. If False, the variant will be skipped over and only matching positions are processed. Otherwise all variants will be processed. Default: False.
Returns:

tuple – Tuple containing the output filenames: an hdf5 and a bed file.

Examples

# Evaluate all variants and all conditions (outputs)
model.predict_variant_effect(DATA, VARIANTS, CONDITIONS,
                             'vcfoutput')

# Evaluate all variants and a subset of conditions (Ctcf output labels)
model.predict_variant_effect(DATA, LABELS, CONDITIONS,
                             'vcfoutput_subset',
                             contition_filter='Cfcf')
save(filename=None, overwrite=True, show_shapes=True)[source]

Saves the model.

Parameters:
  • filename (str) – Filename of the stored model. Default: None.
  • overwrite (bool) – Overwrite a stored model. Default: True.
summary()[source]

Prints the model definition.

Input feature attribution

janggu.input_attribution(model, inputs, chrom=None, start=None, end=None)[source]

Evaluates the integrated gradients method on the input coverage tracks.

This allows to attribute feature importance values to the prediction scores. Integrated gradients have been introduced in Sundararajan, Taly and Yan, Axiomatic Attribution for Deep Networks. PMLR 70, 2017.

The method can either be called, by specifying the region of interest directly by setting chrom, start and end. Alternatively, it is possible to specify the region index. For example, the n^th region of the dataset.

Parameters:
  • model (Janggu) – Janggu model wrapper
  • inputs (Dataset, list(Dataset)) – Input Dataset.
  • chrom (str or None) – Chromosome name.
  • start (int or None) – Region start.
  • end (int or None) – Region end.

Examples

# Suppose DATA is a Bioseq or Cover object
# To query the input feature importance of a specific genomic region
# use
input_attribution(model, DATA, chrom='chr1', start=start, end=end)

Performance evaluation

Scorer.score(model, predicted[, outputs, …]) Scoring of the predictions relative to true outputs.
Scorer.export(path, collection_name[, datatags]) Exporting of the results.
class janggu.Scorer(name, score_fct=None, conditions=None, exporter=<janggu.utils.ExportJson object>, immediate_export=True, percondition=True, subdir=None)[source]

Scorer class.

This class implements the callback interface that is used with Janggu.evaluate and Janggu.predict. The scorer maintains a scoring callable and an exporter callable which take care of determining the desired score and writing the result into a desired file, e.g. json, tsv or a figure, respectively.

Parameters:
  • name (str) – Name of the score to be performed.
  • score_fct (None or callable) – Callable that is invoked for scoring. This callable must satisfy the signature fct(y_true, y_pred) if used with Janggu.evaluate and fct(y_pred) if used with Janggu.predict. The returned score should be compatible with the exporter.
  • conditions (list(str) or None) – List of strings describing the conditions dimension of the dataset that is processed. If None, conditions are extracted from the y_true Dataset, if available. Otherwise, the conditions are integers ranging from zero to len(conditions) - 1.
  • exporter (callable) – Exporter function is used to export the scoring results in the desired manner, e.g. as json or tsv file. This function must satisfy the signature fct(output_path, filename_prefix, results).
  • immediate_export (boolean) – If set to True, the exporter function will be invoked immediately after the evaluation of the dataset. If set to False, the results are maintained in memory which allows to export the results as a collection rather than individually.
  • percondition (boolean) – Indicates whether the evaluation should be performed per condition or across all conditions. The former determines a score for each output condition, while the latter first flattens the array and then scores across conditions. Default: percondition=True.
  • subdir (str) – Name of the subdir to store the output in. Default: None means the results are stored in the ‘evaluation’ subdir.
export(path, collection_name, datatags=None)[source]

Exporting of the results.

When calling export, the results which have been collected in self.results by using the score method are written to disk by invoking the supplied exporter function.

Parameters:
  • path (str) – Output directory.
  • collection_name (str) – Subdirectory in which the results should be stored. E.g. Modelname.
  • datatags (list(str) or None) – Optional tags describing the dataset. E.g. ‘training_set’. Default: None
score(model, predicted, outputs=None, datatags=None)[source]

Scoring of the predictions relative to true outputs.

When calling score, the provided score_fct is applied for each layer and condition separately. The result scores are maintained in a dict that uses (modelname, layername, conditionname) as key and as values another dict of the form: {'date':<currenttime>, 'value': derived_score, 'tags':datatags}.

Parameters:
  • model (Janggu) – a Janggu object representing the current model.
  • predicted (dict{name: np.array}) – Predicted outputs.
  • outputs (dict{name: Dataset} or None) – True output labels. The Scorer is used with Janggu.evaluate this argument will be present. With Janggu.evaluate it is absent.
  • datatags (list(str) or None) – Optional tags describing the dataset, e.g. ‘test_set’.

Performance score utilities

class janggu.ExportJson(filesuffix='json', annot=None, row_names=None)[source]

Method that dumps the results in a json file.

Parameters:
  • filesuffix (str) – Target file ending. Default: ‘json’.
  • annot (None, dict) – Annotation data. If encoded as dict the key indicates the name, while the values holds a list of annotation labels. Default: None.
  • row_names (None or list) – List of row names. Default: None.
class janggu.ExportTsv(filesuffix='tsv', annot=None, row_names=None)[source]

Method that dumps the results as tsv file.

This class can be used to export general table summaries.

Parameters:
  • filesuffix (str) – File ending. Default: ‘tsv’.
  • annot (None, dict) – Annotation data. If encoded as dict the key indicates the name, while the values holds a list of annotation labels. For example, this can be used to store the true output labels. Default: None.
  • row_names (None, list) – List of row names. For example, chromosomal loci. Default: None.
class janggu.ExportBed(gindexer, resolution)[source]

Export predictions to bed.

This function exports the predictions to bed format which allows you to inspect the predictions in a genome browser.

Parameters:
  • gindexer (GenomicIndexer) – GenomicIndexer that links the prediction for a certain region to its associated genomic coordinates.
  • resolution (int) – Used to output the results.
class janggu.ExportBigwig(gindexer)[source]

Export predictions to bigwig.

This function exports the predictions to bigwig format which allows you to inspect the predictions in a genome browser. Importantly, gindexer must contain non-overlapping windows!

Parameters:gindexer (GenomicIndexer) – GenomicIndexer that links the prediction for a certain region to its associated genomic coordinates.
class janggu.ExportScorePlot(figsize=None, xlabel=None, ylabel=None, fform=None)[source]

Exporting score plot.

This class can be used for producing an AUC or PRC plot.

Parameters:
  • figsize (tuple(int, int)) – Used to specify the figure size for matplotlib.
  • xlabel (str or None) – xlabel used for the plot.
  • ylabel (str or None) – ylabel used for the plot.
  • fform (str or None) – Output file format. E.g. ‘png’, ‘eps’, etc. Default: ‘png’.

Decorators for network construction

janggu.inputlayer(func)[source]

Input layer decorator

This decorator appends an input layer to the network with the correct shape and name.

janggu.outputdense(activation)[source]

Output layer decorator

This decorator appends an output layer to the network with the correct shape, activation and layer name.

janggu.outputconv(activation)[source]

Output layer decorator

This decorator appends an output convolution layer to the network with the correct shape, activation and layer name.

Genomics-specific keras layers

class janggu.DnaConv2D(layer, merge_mode='max', **kwargs)[source]

DnaConv2D layer.

This layer wraps a normal keras Conv2D layer for scanning DNA sequences on both strands using the same weight matrices.

Parameters:merge_mode (str or None) – Specifies how to merge information from both strands. Options: {“max”, “ave”, “concat”, None} Default: “max”.

Examples

To scan both DNA strands for motif matches use

xin = Input((200, 1, 4))
dnalayer = DnaConv2D(Conv2D(nfilters, filter_shape))(xin)
class janggu.Complement(*args, **kwargs)[source]

Complement layer.

This layer can be used with keras to determine the complementary DNA sequence in one-hot encoding from a given DNA sequences. It supports higher-order nucleotide representation, e.g. dinucleotides, trinucleotides. The order of the nucleotide representation is automatically determined from the previous layer. To this end, the input layer is assumed to hold the nucleotide representation dimension 3. The layer uses a permutation matrix that is multiplied with the original input dataset in order to evaluate the complementary sequence’s one hot representation.

forwardstrand_dna = Input((200, 1, 4))
reversestrand_dna = Complement()(forwardstrand_dna)
# this also works for higher-order one-hot encoding.
class janggu.Reverse(axis=1, **kwargs)[source]

Reverse layer.

This layer can be used with keras to reverse a tensor for a given axis.

Parameters:axis (int) – Axis which needs to be reversed. Default: 1.
class janggu.LocalAveragePooling2D(window_size=1, **kwargs)[source]

LocalAveragePooling2D layer.

This layer performs window averaging along the lead axis of an input tensor using a given window_size. At the moment, it assumes data_format=’channels_last’. This is similar to applying GlobalAveragePooling2D, but where the average is determined in a window of length ‘window_size’, rather than along the entire sequence length.

Parameters:window_size (int) – Averaging window size. Default: 1.