janggu - Utilities for creating, fitting and evaluating models¶
This section describes the interface and utilities to build build and evaluate deep learning applications with janggu.
Janggu model and utilities for deep learning in genomics.
Janggu (inputs, outputs[, name]) |
Janggu class |
Janggu.create (template[, modelparams, …]) |
Janggu constructor method. |
Janggu.create_by_name (name[, custom_objects]) |
Creates a Janggu object by name. |
Janggu.fit ([inputs, outputs, batch_size, …]) |
Model fitting. |
Janggu.predict (inputs[, batch_size, …]) |
Performs a prediction. |
Janggu.evaluate ([inputs, outputs, …]) |
Evaluates the performance. |
input_attribution (model, inputs[, chrom, …]) |
Evaluates the integrated gradients method on the input coverage tracks. |
Janggu Model¶
-
class
janggu.
Janggu
(inputs, outputs, name=None)[source]¶ Janggu class
The class
Janggu
maintains akeras.models.Model
object, that is an instance of a neural network. Furthermore, to the outside, Janggu behaves similarly tokeras.models.Model
which allows you to create, fit, and evaluate the model.Parameters: - inputs (Input or list(Input)) – Input layer or list of Inputs as defined by keras. See https://keras.io/.
- outputs (Layer or list(Layer)) – Output layer or list of outputs. See https://keras.io/.
- name (str) – Name of the model.
Examples
Define a Janggu object similar to keras.models.Model using Input and Output layers.
from keras.layers import Input from keras.layers import Dense from janggu import Janggu # Define neural network layers using keras in_ = Input(shape=(10,), name='ip') layer = Dense(3)(in_) output = Dense(1, activation='sigmoid', name='out')(layer) # Instantiate a model. model = Janggu(inputs=in_, outputs=output, name='test_model') model.summary()
-
compile
(*args, **kargs)[source]¶ Model compilation.
This method delegates the compilation to keras.models.Model.compile. See also https://keras.io/models/model/
Examples
model.compile(optimizer='adadelta', loss='binary_crossentropy')
-
classmethod
create
(template, modelparams=None, inputs=None, outputs=None, name=None)[source]¶ Janggu constructor method.
This method instantiates a Janggu model with model template and parameters. It also allows to automatically infer and extend the correct input and output layers for the network.
Parameters: - template (function) – Python function that defines a model template of a neural network.
The function signature must adhere to the signature
template(inputs, inputs, outputs, modelparams)
and is expected to return(input_tensor, output_tensor)
of the neural network. - modelparams (list or tuple or None) – Additional model parameters that are passed along to template upon creation of the neural network. For instance, this could contain number of neurons at each layer. Default: None.
- inputs (Dataset, list(Dataset) or None) – Input datasets from which the input layer shapes should be derived.
Use this option together with the
inputlayer
decorator (see Example below). - outputs (Dataset, list(Dataset) or None) – Output datasets from which the output layer shapes should be derived.
Use this option toghether with
outputdense
oroutputconv
decorators (see Example below). - name (str or None) – Model name. If None, a unique model name is generated based on the model configuration and network architecture.
Examples
Specify a model using a model template and parameters:
def test_manual_model(inputs, inp, oup, params): in_ = Input(shape=(10,), name='ip') layer = Dense(params)(in_) output = Dense(1, activation='sigmoid', name='out')(in_) return in_, output # Defines the same model by invoking the definition function # and the create constructor. model = Janggu.create(template=test_manual_model, modelparams=3) model.summary()
Specify a model using automatic input and output layer determination. That is, only the model body needs to be specified:
import numpy as np from janggu import Janggu from janggu import inputlayer, outputdense from janggu.data import Array # Some random data which you would like to use as input for the # model. DATA = Array('ip', np.random.random((1000, 10))) LABELS = Array('out', np.random.randint(2, size=(1000, 1))) # The decorators inputlayer and outputdense # extract the layer shapes and append the respective layers # to the network # so that only the model body remains to be specified. # Note that the the decorator order matters. # inputlayer must be specified before outputdense. @inputlayer @outputdense(activation='sigmoid') def test_inferred_model(inputs, inp, oup, params): with inputs.use('ip') as in_: # the with block allows for easy # access of a specific named input. output = Dense(params)(in_) return in_, output # create a model. model = Janggu.create(template=test_inferred_model, modelparams=3, name='test_model', inputs=DATA, outputs=LABELS)
- template (function) – Python function that defines a model template of a neural network.
The function signature must adhere to the signature
-
classmethod
create_by_name
(name, custom_objects=None)[source]¶ Creates a Janggu object by name.
This option is used to load a pre-trained model.
Parameters: - name (str) – Name of the model.
- custom_objects (dict or None) – This allows loading of custom layers using load_model. All janggu specific layers are automatically included as custom_objects. Default: None
Examples
in_ = Input(shape=(10,), name='ip') layer = Dense(3)(in_) output = Dense(1, activation='sigmoid', name='out')(layer) # Instantiate a model. model = Janggu(inputs=in_, outputs=output, name='test_model') # saves the model to <janggu_results>/models model.save() # remove the original model del model # reload the model model = Janggu.create_by_name('test_model')
-
evaluate
(inputs=None, outputs=None, batch_size=None, sample_weight=None, steps=None, datatags=None, callbacks=None, use_multiprocessing=False, workers=1)[source]¶ Evaluates the performance.
This method is used to evaluate a given model. All of the parameters are directly delegated the evalute_generator of the keras model. See https://keras.io/models/model/#methods.
Parameters: - inputs (
Dataset
, list(Dataset) or Sequence (keras.utils.Sequence)) – Input Dataset or Sequence to use for evaluating the model. - outputs (
Dataset
, list(Dataset) or None) – Output Dataset containing the training targets. If a Sequence is used for inputs, outputs will have no effect. - batch_size (int or None) – Batch size. If set to None a batch size of 32 is used.
- sample_weight (np.array or None) – Sample weights. See https://keras.io.
- steps (int, None.) – Number of predict steps. If None, this value is determined from the dataset size and the batch_size.
- datatags (list(str) or None) – Tags to annotate the evaluation results. Default: None.
- callbacks (List(
Scorer
or str)) – Scorer instances to be applied on the predictions. Furthermore, commonly used scoring metrics can be added by name, including ‘roc’, ‘auroc’, ‘prc’, ‘auprc’ for evaluating binary classification applications and ‘cor’ (for Pearson’s correlation), ‘mae’, ‘mse’ and ‘var_explained’ for regression applications. - use_multiprocessing (boolean) – Whether to use multiprocessing for the prediction. Default: False.
- workers (int) – Number of workers to use. Default: 1.
Examples
model.evaluate(DATA, LABELS) # binary classification evaluation with callbacks model.evaluate(DATA, LABELS, callcacks=['auprc', 'auroc'])
- inputs (
-
fit
(inputs=None, outputs=None, batch_size=None, epochs=1, verbose=1, callbacks=None, validation_data=None, shuffle=True, class_weight=None, sample_weight=None, initial_epoch=0, steps_per_epoch=None, use_multiprocessing=False, workers=1)[source]¶ Model fitting.
This method is used to fit a given model. Most of parameters are directly delegated the fit_generator of the keras model.
Parameters: - inputs (
Dataset
, list(Dataset) or Sequence (keras.utils.Sequence)) – Input Dataset or Sequence to use for fitting the model. - outputs (
Dataset
, list(Dataset) or None) – Output Dataset containing the training targets. If a Sequence is used for inputs, outputs will have no effect. - batch_size (int or None) – Batch size. If set to None a batch size of 32 is used.
- epochs (int) – Number of epochs. Default: 1.
- verbose (int) – Verbosity level. See https://keras.io.
- callbacks (List(keras.callbacks.Callback)) – Callbacks to be applied during training. See https://keras.io/callbacks
- validation_data (tuple, Sequence or None) – Validation data can be a tuple (input_dataset, output_dataset), or (input_dataset, output_dataset, sample_weights) or a keras.utils.Sequence instance or a list of validation chromsomoes. The latter choice only works with when using Cover and Bioseq dataset. This allows you to train on a dedicated set of chromosomes and to validate the performance on respective heldout chromosomes. If None, validation is not applied.
- shuffle (boolean) – shuffle batches. Default: True.
- class_weight (dict) – Class weights. See https://keras.io.
- sample_weight (np.array or None) – Sample weights. See https://keras.io.
- initial_epoch (int) – Initial epoch at which to start training.
- steps_per_epoch (int, None.) – Number of steps per epoch. If None, this value is determined from the dataset size and the batch_size.
- use_multiprocessing (boolean) – Whether to use multiprocessing. See https://keras.io. Default: False.
- workers (int) – Number of workers to use in multiprocessing mode. Default: 1.
Examples
model.fit(DATA, LABELS)
- inputs (
-
name
¶ Model name
-
predict
(inputs, batch_size=None, verbose=0, steps=None, layername=None, datatags=None, callbacks=None, use_multiprocessing=False, workers=1)[source]¶ Performs a prediction.
This method predicts the targets. All of the parameters are directly delegated the predict_generator of the keras model. See https://keras.io/models/model/#methods.
Parameters: - inputs (
Dataset
, list(Dataset) or Sequence (keras.utils.Sequence)) – Input Dataset or Sequence to use for fitting the model. - batch_size (int or None) – Batch size. If set to None a batch size of 32 is used.
- verbose (int) – Verbosity level. See https://keras.io.
- steps (int, None.) – Number of predict steps. If None, this value is determined from the dataset size and the batch_size.
- layername (str or None) – Layername for which the prediction should be performed. If None, the output layer will be used automatically.
- datatags (list(str) or None) – Tags to annotate the evaluation results. Default: None.
- callbacks (List(
Scorer
)) – Scorer instances to be applied on the predictions. - use_multiprocessing (boolean) – Whether to use multiprocessing for the prediction. Default: False.
- workers (int) – Number of workers to use. Default: 1.
Examples
model.predict(DATA)
- inputs (
-
predict_variant_effect
(bioseq, variants, conditions, output_folder, condition_filter=None, batch_size=None, annotation=None, ignore_reference_match=False, order=1)[source]¶ Evaluates the performance.
see help(predict_variant_effect) for a description of the parameters
Examples
# Evaluate all variants and all conditions (outputs) model.predict_variant_effect(DATA, VARIANTS, CONDITIONS, 'vcfoutput') # Evaluate all variants and a subset of conditions (Ctcf output labels) model.predict_variant_effect(DATA, LABELS, CONDITIONS, 'vcfoutput_subset', contition_filter='Cfcf')
Input feature attribution¶
-
janggu.
input_attribution
(model, inputs, chrom=None, start=None, end=None, idx=None)[source]¶ Evaluates the integrated gradients method on the input coverage tracks.
This allows to attribute feature importance values to the prediction scores. Integrated gradients have been introduced in Sundararajan, Taly and Yan, Axiomatic Attribution for Deep Networks. PMLR 70, 2017.
The method can either be called, by specifying the region of interest directly by setting chrom, start and end. Alternatively, it is possible to specify the region index. For example, the n^th region of the dataset.
Parameters: - model (Janggu) – Janggu model wrapper
- inputs (
Dataset
, list(Dataset)) – Input Dataset. - chrom (str or None) – Chromosome name.
- start (int or None) – Region start.
- end (int or None) – Region end.
- idx (int or None) – The index of the i^th sequence to use for the attribution. If idx is set chrom, start and end are ignored.
Examples
# Suppose DATA is a Bioseq or Cover object # To query the input feature importance of a specific genomic region # use input_attribution(model, DATA, chrom='chr1', start=start, end=end) # To query the input feature importance using a specific index input_attribution(model, DATA, idx=0)
Performance evaluation¶
Scorer.score (model, predicted[, outputs, …]) |
Scoring of the predictions relative to true outputs. |
Scorer.export (path, collection_name[, datatags]) |
Exporting of the results. |
-
class
janggu.
Scorer
(name, score_fct=None, conditions=None, exporter=<janggu.utils.ExportJson object>, immediate_export=True, percondition=True, subdir=None)[source]¶ Scorer class.
This class implements the callback interface that is used with
Janggu.evaluate
andJanggu.predict
. The scorer maintains a scoring callable and an exporter callable which take care of determining the desired score and writing the result into a desired file, e.g. json, tsv or a figure, respectively.Parameters: - name (str) – Name of the score to be performed.
- score_fct (None or callable) – Callable that is invoked for scoring.
This callable must satisfy the signature
fct(y_true, y_pred)
if used withJanggu.evaluate
andfct(y_pred)
if used withJanggu.predict
. The returned score should be compatible with the exporter. - conditions (list(str) or None) – List of strings describing the conditions dimension of the dataset
that is processed. If None, conditions are extracted from the
y_true Dataset, if available. Otherwise, the conditions are integers
ranging from zero to
len(conditions) - 1
. - exporter (callable) – Exporter function is used to export the scoring results
in the desired manner,
e.g. as json or tsv file. This function must satisfy the signature
fct(output_path, filename_prefix, results)
. - immediate_export (boolean) – If set to True, the exporter function will be invoked immediately after the evaluation of the dataset. If set to False, the results are maintained in memory which allows to export the results as a collection rather than individually.
- percondition (boolean) – Indicates whether the evaluation should be performed per condition or across all conditions. The former determines a score for each output condition, while the latter first flattens the array and then scores across conditions. Default: percondition=True.
- subdir (str) – Name of the subdir to store the output in. Default: None means the results are stored in the ‘evaluation’ subdir.
-
export
(path, collection_name, datatags=None)[source]¶ Exporting of the results.
When calling export, the results which have been collected in self.results by using the score method are written to disk by invoking the supplied exporter function.
Parameters: - path (str) – Output directory.
- collection_name (str) – Subdirectory in which the results should be stored. E.g. Modelname.
- datatags (list(str) or None) – Optional tags describing the dataset. E.g. ‘training_set’. Default: None
-
score
(model, predicted, outputs=None, datatags=None)[source]¶ Scoring of the predictions relative to true outputs.
When calling score, the provided score_fct is applied for each layer and condition separately. The result scores are maintained in a dict that uses
(modelname, layername, conditionname)
as key and as values another dict of the form:{'date':<currenttime>, 'value': derived_score, 'tags':datatags}
.Parameters: - model (
Janggu
) – a Janggu object representing the current model. - predicted (dict{name: np.array}) – Predicted outputs.
- outputs (dict{name: Dataset} or None) – True output labels. The Scorer is used with
Janggu.evaluate
this argument will be present. WithJanggu.evaluate
it is absent. - datatags (list(str) or None) – Optional tags describing the dataset, e.g. ‘test_set’.
- model (
Performance score utilities¶
-
class
janggu.
ExportJson
(filesuffix='json', annot=None, row_names=None)[source]¶ Method that dumps the results in a json file.
Parameters: - filesuffix (str) – Target file ending. Default: ‘json’.
- annot (None, dict) – Annotation data. If encoded as dict the key indicates the name, while the values holds a list of annotation labels. Default: None.
- row_names (None or list) – List of row names. Default: None.
-
class
janggu.
ExportTsv
(filesuffix='tsv', annot=None, row_names=None)[source]¶ Method that dumps the results as tsv file.
This class can be used to export general table summaries.
Parameters: - filesuffix (str) – File ending. Default: ‘tsv’.
- annot (None, dict) – Annotation data. If encoded as dict the key indicates the name, while the values holds a list of annotation labels. For example, this can be used to store the true output labels. Default: None.
- row_names (None, list) – List of row names. For example, chromosomal loci. Default: None.
-
class
janggu.
ExportBed
(gindexer, resolution)[source]¶ Export predictions to bed.
This function exports the predictions to bed format which allows you to inspect the predictions in a genome browser.
Parameters: - gindexer (GenomicIndexer) – GenomicIndexer that links the prediction for a certain region to its associated genomic coordinates.
- resolution (int) – Used to output the results.
-
class
janggu.
ExportBigwig
(gindexer)[source]¶ Export predictions to bigwig.
This function exports the predictions to bigwig format which allows you to inspect the predictions in a genome browser. Importantly, gindexer must contain non-overlapping windows!
Parameters: gindexer (GenomicIndexer) – GenomicIndexer that links the prediction for a certain region to its associated genomic coordinates.
-
class
janggu.
ExportScorePlot
(figsize=None, xlabel=None, ylabel=None, fform=None)[source]¶ Exporting score plot.
This class can be used for producing an AUC or PRC plot.
Parameters: - figsize (tuple(int, int)) – Used to specify the figure size for matplotlib.
- xlabel (str or None) – xlabel used for the plot.
- ylabel (str or None) – ylabel used for the plot.
- fform (str or None) – Output file format. E.g. ‘png’, ‘eps’, etc. Default: ‘png’.
Decorators for network construction¶
-
janggu.
inputlayer
(func)[source]¶ Input layer decorator
This decorator appends an input layer to the network with the correct shape and name.
Genomics-specific keras layers¶
-
class
janggu.
DnaConv2D
(layer, merge_mode='max', **kwargs)[source]¶ DnaConv2D layer.
This layer wraps a normal keras Conv2D layer for scanning DNA sequences on both strands using the same weight matrices.
Parameters: merge_mode (str or None) – Specifies how to merge information from both strands. Options: {“max”, “ave”, “concat”, None} Default: “max”. Examples
To scan both DNA strands for motif matches use
xin = Input((200, 1, 4)) dnalayer = DnaConv2D(Conv2D(nfilters, filter_shape))(xin)
-
class
janggu.
Complement
(*args, **kwargs)[source]¶ Complement layer.
This layer can be used with keras to determine the complementary DNA sequence in one-hot encoding from a given DNA sequences. It supports higher-order nucleotide representation, e.g. dinucleotides, trinucleotides. The order of the nucleotide representation is automatically determined from the previous layer. To this end, the input layer is assumed to hold the nucleotide representation dimension 3. The layer uses a permutation matrix that is multiplied with the original input dataset in order to evaluate the complementary sequence’s one hot representation.
forwardstrand_dna = Input((200, 1, 4)) reversestrand_dna = Complement()(forwardstrand_dna) # this also works for higher-order one-hot encoding.
-
class
janggu.
Reverse
(axis=1, **kwargs)[source]¶ Reverse layer.
This layer can be used with keras to reverse a tensor for a given axis.
Parameters: axis (int) – Axis which needs to be reversed. Default: 1.
-
class
janggu.
LocalAveragePooling2D
(window_size=1, **kwargs)[source]¶ LocalAveragePooling2D layer.
This layer performs window averaging along the lead axis of an input tensor using a given window_size. At the moment, it assumes data_format=’channels_last’. This is similar to applying GlobalAveragePooling2D, but where the average is determined in a window of length ‘window_size’, rather than along the entire sequence length.
Parameters: window_size (int) – Averaging window size. Default: 1.