TF-NN

A biologically interpretable neural network

predicting gene expression from ChIP-seq data

Semester Project - Report

January 2021

Evan Béal

In the Laboratory of Computational and Systems Biology

1 Abstract

Understanding the contribution of the different transcription factors in the process of gene expression

regulation at a genome-wide level is a key challenge in molecular biology. With the progress of the high-

throughput technologies, more and more gene expression data are collected under speciﬁc conditions.

Linking those data to predicted interactions data between transcription factors and genes (ChIP-seq

data) is therefore of great interest to understand the regulatory process leading to the gene expression

observed. This is what is proposed with the TF-NN, a neural network that has been developed to

determine how the transcription factors are controlling gene expression. Given ChIP-seq data, the

TF-NN predicts gene expression data obtained under certain conditions of interest. The network has

been designed to account for some biological properties allowing to extract key information on the

effect of each transcription factor on gene expression by looking at the parameters of the model trained

for the prediction task. Using gene expression data collected in the liver and at speciﬁc times, the

results for the prediction task show that 53.3% of variance of the gene expression is explained with

the ChIP-seq data. The transcription factors being the most involved in the prediction appear to be

either some global transcription factors or transcription factors known to be involved in the liver or

the circadian rhythm process. Hence, it assesses that some mechanisms regulating gene expression in

the cells are reﬂected in the parameters of the model. The analysis of the TF-NN can be extended to

multiple tissues following a speciﬁc process. Eventually, an alternative architecture of the TF-NN can

be used to reduce the dimensionality of the ChIP-seq data provided at the input of the model, going

from more than 10’000 experiments considered to only 10 dimensions.

2 Introduction

Proteins are responsible of nearly every task of cellular life and determine therefore the health,

function or interaction of the cells in which they are involved [

]. A functional protein results from

a complex process that begins with the DNA transcription. Hence, the transcription process plays

a central role in gene expression since it represents a critical ﬁrst step. In addition, it is also a key

process in speciﬁcity as it is the primary target for gene regulation that results in different proteins

being produced in different tissues [2].

Speciﬁc protein factors are responsible of both the basal transcription and its regulation. They are the

transcription factors. Those regulators of gene expression bind to cognate sites in DNA and regulate

the rate of transcription initiation [

]. Their activity is therefore regulated in order to produce speciﬁc

patterns of gene expression [

]. Upon binding to DNA, a transcription factor interacts with other

factors or RNA polymerase to initiate or repress DNA transcription.

The measure of transcription factor activity is of great interest to understand the key process of

transcription factor binding to DNA in gene expression regulation. Different assays can be used to

study transcription factor activation. Besides electrophoretic mobility shift assay or ELISA-based

assay, there is the ChIP-seq method, which stands for Chromatin ImmunoPrecipitation sequencing.

This assay is used to identify protein-DNA binding events in their natural intracellular context. It

has the advantage of working in vivo, allowing to study transcription factor interaction changes over

time. By quantifying the relative abundance of the transcription factor at different locations of the

genome, it allows to determine which genes may be regulated by the transcription factor studied [

Transcription factors playing a central role in transcription which itself is a key process in gene

expression, the results obtained from a ChIP-seq assay under certain conditions are linked somehow

to the gene expression of the cell under the same conditions. Therefore, by determining gene

expression using for instance the RNA-seq technique, allowing to quantify RNA in a biological

sample at a given moment and analyze the continuously changing cellular transcriptome, it becomes

possible to link gene expression with ChIP-seq data of transcription factors.

This approach has notably been used in the MARA (Motif Activity Response Analysis) methodology

[

]. Considering as input a set of genome-wide gene expression across a number of samples, the

MARA model uses motif activity response analysis to identify which are the transcriptions factors

that are inducing gene expression changes across the samples [

]. Several advantages arise from

this approach that allows to obtain concrete predictions on the key regulators under the conditions

studied. However, as the method is based on the DNA motifs on which the transcription factors bind,

it may be subject to some limitations. Indeed, the transcription factor binding sites are small and

ﬂexible, meaning that decoding how gene regulation relates to transcription factor binding motifs

remains challenging [

]. Many transcription factors can bind to similar motifs as they are not strictly

restricted to a speciﬁc sequence motif. This means that when identifying the effect of a motif on

gene expression, it can’t be fully determined which speciﬁc transcription factor is responsible of this

effect. In addition, some transcription factors have to work together to get anything done which is

not something that can be retrieved from the analysis of the enriched sequence motifs. That’s why,

even though the MARA model has been proven to be able to detect key regulatory interactions using

gene expression data and motifs data from ChIP-seq experiments [

], it appears that working directly

with ChIP-seq data obtained for speciﬁc transcription factors could allow to bring an additional

interpretability layer to the relationship between gene expression regulation and the transcription

factors at the genome-wide level.

It’s from this idea that arises the TF-NN (Transcription Factors Neural Network). Given as input

ChIP-seq data representative of the interactions between transcription factors and genes, the TF-NN

predicts gene expression using deep learning. In addition to this predicting feature, this neural

network has also been designed to be representative of some biological properties, allowing the

analysis and interpretation of the model parameters to retrieve information on the gene expression

regulation under the conditions studied.

However, to be able to perform this analysis at the genome-wide level, ChIP-seq data for many

transcription factors and under many different conditions have to be available. Thanks to the ChIP-

Atlas database, it has been possible to obtain a great amount of data on the interactions between

transcription factors and genes allowing this deep learning approach for gene expression prediction.

The ChIP-Atlas is a database where almost all public ChIP-seq data submitted to the Sequence Read

Archives have been collected [

]. This leads to a total of more than 279’000 experiments performed

mostly in Mus Musculus and Homo Sapiens organisms [

]. Among the different features available,

the Target Genes was the one of interest for this project as it reunites all the genes that are predicted

to be directly regulated by a transcription factor. This is determined using the transcription factor

binding proﬁles of all public ChIP-seq data, allowing to obtain all the reported interaction results

for a transcription factor. By considering an important number of transcription factors, it becomes

possible to build the TF-NN using those data to predict gene expression.

3 Data

In this section, the data used for the project and to build the neural network are described. The aim

of the network being to predict gene expression using ChIP-seq data, both a gene expression and a

ChIP-seq dataset are needed for the output and the input of the model respectively.

The ChIP-seq dataset represents a subset from the target genes feature of the ChIP-Atlas. Those

data are descriptive of the observed and reported interactions between transcription factors and

their predicted target genes, measured using chromatin immuno-precipitation sequencing. For

this project, the data for 705 transcription factors are kept. This corresponds to a total of 10’805

experiments that were performed in mice. Those experiments, collected in the ChIP-Atlas database,

have reported ChIP-seq data under many different conditions: different cell types, cells under

speciﬁc external stress or data obtained at speciﬁc times. The 705 transcription factors considered

have a total of 13’080 target genes.

The gene expression dataset corresponds to RNA-seq gene expression data of samples from mouse

liver tissue collected at different period of the day. The dataset is a subset from the one used in

the paper: "Circadian and feeding rhythms differentially affect rhythmic mRNA transcription and

translation in mouse liver"[

]. The samples are taken at 6 speciﬁc time points for 4 biological

replicates. RNA-seq data were collected both in the nucleus before the splicing of the RNA, resulting

in intronic reads, and in the cytoplasm after the splicing, corresponding to exonic reads. Therefore,

those data provide a good proxy for transcription rates (intronic reads) and for mRNA abundance

(exonic reads) [

]. For each replicate, the samples are taken at 6 speciﬁc time points in the two

cases. This leads to a total of 48 samples and data for 13’080 genes for which ChIP-seq data are also

available.

Eventually, additional gene expression data were used to assess the model on different tissues than

the liver and explore how this network can be used with multiple tissues. Those additional data are

RNA-seq data of samples from 3 different tissues (liver, muscle and white adipose tissue). For each

tissue, the samples are collected at 15 speciﬁc time points for 2 biological replicates. Once again,

the RNA-seq has been performed both in nucleus before the splicing and in the cytoplasm after

the splicing, leading to 30 samples per replicate and therefore a total of 60 samples. Data for 14’337

genes were obtained with 12’117 of them that are common to the genes with available ChIP-seq data.

4 Methods

Below, the architecture of the TF-NN is described after that the data preprocessing has been stated.

In addition, it will be speciﬁed how this model has been designed to be representative of some

biological properties. Eventually, a different architecture of the TF-NN, allowing to highly reduced

the dimension of the ChIP-seq data, will be presented.

4.1 Data preprocessing

The preprocessing performed on the ChIP-seq data aimed to get rid (or at least reduce) the differences

observed in the interactions between transcription factors and genes that are due to experimental

design. Indeed, as the ChIP-seq data used for this project come from more than 10’000 experiments,

it’s important to reduce those technical differences to be able to have some real biological insights of

how the transcription factors really regulate gene expression. That’s why two preprocessing steps

were performed on those data. The ﬁrst one consisted to scale the interaction values between 0

and 1 at the experiment level. This means that for each experiment considered, all the interaction

values are divided by the maximum value reported in the experiment. This allows to have all the

experiments scaled to the same range of values meaning that no experiment will be more important

simply because all the values reported were higher than in the other experiments. The second pre-

processing step consisted to normalize the L2 norm to length 1 at the gene level. This allows to give

the same weight in the analysis to the lowly and highly expressed genes. This is necessary because it’s

expected for a highly expressed gene to have high interaction values for many transcription factors.

Therefore, without this preprocessing, this gene would bring more information to the model than a

lowly expressed one which is not a desired feature.

For the gene expression data, the data used were already preprocessed by computing the RPKM

(Reads Per Kilobase Million) and then log2 transforming those results. Therefore, no additional pre-

processing of the gene expression data were performed. Indeed, as the values reported were already

RPKM, this means that the gene expression values were already normalized for sequencing depth

and gene length. Hence, the differences between samples are already considered by normalizing for

sequencing depth as well as the one due to the gene length, allowing to use those gene expression

data without any additional preprocessing.

4.2 Architecture of the TF-NN

The TF-NN aims to predict gene expression data using ChIP-seq data. In order to perform this task it

needs to ﬁrst process the ChIP-seq data before linking the processed results together to predict gene

expression. Therefore, it has been designed in two parts that will be detailed separately.

4.2.1 First part of the network

The 1

part of the network takes as input the ChIP-seq data. Each experiment is represented by

an input neuron. However, those experiments are not fully connected to the next layer. Indeed, in

order to consider the speciﬁcities of the ChIP-seq data and to obtain a model that can be biologically

interpretable, only the experiments that have been collected for a same transcription factor are

connected to a same neuron. This neuron is therefore representative of the speciﬁc transcription

factor on which the ChIP-seq experiments have been performed. Hence, this ﬁrst part of the network

reduces the space of the ChIP-seq data to 705 dimensions, with each dimension representing one of

the transcription factor considered. The different experiments performed for a similar transcription

factor are grouped together into one neuron using a relu activation function. It’s interesting to use

this function in the 1

part as we want to obtain a network representative of what happen in the cells.

The relu activation function assigns to the neurons representing the transcription factors either a

positive or a null value. Therefore, given a certain gene, if a positive value is obtained for the neuron

corresponding to a transcription factor, it is representative of a transcription factor that is involved

in the expression of this gene. Whereas, if a null value is obtained, it suggests that this transcription

factor has no inﬂuence on the expression of the gene studied. Eventually, a bias term is added for

each group of experiments of a transcription factor. It’s useful to learn a certain threshold on the

ChIP-seq data that is descriptive of how important the interaction between the transcription factor

and the DNA should be to have this transcription factor involved in the gene expression prediction.

4.2.2 Second part of the network

Given as input the 705 neurons resulting from the 1

part of the model and representing the different

transcription factors, the 2

part aims to link those neurons to the output neurons. The output

neurons correspond to the different samples present in the gene expression data. Therefore, 48

output neurons are resulting from the TF-NN. The neurons representing the 705 transcription factors

are directly linked (fully connected) to the 48 output neurons predicting gene expression of the

different samples. This direct link allows to observe how the different transcription factors are

involved in the gene expression prediction. A linear activation function is used in this second part of

the network. This function was considered as a linear relationship is expected between how strong

the transcription factors interact with DNA and how expressed a gene is. A bias term is also added

in this part of the network to account for all the transcription factors that may be expressed in the

conditions studied but for which no ChIP-seq data are available. The bias value represents a kind

of basal level of the gene expression dataset as if no ChIP-seq data are available for a new gene, its

expression will be predicted at this bias value.

To avoid overﬁtting of the training data and have a model that learns the general features of the

gene expression distribution and don’t ﬁt the noise of the training data, both dropout layers and L2

penalty terms have been added in the two parts of the network.

4.2.3 Additional constraints to be more biologically representative

The aim of the TF-NN being to be both accurate in the gene expression prediction and interpretable

when analyzing its parameters, some additional constraints were added in this model to render it

more representative of what happen in the cells.

The ﬁrst constraint concerns the weights of the 1

part of the model, meaning the weights that are

learned for the different experiments of the ChIP-seq data. Without any constraint, those weights

could have been positive, negative or null. However, having negative weights don’t really make

sense because it suggests that an experiment for which a negative weight has been learned tend to

not activate the neuron representing the transcription factor. This is counter intuitive with what

represents the ChIP-seq data. Therefore, if some experiments are not relevant for the gene expression

prediction under the conditions studied, the weights of those experiments should be set to 0 and not

negative. This is done by imposing the weights of the 1

part of the model to be either positive or

equal to 0.

The second constraint was imposed on the bias terms of the 1

part of the network. Indeed, to

have a model biologically representative, one wants that if there is no ChIP-seq data for a certain

transcription factor, the neuron representing this transcription factor can’t be activated and involved

in the gene expression prediction. By forcing the bias to be either negative or equal to 0, it allows to

get rid of this potential obstacle in the biological interpretation of the TF-NN.

4.2.4 TF-NN

Reuniting the 1

part and 2

part together, adding the dropout layers and the penalty terms as

well as the additional constraints both on the weights and the bias terms of the 1

part leads to the

architecture of the TF-NN. This network allows to predict gene expression data using information on

the interaction between transcription factors and DNA. In addition, it has been designed to be able to

extract some biological properties on how transcription factors are regulating gene expression under

the conditions studied. The different parameters allowing to extract those biological information

will be presented in the results section. The TF-NN is schematically depicted on Figure 1.

Figure 1: Schema of the TF-NN architecture.

4.3 Alternative architecture

In addition to the TF-NN, the results for another architecture will be discussed in the next section.

This network is really close to the TF-NN, with a 1

part that is exactly the same and a second part

where simply an additional hidden layer is added. This hidden layer is fully connected, via a relu

activation function, with the layer of neurons representing the transcription factors. The layer is

formed by 10 hidden neurons, allowing to reduce even more the dimensionality of the ChIP-seq data.

A schema of this architecture can be observed on Figure 2.

Figure 2: Schema of the alternative architecture of the TF-NN.

5 Results

With its speciﬁc architecture, the TF-NN can be analyzed at different levels in order to extract some

speciﬁc biological features. In this section, it will be explored which information can be obtained

from which parameters of the model and it will be discussed whereas those results appear to be really

biologically relevant. In addition to this analysis at a single tissue level, the results obtained using

multiple tissues will also be described as well as how the model should be used in such conﬁguration.

However, before exploring the interpretability of this network, let’s ﬁrst have a look on how well it

can predict gene expression using ChIP-seq data.

5.1 Predicting gene expression

5.1.1 Training the network

The network was built using Tensor Flow and the Keras API. It allowed to acquire the necessary

modularity to construct the particular 1

part of the TF-NN. To train the network and evaluate

it on two different sets of data, the ChIP-seq and gene expression datasets have been split into a

training set and a testing set. Out of the 13’080 genes, 60% of them have been randomly selected for

the training part, leading to a total of 7’848 genes. The 7’848 genes were again subdivided into two

groups. The ﬁnal training data and the validation data, allowing to assess the performance of the

model throughout the training process and identify notably if the model is overﬁtting or underﬁtting

the training data. 10% of the 7’848 genes were randomly included in the validation dataset, leading to

a ﬁnal training dataset of 7’063 genes. The Adam algorithm was used as an optimizer in the training

process. Adam optimization is a stochastic gradient descent method that is based on adaptive

estimation of ﬁrst-order and second-order moments. It’s a computationally efﬁcient method that

is well suited for problems large in terms of parameters [

]. This optimizer has been used with a

learning rate of 0.001. Eventually, the loss function used to optimize the model was the mean square

error loss. This function computes the mean of squares of errors between the labels (true gene

expression) and predictions (predicted gene expression). The model will therefore seek to minimize

the mean square error loss during the training stage. With the dropout layers and the penalty terms,

the training loss and validation loss are converging and no overﬁtting or underﬁtting are observed.

5.1.2 Testing the network

To test the network, the testing dataset is used in order to evaluate the prediction performance of the

model on new data that were unseen during the training process. This testing set represents 40% of

the genes present in the initial gene expression and ChIP-seq data, which corresponds to a total of

5’232 genes. The model has therefore been evaluated by predicting the expression of those genes and

by comparing the predicted values with the true gene expression values across the 48 samples studied.

To assess statistically the performance of the model and the prediction accuracy, the coefﬁcient of

determination R

has been computed. This statistical result represents the proportion of the variance

in the true gene expression that is explained by the predicted gene expression. Therefore, it provides

a measure of how well the true gene expression data are approximated by the network. Testing the

TF-NN on the 5’232 genes and assessing statistically its performance by computing the average of

the coefﬁcient of determination of the various samples, it can be observed that 53.3% of variance

of the gene expression data is explained by the model. This is a quite impressive value, especially

considering that this network is designed not only to be accurate in its prediction performance but

also to be interpretable. As a mean of comparison, training and testing fully connected networks

(single layer perceptron or multilayer perceptron) under the same conditions leads to less than 47%

of variance of the gene expression data explained. In addition, those networks operate like black

boxes and are not really interpretable, thus enhancing the interest of the TF-NN to predict gene

expression.

5.2 Interpretation of the parameters - Weights analysis

5.2.1 First part of the TF-NN

In the ﬁrst part of the network, the ChIP-seq data are taken as input and the space of the data

is reduced to 705 dimensions, with each dimension representing one of the transcription factor

considered. This is done by grouping together the different experiments performed for a similar

transcription factor as well as a bias term into one neuron using a relu activation function. Therefore,

in this part there is a bias value that is computed for each neuron representing a transcription factor

and weights associated to each experiment. With the constraints that have been imposed on those

parameters, they have become biologically interpretable.

Indeed, the bias term gives an idea on how strong the interaction between the transcription factor

and the gene should be in order to involve the transcription factor in the expression prediction of

this gene. The more negative the bias, the higher the ChIP-seq data must be to activate the neuron

representing the transcription factor and involved this transcription factor in the gene expression

prediction. On the other hand, if a bias term is null for a certain transcription factor, this means that

as soon as some interactions between this transcription factor and a gene are reported, the neuron

representing the transcription factor will be part of the prediction process for this gene.

The weights of the different experiments being only positive or null, they are reﬂecting how impor-

tant each experiment is to describe the role of the transcription factor in the conditions studied. If a

weight for a certain experiment is null, this means that this experiment will never be considered in

the model, suggesting that no interesting information can be obtained from such an experiment

in the context in which the TF-NN is used. In contrary, a really high weight value indicates that the

experiment has been done in conditions that are well reﬂecting the conditions on which the gene

expression data have been obtained.

Using the weights and the bias terms together, it is possible to determine for how many genes the

neuron representing a transcription factor will be activated (positive value after the application of

the relu activation function) and thus be involved in the prediction process of the gene expression.

Performing such analysis for all the transcription factors indicates what are the transcription fac-

tors that are the most often involved in the prediction process, which gives a ﬁrst idea on which

transcription factors are taking part in the regulation of gene expression in the conditions studied.

However, those values are not representative on how important a transcription factor will be in the

gene expression process.

The results of the percentage of activation of each transcription factor obtained with the liver and

time dependent gene expression data used for this study are represented on Figure 3. This visualiza-

tion allows to observe that some neurons representing transcription factors are often activated in

the model meaning that they are involved in the expression prediction of most of the genes of the

dataset. This is the case for instance for the neuron representing Nr3c1 transcription factor. On the

other hand, certain neurons are never or almost never activated in the model meaning that they are

not really useful for this model and suggesting that the transcription factors that are associated to

those neurons are not involved in gene expression in the liver. Those percentages of activation are

directly linked to the bias values associated to the different transcription factors and for which their

absolute value is represented as red dots on Figure 3. Indeed, as mentioned previously, a very small

bias value (high absolute value) means that the ChIP-seq data for this transcription factor should be

really high in order to activate the neuron representing this transcription factor and involved it in the

prediction. Whereas, a bias value equals to 0 means that as soon as an interaction between a gene

and the transcription factor is predicted in the ChIP-seq data, the neuron of this transcription factor

will be activated in the model, involving the transcription factor in the expression prediction process.

5.2.2 Second part of the TF-NN

The 2

part of the network is linking the 705 neurons representing the transcription factors to the

48 output neurons of gene expression prediction for each sample. The link is simply done via a linear

activation function and an additional bias term. Therefore, the analysis of the bias and of the weights

bring some interesting information on how the transcription factors may regulate gene expression

in the liver.

Starting with the bias term, this parameter is representative of the basal gene expression level of the

dataset and allows to get an idea on how much information is missing in the ChIP-seq data. Indeed,

to predict gene expression in the liver, the ChIP-seq data for 705 transcription factors are considered.

However, the number of known transcription factors in mice, as well as in humans, is more than

the double of the transcription factors considered (

∼

1600 annotated transcription factors in mice

[

]). Of course not all of those transcription factors are expressed in the liver, it’s even mentioned in

Figure 3: Barplot showing the percentage of genes (y-axis) that involve the transcription factors (x-axis,

listed in alphabetical order) in their expression prediction. The red dots are the absolute bias values

(y-axis) linked to the different transcription factors (real values are only negative or null, by design).

the literature that only around 600 transcription factors are active in this tissue [

]. Although, the

expression of the 705 transcription factors has been observed with the liver gene expression data

used for the project and it appeared that 503 of them were expressed, meaning that the ChIP-seq

data for around 100 transcription factors is lacking to account for all transcription factors expressed

in the liver, in the gene expression prediction process. Therefore, the bias term of this 2

part is

kind of representative of the effect of those 100 missing transcription factors in the gene expression

regulation.

The analysis of the weights going from the neuron representing the transcription factor to the

output neurons can be done at two levels: looking at the average effect of the transcription factor

on gene expression prediction or looking on how the weights vary between the different samples.

Looking at the average effect, meaning computing the mean of the 48 weights going from a neuron

representing a certain transcription factor to the different output neurons, allows to determine

whether a transcription factor will tend to act as a repressor or an enhancer of gene expression in

the network. It allows also to get an idea on how important a transcription factor will be to the ﬁnal

gene expression prediction as the higher the average weights in absolute value, the higher its effect

may be on the prediction. Hence, this approach gives a general idea of the effect of the transcription

factor. However, it doesn’t allow to observe the ﬂuctuation of the weights across the different samples,

which is a feature of great interest in the way the network has been designed in order to observe the

speciﬁcities (time dependency) of the gene expression data used.

Therefore, to observe how the weights are dependent on the output samples considered, it’s possible

to visualize the weights going from the neuron representing the transcription factor to the output

neuron in function of the output neuron considered. This is what is represented on Figure 4. Very

interestingly, it can be observed that some rhythmic patterns appear in the weights of certain

transcription factors. Those rhythmic patterns observed for the Arntl and Nr3c1 transcription factors

for example have a period of 6 samples which corresponds to a period of 24 hours (the 6 samples

of 1 replicate). Having the weights of some transcription factors showing rhythmic patterns is the

reﬂect of what happen in the cells with the circadian rhythm as we are predicting gene expression at

different period of the day and in the liver. This feature of the network is of particular interest because

not every neuron representing a transcription factor is showing rhythmicity in its weights. Therefore,

by analyzing those weights using a more systematic approach like a Fourier analysis, it can be

observed whether a transcription factor is likely to be involved in the circadian rhythm (oscillations

of the weights in function of the samples) or not (constant weights across the samples). The results

observed for the neuron representing Clock transcription factor are showing less rhythmicity than

for the other two transcription factors visualized. As it is known that Clock transcription factor is

a core component of the circadian rhythm, clear rhythmic patterns were expected. The reason

why less rhythmicity is observed may be explained by the fact that this transcription factor forms

an heterodimer with Arntl before activating the expression of downstream genes. Therefore, the

rhythmicity of this complex may have been learned mostly for the Arntl transcription factor, showing

some limitations of this approach to fully determine what are the transcription factors involved in

the circadian rhythm.

Figure 4: Variation of the weights going from the neuron representing a transcription factor to the

output neurons across the samples (intron samples before the vertical red dashed line, exon samples

after it). Weights for neurons representing Arntl, Nr3c1 and Clock transcription factors are represented.

5.2.3 Full TF-NN

Considering the whole network, it becomes possible to determine the exact effect of each transcrip-

tion factor on the expression prediction of a certain gene. Indeed, by considering the ChIP-seq

data of a gene and using the different parameters of the trained network, it is possible to follow

the prediction process and determine to which extent each transcription factor has been involved.

This allows to give some insights on how a speciﬁc chosen gene is regulated. The disadvantage is

that it is an analysis done at a very ﬁne level of granularity, which doesn’t allow to determine the

global process regulating gene expression. However, by considering all the genes of the dataset and

averaging the effect of the different transcription factors on each of those genes, it allows this time to

determine the importance of each transcription factor in the prediction of gene expression under the

conditions studied. Therefore, doing this analysis on the dataset considered for this project, it allows

to determine what are the transcription factors regulating gene expression in the liver at speciﬁc

times.

This is what is represented on Figure 5. On the x-axis are the 705 transcription factors and on

the y-axis their average contribution to the gene expression prediction. So, looking at how each

transcription factor contributes to the gene expression prediction, it gives an idea of what are the

transcription factors that are the most important and the most involved in the conditions studied. It

can be observed in this case that the three most involved transcription factors in the gene expression

prediction are Nelfa, Clock and Brd4. Those results seem to be quite biologically relevant and rep-

resentative on how gene expression is regulated in the liver through time. Indeed, out of the most

important transcription factors, it can be observed that some of them are involved in general process

of gene expression regulation: For instance, Nelfa participates in the regulation of RNA polymerase

II transcription elongation [

], Brd4 is an epigenetic reader domain thus involved in histone regu-

lation [

] and Nr3c1, which is the glucocorticoid receptor, is expressed on almost all cell types in

the body and regulates the expression of many genes [

]. Whereas the other transcription factors

that participate the most to gene expression prediction are more speciﬁc to the liver or the circadian

rhythm: Clock plays a central role in the regulation of circadian rhythms [

], Hnf4a is essential

for circadian rhythm maintenance and period regulation in the liver and acts as a transcriptional

regulator controlling the expression of several hepatic genes [17] and Onecut1, which expression is

enriched in liver, stimulates transcription of liver-expressed genes [18].

Figure 5: Mean contribution (y-axis, in percent) of all transcription factors (x-axis, listed in alphabeti-

cal order) to the gene expression prediction.

Those key results tend to demonstrate that, with the constraints added on the different parameters

of the network, the transcription factors appearing to be the most important for the model to predict

gene expression are probably reﬂecting (at least partially) how gene expression is regulated in the

liver.

In addition to this approach done for the whole dataset, it can also be interesting to reduce the

number of genes considered in order to look only at group of genes involved in a common biological

process. This allows to determine which transcription factors are the most important for a smaller

set of genes, reducing the averaging effect on the results observed. Therefore, this weight analysis of

the whole TF-NN can be done at three different levels (one gene, a group of gene or all the genes of

the dataset) allowing to analyze gene regulation at any granularity.

5.3 TF-NN with an additional hidden layer

Adding a hidden layer of few hidden neurons between the layer of neurons representing the tran-

scription factors and the output neurons (Figure 2) shows both great advantages and disadvantages.

On the positive side, by designing the neural network like this, it allows to reduce greatly the di-

mensionality of the ChIP-seq data used as input data. Indeed, one of the disadvantage of using

those data is that they are of high dimensions (more than 10’000 conditions considered) and not

all the experiments are necessarily relevant for the conditions considered. Therefore, it appears

essential to reduce the space of those data before predicting gene expression. This is what is already

done with the TF-NN that reduced the space of the ChIP-seq data to 1 dimension per transcription

factor leading to 705 values. However, it appears that adding an additional hidden layer of 10 hidden

neurons (or more, but being interested by the dimensionality reduction a smaller number of hidden

neurons is of greater relevance) allows to obtain interesting prediction results. Using the same

training and testing approach than the one described for the TF-NN, it appears that the trained

model of this architecture allows to explain 54.1% of variance. This is a slightly better result than the

one obtain with the TF-NN (53.3%).

However, on the negative side, some interpretability of the network is loss. Indeed, ﬁrst it may

be thought that such a model would be more representative of what happen in the cells with the

hidden neurons representing group of transcription factors operating together to regulate gene

expression. Though, by analyzing the weights distribution on those hidden neurons, it appears

that the transcription factors are not really clustering into those different hidden neurons. Some

additional constraints or complexity should be added in this model to obtain such results. Hence,

how it stands, this model is less interpretable than the TF-NN because the direct link between the

neurons representing the transcription factors and the output neurons is loss. So, condensing even

more the information of the ChIP-seq data works well for the aim of predicting gene expression but

comes at the cost of a loss of interpretability, resulting in a model that is not yet able to offer a better

compromise between prediction accuracy and interpretability than the one of the TF-NN.

5.4 Multi tissue study

For this study, data for different tissues were needed. The data used were therefore the second set of

gene expression data described in the data section. It corresponds to gene expression data for the

liver, the muscle and the white adipose tissue.

5.4.1 Independent models

Before exploring how the TF-NN can be used on multiple tissues, an independent model has been

trained and tested on gene expression of each tissue. It allowed to evaluate that the results obtained

with the ﬁrst dataset on the liver tissue were not speciﬁc to this dataset. This is indeed what has

been observed with the results obtained for the liver tissue of this new set of data. The coefﬁcient of

determination for this new model assesses that 53% of variance of the gene expression was explained,

which is in the same range as the results obtained with the previous gene expression dataset (53.3%).

However, using the gene expression data of the two other tissues, the performance of the TF-NN

decreases to 47% and 43.3% of variance explained for the gene expression data of the muscle tissue

and the data of the white adipose tissue respectively.

The reason of this decrease in prediction performance may ﬁnd its source directly in the ChIP-seq

data. Indeed, those data are collected from various experiments performed under many different

conditions and cell types or tissues. Therefore, not all the tissues are similarly represented in those

data. It appears that 646 experiments have been performed in the liver tissue (out of the 10’805)

whereas only around 150 experiments have been performed in muscle tissue and even less for the

white adipose tissue. The decrease observed in the model performance is thus probably the reﬂect

of the decrease of those tissue speciﬁc conditions in the ChIP-seq data.

5.4.2 Use of the TF-NN on multiple tissues

Using the same network for multiple tissues is of great interest. With such approach, it would be

possible to train a model with one or multiple tissues before looking at how well this model could

perform on a new tissue, allowing to explore similarities in gene expression regulation between

tissues for instance.

However, using a same model on different tissues is not necessarily straightforward with the TF-NN.

Indeed, a major problem is encountered with the input data. It’s the same input dataset that is used

for every tissue as the ChIP-seq data are not tissue speciﬁc but reunite a large number of experiments

performed under different conditions. The direct consequence of this problem is that it becomes

impossible to train a model on one tissue before testing it on another tissue as nothing will be speciﬁc

to the second tissue so the predicted gene expression values would be the same than the ones of the

ﬁrst tissue.

That’s why another approach should be considered to perform a multi tissue study. The method

consists to train the whole network on one or many tissues before freezing a part of this trained

model, meaning to block the values of the parameters and render them untrainable, and training the

unfrozen part for certain genes of a new tissue to eventually test the whole model for the other genes

of the new tissue. Like this, it allows to have a part of the model that is tissue dependent and another

part that is tissue independent, global for the different tissues.

With the TF-NN two possibilities arise with this approach. Either the 1

part of the TF-NN (Figure 6)

or the 2

part can be frozen after the training on one or some tissues. Freezing the 1

part means

that the link between the ChIP-seq data and the neurons representing the transcription factors will

be unspeciﬁc to the to the tissue for which gene regulation is studied, whereas the link between the

neurons representing the transcription factors and the prediction neurons will be speciﬁc to this

tissue, allowing to ﬁne tune the effect of each transcription factor on gene expression once they are

activated.

Figure 6: TF-NN with its 1st part frozen.

Freezing the 2

part of the network, this time the weights of each experiment of the ChIP-seq data

as well as the threshold set on those data will be speciﬁc to the tissue studied and it will be the effect

of each transcription factors on gene expression that will be unspeciﬁc to the new tissue.

5.4.3 Freezing the ﬁrst part of the network

Performing this approach and using the gene expression data of the liver tissue to train the whole

model, before freezing the 1

part and training/testing this model with gene expression data of

either the muscle or the white adipose tissue, leads to noteworthy results.

First, it can be observed that such models lead to a decrease in the prediction performance. In the

two different cases tested, the coefﬁcient of determination is reduced. For the muscle tissue, 39.6%

of variance of the gene expression data is explained with this approach instead of 47% of variance

explained when using an independent model. Whereas for the white adipose tissue the coefﬁcient

of determination goes from 43.3% to 37.2% of variance explained when freezing the 1

part of the

network. This decrease in prediction performance is pertinent as not all the parameters are speciﬁc

to the tissue for which gene expression is predicted. However, besides the prediction results, the

biological interpretability of the parameters is the central interest of this project. This is where this

approach appears to bring some remarkable outcomes.

When freezing the 1

part of the network, the only parameters that are modiﬁed, after that the

model has been trained on the liver tissue, are the weights going from the neurons representing the

transcription factors to the output (prediction) neurons. Therefore, by looking at how the value of

those weights changes from when the model has been trained on the liver tissue only to when it has

also been trained on the muscle (or white adipose) tissue, it allows to compare how the transcription

factors are differentially involved in gene expression in those two tissues. This is what is represented

on Figure 7. Those results seem to be actually well representative of the transcription factors that are

speciﬁc to the tissues studied. Indeed, on the negative part of the ﬁgure, the transcription factors

that have higher weights in the liver and thus are more important in gene expression prediction

of this tissue can be found. It appears that the transcription factors present in this part are for the

majority of them transcription factors that are truly speciﬁc to liver genes. Besides the transcription

factors already described in this report like Clock, Hnf4a or Onecut1, Hnf1a can be found, which is

a transcription factor required for the expression of several liver-speciﬁc genes [

] or also Foxa1

and Foxa2 that are hepatocyte nuclear factors acting as transcriptional activators for liver-speciﬁc

transcripts [

]. On the other side of the ﬁgure, where the transcription factors that are more speciﬁc

to the muscle can be found, again it seems to be the reﬂect of what happen in the cells with for

instance the Myog transcription factor that promotes transcription of muscle-speciﬁc target genes

[

], the transcription factors of the Mef2 family (Mef2c and Mef2d) that are involved in the control

of muscle differentiation and development [

] or the Myod1 transcription factor that promotes

transcription of muscle speciﬁc target genes [23].

Figure 7: Difference in weights of the 2

part between the muscle and liver tissue after using the

freezing approach for the 1st part of the TF-NN.

Therefore, this approach of training the whole model for a ﬁrst tissue before freezing the 1

part and

training/testing using another tissue reduces, on one hand, the gene expression prediction but, on

the other hand, allows to determine what are the tissue speciﬁc transcription factors between the two

tissues studied. In addition, looking at the differences between two tissues show a supplementary

advantage in comparison to the results obtained using a single tissue is that it allows to get rid (or at

least reduce) the effect of the general transcription factors as those ones will be involved in the two

tissues. Hence, it provides a real idea on what are the transcription factors particularly important

for gene expression regulation of a certain tissue. Indeed, it can be seen that general transcription

factors that were observed as some of the most important ones for gene expression regulation in the

liver like Brd4 or Nr3c1 (Figure 5) are not present anymore when looking at the difference between

two tissues (Figure 7).

5.4.4 Freezing the second part of the network

Doing the opposite than in the previous section, meaning to train the whole model with gene

expression data of the liver before freezing this time the 2

part and training/testing this model with

gene expression data of either the muscle or the white adipose tissue, an almost similar decrease in

the prediction performance can be observed.

For the muscle tissue, 38.8% of variance of the gene expression data is explained with this approach

(39.6% when freezing the 1

part, 47% when using an independent model). Whereas for the white

adipose tissue the coefﬁcient of determination goes from 43.3% to 38.3% of variance explained when

freezing the 2nd part of the network. Again, this decrease seems relevant but is still important.

However, the prediction results may be improved using more than a single tissue to train the model

before freezing the 2

part. To test this hypothesis, two independent model were trained. One with

the liver tissue and one with the muscle tissue. The averages of the weights of the 2

part between

those two independent models have been computed before setting those results as the weights of the

part of a new model, freezing this 2

part of the network and training/testing this new model

on the white adipose tissue. With this approach the coefﬁcient of determination reaches a value of

39.8%. This is still lower than the 43.3% of variance explained obtained when training the whole

model with the gene expression data of the white adipose tissue. However, it is a better prediction

result than when freezing the 2

part on the network directly after having trained the model on the

liver gene expression data. Therefore, this tendency may mean that using multiple tissues for the 2

part of the network before freezing it is beneﬁcial as it allows to determine a kind of average effect of

each transcription factor on gene expression (not tissue speciﬁc anymore) that is thus better suited

for the method consisting of freezing the 2nd part of the network.

6 Discussion

The understanding of how transcription factors regulate gene expression under speciﬁc conditions

and on a genome wide scale is of great interest with the yet well established high-throughput tech-

nologies allowing the measurement of genome-wide mRNA expression. Here is introduced the

TF-NN, a neural network that uses chromatin immunoprecipitation sequencing data to predict gene

expression data. Through this prediction process, the model learns parameters that are speciﬁc to the

condition of interest on which the gene expression data have been collected. Those parameters can

then be extracted to obtain insights on how the transcription factors are regulating gene expression

under those conditions.

Indeed, by adding various constraints on different parameters of the network, the TF-NN has been

designed to be representative of what may happen biologically allowing to both predict gene expres-

sion data and extract biological properties from its parameters. The use of ChIP-seq data that are not

tissue speciﬁc but reunite more than 10’000 experiments performed for different cell types, under

particular external stress or collected at speciﬁc times, allows this model to be employed for gene

expression data obtained under many different conditions.

Using the TF-NN with gene expression data collected in the mouse liver at speciﬁc times, it has

been shown that this network allows notably to determine how important the different transcrip-

tion factors are in the regulation of gene expression under those conditions or to observe which

transcription factors may be involved in the circadian rhythm by looking at the oscillating weights

appearing for some neurons. Those results look quite relevant considering the known effects of

some transcription factors described in the results section. Therefore, the duality (prediction and

interpretation of the weights) of the TF-NN is certainly noteworthy to get some insights on how the

transcription factors are involved in the regulation of gene expression in the liver and at speciﬁc

times.

Another advantage of the TF-NN is that its architecture allows to reduce the space of the ChIP-seq

data to 1 dimension per transcription factor, leading to 705 dimensions. Indeed, the interaction

data considered between the transcription factors and DNA come from the ChIP-Atlas database

that reunite all the public ChIP-seq data submitted to the Sequence Read Archives. Therefore, this

leads to more than 10’000 experiments considered to account for many transcription factors, which

is way too many information to look at to understand how gene expression is regulated by those

transcription factors. However, with this architecture and the dimensionality reduction, it renders

those data more interpretable and somehow more useful on a genome-wide scale. It has been shown

that for the only aim of prediction of gene expression in the liver and at speciﬁc times the space can

even be reduced up to 10 dimensions using the architecture schematically depicted on Figure 2.

Eventually, even though the ChIP-seq data are not tissue speciﬁc and therefore are used as input

of the network for any tissue studied, the TF-NN can be used with multiple tissues to extract some

tissue speciﬁcities. Indeed, by using an approach allowing to have parameters that are learned with

some tissues and others that are speciﬁc to the tissue and the conditions of interest, it becomes

possible to identify transcription factors that are really speciﬁc to the tissue studied and to reduce or

get rid of the effects of the more global transcription factors. It has also been shown that using two

tissues to determine the average effect of each transcription factor on gene expression and then only

render the 1

part of the model tissue speciﬁc allows to get prediction results that are closer to the

ones obtained using an independent model. Therefore, it may be interesting for further studies to

test this feature of transcription factor average effect on gene expression using even more tissues in

order to observe if the more tissues considered the closer the prediction results may be to the ones

of an independent model.

There are of course some limitations to the approach of using the TF-NN to understand how gene

expression is regulated in the conditions studied. For instance, using only linear relationships (relu

activation function or linear activation function) the TF-NN won’t be representative of saturation

effects that are occurring in the cells. This approach also assumes that a transcription factor acts

only either as an activator or as a repressor for all the genes considered in the dataset whereas the

effect of each transcription on gene expression may be positive for some targets and negative for

others. Additionally, one limitation appears with the information that is provided by the ChIP-seq

data. Indeed, those data relate the interaction between the transcription factor and the potential

target genes located at a maximal (speciﬁed) distance. This restricts the analysis to the assumption

that a transcription factor affects the expression of the gene that is the most proximal to the ob-

served binding site. Whereas, in fact this transcription factor may act on another DNA component

that could impact a distal target gene [

]. Eventually, a last limitation concerns the links between

the neurons representing the transcription factors and the prediction neurons. Those links are

direct in the TF-NN with transcription factors that are therefore considered to directly regulate gene

expression. However, this is not exactly what happen in the cells. The transcription factors are

interacting together to regulate gene expression, forming heterodimers or activating/repressing each

other. Implementing those additional biological properties in the TF-NN, by adding some neurons

representing pairs of transcription factors for instance, would lead to a further increase in the model

representativity of the gene expression regulation allowing to extract some complementary infor-

mation on how the transcription factors are interacting together and what are the effects of those

interactions on gene expression.

7 Acknowledgments

I would like to thank the Laboratory of Computational and Systems Biology for this very interesting

project. A special thanks to Lorenzo Talamanca, who monitored the progress of the project and

offered his precious help and advices every week, to Professor Naef, who brought all his expertise and

key advices to help in the design of the TF-NN and the analysis of the results and to Cédric Gobet,

who set up really well the initial working environment and the data provided as well as suggested

some nice features to add to the TF-NN.

References

[1] “Protein function.”

Available at: https://www.nature.com/scitable/topicpage/protein-function-14123348/.

[2]

D. S. Latchman, “Transcription factors: an overview.,” vol. 74, no. 5, pp. 417–422. International

Journal of Experimental Pathology.

[3] P. J. Balwierz, M. Pachkov, P. Arnold, A. J. Gruber, M. Zavolan, and E. van Nimwegen, “ISMARA:

automated modeling of genomic signals as a democracy of regulatory motifs,” vol. 24, no. 5,

pp. 869–884. Genome Research.

[4] “How to study activation of transcription factors | abcam.”

Available at: https://www.abcam.com/kits/methods-to-study-transcription-factor-activation.

[5]

S. A. Lambert, A. Jolma, L. F. Campitelli, P. K. Das, Y. Yin, M. Albu, X. Chen, J. Taipale, T. R.

Hughes, and M. T. Weirauch, “The human transcription factors,” vol. 172, no. 4, pp. 650–665.

Cell.

[6] “ChIP-Atlas.”

Available at: https://chip-atlas.org/.

[7] “ChIP-Atlas | github.”

Available at: https://github.com/inutano/chip-atlas.

[8]

F. Atger, C. Gobet, J. Marquis, E. Martin, J. Wang, B. Weger, G. Lefebvre, P. Descombes, F. Naef, and

F. Gachon, “Circadian and feeding rhythms differentially affect rhythmic mRNA transcription

and translation in mouse liver,” vol. 112, no. 47, pp. E6579–E6588. National Academy of Sciences

Section: PNAS Plus.

[9]

D. Gaidatzis, L. Burger, M. Florescu, and M. B. Stadler, “Analysis of intronic and exonic reads in

RNA-seq data characterizes transcriptional and post-transcriptional regulation,” vol. 33, no. 7,

pp. 722–729. Nature Biotechnology.

[10] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980 [cs].

[11]

Q. Zhou, M. Liu, X. Xia, T. Gong, J. Feng, W. Liu, Y. Liu, B. Zhen, Y. Wang, C. Ding, and J. Qin, “A

mouse tissue transcription factor atlas,” vol. 8, no. 1, p. 15089. Nature Communications.

[12]

K. Sun, H. Wang, and H. Sun, “mTFkb: a knowledgebase for fundamental annotation of mouse

transcription factors,” vol. 7, no. 1, p. 3022. Scientiﬁc Reports.

[13]

“NELFA negative elongation factor complex member a [homo sapiens (human)] - gene - NCBI.”

Available at: https://www.ncbi.nlm.nih.gov/gene/7469.

[14] “Tocris Bioscience - Cell Biology - Bromodomains.”

Available at: https://www.tocris.com/cell-biology/bromodomains.

[15]

L. Quatrini and S. Ugolini, “New insights into the cell- and tissue-speciﬁcity of glucocorticoid

actions,” pp. 1–10. Cellular & Molecular Immunology.

[16] “CLOCK clock circadian regulator [homo sapiens (human)] - gene - NCBI.”

Available at: https://www.ncbi.nlm.nih.gov/gene/9575.

[17]

M. Qu, T. Duffy, T. Hirota, and S. A. Kay, “Nuclear receptor HNF4a transrepresses CLOCK:BMAL1

and modulates tissue-speciﬁc circadian networks,” vol. 115, no. 52, pp. E12305–E12312. Pro-

ceedings of the National Academy of Sciences.

[18] “ONECUT1 one cut homeobox 1 [homo sapiens (human)] - gene - NCBI.”

Available at: https://www.ncbi.nlm.nih.gov/gene/3175.

[19] “HNF1a HNF1 homeobox a [homo sapiens (human)] - gene - NCBI.”

Available at: https://www.ncbi.nlm.nih.gov/gene/6927.

[20] “FOXA1 forkhead box a1 [homo sapiens (human)] - gene - NCBI.”

Available at: https://www.ncbi.nlm.nih.gov/gene/3169.

[21] “MYOG myogenin [homo sapiens (human)] - gene - NCBI.”

Available at: https://www.ncbi.nlm.nih.gov/gene/4656.

[22] “MEF2d myocyte enhancer factor 2d [homo sapiens (human)] - gene - NCBI.”

Available at: https://www.ncbi.nlm.nih.gov/gene/4209.

[23] “MYOD1 myogenic differentiation 1 [homo sapiens (human)] - gene - NCBI.”

Available at: https://www.ncbi.nlm.nih.gov/gene/4654.

[24]

B. Deplancke, D. Alpern, and V. Gardeux, “The genetics of transcription factor DNA binding

variation,” vol. 166, no. 3, pp. 538–554. Cell.