TF-NN
A biologically interpretable neural network
predicting gene expression from ChIP-seq data
Semester Project - Report
January 2021
Evan Béal
In the Laboratory of Computational and Systems Biology
1
1 Abstract
Understanding the contribution of the different transcription factors in the process of gene expression
regulation at a genome-wide level is a key challenge in molecular biology. With the progress of the high-
throughput technologies, more and more gene expression data are collected under specific conditions.
Linking those data to predicted interactions data between transcription factors and genes (ChIP-seq
data) is therefore of great interest to understand the regulatory process leading to the gene expression
observed. This is what is proposed with the TF-NN, a neural network that has been developed to
determine how the transcription factors are controlling gene expression. Given ChIP-seq data, the
TF-NN predicts gene expression data obtained under certain conditions of interest. The network has
been designed to account for some biological properties allowing to extract key information on the
effect of each transcription factor on gene expression by looking at the parameters of the model trained
for the prediction task. Using gene expression data collected in the liver and at specific times, the
results for the prediction task show that 53.3% of variance of the gene expression is explained with
the ChIP-seq data. The transcription factors being the most involved in the prediction appear to be
either some global transcription factors or transcription factors known to be involved in the liver or
the circadian rhythm process. Hence, it assesses that some mechanisms regulating gene expression in
the cells are reflected in the parameters of the model. The analysis of the TF-NN can be extended to
multiple tissues following a specific process. Eventually, an alternative architecture of the TF-NN can
be used to reduce the dimensionality of the ChIP-seq data provided at the input of the model, going
from more than 10’000 experiments considered to only 10 dimensions.
2 Introduction
Proteins are responsible of nearly every task of cellular life and determine therefore the health,
function or interaction of the cells in which they are involved [
1
]. A functional protein results from
a complex process that begins with the DNA transcription. Hence, the transcription process plays
a central role in gene expression since it represents a critical first step. In addition, it is also a key
process in specificity as it is the primary target for gene regulation that results in different proteins
being produced in different tissues [2].
Specific protein factors are responsible of both the basal transcription and its regulation. They are the
transcription factors. Those regulators of gene expression bind to cognate sites in DNA and regulate
the rate of transcription initiation [
3
]. Their activity is therefore regulated in order to produce specific
patterns of gene expression [
2
]. Upon binding to DNA, a transcription factor interacts with other
factors or RNA polymerase to initiate or repress DNA transcription.
The measure of transcription factor activity is of great interest to understand the key process of
transcription factor binding to DNA in gene expression regulation. Different assays can be used to
study transcription factor activation. Besides electrophoretic mobility shift assay or ELISA-based
assay, there is the ChIP-seq method, which stands for Chromatin ImmunoPrecipitation sequencing.
This assay is used to identify protein-DNA binding events in their natural intracellular context. It
has the advantage of working in vivo, allowing to study transcription factor interaction changes over
time. By quantifying the relative abundance of the transcription factor at different locations of the
genome, it allows to determine which genes may be regulated by the transcription factor studied [
4
].
Transcription factors playing a central role in transcription which itself is a key process in gene
2
expression, the results obtained from a ChIP-seq assay under certain conditions are linked somehow
to the gene expression of the cell under the same conditions. Therefore, by determining gene
expression using for instance the RNA-seq technique, allowing to quantify RNA in a biological
sample at a given moment and analyze the continuously changing cellular transcriptome, it becomes
possible to link gene expression with ChIP-seq data of transcription factors.
This approach has notably been used in the MARA (Motif Activity Response Analysis) methodology
[
3
]. Considering as input a set of genome-wide gene expression across a number of samples, the
MARA model uses motif activity response analysis to identify which are the transcriptions factors
that are inducing gene expression changes across the samples [
3
]. Several advantages arise from
this approach that allows to obtain concrete predictions on the key regulators under the conditions
studied. However, as the method is based on the DNA motifs on which the transcription factors bind,
it may be subject to some limitations. Indeed, the transcription factor binding sites are small and
flexible, meaning that decoding how gene regulation relates to transcription factor binding motifs
remains challenging [
5
]. Many transcription factors can bind to similar motifs as they are not strictly
restricted to a specific sequence motif. This means that when identifying the effect of a motif on
gene expression, it cant be fully determined which specific transcription factor is responsible of this
effect. In addition, some transcription factors have to work together to get anything done which is
not something that can be retrieved from the analysis of the enriched sequence motifs. That’s why,
even though the MARA model has been proven to be able to detect key regulatory interactions using
gene expression data and motifs data from ChIP-seq experiments [
3
], it appears that working directly
with ChIP-seq data obtained for specific transcription factors could allow to bring an additional
interpretability layer to the relationship between gene expression regulation and the transcription
factors at the genome-wide level.
Its from this idea that arises the TF-NN (Transcription Factors Neural Network). Given as input
ChIP-seq data representative of the interactions between transcription factors and genes, the TF-NN
predicts gene expression using deep learning. In addition to this predicting feature, this neural
network has also been designed to be representative of some biological properties, allowing the
analysis and interpretation of the model parameters to retrieve information on the gene expression
regulation under the conditions studied.
However, to be able to perform this analysis at the genome-wide level, ChIP-seq data for many
transcription factors and under many different conditions have to be available. Thanks to the ChIP-
Atlas database, it has been possible to obtain a great amount of data on the interactions between
transcription factors and genes allowing this deep learning approach for gene expression prediction.
The ChIP-Atlas is a database where almost all public ChIP-seq data submitted to the Sequence Read
Archives have been collected [
6
]. This leads to a total of more than 279’000 experiments performed
mostly in Mus Musculus and Homo Sapiens organisms [
7
]. Among the different features available,
the Target Genes was the one of interest for this project as it reunites all the genes that are predicted
to be directly regulated by a transcription factor. This is determined using the transcription factor
binding profiles of all public ChIP-seq data, allowing to obtain all the reported interaction results
for a transcription factor. By considering an important number of transcription factors, it becomes
possible to build the TF-NN using those data to predict gene expression.
3
3 Data
In this section, the data used for the project and to build the neural network are described. The aim
of the network being to predict gene expression using ChIP-seq data, both a gene expression and a
ChIP-seq dataset are needed for the output and the input of the model respectively.
The ChIP-seq dataset represents a subset from the target genes feature of the ChIP-Atlas. Those
data are descriptive of the observed and reported interactions between transcription factors and
their predicted target genes, measured using chromatin immuno-precipitation sequencing. For
this project, the data for 705 transcription factors are kept. This corresponds to a total of 10’805
experiments that were performed in mice. Those experiments, collected in the ChIP-Atlas database,
have reported ChIP-seq data under many different conditions: different cell types, cells under
specific external stress or data obtained at specific times. The 705 transcription factors considered
have a total of 13’080 target genes.
The gene expression dataset corresponds to RNA-seq gene expression data of samples from mouse
liver tissue collected at different period of the day. The dataset is a subset from the one used in
the paper: "Circadian and feeding rhythms differentially affect rhythmic mRNA transcription and
translation in mouse liver"[
8
]. The samples are taken at 6 specific time points for 4 biological
replicates. RNA-seq data were collected both in the nucleus before the splicing of the RNA, resulting
in intronic reads, and in the cytoplasm after the splicing, corresponding to exonic reads. Therefore,
those data provide a good proxy for transcription rates (intronic reads) and for mRNA abundance
(exonic reads) [
9
]. For each replicate, the samples are taken at 6 specific time points in the two
cases. This leads to a total of 48 samples and data for 13’080 genes for which ChIP-seq data are also
available.
Eventually, additional gene expression data were used to assess the model on different tissues than
the liver and explore how this network can be used with multiple tissues. Those additional data are
RNA-seq data of samples from 3 different tissues (liver, muscle and white adipose tissue). For each
tissue, the samples are collected at 15 specific time points for 2 biological replicates. Once again,
the RNA-seq has been performed both in nucleus before the splicing and in the cytoplasm after
the splicing, leading to 30 samples per replicate and therefore a total of 60 samples. Data for 14’337
genes were obtained with 12’117 of them that are common to the genes with available ChIP-seq data.
4 Methods
Below, the architecture of the TF-NN is described after that the data preprocessing has been stated.
In addition, it will be specified how this model has been designed to be representative of some
biological properties. Eventually, a different architecture of the TF-NN, allowing to highly reduced
the dimension of the ChIP-seq data, will be presented.
4.1 Data preprocessing
The preprocessing performed on the ChIP-seq data aimed to get rid (or at least reduce) the differences
observed in the interactions between transcription factors and genes that are due to experimental
design. Indeed, as the ChIP-seq data used for this project come from more than 10’000 experiments,
4
it’s important to reduce those technical differences to be able to have some real biological insights of
how the transcription factors really regulate gene expression. Thats why two preprocessing steps
were performed on those data. The first one consisted to scale the interaction values between 0
and 1 at the experiment level. This means that for each experiment considered, all the interaction
values are divided by the maximum value reported in the experiment. This allows to have all the
experiments scaled to the same range of values meaning that no experiment will be more important
simply because all the values reported were higher than in the other experiments. The second pre-
processing step consisted to normalize the L2 norm to length 1 at the gene level. This allows to give
the same weight in the analysis to the lowly and highly expressed genes. This is necessary because its
expected for a highly expressed gene to have high interaction values for many transcription factors.
Therefore, without this preprocessing, this gene would bring more information to the model than a
lowly expressed one which is not a desired feature.
For the gene expression data, the data used were already preprocessed by computing the RPKM
(Reads Per Kilobase Million) and then log2 transforming those results. Therefore, no additional pre-
processing of the gene expression data were performed. Indeed, as the values reported were already
RPKM, this means that the gene expression values were already normalized for sequencing depth
and gene length. Hence, the differences between samples are already considered by normalizing for
sequencing depth as well as the one due to the gene length, allowing to use those gene expression
data without any additional preprocessing.
4.2 Architecture of the TF-NN
The TF-NN aims to predict gene expression data using ChIP-seq data. In order to perform this task it
needs to first process the ChIP-seq data before linking the processed results together to predict gene
expression. Therefore, it has been designed in two parts that will be detailed separately.
4.2.1 First part of the network
The 1
st
part of the network takes as input the ChIP-seq data. Each experiment is represented by
an input neuron. However, those experiments are not fully connected to the next layer. Indeed, in
order to consider the specificities of the ChIP-seq data and to obtain a model that can be biologically
interpretable, only the experiments that have been collected for a same transcription factor are
connected to a same neuron. This neuron is therefore representative of the specific transcription
factor on which the ChIP-seq experiments have been performed. Hence, this first part of the network
reduces the space of the ChIP-seq data to 705 dimensions, with each dimension representing one of
the transcription factor considered. The different experiments performed for a similar transcription
factor are grouped together into one neuron using a relu activation function. It’s interesting to use
this function in the 1
st
part as we want to obtain a network representative of what happen in the cells.
The relu activation function assigns to the neurons representing the transcription factors either a
positive or a null value. Therefore, given a certain gene, if a positive value is obtained for the neuron
corresponding to a transcription factor, it is representative of a transcription factor that is involved
in the expression of this gene. Whereas, if a null value is obtained, it suggests that this transcription
factor has no influence on the expression of the gene studied. Eventually, a bias term is added for
each group of experiments of a transcription factor. It’s useful to learn a certain threshold on the
5
ChIP-seq data that is descriptive of how important the interaction between the transcription factor
and the DNA should be to have this transcription factor involved in the gene expression prediction.
4.2.2 Second part of the network
Given as input the 705 neurons resulting from the 1
st
part of the model and representing the different
transcription factors, the 2
nd
part aims to link those neurons to the output neurons. The output
neurons correspond to the different samples present in the gene expression data. Therefore, 48
output neurons are resulting from the TF-NN. The neurons representing the 705 transcription factors
are directly linked (fully connected) to the 48 output neurons predicting gene expression of the
different samples. This direct link allows to observe how the different transcription factors are
involved in the gene expression prediction. A linear activation function is used in this second part of
the network. This function was considered as a linear relationship is expected between how strong
the transcription factors interact with DNA and how expressed a gene is. A bias term is also added
in this part of the network to account for all the transcription factors that may be expressed in the
conditions studied but for which no ChIP-seq data are available. The bias value represents a kind
of basal level of the gene expression dataset as if no ChIP-seq data are available for a new gene, its
expression will be predicted at this bias value.
To avoid overfitting of the training data and have a model that learns the general features of the
gene expression distribution and dont fit the noise of the training data, both dropout layers and L2
penalty terms have been added in the two parts of the network.
4.2.3 Additional constraints to be more biologically representative
The aim of the TF-NN being to be both accurate in the gene expression prediction and interpretable
when analyzing its parameters, some additional constraints were added in this model to render it
more representative of what happen in the cells.
The first constraint concerns the weights of the 1
st
part of the model, meaning the weights that are
learned for the different experiments of the ChIP-seq data. Without any constraint, those weights
could have been positive, negative or null. However, having negative weights dont really make
sense because it suggests that an experiment for which a negative weight has been learned tend to
not activate the neuron representing the transcription factor. This is counter intuitive with what
represents the ChIP-seq data. Therefore, if some experiments are not relevant for the gene expression
prediction under the conditions studied, the weights of those experiments should be set to 0 and not
negative. This is done by imposing the weights of the 1
st
part of the model to be either positive or
equal to 0.
The second constraint was imposed on the bias terms of the 1
st
part of the network. Indeed, to
have a model biologically representative, one wants that if there is no ChIP-seq data for a certain
transcription factor, the neuron representing this transcription factor cant be activated and involved
in the gene expression prediction. By forcing the bias to be either negative or equal to 0, it allows to
get rid of this potential obstacle in the biological interpretation of the TF-NN.
6
4.2.4 TF-NN
Reuniting the 1
st
part and 2
nd
part together, adding the dropout layers and the penalty terms as
well as the additional constraints both on the weights and the bias terms of the 1
st
part leads to the
architecture of the TF-NN. This network allows to predict gene expression data using information on
the interaction between transcription factors and DNA. In addition, it has been designed to be able to
extract some biological properties on how transcription factors are regulating gene expression under
the conditions studied. The different parameters allowing to extract those biological information
will be presented in the results section. The TF-NN is schematically depicted on Figure 1.
Figure 1: Schema of the TF-NN architecture.
4.3 Alternative architecture
In addition to the TF-NN, the results for another architecture will be discussed in the next section.
This network is really close to the TF-NN, with a 1
st
part that is exactly the same and a second part
where simply an additional hidden layer is added. This hidden layer is fully connected, via a relu
activation function, with the layer of neurons representing the transcription factors. The layer is
formed by 10 hidden neurons, allowing to reduce even more the dimensionality of the ChIP-seq data.
A schema of this architecture can be observed on Figure 2.
7
Figure 2: Schema of the alternative architecture of the TF-NN.
5 Results
With its specific architecture, the TF-NN can be analyzed at different levels in order to extract some
specific biological features. In this section, it will be explored which information can be obtained
from which parameters of the model and it will be discussed whereas those results appear to be really
biologically relevant. In addition to this analysis at a single tissue level, the results obtained using
multiple tissues will also be described as well as how the model should be used in such configuration.
However, before exploring the interpretability of this network, let’s first have a look on how well it
can predict gene expression using ChIP-seq data.
5.1 Predicting gene expression
5.1.1 Training the network
The network was built using Tensor Flow and the Keras API. It allowed to acquire the necessary
modularity to construct the particular 1
st
part of the TF-NN. To train the network and evaluate
it on two different sets of data, the ChIP-seq and gene expression datasets have been split into a
training set and a testing set. Out of the 13’080 genes, 60% of them have been randomly selected for
the training part, leading to a total of 7’848 genes. The 7’848 genes were again subdivided into two
groups. The final training data and the validation data, allowing to assess the performance of the
model throughout the training process and identify notably if the model is overfitting or underfitting
the training data. 10% of the 7’848 genes were randomly included in the validation dataset, leading to
a final training dataset of 7’063 genes. The Adam algorithm was used as an optimizer in the training
process. Adam optimization is a stochastic gradient descent method that is based on adaptive
estimation of first-order and second-order moments. It’s a computationally efficient method that
is well suited for problems large in terms of parameters [
10
]. This optimizer has been used with a
learning rate of 0.001. Eventually, the loss function used to optimize the model was the mean square
8
error loss. This function computes the mean of squares of errors between the labels (true gene
expression) and predictions (predicted gene expression). The model will therefore seek to minimize
the mean square error loss during the training stage. With the dropout layers and the penalty terms,
the training loss and validation loss are converging and no overfitting or underfitting are observed.
5.1.2 Testing the network
To test the network, the testing dataset is used in order to evaluate the prediction performance of the
model on new data that were unseen during the training process. This testing set represents 40% of
the genes present in the initial gene expression and ChIP-seq data, which corresponds to a total of
5’232 genes. The model has therefore been evaluated by predicting the expression of those genes and
by comparing the predicted values with the true gene expression values across the 48 samples studied.
To assess statistically the performance of the model and the prediction accuracy, the coefficient of
determination R
2
has been computed. This statistical result represents the proportion of the variance
in the true gene expression that is explained by the predicted gene expression. Therefore, it provides
a measure of how well the true gene expression data are approximated by the network. Testing the
TF-NN on the 5’232 genes and assessing statistically its performance by computing the average of
the coefficient of determination of the various samples, it can be observed that 53.3% of variance
of the gene expression data is explained by the model. This is a quite impressive value, especially
considering that this network is designed not only to be accurate in its prediction performance but
also to be interpretable. As a mean of comparison, training and testing fully connected networks
(single layer perceptron or multilayer perceptron) under the same conditions leads to less than 47%
of variance of the gene expression data explained. In addition, those networks operate like black
boxes and are not really interpretable, thus enhancing the interest of the TF-NN to predict gene
expression.
5.2 Interpretation of the parameters - Weights analysis
5.2.1 First part of the TF-NN
In the first part of the network, the ChIP-seq data are taken as input and the space of the data
is reduced to 705 dimensions, with each dimension representing one of the transcription factor
considered. This is done by grouping together the different experiments performed for a similar
transcription factor as well as a bias term into one neuron using a relu activation function. Therefore,
in this part there is a bias value that is computed for each neuron representing a transcription factor
and weights associated to each experiment. With the constraints that have been imposed on those
parameters, they have become biologically interpretable.
Indeed, the bias term gives an idea on how strong the interaction between the transcription factor
and the gene should be in order to involve the transcription factor in the expression prediction of
this gene. The more negative the bias, the higher the ChIP-seq data must be to activate the neuron
representing the transcription factor and involved this transcription factor in the gene expression
prediction. On the other hand, if a bias term is null for a certain transcription factor, this means that
as soon as some interactions between this transcription factor and a gene are reported, the neuron
representing the transcription factor will be part of the prediction process for this gene.
9
The weights of the different experiments being only positive or null, they are reflecting how impor-
tant each experiment is to describe the role of the transcription factor in the conditions studied. If a
weight for a certain experiment is null, this means that this experiment will never be considered in
the model, suggesting that no interesting information can be obtained from such an experiment
in the context in which the TF-NN is used. In contrary, a really high weight value indicates that the
experiment has been done in conditions that are well reflecting the conditions on which the gene
expression data have been obtained.
Using the weights and the bias terms together, it is possible to determine for how many genes the
neuron representing a transcription factor will be activated (positive value after the application of
the relu activation function) and thus be involved in the prediction process of the gene expression.
Performing such analysis for all the transcription factors indicates what are the transcription fac-
tors that are the most often involved in the prediction process, which gives a first idea on which
transcription factors are taking part in the regulation of gene expression in the conditions studied.
However, those values are not representative on how important a transcription factor will be in the
gene expression process.
The results of the percentage of activation of each transcription factor obtained with the liver and
time dependent gene expression data used for this study are represented on Figure 3. This visualiza-
tion allows to observe that some neurons representing transcription factors are often activated in
the model meaning that they are involved in the expression prediction of most of the genes of the
dataset. This is the case for instance for the neuron representing Nr3c1 transcription factor. On the
other hand, certain neurons are never or almost never activated in the model meaning that they are
not really useful for this model and suggesting that the transcription factors that are associated to
those neurons are not involved in gene expression in the liver. Those percentages of activation are
directly linked to the bias values associated to the different transcription factors and for which their
absolute value is represented as red dots on Figure 3. Indeed, as mentioned previously, a very small
bias value (high absolute value) means that the ChIP-seq data for this transcription factor should be
really high in order to activate the neuron representing this transcription factor and involved it in the
prediction. Whereas, a bias value equals to 0 means that as soon as an interaction between a gene
and the transcription factor is predicted in the ChIP-seq data, the neuron of this transcription factor
will be activated in the model, involving the transcription factor in the expression prediction process.
5.2.2 Second part of the TF-NN
The 2
nd
part of the network is linking the 705 neurons representing the transcription factors to the
48 output neurons of gene expression prediction for each sample. The link is simply done via a linear
activation function and an additional bias term. Therefore, the analysis of the bias and of the weights
bring some interesting information on how the transcription factors may regulate gene expression
in the liver.
Starting with the bias term, this parameter is representative of the basal gene expression level of the
dataset and allows to get an idea on how much information is missing in the ChIP-seq data. Indeed,
to predict gene expression in the liver, the ChIP-seq data for 705 transcription factors are considered.
However, the number of known transcription factors in mice, as well as in humans, is more than
the double of the transcription factors considered (
1600 annotated transcription factors in mice
[
11
]). Of course not all of those transcription factors are expressed in the liver, it’s even mentioned in
10
Figure 3: Barplot showing the percentage of genes (y-axis) that involve the transcription factors (x-axis,
listed in alphabetical order) in their expression prediction. The red dots are the absolute bias values
(y-axis) linked to the different transcription factors (real values are only negative or null, by design).
the literature that only around 600 transcription factors are active in this tissue [
12
]. Although, the
expression of the 705 transcription factors has been observed with the liver gene expression data
used for the project and it appeared that 503 of them were expressed, meaning that the ChIP-seq
data for around 100 transcription factors is lacking to account for all transcription factors expressed
in the liver, in the gene expression prediction process. Therefore, the bias term of this 2
nd
part is
kind of representative of the effect of those 100 missing transcription factors in the gene expression
regulation.
The analysis of the weights going from the neuron representing the transcription factor to the
output neurons can be done at two levels: looking at the average effect of the transcription factor
on gene expression prediction or looking on how the weights vary between the different samples.
Looking at the average effect, meaning computing the mean of the 48 weights going from a neuron
representing a certain transcription factor to the different output neurons, allows to determine
whether a transcription factor will tend to act as a repressor or an enhancer of gene expression in
the network. It allows also to get an idea on how important a transcription factor will be to the final
gene expression prediction as the higher the average weights in absolute value, the higher its effect
may be on the prediction. Hence, this approach gives a general idea of the effect of the transcription
factor. However, it doesnt allow to observe the fluctuation of the weights across the different samples,
which is a feature of great interest in the way the network has been designed in order to observe the
specificities (time dependency) of the gene expression data used.
Therefore, to observe how the weights are dependent on the output samples considered, it’s possible
to visualize the weights going from the neuron representing the transcription factor to the output
neuron in function of the output neuron considered. This is what is represented on Figure 4. Very
interestingly, it can be observed that some rhythmic patterns appear in the weights of certain
transcription factors. Those rhythmic patterns observed for the Arntl and Nr3c1 transcription factors
for example have a period of 6 samples which corresponds to a period of 24 hours (the 6 samples
11
of 1 replicate). Having the weights of some transcription factors showing rhythmic patterns is the
reflect of what happen in the cells with the circadian rhythm as we are predicting gene expression at
different period of the day and in the liver. This feature of the network is of particular interest because
not every neuron representing a transcription factor is showing rhythmicity in its weights. Therefore,
by analyzing those weights using a more systematic approach like a Fourier analysis, it can be
observed whether a transcription factor is likely to be involved in the circadian rhythm (oscillations
of the weights in function of the samples) or not (constant weights across the samples). The results
observed for the neuron representing Clock transcription factor are showing less rhythmicity than
for the other two transcription factors visualized. As it is known that Clock transcription factor is
a core component of the circadian rhythm, clear rhythmic patterns were expected. The reason
why less rhythmicity is observed may be explained by the fact that this transcription factor forms
an heterodimer with Arntl before activating the expression of downstream genes. Therefore, the
rhythmicity of this complex may have been learned mostly for the Arntl transcription factor, showing
some limitations of this approach to fully determine what are the transcription factors involved in
the circadian rhythm.
Figure 4: Variation of the weights going from the neuron representing a transcription factor to the
output neurons across the samples (intron samples before the vertical red dashed line, exon samples
after it). Weights for neurons representing Arntl, Nr3c1 and Clock transcription factors are represented.
5.2.3 Full TF-NN
Considering the whole network, it becomes possible to determine the exact effect of each transcrip-
tion factor on the expression prediction of a certain gene. Indeed, by considering the ChIP-seq
data of a gene and using the different parameters of the trained network, it is possible to follow
the prediction process and determine to which extent each transcription factor has been involved.
This allows to give some insights on how a specific chosen gene is regulated. The disadvantage is
12
that it is an analysis done at a very fine level of granularity, which doesnt allow to determine the
global process regulating gene expression. However, by considering all the genes of the dataset and
averaging the effect of the different transcription factors on each of those genes, it allows this time to
determine the importance of each transcription factor in the prediction of gene expression under the
conditions studied. Therefore, doing this analysis on the dataset considered for this project, it allows
to determine what are the transcription factors regulating gene expression in the liver at specific
times.
This is what is represented on Figure 5. On the x-axis are the 705 transcription factors and on
the y-axis their average contribution to the gene expression prediction. So, looking at how each
transcription factor contributes to the gene expression prediction, it gives an idea of what are the
transcription factors that are the most important and the most involved in the conditions studied. It
can be observed in this case that the three most involved transcription factors in the gene expression
prediction are Nelfa, Clock and Brd4. Those results seem to be quite biologically relevant and rep-
resentative on how gene expression is regulated in the liver through time. Indeed, out of the most
important transcription factors, it can be observed that some of them are involved in general process
of gene expression regulation: For instance, Nelfa participates in the regulation of RNA polymerase
II transcription elongation [
13
], Brd4 is an epigenetic reader domain thus involved in histone regu-
lation [
14
] and Nr3c1, which is the glucocorticoid receptor, is expressed on almost all cell types in
the body and regulates the expression of many genes [
15
]. Whereas the other transcription factors
that participate the most to gene expression prediction are more specific to the liver or the circadian
rhythm: Clock plays a central role in the regulation of circadian rhythms [
16
], Hnf4a is essential
for circadian rhythm maintenance and period regulation in the liver and acts as a transcriptional
regulator controlling the expression of several hepatic genes [17] and Onecut1, which expression is
enriched in liver, stimulates transcription of liver-expressed genes [18].
Figure 5: Mean contribution (y-axis, in percent) of all transcription factors (x-axis, listed in alphabeti-
cal order) to the gene expression prediction.
Those key results tend to demonstrate that, with the constraints added on the different parameters
of the network, the transcription factors appearing to be the most important for the model to predict
gene expression are probably reflecting (at least partially) how gene expression is regulated in the
13
liver.
In addition to this approach done for the whole dataset, it can also be interesting to reduce the
number of genes considered in order to look only at group of genes involved in a common biological
process. This allows to determine which transcription factors are the most important for a smaller
set of genes, reducing the averaging effect on the results observed. Therefore, this weight analysis of
the whole TF-NN can be done at three different levels (one gene, a group of gene or all the genes of
the dataset) allowing to analyze gene regulation at any granularity.
5.3 TF-NN with an additional hidden layer
Adding a hidden layer of few hidden neurons between the layer of neurons representing the tran-
scription factors and the output neurons (Figure 2) shows both great advantages and disadvantages.
On the positive side, by designing the neural network like this, it allows to reduce greatly the di-
mensionality of the ChIP-seq data used as input data. Indeed, one of the disadvantage of using
those data is that they are of high dimensions (more than 10’000 conditions considered) and not
all the experiments are necessarily relevant for the conditions considered. Therefore, it appears
essential to reduce the space of those data before predicting gene expression. This is what is already
done with the TF-NN that reduced the space of the ChIP-seq data to 1 dimension per transcription
factor leading to 705 values. However, it appears that adding an additional hidden layer of 10 hidden
neurons (or more, but being interested by the dimensionality reduction a smaller number of hidden
neurons is of greater relevance) allows to obtain interesting prediction results. Using the same
training and testing approach than the one described for the TF-NN, it appears that the trained
model of this architecture allows to explain 54.1% of variance. This is a slightly better result than the
one obtain with the TF-NN (53.3%).
However, on the negative side, some interpretability of the network is loss. Indeed, first it may
be thought that such a model would be more representative of what happen in the cells with the
hidden neurons representing group of transcription factors operating together to regulate gene
expression. Though, by analyzing the weights distribution on those hidden neurons, it appears
that the transcription factors are not really clustering into those different hidden neurons. Some
additional constraints or complexity should be added in this model to obtain such results. Hence,
how it stands, this model is less interpretable than the TF-NN because the direct link between the
neurons representing the transcription factors and the output neurons is loss. So, condensing even
more the information of the ChIP-seq data works well for the aim of predicting gene expression but
comes at the cost of a loss of interpretability, resulting in a model that is not yet able to offer a better
compromise between prediction accuracy and interpretability than the one of the TF-NN.
5.4 Multi tissue study
For this study, data for different tissues were needed. The data used were therefore the second set of
gene expression data described in the data section. It corresponds to gene expression data for the
liver, the muscle and the white adipose tissue.
14
5.4.1 Independent models
Before exploring how the TF-NN can be used on multiple tissues, an independent model has been
trained and tested on gene expression of each tissue. It allowed to evaluate that the results obtained
with the first dataset on the liver tissue were not specific to this dataset. This is indeed what has
been observed with the results obtained for the liver tissue of this new set of data. The coefficient of
determination for this new model assesses that 53% of variance of the gene expression was explained,
which is in the same range as the results obtained with the previous gene expression dataset (53.3%).
However, using the gene expression data of the two other tissues, the performance of the TF-NN
decreases to 47% and 43.3% of variance explained for the gene expression data of the muscle tissue
and the data of the white adipose tissue respectively.
The reason of this decrease in prediction performance may find its source directly in the ChIP-seq
data. Indeed, those data are collected from various experiments performed under many different
conditions and cell types or tissues. Therefore, not all the tissues are similarly represented in those
data. It appears that 646 experiments have been performed in the liver tissue (out of the 10’805)
whereas only around 150 experiments have been performed in muscle tissue and even less for the
white adipose tissue. The decrease observed in the model performance is thus probably the reflect
of the decrease of those tissue specific conditions in the ChIP-seq data.
5.4.2 Use of the TF-NN on multiple tissues
Using the same network for multiple tissues is of great interest. With such approach, it would be
possible to train a model with one or multiple tissues before looking at how well this model could
perform on a new tissue, allowing to explore similarities in gene expression regulation between
tissues for instance.
However, using a same model on different tissues is not necessarily straightforward with the TF-NN.
Indeed, a major problem is encountered with the input data. Its the same input dataset that is used
for every tissue as the ChIP-seq data are not tissue specific but reunite a large number of experiments
performed under different conditions. The direct consequence of this problem is that it becomes
impossible to train a model on one tissue before testing it on another tissue as nothing will be specific
to the second tissue so the predicted gene expression values would be the same than the ones of the
first tissue.
That’s why another approach should be considered to perform a multi tissue study. The method
consists to train the whole network on one or many tissues before freezing a part of this trained
model, meaning to block the values of the parameters and render them untrainable, and training the
unfrozen part for certain genes of a new tissue to eventually test the whole model for the other genes
of the new tissue. Like this, it allows to have a part of the model that is tissue dependent and another
part that is tissue independent, global for the different tissues.
With the TF-NN two possibilities arise with this approach. Either the 1
st
part of the TF-NN (Figure 6)
or the 2
nd
part can be frozen after the training on one or some tissues. Freezing the 1
st
part means
that the link between the ChIP-seq data and the neurons representing the transcription factors will
be unspecific to the to the tissue for which gene regulation is studied, whereas the link between the
neurons representing the transcription factors and the prediction neurons will be specific to this
tissue, allowing to fine tune the effect of each transcription factor on gene expression once they are
activated.
15
Figure 6: TF-NN with its 1st part frozen.
Freezing the 2
nd
part of the network, this time the weights of each experiment of the ChIP-seq data
as well as the threshold set on those data will be specific to the tissue studied and it will be the effect
of each transcription factors on gene expression that will be unspecific to the new tissue.
5.4.3 Freezing the first part of the network
Performing this approach and using the gene expression data of the liver tissue to train the whole
model, before freezing the 1
st
part and training/testing this model with gene expression data of
either the muscle or the white adipose tissue, leads to noteworthy results.
First, it can be observed that such models lead to a decrease in the prediction performance. In the
two different cases tested, the coefficient of determination is reduced. For the muscle tissue, 39.6%
of variance of the gene expression data is explained with this approach instead of 47% of variance
explained when using an independent model. Whereas for the white adipose tissue the coefficient
of determination goes from 43.3% to 37.2% of variance explained when freezing the 1
st
part of the
network. This decrease in prediction performance is pertinent as not all the parameters are specific
to the tissue for which gene expression is predicted. However, besides the prediction results, the
biological interpretability of the parameters is the central interest of this project. This is where this
approach appears to bring some remarkable outcomes.
When freezing the 1
st
part of the network, the only parameters that are modified, after that the
model has been trained on the liver tissue, are the weights going from the neurons representing the
16
transcription factors to the output (prediction) neurons. Therefore, by looking at how the value of
those weights changes from when the model has been trained on the liver tissue only to when it has
also been trained on the muscle (or white adipose) tissue, it allows to compare how the transcription
factors are differentially involved in gene expression in those two tissues. This is what is represented
on Figure 7. Those results seem to be actually well representative of the transcription factors that are
specific to the tissues studied. Indeed, on the negative part of the figure, the transcription factors
that have higher weights in the liver and thus are more important in gene expression prediction
of this tissue can be found. It appears that the transcription factors present in this part are for the
majority of them transcription factors that are truly specific to liver genes. Besides the transcription
factors already described in this report like Clock, Hnf4a or Onecut1, Hnf1a can be found, which is
a transcription factor required for the expression of several liver-specific genes [
19
] or also Foxa1
and Foxa2 that are hepatocyte nuclear factors acting as transcriptional activators for liver-specific
transcripts [
20
]. On the other side of the figure, where the transcription factors that are more specific
to the muscle can be found, again it seems to be the reflect of what happen in the cells with for
instance the Myog transcription factor that promotes transcription of muscle-specific target genes
[
21
], the transcription factors of the Mef2 family (Mef2c and Mef2d) that are involved in the control
of muscle differentiation and development [
22
] or the Myod1 transcription factor that promotes
transcription of muscle specific target genes [23].
Figure 7: Difference in weights of the 2
nd
part between the muscle and liver tissue after using the
freezing approach for the 1st part of the TF-NN.
Therefore, this approach of training the whole model for a first tissue before freezing the 1
st
part and
training/testing using another tissue reduces, on one hand, the gene expression prediction but, on
the other hand, allows to determine what are the tissue specific transcription factors between the two
tissues studied. In addition, looking at the differences between two tissues show a supplementary
advantage in comparison to the results obtained using a single tissue is that it allows to get rid (or at
least reduce) the effect of the general transcription factors as those ones will be involved in the two
17
tissues. Hence, it provides a real idea on what are the transcription factors particularly important
for gene expression regulation of a certain tissue. Indeed, it can be seen that general transcription
factors that were observed as some of the most important ones for gene expression regulation in the
liver like Brd4 or Nr3c1 (Figure 5) are not present anymore when looking at the difference between
two tissues (Figure 7).
5.4.4 Freezing the second part of the network
Doing the opposite than in the previous section, meaning to train the whole model with gene
expression data of the liver before freezing this time the 2
nd
part and training/testing this model with
gene expression data of either the muscle or the white adipose tissue, an almost similar decrease in
the prediction performance can be observed.
For the muscle tissue, 38.8% of variance of the gene expression data is explained with this approach
(39.6% when freezing the 1
st
part, 47% when using an independent model). Whereas for the white
adipose tissue the coefficient of determination goes from 43.3% to 38.3% of variance explained when
freezing the 2nd part of the network. Again, this decrease seems relevant but is still important.
However, the prediction results may be improved using more than a single tissue to train the model
before freezing the 2
nd
part. To test this hypothesis, two independent model were trained. One with
the liver tissue and one with the muscle tissue. The averages of the weights of the 2
nd
part between
those two independent models have been computed before setting those results as the weights of the
2
nd
part of a new model, freezing this 2
nd
part of the network and training/testing this new model
on the white adipose tissue. With this approach the coefficient of determination reaches a value of
39.8%. This is still lower than the 43.3% of variance explained obtained when training the whole
model with the gene expression data of the white adipose tissue. However, it is a better prediction
result than when freezing the 2
nd
part on the network directly after having trained the model on the
liver gene expression data. Therefore, this tendency may mean that using multiple tissues for the 2
nd
part of the network before freezing it is beneficial as it allows to determine a kind of average effect of
each transcription factor on gene expression (not tissue specific anymore) that is thus better suited
for the method consisting of freezing the 2nd part of the network.
18
6 Discussion
The understanding of how transcription factors regulate gene expression under specific conditions
and on a genome wide scale is of great interest with the yet well established high-throughput tech-
nologies allowing the measurement of genome-wide mRNA expression. Here is introduced the
TF-NN, a neural network that uses chromatin immunoprecipitation sequencing data to predict gene
expression data. Through this prediction process, the model learns parameters that are specific to the
condition of interest on which the gene expression data have been collected. Those parameters can
then be extracted to obtain insights on how the transcription factors are regulating gene expression
under those conditions.
Indeed, by adding various constraints on different parameters of the network, the TF-NN has been
designed to be representative of what may happen biologically allowing to both predict gene expres-
sion data and extract biological properties from its parameters. The use of ChIP-seq data that are not
tissue specific but reunite more than 10’000 experiments performed for different cell types, under
particular external stress or collected at specific times, allows this model to be employed for gene
expression data obtained under many different conditions.
Using the TF-NN with gene expression data collected in the mouse liver at specific times, it has
been shown that this network allows notably to determine how important the different transcrip-
tion factors are in the regulation of gene expression under those conditions or to observe which
transcription factors may be involved in the circadian rhythm by looking at the oscillating weights
appearing for some neurons. Those results look quite relevant considering the known effects of
some transcription factors described in the results section. Therefore, the duality (prediction and
interpretation of the weights) of the TF-NN is certainly noteworthy to get some insights on how the
transcription factors are involved in the regulation of gene expression in the liver and at specific
times.
Another advantage of the TF-NN is that its architecture allows to reduce the space of the ChIP-seq
data to 1 dimension per transcription factor, leading to 705 dimensions. Indeed, the interaction
data considered between the transcription factors and DNA come from the ChIP-Atlas database
that reunite all the public ChIP-seq data submitted to the Sequence Read Archives. Therefore, this
leads to more than 10’000 experiments considered to account for many transcription factors, which
is way too many information to look at to understand how gene expression is regulated by those
transcription factors. However, with this architecture and the dimensionality reduction, it renders
those data more interpretable and somehow more useful on a genome-wide scale. It has been shown
that for the only aim of prediction of gene expression in the liver and at specific times the space can
even be reduced up to 10 dimensions using the architecture schematically depicted on Figure 2.
Eventually, even though the ChIP-seq data are not tissue specific and therefore are used as input
of the network for any tissue studied, the TF-NN can be used with multiple tissues to extract some
tissue specificities. Indeed, by using an approach allowing to have parameters that are learned with
some tissues and others that are specific to the tissue and the conditions of interest, it becomes
possible to identify transcription factors that are really specific to the tissue studied and to reduce or
get rid of the effects of the more global transcription factors. It has also been shown that using two
tissues to determine the average effect of each transcription factor on gene expression and then only
render the 1
st
part of the model tissue specific allows to get prediction results that are closer to the
ones obtained using an independent model. Therefore, it may be interesting for further studies to
test this feature of transcription factor average effect on gene expression using even more tissues in
order to observe if the more tissues considered the closer the prediction results may be to the ones
19
of an independent model.
There are of course some limitations to the approach of using the TF-NN to understand how gene
expression is regulated in the conditions studied. For instance, using only linear relationships (relu
activation function or linear activation function) the TF-NN wont be representative of saturation
effects that are occurring in the cells. This approach also assumes that a transcription factor acts
only either as an activator or as a repressor for all the genes considered in the dataset whereas the
effect of each transcription on gene expression may be positive for some targets and negative for
others. Additionally, one limitation appears with the information that is provided by the ChIP-seq
data. Indeed, those data relate the interaction between the transcription factor and the potential
target genes located at a maximal (specified) distance. This restricts the analysis to the assumption
that a transcription factor affects the expression of the gene that is the most proximal to the ob-
served binding site. Whereas, in fact this transcription factor may act on another DNA component
that could impact a distal target gene [
24
]. Eventually, a last limitation concerns the links between
the neurons representing the transcription factors and the prediction neurons. Those links are
direct in the TF-NN with transcription factors that are therefore considered to directly regulate gene
expression. However, this is not exactly what happen in the cells. The transcription factors are
interacting together to regulate gene expression, forming heterodimers or activating/repressing each
other. Implementing those additional biological properties in the TF-NN, by adding some neurons
representing pairs of transcription factors for instance, would lead to a further increase in the model
representativity of the gene expression regulation allowing to extract some complementary infor-
mation on how the transcription factors are interacting together and what are the effects of those
interactions on gene expression.
7 Acknowledgments
I would like to thank the Laboratory of Computational and Systems Biology for this very interesting
project. A special thanks to Lorenzo Talamanca, who monitored the progress of the project and
offered his precious help and advices every week, to Professor Naef, who brought all his expertise and
key advices to help in the design of the TF-NN and the analysis of the results and to Cédric Gobet,
who set up really well the initial working environment and the data provided as well as suggested
some nice features to add to the TF-NN.
20
References
[1] Protein function.
Available at: https://www.nature.com/scitable/topicpage/protein-function-14123348/.
[2]
D. S. Latchman, “Transcription factors: an overview., vol. 74, no. 5, pp. 417–422. International
Journal of Experimental Pathology.
[3] P. J. Balwierz, M. Pachkov, P. Arnold, A. J. Gruber, M. Zavolan, and E. van Nimwegen, “ISMARA:
automated modeling of genomic signals as a democracy of regulatory motifs, vol. 24, no. 5,
pp. 869–884. Genome Research.
[4] How to study activation of transcription factors | abcam.
Available at: https://www.abcam.com/kits/methods-to-study-transcription-factor-activation.
[5]
S. A. Lambert, A. Jolma, L. F. Campitelli, P. K. Das, Y. Yin, M. Albu, X. Chen, J. Taipale, T. R.
Hughes, and M. T. Weirauch, “The human transcription factors, vol. 172, no. 4, pp. 650–665.
Cell.
[6] “ChIP-Atlas.
Available at: https://chip-atlas.org/.
[7] “ChIP-Atlas | github.
Available at: https://github.com/inutano/chip-atlas.
[8]
F. Atger, C. Gobet, J. Marquis, E. Martin, J. Wang, B. Weger, G. Lefebvre, P. Descombes, F. Naef, and
F. Gachon, Circadian and feeding rhythms differentially affect rhythmic mRNA transcription
and translation in mouse liver, vol. 112, no. 47, pp. E6579–E6588. National Academy of Sciences
Section: PNAS Plus.
[9]
D. Gaidatzis, L. Burger, M. Florescu, and M. B. Stadler, Analysis of intronic and exonic reads in
RNA-seq data characterizes transcriptional and post-transcriptional regulation, vol. 33, no. 7,
pp. 722–729. Nature Biotechnology.
[10] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv:1412.6980 [cs].
[11]
Q. Zhou, M. Liu, X. Xia, T. Gong, J. Feng, W. Liu, Y. Liu, B. Zhen, Y. Wang, C. Ding, and J. Qin, A
mouse tissue transcription factor atlas, vol. 8, no. 1, p. 15089. Nature Communications.
[12]
K. Sun, H. Wang, and H. Sun, “mTFkb: a knowledgebase for fundamental annotation of mouse
transcription factors, vol. 7, no. 1, p. 3022. Scientific Reports.
[13]
“NELFA negative elongation factor complex member a [homo sapiens (human)] - gene - NCBI.
Available at: https://www.ncbi.nlm.nih.gov/gene/7469.
[14] “Tocris Bioscience - Cell Biology - Bromodomains.
Available at: https://www.tocris.com/cell-biology/bromodomains.
[15]
L. Quatrini and S. Ugolini, “New insights into the cell- and tissue-specificity of glucocorticoid
actions, pp. 1–10. Cellular & Molecular Immunology.
21
[16] “CLOCK clock circadian regulator [homo sapiens (human)] - gene - NCBI.
Available at: https://www.ncbi.nlm.nih.gov/gene/9575.
[17]
M. Qu, T. Duffy, T. Hirota, and S. A. Kay, “Nuclear receptor HNF4a transrepresses CLOCK:BMAL1
and modulates tissue-specific circadian networks, vol. 115, no. 52, pp. E12305–E12312. Pro-
ceedings of the National Academy of Sciences.
[18] “ONECUT1 one cut homeobox 1 [homo sapiens (human)] - gene - NCBI.
Available at: https://www.ncbi.nlm.nih.gov/gene/3175.
[19] “HNF1a HNF1 homeobox a [homo sapiens (human)] - gene - NCBI.
Available at: https://www.ncbi.nlm.nih.gov/gene/6927.
[20] “FOXA1 forkhead box a1 [homo sapiens (human)] - gene - NCBI.
Available at: https://www.ncbi.nlm.nih.gov/gene/3169.
[21] “MYOG myogenin [homo sapiens (human)] - gene - NCBI.
Available at: https://www.ncbi.nlm.nih.gov/gene/4656.
[22] “MEF2d myocyte enhancer factor 2d [homo sapiens (human)] - gene - NCBI.
Available at: https://www.ncbi.nlm.nih.gov/gene/4209.
[23] “MYOD1 myogenic differentiation 1 [homo sapiens (human)] - gene - NCBI.
Available at: https://www.ncbi.nlm.nih.gov/gene/4654.
[24]
B. Deplancke, D. Alpern, and V. Gardeux, “The genetics of transcription factor DNA binding
variation, vol. 166, no. 3, pp. 538–554. Cell.