3
3.1.1 Differential gene expression analysis
Differential expression analysis has been performed using the package EdgeR (https://www.rdocumentation.
org/packages/edgeR/versions/3.14.0). The process starts with a preprocessing step that aims to
remove the genes that were almost not expressed across the samples. To be kept, the genes had to
have a transcripts per million value greater than 10 in at least three samples out of the 120. Therefore,
the data set consisted of 120 samples representing 22 nuclei and 16’538 genes. Then, the counts
were fitted to a negative binomial generalized linear model, where each factor level represented one
of the 22 different thalamic nuclei. Fitting the data with a negative binomial distribution is useful
for over-dispersed count data making it a commonly used distribution to fit RNA-seq data. The
reasons why this distribution is often the first one considered is that since reads are count based, they
can’t be normally distributed without processing such as log-transformation (we can’t have negative
values or have floating-point numbers). Therefore, two distributions for count based data can be
taken into account, which are Poisson (that presumes the variance and mean of gene expression
in our case to be equal) or negative binomial (which does not make this assumption) [
6
]. Negative
binomial distribution will especially be useful for discrete data over an unbounded positive range
whose sample variance exceeds the sample mean. Therefore, the observations are overdispersed
with respect to a Poisson distribution, for which the mean is equal to the variance. Hence a Poisson
distribution is not an appropriate model. Since the negative binomial distribution has one more
parameter than the Poisson, the second parameter can be used to adjust the variance independently
of the mean. That’s why the negative binomial distribution is commonly used to model data in the
form of discrete sequence read counts from high-throughput RNA and DNA sequencing experiments
[7].
After that the counts have been fitted to a negative binomial generalized linear model, the 500 most
differentially expressed genes between any thalamic nuclei have been determined using a one-way
analysis of deviance (ANODEV) for each gene. This allowed to test for differences between any of
the 22 nuclei and to select the genes with the lowest P value representing the most differentially
expressed ones. Eventually, the count data have been transformed, through variance stabilizing
transformation and normalization, to become a matrix of values approximately homoscedastic
(having constant variance along the range of mean values).
3.1.2 Hierarchical clustering
Hierarchical clustering was then performed on the approximately homoscedastic matrix where each
nucleus represented the mean value of its samples. This was done using Spearman’s correlation as a
distance metric and complete linkage for agglomeration. Groups were defined by splitting the tree at
the level of five branches, which represent what they termed "profiles".
Using Spearman’s correlation allows to assess how well the relationship between two variables can
be described by means of monotonic relationships. A Spearman correlation of +1 or
−
1 occurs when
each of the variables is a perfect monotone function of the other. Therefore, it allowed to determine
how the 22 thalamic nuclei were close to each other by comparing their expression profile of the
500 genes selected. Thus, after having determined those values it was possible to perform complete-
linkage clustering, which is one of several methods of agglomerative hierarchical clustering. At the
beginning of the process, each element is in a cluster of its own. The clusters are then sequentially
combined into larger clusters until all elements end up being in the same cluster.