Genomics and bioinformatics course
Project report
Date: 29 May 2020
Title:
Molecular analysis of thalamic nuclei via unsupersived hierarchical clustering and PCA
based on the data from paper:
A repeated molecular architecture across thalamic pathways (J. Phillips et al.)[1]
Evan Béal
Professor.:
Prof. Jacques Rougemont
Teaching assistant:
Jane Yi
1
1 Summary
The thalamus plays an important role in a mammalian brain. It performs hub-like exchanges of
information via nerve fibers that project out of it to the cerebral cortex in all directions. Many main
functions of the brain are thus directly involving the thalamus, which is considered as the central
communication hub of the forebrain. However, the organizational logic of thalamic projections has
remained abstract. Therefore, through unsupervised hierarchical clustering of the most differentially
expressed genes, obtained via gene expression fitting of a negative binomial distribution across 22
distinct nuclei of the thalamus, this study allows to determine three major molecular profiles in
this region of the brain. This new approach of using molecular architecture for classification of the
thalamus nuclei allows to obtain different results than the ones previously proposed based on the
output targets of each nucleus. It permits to identify an organization in a statistically unsupervised
manner based on transcriptomic data of a broad set of genes and describes a molecular framework
upon which an increasingly detailed understanding of the thalamus can be built. Another approach
to compute the most differentially expressed genes has been performed with the modelling of the
mean-variance relationship of the read counts in addition to the other analyzes. This also leads
to three major multinuclei thalamic gene expression profiles, with only differences for one of the
22 nuclei in the hierarchical clustering and no significant variations in the ordered relationships
between the nuclei established with principal component analysis. Therefore, the general similarity
in the results obtained assesses the two methods tested for differential expression analysis on
transcriptomic data.
2 Background
The thalamus has several functions, such as the relaying of sensory signals, like motor signals, to
the cerebral cortex and the regulation of consciousness, sleep, and alertness. In addition, it’s wide
ranging connectivity makes it being the central communication hub of the forebrain [
2
]. Thus, it’s
role is essential in the brain, which render the understanding of its organization of great interest.
Therefore, as it’s known that cell-type specialization underlies the functioning of all understood
organs, to determine whether there are multiple types of thalamic projection neurons would allow to
give insights in the logic of how thalamocortical projections support cortical function and increase
our comprehension of this part of the brain.
Several approaches have been performed to classify the thalamic nuclei. Based on cytoarchitectural
and connectomic features, the thalamus can be divided into approximately 30 distinct nuclei [
3
].
Thus the classification task aims to categorize those nuclei into subgroups in which there is a small
variance for nuclei of the same group, as they should share common features, and a large variance
for nuclei of different groups. The common approach was to group nuclei by the type of sensory
inputs that they receive, allowing to distinguish visual, auditory and somatosensory nuclei. This
approach was then extended to consider the output targets of each nucleus in order to be able to
also classify nuclei of non-sensory regions, allowing to define motor or prefrontal projection systems
for example. However, this classification approach based on projection considers only one feature to
determine the group of each nucleus and therefore does not explain why multiple thalamic nuclei
exist within each group. So, different nuclei classified in the motor projection system for instance can
vary considerably in various relevant features. Therefore, several theories such as the core/matrix [
4
]
2
or first/higher-order [
5
] theories have been proposed to categorize the thalamic projection nuclei.
Nevertheless, those theories have limitations. For example, significant variations can be found in the
functional and anatomical properties of nuclei within each proposed class, some of those theories
are based almost exclusively on test realized in a sensory context or even some categorizations rely
only on a single anatomical feature although neurons vary in many more parameters [1].
Thus, to take into account the different parameters of a cell, the analyzes have to be done at a molec-
ular level. The molecular analyzes have already provided a principled way to characterize neural
populations in other brain regions, offering frameworks for functional networks understanding.
However, there is a lack of understanding at this level for thalamic nuclei. The cellular architecture
based on molecular analyses is either limited to a set of molecular markers or limited to a set of
thalamic nuclei, or both [
1
]. In addition, some molecular markers that have been identified were not
determined in an unsupervised manner, which therefore introduce potential bias and reduce the
causality links. For all those reasons, here the thalamic nuclei will be studied at the molecular level
using transcriptomic data to classify the nuclei into subgroups in an unsupervised manner.
Therefore, to understand the organization of the thalamus, anatomical and transcriptomic ap-
proaches have been combined to obtain read counts values for more than 16’000 genes and 120
samples from mice corresponding to 22 nuclei representing 8 projection targets. The thalamic nuclei
were retrogradely labeled from forebrain areas and the separate nuclei were microdissected. RNA
was then isolated from pooled cells (bulk populations of 14 to 200 cells, median value of 52 cells)
and sequenced. One advantage of the pooled-cell approach over a single cell approach is that, by
averaging over cells, it enables a greater sensitivity for detecting low expressing genes, with some of
those ones encoding for receptors or ion channels, which could be key features to separate the nuclei
into different groups. Consequently, the transcriptome profiling defined via projection labeling
will be used to explore the major branches of thalamic molecular heterogeneity and identify major
subdivisions of nuclei through different analyses ranging from gene differential expression analysis
to principal component analysis, and including unsupervised hierarchical clustering.
3 Methods
3.1 Data analysis 1
In this section, we will get through the different methods of data analysis that have been used by the
authors and reproduced by myself to be able to obtain a hierarchical clustering of thalamic nuclei in
the form of a heatmap and to observe clusters of nuclei based on their molecular similarity. This was
done using the data available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE133911
and adapting the code available on github (https://github.com/aschulmann/ThalamoSeq).
Those methods can be reunited in three main groups. The first one is the differential gene expression
analysis between the samples. Then, the second one consists of the hierarchical clustering to
determine a hierarchy of the different thalamic nuclei. Eventually, the third one will consist of a
dimensionality reduction through principal component analysis.
3
3.1.1 Differential gene expression analysis
Differential expression analysis has been performed using the package EdgeR (https://www.rdocumentation.
org/packages/edgeR/versions/3.14.0). The process starts with a preprocessing step that aims to
remove the genes that were almost not expressed across the samples. To be kept, the genes had to
have a transcripts per million value greater than 10 in at least three samples out of the 120. Therefore,
the data set consisted of 120 samples representing 22 nuclei and 16’538 genes. Then, the counts
were fitted to a negative binomial generalized linear model, where each factor level represented one
of the 22 different thalamic nuclei. Fitting the data with a negative binomial distribution is useful
for over-dispersed count data making it a commonly used distribution to fit RNA-seq data. The
reasons why this distribution is often the first one considered is that since reads are count based, they
cant be normally distributed without processing such as log-transformation (we cant have negative
values or have floating-point numbers). Therefore, two distributions for count based data can be
taken into account, which are Poisson (that presumes the variance and mean of gene expression
in our case to be equal) or negative binomial (which does not make this assumption) [
6
]. Negative
binomial distribution will especially be useful for discrete data over an unbounded positive range
whose sample variance exceeds the sample mean. Therefore, the observations are overdispersed
with respect to a Poisson distribution, for which the mean is equal to the variance. Hence a Poisson
distribution is not an appropriate model. Since the negative binomial distribution has one more
parameter than the Poisson, the second parameter can be used to adjust the variance independently
of the mean. That’s why the negative binomial distribution is commonly used to model data in the
form of discrete sequence read counts from high-throughput RNA and DNA sequencing experiments
[7].
After that the counts have been fitted to a negative binomial generalized linear model, the 500 most
differentially expressed genes between any thalamic nuclei have been determined using a one-way
analysis of deviance (ANODEV) for each gene. This allowed to test for differences between any of
the 22 nuclei and to select the genes with the lowest P value representing the most differentially
expressed ones. Eventually, the count data have been transformed, through variance stabilizing
transformation and normalization, to become a matrix of values approximately homoscedastic
(having constant variance along the range of mean values).
3.1.2 Hierarchical clustering
Hierarchical clustering was then performed on the approximately homoscedastic matrix where each
nucleus represented the mean value of its samples. This was done using Spearmans correlation as a
distance metric and complete linkage for agglomeration. Groups were defined by splitting the tree at
the level of five branches, which represent what they termed "profiles".
Using Spearmans correlation allows to assess how well the relationship between two variables can
be described by means of monotonic relationships. A Spearman correlation of +1 or
1 occurs when
each of the variables is a perfect monotone function of the other. Therefore, it allowed to determine
how the 22 thalamic nuclei were close to each other by comparing their expression profile of the
500 genes selected. Thus, after having determined those values it was possible to perform complete-
linkage clustering, which is one of several methods of agglomerative hierarchical clustering. At the
beginning of the process, each element is in a cluster of its own. The clusters are then sequentially
combined into larger clusters until all elements end up being in the same cluster.
4
3.1.3 Principal Component Analysis (PCA)
The last main part of the analyses performed consisted of a dimensionality reduction through PCA.
PCA is a common technique to visualize genetic distance and relatedness between populations.
Therefore, in this case the PCA is used on the 120 samples to observe their variation between
each other and has been calculated through singular-value-decomposition and on the 500 most
differentially expressed genes.
The relation of the first PCs to the profiles obtained with the hierarchical clustering has eventually
been assessed by analysis of variance (ANOVA), to determine the ratio of between and within group
variance, and evaluated via an F-test.
3.2 Data analysis 2
For the additional figures, instead of defining the most differentially expressed genes through a
negative binomial fitting of the gene expression, those ones have been determined via the modelling
of the mean-variance relationship of the read counts. This has been done with the Limma package
(https://www.rdocumentation.org/packages/limma/versions/3.28.14) that is well known for data
analysis, linear models and differential expression analysis of microarray data. However, it can also
be used for count data and therefore gene expression analysis [8].
The first step of the process to determine the most differentially expressed genes through this
approach was to perform a preprocessing of the data to keep only genes that are expressed in
minimum 90% of the samples (108/120). Then, the mean-variance relationship was modelled with
precision weights approach called “voom (variance modeling at the observational level). This
transformation uses the variance of genes to create weights for use in linear models. First, gene-wise
linear models are fitted to the normalized log-count-per-million values accounting for all relevant
factors generating a residual deviation for each gene. A robust trend is then fitted to the residual
standard deviations as a function of the average log-count for each gene. The fitted log-cpm (from
the linear model fit) for each observation is converted into a predicted count and using this value,
the standard deviation is interpolated from the trend. Eventually, the inverse squared predicted
standard deviation for each observation becomes the weight for that observation [
9
]. This allows to
obtain the mean-variance trend estimated for all genes.
The next step of the process was to estimate the fold changes and standard errors by fitting a linear
model for each gene. Eventually, an empirical Bayes smoothing has been applied on the standard
errors, meaning that information from all genes has been borrowed to provide stable variance
estimation, and the 500 most differentially expressed genes between any thalamic nuclei have been
determined selecting the genes with the lowest P value.
This method is good when working with noisy data and dealing with several factors. In addition, it
has been shown than the variance modeling at the observational level approach has the best power
of methods that control the type I error rate as well as the lowest false discovery rate each time in
comparison with different methods for differential gene expression analysis and especially the one
used in the paper which is EdgeR glm fit (negative binomial) [9].
5
4 Results
4.1 Data exploration
The first part of the results concerns the data exploration. This part aims simply to determine what
the data looks like. Therefore, the count matrix representing the transcriptome of 120 samples that
came from 22 different thalamic nuclei has been examined. This has been done through three plots.
The first one (Figure 1) allows to observe the summary of the read count for 9 randomly selected
samples with a boxplot. We can see that the median appears to be between a read count of 100
and 1000 for each sample and almost no outliers emerge with this log10 scale. The second plot
(Figure 2) represents the read counts distribution for 9 genes randomly selected. Even if those genes
have different expression pattern across the samples, it can be observed that this expression could
be fitted using a negative binomial distribution, as the sample variance exceeds the sample mean.
Eventually, the third exploratory plot (Figure 3) is the result of a principal component analysis. It
can be seen that differences between samples of the same thalamic nucleus exist and that some
relatedness between samples of different nuclei can be observed with the first principal component.
However, no clear pattern results from this analysis, assessing the role of the next part to determine
the 500 most differentially expressed genes in order to enhance similarity or dissimilarity between
samples.
4.2 Data analysis 1
The figures in the original paper corresponding to the ones reproduced are the figures 1C, 2A and
2B. The hierarchical clustering on the 500 most differentially expressed genes using Spearmans
correlation as a distance metric and complete linkage for agglomeration leads to a dendrogram
(Figure 4). It allows to identify five subdivisions of nuclei across the thalamus, representing the top
five branches of the cluster tree. Three major multinuclei thalamic profiles were determined each
composed of five to eight nuclei, while the two last profiles were composed only either of the anterior
dorsal nucleus (AD) or the nucleus reuniens (RE). The heatmap allows to visualize the correlations
between the nuclei in a different way and to observe the three main clusters resulting from this
analysis on the most differentially expressed genes (Figure 5).
Then, a dimensional reduction analysis via PCA has been done to observe the patterns of gene
expression variation between the profiles. This analysis shows six significant components with
the two first PCs explaining more than 50% of the variation of the total variance and thus allowing
both a great separation of the functional nuclear profiles and to establish an ordered relationship
of increasing molecular differences from primary to secondary to tertiary profiles (Figure 6). In
addition, the comparison of how the PCs of gene expression relate to the profiles obtained with
hierarchical clustering to how the PCs of gene expression relate to the projection targets shows that
the PCs were explained by the profiles rather than the projection targets (Figure 7). Therefore, this
assesses that this new approach and way to classify the thalamus nuclei into profiles determined via
molecular analysis allows to explain more variance at the gene expression level than the one based
on projection targets.
6
4.3 Data analysis 2
The new analyses are based on a different approach to compute the 500 most differentially expressed
genes, the first main step of the reproduced analyses. Therefore, a different preprocessing was used
resulting of 10’246 genes being kept from the 16’538 for which transcriptomic data was available.
That’s because 6’292 genes were not present in minimum 90% of the samples.
The replicates in this study being genetically identical mice, the standard deviation of the log-
cpm should decrease steadily as a function of the mean before asymptote at a moderate level
corresponding to a biological coefficient of variation of about 10% [
9
]. This is what can be mainly
observed on Figure 8 with preprocessing (right) even if the trend doesnt show the asymptotic part.
The preprocessing step is thus assessed by this plot as we can see that without this step (Figure 8,
left), the standard deviation doesnt decrease in function of the count’s mean due to genes with a low
mean expression .
Diagnostic plots show the log-fold-change in function of the average log-expression for 9 randomly
selected nuclei out of the 22 (Figure 9). They allow to observe that there is no correlation between
those two features, which is important as the fold change should be a relative value and not be
correlated to the gene expression. Moreover, with the top 500 most differentially expressed genes
highlighted in red, we can see that its not all the most differentially expressed genes of a certain
nuclei that are the ones selected. Those two results assess the use of linear models to determine the
gene expression fold change and obtain the most differentially expressed genes across the 22 nuclei.
The correlation matrices of the nuclei obtained with those 500 most differentially expressed genes
or with the 500 genes obtained after the fitting to the negative binomial distribution have been
compared via a chi-squared test. This test revealed statistical significant differences between the
two correlation matrices with a P value around 3.4e-12. Therefore, a validation of this new approach
for differential gene expression analysis was necessary and has been performed by reproducing
the previous analyzes using this time the 500 most differentially expressed genes obtained with the
approach of variance modeling at the observational level.
4.4 Validation
We can observe with the dendrogram based on Spearmans correlation and complete linkage that
now four different clusters are present (instead of five using negative binomial distribution fitting
for the computation of the most differentially expressed genes) (Figure 10). The nucleus reuniens
(RE) is not anymore forming a profile on its own but its clustering with the secondary profile. The
primary and tertiary profiles remained unchanged, as well as the anterior dorsal (AD) nucleus that is
forming an independent profile. The four new profiles can also be observed in the new heatmap
(Figure 11) even if most of the nuclei of the secondary profile are also strongly correlated either
with the primary profile or the tertiary one. It’s also interesting to see that the correlation between
nuclei is globally higher than in the results obtained in the first data analysis part. This could be
explained by the preprocessing step of this second analysis that was more restrictive and discarded
some low expressed genes (if they were not expressed in more than 90% of the samples) that could
be responsible of high variance between nuclei, but also responsible of errors due to the higher
significance of noise in the variance modeling or distribution fitting. Then, although the percentage
of variance explained by each of the two first PCs, after the principal component analysis, is different
and a bit smaller (probably for the same reasons as the ones for the higher correlation), the pattern
7
that is observed is quite similar as when fitting the gene expression to negative binomial distributions
(Figure 12). Indeed, we can notice that the nucleus (RE) that is belonging to a different profile than
in the first analysis is not totally included in its new profile and therefore could again form a profile
on its own based on those results.
Therefore, we can conclude that, in this case, even if the correlation matrices are different and the
hierarchical clustering leads to different results, performing differential gene expression analysis by
fitting counts to a negative binomial generalized linear model where each factor level represents a
different thalamic nucleus or fitting a linear model for each gene to estimate the fold changes and
standard errors lead to comparable results. Indeed, the difference observed concerned only one of
the 22 thalamic nuclei and with the PCA it has been noticed that in fact this nucleus stand in the
same order as in the first analysis allowing to still establish an ordered relationship of increasing
molecular differences between the nuclei and their respective profiles.
5 Discussion and Conclusions
The thalamus is a central-hub for communication in the brain and regulates different states such
as consciousness. Thats why its of great interest to increase the knowledge on the properties and
organization of its pathways. Multiple analyses at different levels have thus been performed but
the molecular architecture of the thalamus was less described than for other regions of the brain.
Therefore, in this study, the aim was to bring insights on the organization of the thalamic pathways
in an unsupervised way and via analyses at the molecular level based on gene expression of 22 nuclei.
It allowed to determine three major multinuclei thalamic gene expression profiles and it can be
observed that in fact, nuclei included in those molecular profiles lie in a topographical arrangement
(Figure 13). The PCA on the 500 most differentially expressed genes separated as well the nuclei
by the profiles determined with hierarchical clustering. Those two consistent results allowed to
establish an ordered relationship of increasing molecular difference from the primary to the tertiary
profiles. Eventually, the additional analyses permitted to observe that close results are obtained with
a distinct approach of computing the most differentially expressed genes. Even if some differences
appeared that are probably due to a more restricted preprocessing step, the three main profiles
were almost unchanged and more importantly, the ordered molecular relationship observed in the
PCA between the nuclei and profiles was conserved. Consequently, this molecular organization,
consisting of those three main and ordered subgroups of thalamic nuclei, gives a first insight on the
architecture and organization of the thalamus and could be used in further studies to increase the
understanding of this central part of the brain.
8
0 1 2 3 4 5
Boxplot of PVT_AAV.st_1
log10 read counts
012345
Boxplot of IMD_AAV.st_3
log10 read counts
0 1 2 3 4 5 6
Boxplot of CM_AAV.st_2
log10 read counts
0123456
Boxplot of CL_AAV.st_1
log10 read counts
0123456
Boxplot of PF_AAV.st_2
log10 read counts
012345
Boxplot of VA_DiI.m1_2
log10 read counts
012345
Boxplot of VA_DiI.m1_6
log10 read counts
012345
Boxplot of SPA_AAV.st_4
log10 read counts
0 1 2 3 4 5
Boxplot of VM_DiI.m1_3
log10 read counts
Figure 1: Boxplots representing the log10 read counts of 9 randomly selected samples.
Histogram of Gm37329
Counts of Gm37329
Frequency
0 100 200 300 400 500 600
0 5 10 15 20
Histogram of Mrpl15
Counts of Mrpl15
Frequency
0 500 1000 1500
0 2 4 6 8 10 14
Histogram of Gm37895
Counts of Gm37895
Frequency
0 200 400 600 800 1000
0 2 4 6 8 10 12
Histogram of Mrps9
Counts of Mrps9
Frequency
0 200 400 600 800 1000 1200 1400
01234567
Histogram of Bcs1l
Counts of Bcs1l
Frequency
0 200 400 600
0 5 10 15 20 25
Histogram of Gm28941
Counts of Gm28941
Frequency
0 200 400 600
0 5 10 20 30
Histogram of Ube2f
Counts of Ube2f
Frequency
0 500 1000 1500 2000
0 5 10 15
Histogram of Mgat5
Counts of Mgat5
Frequency
0 2000 4000 6000 8000
0 2 4 6 8 10
Histogram of Dcaf6
Counts of Dcaf6
Frequency
0 1000 2000 3000 4000
0 2 4 6 8 10 12
Figure 2: Distribution of gene expression across the 120 samples for 9 randomly selected genes.
9
−50 0 50
−30 −20 −10 0 10 20 30
PC1: 18.1% variance
PC2: 4.8% variance
PVT PVT
PVT
IMD
IMD
IMD
IMD
PCN
CM_
PCN
CM_
PCN
CM_
PCN
CM_
CL_
CL_
CL_
CL_
PF_
PF_
PF_
PF_
VA_
VL_
VA_
VL_
VA_
VM_ VM_
VA_
VA_
SPA
SPA SPA
SPA
AM_
RE_
AM_
RE_
VM_
VM_
VM_
VM_
VL_
VL_
VL_ VL_
VA_
VA_
VA_
CM_
CM_
CM_ AM_
RE_
AM_
RE_
AD_
AD_
AD_
AD_
IAD IAD
IAD AV_
AV_
AV_
AV_
AV_
AM_
RE_
RE_
RE_
RE_ LP_
LP_
VPM
VPM
MD_
MD_
MD_
MD_
MD_
MD_
LD_
LD_
LD_
VPM
VPM
VPM
VPM
PO_
PO_
PO_
PO_
LGd
LGd
LGd
LGd
LGd
LP_
LP_LP_
LP_
LP_ LP_
LP_
VB_
VB_
VB_
VB_
PO_
PO_
PO_
AV_
AV_
AV_
LD_
LD_
Figure 3: PCA on the 120 samples using all the genes. The 120 samples are named in function of their
nucleus.
0.0 0.1 0.2 0.3 0.4 0.5
CL
PF
CM
PCN
SPA
IMD
PVT
RE
VPMpc
MD
LP
PO
AM
IAD
VA
VM
AD
AV
LD
VL
LGd
VB
Figure 4: Dendrogram of hierarchical clustering on the top 500 most differentially expressed genes
using Spearmans correlation and complete linkage defining the major profiles as the top five branches.
10
AD
AV
LD
VL
LGd
VB
VPMpc
MD
LP
PO
AM
IAD
VA
VM
RE
CL
PF
CM
PCN
SPA
IMD
PVT
AD
AV
LD
VL
LGd
VB
VPMpc
MD
LP
PO
AM
IAD
VA
VM
RE
CL
PF
CM
PCN
SPA
IMD
PVT
ρ(Spearman)
0.4
0.6
0.8
1
Figure 5: Correlation heatmap on the top 500 most differentially expressed genes of the nuclei labeled
in function of the profiles determined with the cluster dendrogram.
−30 −20 −10 0 10 20
−20 −10 0 10
PC1: 38.7% variance
PC2: 15% variance
PVT
PVT
PVT
IMD
IMD
IMD
IMD
PCN
CM
PCN
CM
PCN
CM
PCN
CM
CL
CL
CL
CL
PF
PF PF
PF VA
VL
VA
VL
VA
VM VM
VA VA
SPA
SPA
SPA
SPA
AM
RE
AM
RE
VM
VM
VM
VM
VL VL VL
VL
VA
VA
VA
CM
CM
CM
AM
RE
AM
RE
AD
AD
AD
AD
IAD IAD
IAD
AV
AV AV
AV
AV
AM
RE
RE
RE
RE
LP
LP
VPMpc
VPMpc
MD
MD
MD
MD
MD
MD
LD
LD
LD
VPMpc
VPMpc
VPMpc
VPMpc
PO PO
PO
PO
LGd
LGd
LGd
LGd
LGd
LP
LP
LP
LP
LP LP
LP
VB
VB
VB
VB
PO
PO
PO
AV
AV
AV
LD
LD
Primary
Secondary
Tertiary
Figure 6: Plot of the two first PCs resulting from the PCA. The 120 samples are named in function of
their nucleus and labeled in function of their profiles determined with the hierarchical clustering.
11
0
50
100
150
123456
PC
F−statistic
group
profiles
projections
Figure 7: Bar plot showing how the PCs of gene expression relate to the profiles from hierarchical
clustering and projection targets.
0 5 10 15
0.5 1.0 1.5
5 10 15
0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8
Figure 8: Mean-variance trend and relationship without (left) or with (right) preprocessing.
12
0 5 10
−5 0 5 10
AM
Average log−expression
log−fold−change
0 5 10
−5 0 5 10
AV
Average log−expression
log−fold−change
0 5 10
−5 0 5 10
CL
Average log−expression
log−fold−change
0 5 10
−5 0 5 10
CM
Average log−expression
log−fold−change
0 5 10
−5 0 5 10
IAD
Average log−expression
log−fold−change
0 5 10
−10 −5 0 5 10
IMD
Average log−expression
log−fold−change
0 5 10
−10 −5 0 5 10
LD
Average log−expression
log−fold−change
0 5 10
−10 −5 0 5 10
LGd
Average log−expression
log−fold−change
0 5 10
−5 0 5 10
LP
Average log−expression
log−fold−change
Figure 9: Mean-Average (MA) plot (expression level vs. log fold change) of differential expression for 9
randomly selected nuclei. Top 500 most differentially expressed genes are highlighted in red.
0.0 0.1 0.2 0.3 0.4 0.5 0.6
CM
PCN
CL
PF
SPA
IMD
PVT
RE
AM
IAD
VA
VM
VPMpc
MD
LP
PO
AD
AV
LD
VL
LGd
VB
Figure 10: Dendrogram of hierarchical clustering on the top 500 most differentially expressed genes
(obtained with voom) using Spearmans correlation and complete linkage defining the major profiles
as the top four branches.
13
AD
AV
LD
VL
LGd
VB
AM
IAD
RE
VA
VM
MD
LP
PO
VPMpc
CL
PF
CM
PCN
SPA
IMD
PVT
AD
AV
LD
VL
LGd
VB
AM
IAD
RE
VA
VM
MD
LP
PO
VPMpc
CL
PF
CM
PCN
SPA
IMD
PVT
ρ(Spearman)
0.4
0.6
0.8
1
Figure 11: Correlation heatmap on the top 500 most differentially expressed genes (obtained with
voom) of the nuclei labeled in function of the profiles determined with the cluster dendrogram.
−10 0 10 20
−20 −10 0 10 20 30
PC1: 25.9% variance
PC2: 22% variance
PVT PVT
PVT
IMD
IMD
IMD
IMD PCN
CM PCN
CM
PCN
CM PCN
CM
CL CL
CL
CL
PF
PF PF
PF
VA
VL
VA
VL
VA
VM
VM
VA VA
SPA SPA
SPA
SPA AM
RE
AM
RE VM
VM
VM
VM
VL VL VL
VL
VA
VA VA
CM
CM
CM
AM
RE
AM
RE
AD
AD
AD
AD
IAD IAD
IAD
AV
AV AV
AV
AV
AM
RE
RE
RE
RE
LP
LP
VPMpc
VPMpc
MD
MDMD
MD
MD
MD
LD
LD
LD
VPMpc
VPMpc
VPMpc
VPMpc
PO PO
PO
PO
LGd
LGd
LGd
LGd
LGd
LP
LP LP
LP
LP LP
LP
VB
VB
VB
VB
PO
PO
PO
AV
AV
AV
LD
LD
Primary
Secondary
Tertiary
Figure 12: Plot of the two first PCs resulting from the PCA. The 120 samples are named in function of
their nucleus and labeled in function of their profiles determined with the hierarchical clustering.
14
Figure 13: From [1]. Topological localization of gene expression profiles in the thalamus.
15
References
[1]
J. W. Phillips, A. Schulmann, E. Hara, J. Winnubst, C. Liu, V. Valakh, L. Wang, B. C. Shields,
W. Korff, J. Chandrashekar, A. L. Lemire, B. Mensh, J. T. Dudman, S. B. Nelson, and A. W. Hantman,
A repeated molecular architecture across thalamic pathways, vol. 22, no. 11, pp. 1925–1935.
Number: 11 Publisher: Nature Publishing Group.
[2]
K. Hwang, M. A. Bertolero, W. B. Liu, and M. D’Esposito, “The human thalamus is an integrative
hub for functional brain networks, vol. 37, no. 23, pp. 5594–5607.
[3]
R. Yuan, X. Di, P. Taylor, S. Gohel, Y.-H. Tsai, and B. B. Biswal, “Functional topography of the
thalamocortical system in human, vol. 221, no. 4, pp. 1971–1984.
[4]
E. G. Jones, “The thalamic matrix and thalamocortical synchrony, vol. 24, no. 10, pp. 595–601.
Publisher: Elsevier.
[5]
S. M. Sherman and R. W. Guillery, The role of the thalamus in the flow of information to the
cortex., vol. 357, no. 1428, pp. 1695–1708.
[6]
“Why do we use the negative binomial distribution for RNAseq?. Library Catalog: bridges-
lab.sph.umich.edu.
[7]
Y. Chen, D. McCarthy, M. Ritchie, M. Robinson, and G. Smyth, “edgeR: differential analysis of
sequence read count data users guide, p. 122.
[8]
R. Liu, A. Z. Holik, S. Su, N. Jansz, K. Chen, H. S. Leong, M. E. Blewitt, M.-L. Asselin-Labat, G. K.
Smyth, and M. E. Ritchie, Why weight? modelling sample and observational level variability
improves power in RNA-seq analyses, vol. 43, no. 15, p. e97.
[9]
C. W. Law, Y. Chen, W. Shi, and G. K. Smyth, “voom: precision weights unlock linear model
analysis tools for RNA-seq read counts, vol. 15, no. 2, p. R29.