Genomics and bioinformatics course

Project report

Date: 29 May 2020

Title:

Molecular analysis of thalamic nuclei via unsupersived hierarchical clustering and PCA

based on the data from paper:

A repeated molecular architecture across thalamic pathways (J. Phillips et al.)[1]

Evan Béal

Professor.:

Prof. Jacques Rougemont

Teaching assistant:

Jane Yi

1 Summary

The thalamus plays an important role in a mammalian brain. It performs hub-like exchanges of

information via nerve ﬁbers that project out of it to the cerebral cortex in all directions. Many main

functions of the brain are thus directly involving the thalamus, which is considered as the central

communication hub of the forebrain. However, the organizational logic of thalamic projections has

remained abstract. Therefore, through unsupervised hierarchical clustering of the most differentially

expressed genes, obtained via gene expression ﬁtting of a negative binomial distribution across 22

distinct nuclei of the thalamus, this study allows to determine three major molecular proﬁles in

this region of the brain. This new approach of using molecular architecture for classiﬁcation of the

thalamus nuclei allows to obtain different results than the ones previously proposed based on the

output targets of each nucleus. It permits to identify an organization in a statistically unsupervised

manner based on transcriptomic data of a broad set of genes and describes a molecular framework

upon which an increasingly detailed understanding of the thalamus can be built. Another approach

to compute the most differentially expressed genes has been performed with the modelling of the

mean-variance relationship of the read counts in addition to the other analyzes. This also leads

to three major multinuclei thalamic gene expression proﬁles, with only differences for one of the

22 nuclei in the hierarchical clustering and no signiﬁcant variations in the ordered relationships

between the nuclei established with principal component analysis. Therefore, the general similarity

in the results obtained assesses the two methods tested for differential expression analysis on

transcriptomic data.

2 Background

The thalamus has several functions, such as the relaying of sensory signals, like motor signals, to

the cerebral cortex and the regulation of consciousness, sleep, and alertness. In addition, it’s wide

ranging connectivity makes it being the central communication hub of the forebrain [

]. Thus, it’s

role is essential in the brain, which render the understanding of its organization of great interest.

Therefore, as it’s known that cell-type specialization underlies the functioning of all understood

organs, to determine whether there are multiple types of thalamic projection neurons would allow to

give insights in the logic of how thalamocortical projections support cortical function and increase

our comprehension of this part of the brain.

Several approaches have been performed to classify the thalamic nuclei. Based on cytoarchitectural

and connectomic features, the thalamus can be divided into approximately 30 distinct nuclei [

Thus the classiﬁcation task aims to categorize those nuclei into subgroups in which there is a small

variance for nuclei of the same group, as they should share common features, and a large variance

for nuclei of different groups. The common approach was to group nuclei by the type of sensory

inputs that they receive, allowing to distinguish visual, auditory and somatosensory nuclei. This

approach was then extended to consider the output targets of each nucleus in order to be able to

also classify nuclei of non-sensory regions, allowing to deﬁne motor or prefrontal projection systems

for example. However, this classiﬁcation approach based on projection considers only one feature to

determine the group of each nucleus and therefore does not explain why multiple thalamic nuclei

exist within each group. So, different nuclei classiﬁed in the motor projection system for instance can

vary considerably in various relevant features. Therefore, several theories such as the core/matrix [

]

or ﬁrst/higher-order [

] theories have been proposed to categorize the thalamic projection nuclei.

Nevertheless, those theories have limitations. For example, signiﬁcant variations can be found in the

functional and anatomical properties of nuclei within each proposed class, some of those theories

are based almost exclusively on test realized in a sensory context or even some categorizations rely

only on a single anatomical feature although neurons vary in many more parameters [1].

Thus, to take into account the different parameters of a cell, the analyzes have to be done at a molec-

ular level. The molecular analyzes have already provided a principled way to characterize neural

populations in other brain regions, offering frameworks for functional networks understanding.

However, there is a lack of understanding at this level for thalamic nuclei. The cellular architecture

based on molecular analyses is either limited to a set of molecular markers or limited to a set of

thalamic nuclei, or both [

]. In addition, some molecular markers that have been identiﬁed were not

determined in an unsupervised manner, which therefore introduce potential bias and reduce the

causality links. For all those reasons, here the thalamic nuclei will be studied at the molecular level

using transcriptomic data to classify the nuclei into subgroups in an unsupervised manner.

Therefore, to understand the organization of the thalamus, anatomical and transcriptomic ap-

proaches have been combined to obtain read counts values for more than 16’000 genes and 120

samples from mice corresponding to 22 nuclei representing 8 projection targets. The thalamic nuclei

were retrogradely labeled from forebrain areas and the separate nuclei were microdissected. RNA

was then isolated from pooled cells (bulk populations of 14 to 200 cells, median value of 52 cells)

and sequenced. One advantage of the pooled-cell approach over a single cell approach is that, by

averaging over cells, it enables a greater sensitivity for detecting low expressing genes, with some of

those ones encoding for receptors or ion channels, which could be key features to separate the nuclei

into different groups. Consequently, the transcriptome proﬁling deﬁned via projection labeling

will be used to explore the major branches of thalamic molecular heterogeneity and identify major

subdivisions of nuclei through different analyses ranging from gene differential expression analysis

to principal component analysis, and including unsupervised hierarchical clustering.

3 Methods

3.1 Data analysis 1

In this section, we will get through the different methods of data analysis that have been used by the

authors and reproduced by myself to be able to obtain a hierarchical clustering of thalamic nuclei in

the form of a heatmap and to observe clusters of nuclei based on their molecular similarity. This was

done using the data available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE133911

and adapting the code available on github (https://github.com/aschulmann/ThalamoSeq).

Those methods can be reunited in three main groups. The ﬁrst one is the differential gene expression

analysis between the samples. Then, the second one consists of the hierarchical clustering to

determine a hierarchy of the different thalamic nuclei. Eventually, the third one will consist of a

dimensionality reduction through principal component analysis.

3.1.1 Differential gene expression analysis

Differential expression analysis has been performed using the package EdgeR (https://www.rdocumentation.

org/packages/edgeR/versions/3.14.0). The process starts with a preprocessing step that aims to

remove the genes that were almost not expressed across the samples. To be kept, the genes had to

have a transcripts per million value greater than 10 in at least three samples out of the 120. Therefore,

the data set consisted of 120 samples representing 22 nuclei and 16’538 genes. Then, the counts

were ﬁtted to a negative binomial generalized linear model, where each factor level represented one

of the 22 different thalamic nuclei. Fitting the data with a negative binomial distribution is useful

for over-dispersed count data making it a commonly used distribution to ﬁt RNA-seq data. The

reasons why this distribution is often the ﬁrst one considered is that since reads are count based, they

can’t be normally distributed without processing such as log-transformation (we can’t have negative

values or have ﬂoating-point numbers). Therefore, two distributions for count based data can be

taken into account, which are Poisson (that presumes the variance and mean of gene expression

in our case to be equal) or negative binomial (which does not make this assumption) [

]. Negative

binomial distribution will especially be useful for discrete data over an unbounded positive range

whose sample variance exceeds the sample mean. Therefore, the observations are overdispersed

with respect to a Poisson distribution, for which the mean is equal to the variance. Hence a Poisson

distribution is not an appropriate model. Since the negative binomial distribution has one more

parameter than the Poisson, the second parameter can be used to adjust the variance independently

of the mean. That’s why the negative binomial distribution is commonly used to model data in the

form of discrete sequence read counts from high-throughput RNA and DNA sequencing experiments

[7].

After that the counts have been ﬁtted to a negative binomial generalized linear model, the 500 most

differentially expressed genes between any thalamic nuclei have been determined using a one-way

analysis of deviance (ANODEV) for each gene. This allowed to test for differences between any of

the 22 nuclei and to select the genes with the lowest P value representing the most differentially

expressed ones. Eventually, the count data have been transformed, through variance stabilizing

transformation and normalization, to become a matrix of values approximately homoscedastic

(having constant variance along the range of mean values).

3.1.2 Hierarchical clustering

Hierarchical clustering was then performed on the approximately homoscedastic matrix where each

nucleus represented the mean value of its samples. This was done using Spearman’s correlation as a

distance metric and complete linkage for agglomeration. Groups were deﬁned by splitting the tree at

the level of ﬁve branches, which represent what they termed "proﬁles".

Using Spearman’s correlation allows to assess how well the relationship between two variables can

be described by means of monotonic relationships. A Spearman correlation of +1 or

−

1 occurs when

each of the variables is a perfect monotone function of the other. Therefore, it allowed to determine

how the 22 thalamic nuclei were close to each other by comparing their expression proﬁle of the

500 genes selected. Thus, after having determined those values it was possible to perform complete-

linkage clustering, which is one of several methods of agglomerative hierarchical clustering. At the

beginning of the process, each element is in a cluster of its own. The clusters are then sequentially

combined into larger clusters until all elements end up being in the same cluster.

3.1.3 Principal Component Analysis (PCA)

The last main part of the analyses performed consisted of a dimensionality reduction through PCA.

PCA is a common technique to visualize genetic distance and relatedness between populations.

Therefore, in this case the PCA is used on the 120 samples to observe their variation between

each other and has been calculated through singular-value-decomposition and on the 500 most

differentially expressed genes.

The relation of the ﬁrst PCs to the proﬁles obtained with the hierarchical clustering has eventually

been assessed by analysis of variance (ANOVA), to determine the ratio of between and within group

variance, and evaluated via an F-test.

3.2 Data analysis 2

For the additional ﬁgures, instead of deﬁning the most differentially expressed genes through a

negative binomial ﬁtting of the gene expression, those ones have been determined via the modelling

of the mean-variance relationship of the read counts. This has been done with the Limma package

(https://www.rdocumentation.org/packages/limma/versions/3.28.14) that is well known for data

analysis, linear models and differential expression analysis of microarray data. However, it can also

be used for count data and therefore gene expression analysis [8].

The ﬁrst step of the process to determine the most differentially expressed genes through this

approach was to perform a preprocessing of the data to keep only genes that are expressed in

minimum 90% of the samples (108/120). Then, the mean-variance relationship was modelled with

precision weights approach called “voom” (variance modeling at the observational level). This

transformation uses the variance of genes to create weights for use in linear models. First, gene-wise

linear models are ﬁtted to the normalized log-count-per-million values accounting for all relevant

factors generating a residual deviation for each gene. A robust trend is then ﬁtted to the residual

standard deviations as a function of the average log-count for each gene. The ﬁtted log-cpm (from

the linear model ﬁt) for each observation is converted into a predicted count and using this value,

the standard deviation is interpolated from the trend. Eventually, the inverse squared predicted

standard deviation for each observation becomes the weight for that observation [

]. This allows to

obtain the mean-variance trend estimated for all genes.

The next step of the process was to estimate the fold changes and standard errors by ﬁtting a linear

model for each gene. Eventually, an empirical Bayes smoothing has been applied on the standard

errors, meaning that information from all genes has been borrowed to provide stable variance

estimation, and the 500 most differentially expressed genes between any thalamic nuclei have been

determined selecting the genes with the lowest P value.

This method is good when working with noisy data and dealing with several factors. In addition, it

has been shown than the variance modeling at the observational level approach has the best power

of methods that control the type I error rate as well as the lowest false discovery rate each time in

comparison with different methods for differential gene expression analysis and especially the one

used in the paper which is EdgeR glm ﬁt (negative binomial) [9].

4 Results

4.1 Data exploration

The ﬁrst part of the results concerns the data exploration. This part aims simply to determine what

the data looks like. Therefore, the count matrix representing the transcriptome of 120 samples that

came from 22 different thalamic nuclei has been examined. This has been done through three plots.

The ﬁrst one (Figure 1) allows to observe the summary of the read count for 9 randomly selected

samples with a boxplot. We can see that the median appears to be between a read count of 100

and 1000 for each sample and almost no outliers emerge with this log10 scale. The second plot

(Figure 2) represents the read counts distribution for 9 genes randomly selected. Even if those genes

have different expression pattern across the samples, it can be observed that this expression could

be ﬁtted using a negative binomial distribution, as the sample variance exceeds the sample mean.

Eventually, the third exploratory plot (Figure 3) is the result of a principal component analysis. It

can be seen that differences between samples of the same thalamic nucleus exist and that some

relatedness between samples of different nuclei can be observed with the ﬁrst principal component.

However, no clear pattern results from this analysis, assessing the role of the next part to determine

the 500 most differentially expressed genes in order to enhance similarity or dissimilarity between

samples.

4.2 Data analysis 1

The ﬁgures in the original paper corresponding to the ones reproduced are the ﬁgures 1C, 2A and

2B. The hierarchical clustering on the 500 most differentially expressed genes using Spearman’s

correlation as a distance metric and complete linkage for agglomeration leads to a dendrogram

(Figure 4). It allows to identify ﬁve subdivisions of nuclei across the thalamus, representing the top

ﬁve branches of the cluster tree. Three major multinuclei thalamic proﬁles were determined each

composed of ﬁve to eight nuclei, while the two last proﬁles were composed only either of the anterior

dorsal nucleus (AD) or the nucleus reuniens (RE). The heatmap allows to visualize the correlations

between the nuclei in a different way and to observe the three main clusters resulting from this

analysis on the most differentially expressed genes (Figure 5).

Then, a dimensional reduction analysis via PCA has been done to observe the patterns of gene

expression variation between the proﬁles. This analysis shows six signiﬁcant components with

the two ﬁrst PCs explaining more than 50% of the variation of the total variance and thus allowing

both a great separation of the functional nuclear proﬁles and to establish an ordered relationship

of increasing molecular differences from primary to secondary to tertiary proﬁles (Figure 6). In

addition, the comparison of how the PCs of gene expression relate to the proﬁles obtained with

hierarchical clustering to how the PCs of gene expression relate to the projection targets shows that

the PCs were explained by the proﬁles rather than the projection targets (Figure 7). Therefore, this

assesses that this new approach and way to classify the thalamus nuclei into proﬁles determined via

molecular analysis allows to explain more variance at the gene expression level than the one based

on projection targets.

4.3 Data analysis 2

The new analyses are based on a different approach to compute the 500 most differentially expressed

genes, the ﬁrst main step of the reproduced analyses. Therefore, a different preprocessing was used

resulting of 10’246 genes being kept from the 16’538 for which transcriptomic data was available.

That’s because 6’292 genes were not present in minimum 90% of the samples.

The replicates in this study being genetically identical mice, the standard deviation of the log-

cpm should decrease steadily as a function of the mean before asymptote at a moderate level

corresponding to a biological coefﬁcient of variation of about 10% [

]. This is what can be mainly

observed on Figure 8 with preprocessing (right) even if the trend doesn’t show the asymptotic part.

The preprocessing step is thus assessed by this plot as we can see that without this step (Figure 8,

left), the standard deviation doesn’t decrease in function of the count’s mean due to genes with a low

mean expression .

Diagnostic plots show the log-fold-change in function of the average log-expression for 9 randomly

selected nuclei out of the 22 (Figure 9). They allow to observe that there is no correlation between

those two features, which is important as the fold change should be a relative value and not be

correlated to the gene expression. Moreover, with the top 500 most differentially expressed genes

highlighted in red, we can see that it’s not all the most differentially expressed genes of a certain

nuclei that are the ones selected. Those two results assess the use of linear models to determine the

gene expression fold change and obtain the most differentially expressed genes across the 22 nuclei.

The correlation matrices of the nuclei obtained with those 500 most differentially expressed genes

or with the 500 genes obtained after the ﬁtting to the negative binomial distribution have been

compared via a chi-squared test. This test revealed statistical signiﬁcant differences between the

two correlation matrices with a P value around 3.4e-12. Therefore, a validation of this new approach

for differential gene expression analysis was necessary and has been performed by reproducing

the previous analyzes using this time the 500 most differentially expressed genes obtained with the

approach of variance modeling at the observational level.

4.4 Validation

We can observe with the dendrogram based on Spearman’s correlation and complete linkage that

now four different clusters are present (instead of ﬁve using negative binomial distribution ﬁtting

for the computation of the most differentially expressed genes) (Figure 10). The nucleus reuniens

(RE) is not anymore forming a proﬁle on its own but it’s clustering with the secondary proﬁle. The

primary and tertiary proﬁles remained unchanged, as well as the anterior dorsal (AD) nucleus that is

forming an independent proﬁle. The four new proﬁles can also be observed in the new heatmap

(Figure 11) even if most of the nuclei of the secondary proﬁle are also strongly correlated either

with the primary proﬁle or the tertiary one. It’s also interesting to see that the correlation between

nuclei is globally higher than in the results obtained in the ﬁrst data analysis part. This could be

explained by the preprocessing step of this second analysis that was more restrictive and discarded

some low expressed genes (if they were not expressed in more than 90% of the samples) that could

be responsible of high variance between nuclei, but also responsible of errors due to the higher

signiﬁcance of noise in the variance modeling or distribution ﬁtting. Then, although the percentage

of variance explained by each of the two ﬁrst PCs, after the principal component analysis, is different

and a bit smaller (probably for the same reasons as the ones for the higher correlation), the pattern

that is observed is quite similar as when ﬁtting the gene expression to negative binomial distributions

(Figure 12). Indeed, we can notice that the nucleus (RE) that is belonging to a different proﬁle than

in the ﬁrst analysis is not totally included in its new proﬁle and therefore could again form a proﬁle

on its own based on those results.

Therefore, we can conclude that, in this case, even if the correlation matrices are different and the

hierarchical clustering leads to different results, performing differential gene expression analysis by

ﬁtting counts to a negative binomial generalized linear model where each factor level represents a

different thalamic nucleus or ﬁtting a linear model for each gene to estimate the fold changes and

standard errors lead to comparable results. Indeed, the difference observed concerned only one of

the 22 thalamic nuclei and with the PCA it has been noticed that in fact this nucleus stand in the

same order as in the ﬁrst analysis allowing to still establish an ordered relationship of increasing

molecular differences between the nuclei and their respective proﬁles.

5 Discussion and Conclusions

The thalamus is a central-hub for communication in the brain and regulates different states such

as consciousness. That’s why it’s of great interest to increase the knowledge on the properties and

organization of its pathways. Multiple analyses at different levels have thus been performed but

the molecular architecture of the thalamus was less described than for other regions of the brain.

Therefore, in this study, the aim was to bring insights on the organization of the thalamic pathways

in an unsupervised way and via analyses at the molecular level based on gene expression of 22 nuclei.

It allowed to determine three major multinuclei thalamic gene expression proﬁles and it can be

observed that in fact, nuclei included in those molecular proﬁles lie in a topographical arrangement

(Figure 13). The PCA on the 500 most differentially expressed genes separated as well the nuclei

by the proﬁles determined with hierarchical clustering. Those two consistent results allowed to

establish an ordered relationship of increasing molecular difference from the primary to the tertiary

proﬁles. Eventually, the additional analyses permitted to observe that close results are obtained with

a distinct approach of computing the most differentially expressed genes. Even if some differences

appeared that are probably due to a more restricted preprocessing step, the three main proﬁles

were almost unchanged and more importantly, the ordered molecular relationship observed in the

PCA between the nuclei and proﬁles was conserved. Consequently, this molecular organization,

consisting of those three main and ordered subgroups of thalamic nuclei, gives a ﬁrst insight on the

architecture and organization of the thalamus and could be used in further studies to increase the

understanding of this central part of the brain.

0 1 2 3 4 5

Boxplot of PVT_AAV.st_1

log10 read counts

012345

Boxplot of IMD_AAV.st_3

log10 read counts

0 1 2 3 4 5 6

Boxplot of CM_AAV.st_2

log10 read counts

0123456

Boxplot of CL_AAV.st_1

log10 read counts

0123456

Boxplot of PF_AAV.st_2

log10 read counts

012345

Boxplot of VA_DiI.m1_2

log10 read counts

012345

Boxplot of VA_DiI.m1_6

log10 read counts

012345

Boxplot of SPA_AAV.st_4

log10 read counts

0 1 2 3 4 5

Boxplot of VM_DiI.m1_3

log10 read counts

Figure 1: Boxplots representing the log10 read counts of 9 randomly selected samples.

Histogram of Gm37329

Counts of Gm37329

Frequency

0 100 200 300 400 500 600

0 5 10 15 20

Histogram of Mrpl15

Counts of Mrpl15

Frequency

0 500 1000 1500

0 2 4 6 8 10 14

Histogram of Gm37895

Counts of Gm37895

Frequency

0 200 400 600 800 1000

0 2 4 6 8 10 12

Histogram of Mrps9

Counts of Mrps9

Frequency

0 200 400 600 800 1000 1200 1400

01234567

Histogram of Bcs1l

Counts of Bcs1l

Frequency

0 200 400 600

0 5 10 15 20 25

Histogram of Gm28941

Counts of Gm28941

Frequency

0 200 400 600

0 5 10 20 30

Histogram of Ube2f

Counts of Ube2f

Frequency

0 500 1000 1500 2000

0 5 10 15

Histogram of Mgat5

Counts of Mgat5

Frequency

0 2000 4000 6000 8000

0 2 4 6 8 10

Histogram of Dcaf6

Counts of Dcaf6

Frequency

0 1000 2000 3000 4000

0 2 4 6 8 10 12

Figure 2: Distribution of gene expression across the 120 samples for 9 randomly selected genes.

−50 0 50

−30 −20 −10 0 10 20 30

PC1: 18.1% variance

PC2: 4.8% variance

PVT PVT

PVT

IMD

PCN

CM_

PCN

CM_

PCN

CM_

PCN

CM_

CL_

PF_

VA_

VL_

VA_

VL_

VA_

VM_ VM_

VA_

SPA

SPA SPA

SPA

AM_

RE_

AM_

RE_

VM_

VL_

VL_ VL_

VA_

CM_

CM_ AM_

RE_

AM_

RE_

AD_

IAD IAD

IAD AV_

AV_

AM_

RE_

RE_ LP_

LP_

VPM

MD_

LD_

VPM

PO_

LGd

LP_

LP_LP_

LP_

LP_ LP_

LP_

VB_

PO_

AV_

LD_

Figure 3: PCA on the 120 samples using all the genes. The 120 samples are named in function of their

nucleus.

0.0 0.1 0.2 0.3 0.4 0.5

PCN

SPA

IMD

PVT

VPMpc

IAD

LGd

Figure 4: Dendrogram of hierarchical clustering on the top 500 most differentially expressed genes

using Spearman’s correlation and complete linkage deﬁning the major proﬁles as the top ﬁve branches.

LGd

VPMpc

IAD

PCN

SPA

IMD

PVT

LGd

VPMpc

IAD

PCN

SPA

IMD

PVT

ρ(Spearman)

0.4

0.6

0.8

Figure 5: Correlation heatmap on the top 500 most differentially expressed genes of the nuclei labeled

in function of the proﬁles determined with the cluster dendrogram.

−30 −20 −10 0 10 20

−20 −10 0 10

PC1: 38.7% variance

PC2: 15% variance

PVT

IMD

PCN

PF PF

PF VA

VM VM

VA VA

SPA

VL VL VL

IAD IAD

IAD

AV AV

VPMpc

PO PO

LGd

LP LP

Primary

Secondary

Tertiary

Figure 6: Plot of the two ﬁrst PCs resulting from the PCA. The 120 samples are named in function of

their nucleus and labeled in function of their proﬁles determined with the hierarchical clustering.

100

150

123456

F−statistic

group

profiles

projections

Figure 7: Bar plot showing how the PCs of gene expression relate to the proﬁles from hierarchical

clustering and projection targets.

0 5 10 15

0.5 1.0 1.5

log2( count size + 0.5 )

Sqrt( standard deviation )

voom: Mean−variance trend

5 10 15

0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8

log2( count size + 0.5 )

Sqrt( standard deviation )

voom: Mean−variance trend

Figure 8: Mean-variance trend and relationship without (left) or with (right) preprocessing.

0 5 10

−5 0 5 10

Average log−expression

log−fold−change

0 5 10

−5 0 5 10

Average log−expression

log−fold−change

0 5 10

−5 0 5 10

Average log−expression

log−fold−change

0 5 10

−5 0 5 10

Average log−expression

log−fold−change

0 5 10

−5 0 5 10

IAD

Average log−expression

log−fold−change

0 5 10

−10 −5 0 5 10

IMD

Average log−expression

log−fold−change

0 5 10

−10 −5 0 5 10

Average log−expression

log−fold−change

0 5 10

−10 −5 0 5 10

LGd

Average log−expression

log−fold−change

0 5 10

−5 0 5 10

Average log−expression

log−fold−change

Figure 9: Mean-Average (MA) plot (expression level vs. log fold change) of differential expression for 9

randomly selected nuclei. Top 500 most differentially expressed genes are highlighted in red.

0.0 0.1 0.2 0.3 0.4 0.5 0.6

PCN

SPA

IMD

PVT

IAD

VPMpc

LGd

Figure 10: Dendrogram of hierarchical clustering on the top 500 most differentially expressed genes

(obtained with voom) using Spearman’s correlation and complete linkage deﬁning the major proﬁles

as the top four branches.

LGd

IAD

VPMpc

PCN

SPA

IMD

PVT

LGd

IAD

VPMpc

PCN

SPA

IMD

PVT

ρ(Spearman)

0.4

0.6

0.8

Figure 11: Correlation heatmap on the top 500 most differentially expressed genes (obtained with

voom) of the nuclei labeled in function of the proﬁles determined with the cluster dendrogram.

−10 0 10 20

−20 −10 0 10 20 30

PC1: 25.9% variance

PC2: 22% variance

PVT PVT

PVT

IMD

IMD PCN

CM PCN

PCN

CM PCN

CL CL

PF PF

VA VA

SPA SPA

SPA

SPA AM

RE VM

VL VL VL

VA VA

IAD IAD

IAD

AV AV

VPMpc

MDMD

VPMpc

PO PO

LGd

LP LP

Primary

Secondary

Tertiary

Figure 12: Plot of the two ﬁrst PCs resulting from the PCA. The 120 samples are named in function of

their nucleus and labeled in function of their proﬁles determined with the hierarchical clustering.

Figure 13: From [1]. Topological localization of gene expression proﬁles in the thalamus.

References

[1]

J. W. Phillips, A. Schulmann, E. Hara, J. Winnubst, C. Liu, V. Valakh, L. Wang, B. C. Shields,

W. Korff, J. Chandrashekar, A. L. Lemire, B. Mensh, J. T. Dudman, S. B. Nelson, and A. W. Hantman,

“A repeated molecular architecture across thalamic pathways,” vol. 22, no. 11, pp. 1925–1935.

Number: 11 Publisher: Nature Publishing Group.

[2]

K. Hwang, M. A. Bertolero, W. B. Liu, and M. D’Esposito, “The human thalamus is an integrative

hub for functional brain networks,” vol. 37, no. 23, pp. 5594–5607.

[3]

R. Yuan, X. Di, P. Taylor, S. Gohel, Y.-H. Tsai, and B. B. Biswal, “Functional topography of the

thalamocortical system in human,” vol. 221, no. 4, pp. 1971–1984.

[4]

E. G. Jones, “The thalamic matrix and thalamocortical synchrony,” vol. 24, no. 10, pp. 595–601.

Publisher: Elsevier.

[5]

S. M. Sherman and R. W. Guillery, “The role of the thalamus in the ﬂow of information to the

cortex.,” vol. 357, no. 1428, pp. 1695–1708.

[6]

“Why do we use the negative binomial distribution for RNAseq?.” Library Catalog: bridges-

lab.sph.umich.edu.

[7]

Y. Chen, D. McCarthy, M. Ritchie, M. Robinson, and G. Smyth, “edgeR: differential analysis of

sequence read count data user’s guide,” p. 122.

[8]

R. Liu, A. Z. Holik, S. Su, N. Jansz, K. Chen, H. S. Leong, M. E. Blewitt, M.-L. Asselin-Labat, G. K.

Smyth, and M. E. Ritchie, “Why weight? modelling sample and observational level variability

improves power in RNA-seq analyses,” vol. 43, no. 15, p. e97.

[9]

C. W. Law, Y. Chen, W. Shi, and G. K. Smyth, “voom: precision weights unlock linear model

analysis tools for RNA-seq read counts,” vol. 15, no. 2, p. R29.