In this project, we were given a heterogenous dataset of mouse brain cells and the aim was to analyze this sample. First, we determined how many cell types were in our given dataset. Secondly, we found out which marker genes were characterizing our different cell types. Furthermore, we identified what cell types were likely present in our dataset with the help of the mouse brain website. Finally, we determined how many genes were expressed per cell and per cell type.
Loading the packages needed, loading the raw data & formatting the data to be in the form of a Seurat object.
require(Seurat)
require(dplyr)
require(ggplot2)
#Colors used for the plots
color9=c("tomato","orange","chartreuse4","green3","gray3","mediumorchid3","royalblue3",'red','yellow')
color8=c("tomato","orange","chartreuse4","green3","gray2","mediumorchid3","royalblue3",'red')
data = read.table("MOUSE_BRAIN_DATASET_1_COUNTS.tsv", header = TRUE)
data_Seurat = CreateSeuratObject(data, project = "Mouse Brain", assay = "RNA",
min.cells = 1, min.features = 1)
The following steps are performing the pre-processing needed for scRNA-seq data analysis using Seurat. It will select and filtrate the cells based on QC metrics. Then the data will be normalized and scaled. Finally, The highly variable features will be identified.
head(data_Seurat@meta.data, 5)
## orig.ident nCount_RNA nFeature_RNA
## Cell.1 Mouse Brain 587 344
## Cell.2 Mouse Brain 445 222
## Cell.3 Mouse Brain 1632 1054
## Cell.4 Mouse Brain 743 512
## Cell.5 Mouse Brain 1124 739
The percentage of mitochondrial reads assesses the quality of the cells. Cells with more than 20% of mitochondrial genes expressed need to be filtered as it would mean that those cells are apoptotic or lysing cells.
data_Seurat[["percent.mt"]] <- PercentageFeatureSet(data_Seurat, pattern = "^mt-")
VlnPlot(data_Seurat, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
The QC metrics indicate that, overall, the quality of the reads is good, even if some cells may have abnormally high amount of RNA compared to the number of gene expressed. It doesn’t seem necessary to subset the data based on the unique feature count or on the pourcentage of mitochondrial DNA as this latter is way below 20%.
In order to only keep the cells that have a great quality, we can make a scatter plot between the number of gene expressed per cell (nFeature_RNA) and the total amount of RNA per cell (nCount_RNA). We expect that nFeature_RNA is linear as a function of nCount_RNA.
FeatureScatter(data_Seurat, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
Following those analysis plots, we decided to subset the data to keep only cells that have less than 20’000 RNA counts. Indeed, the coefficient of correlation \(R^{2}\) is equal to 0.91 and it might be better if we filter the cells that have more than 20’000 RNA counts.
data_Seurat <- subset(data_Seurat, subset = nCount_RNA < 20000)
After the subset of the data, check the new scatter plot.
FeatureScatter(data_Seurat, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
After the filtration of the cells with more than 20’000 RNA counts, the coefficient \(R^{2}\) is now equal to 0.95. This result indicates that the filtration worked fine.
To summarize, we did not need to remove any cells based on their mitochondrial content, and we removed cells with more than 20’000 RNA counts.
The normalization of the data is performed using a log transform.
data_Seurat <- NormalizeData(data_Seurat, normalization.method = "LogNormalize", scale.factor = 10000)
We calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e. they are highly expressed in some cells, and lowly expressed in others). We next identify the 10 most highly variable genes and eventually plot those variables genes.
data_Seurat <- FindVariableFeatures(data_Seurat, selection.method = "vst", nfeatures = 2000)
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(data_Seurat), 10)
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(data_Seurat)
LabelPoints(plot = plot1, points = top10, repel = TRUE,xnudge = 0,ynudge = 0)
Then, we scaled the data before performing a PCA. It enables to reduce the effect of highly expressed genes. This is done by setting the mean expression across cells to 0 and their variance to 1.
all.genes <- rownames(data_Seurat)
data_Seurat <- ScaleData(data_Seurat, features = all.genes)
## Centering and scaling data matrix
We then performed PCA on the scaled data.
data_Seurat <- RunPCA(data_Seurat, features = VariableFeatures(object = data_Seurat))
DimPlot(data_Seurat, reduction = "pca")
The results are plotted in the form of a heatmap. It allows to observe the genes with the highest PC values.
DimHeatmap(data_Seurat, dims = 1, cells = 500, balanced = TRUE)
The p values of the PCs are plotted in the form of a Jackstraw plot to observe which ones are significant. The PCs are also compared based on their percentage of variance explained in the form of an Elbow plot.
data_Seurat <- JackStraw(data_Seurat, num.replicate = 100)
data_Seurat <- ScoreJackStraw(data_Seurat, dims = 1:20)
JackStrawPlot(data_Seurat, dims = 1:15)
ElbowPlot(data_Seurat)
Here, it seems that the 10 first PCs are significantly contributing to the variability of the data based on the Elbow plot.
The next step is to find the different clusters. This is done by running the Louvain algorithm. As the 10 first PCs are the most informative components, the finding of the neighbors is done with those.
data_Seurat <- FindNeighbors(data_Seurat, dims = 1:10)
data_Seurat <- FindClusters(data_Seurat, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 1609
## Number of edges: 52563
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9184
## Number of communities: 9
## Elapsed time: 0 seconds
Then, in order to vizualize the different clusters that were found by the algorithm, we run a UMAP dimensionality reduction. UMAP preserves local data structure: the points close to each other in high-dimensional data set tend to be close to each other in low-dimension. As a consequence, UMAP is better than PCA to preserve local structure (and so clusters). In addition this plot is better than t-SNE to preserve the global structure, and it runs faster.
data_Seurat <- RunUMAP(data_Seurat, dims = 1:10)
DimPlot(data_Seurat, reduction = "umap", cols = color9)
The Louvain algorithm has found 9 different clusters. Nevertheless, there is not a clear separation between the cluster 2 and the cluster 6, which might mean that the algorithm has tried to cluster too much due to too many PC values taken into account. In order to determine whether they form two distinct clusters or a single, we can investigate the differential expression and compare their different biomarkers. If the biomarkers are the same, then we might consider that they form an unique cluster.
cluster1.markers <- FindMarkers(data_Seurat, ident.1 = 1, min.pct = 0.25)
head(cluster1.markers, n = 5)
## p_val avg_logFC pct.1 pct.2 p_val_adj
## Neurod1 2.102208e-177 2.644319 0.588 0.006 3.663938e-173
## Malat1 2.169034e-173 1.420688 0.997 0.999 3.780409e-169
## Gm2694 6.307059e-158 2.991680 0.778 0.160 1.099257e-153
## Cox8a 6.967026e-158 1.806281 0.971 0.838 1.214283e-153
## Ndufa4 6.822482e-130 1.728161 0.920 0.758 1.189090e-125
If the two clusters are different, they might express different marker genes.
cluster2.6.markers <- FindMarkers(data_Seurat, ident.1 = 2, ident.2 = 6, min.pct = 0.25)
head(cluster2.6.markers, n = 5)
## p_val avg_logFC pct.1 pct.2 p_val_adj
## Tecr 3.559008e-35 -1.1489096 0.844 1.00 6.202995e-31
## Itm2b 9.204606e-34 -1.0787411 0.930 1.00 1.604271e-29
## Tuba1a 1.016999e-31 -0.9729299 1.000 1.00 1.772527e-27
## Ldhb 1.981397e-29 -0.8900984 0.668 0.99 3.453376e-25
## Apod 2.937089e-29 -0.7011571 0.995 1.00 5.119053e-25
This represent the five most differentially expressed genes between the cluster 2 and the cluster 6. Interestingly, pct1 is closed to pct2, meaning that the percentage of cells in the cluster 2(pct.1) expressing those 5 genes is relatively the same as the proportion of cells in cluster 6 expressing them. The adjusted p-values are significant, but quite high, especially when we compare them with the p-values of the biomarkers of cluster 1 (see above). However, we can explain this significance by the fact that the expression of these genes is very small, which might bias the p-values. At this point, we can say that these two clusters differs by only few-expressed genes and that they share the same biomarkers.
Then we can regroup those two clusters. In order to vizualize this, we can look at the expression heatmap.
data.markers <- FindAllMarkers(data_Seurat, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
data.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
In this case, we are plotting the top 10 markers (or all markers if less than 10) for each cluster.
top10 <- data.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
DoHeatmap(data_Seurat, features = top10$gene) + NoLegend()
We can observe than the heatmap of the biomarkers between cluster 2 and cluster 6 is similar, meaning that they form only one cluster. As a consequence, we need to modify the clustering algorithm so that it will merge the two clusters.
We also decided to observe whether the most differentially expressed genes from cluster 2 or cluster 6 are indeed also highly expressed in the other cluster, to ensure that cluster 2 and 6 are composed of cells from the same cell type.
FeaturePlot(data_Seurat, features = c("Gng11", "Ppp1r14a", "Ermn", "Ugt8a", "Fa2h", "Mog", "S100b", "Mobp", "Plp1"))
FeaturePlot(data_Seurat, features = c("Cdkn1c", "Apod", "Klk6", "Cldn11", "Gsn", "Mag", "Cnp", "Trf", "Aplp1"))
Therfore, we can see that the genes are highly expressed in both clusters, so they form one unique cluster.
Instead of taking the first 10 PCs, we decided to take into account only the 9th first ones.
data_Seurat <-FindNeighbors(data_Seurat,dims = 1:9)
data_Seurat <- FindClusters(data_Seurat, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 1609
## Number of edges: 53049
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9156
## Number of communities: 8
## Elapsed time: 0 seconds
data_Seurat <- RunUMAP(data_Seurat, dims = 1:9)
DimPlot(data_Seurat, reduction = "umap",cols=color8)
This time, the clustering seems to be more accurate. The different clusters are well separated.
data.markers <- FindAllMarkers(data_Seurat, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
top10 <- data.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
DoHeatmap(data_Seurat, features = top10$gene) + NoLegend()
The heatmap shows that the 8 clusters each express different genes, that appear as “gene markers”. This indicates that the clustering is consistent with the differential gene expression of the cells. Based on these genes, we can compare the clusters with a database, such as mousebrain.org, in order to associate each cluster with a cell type.
To find which cell type is asociated with each cluster, we computed the different marker genes in the Mouse Brain Atlas (and EnrichR for cluster 6), and we kept the cell types that corresponded the most by looking at the heatmap in the website. The results are the following.
Cell types:
We observed that it was difficult to determine the cell type of the cluster 6. Therefore, we decided to see how the most differentially expressed genes of this cluster are distributed in the other clusters.
FeaturePlot(data_Seurat, features = c("Meis2", "Nrgn", "Ly6h", "Chd3os", "Meg3", "Ahi1", "Nap1l5", "Pcp4", "6330403K07Rik"))
The distributions of the most differentially expressed genes of cluster 6 show indeed that some of those genes are also highly expressed in other clusters and are not fully specific to this cluster, which explains why it was more difficult to determine the cell type associated to this cluster.
new.cluster.ids <- c("Microglia", "Granule neurons", "Oligodendrocytes", "Serotonergic neurons", "Excitatory neurons", "Noradrenergic neurons", "Hypothalamus cells", "Schwann cells")
names(new.cluster.ids) <- levels(data_Seurat)
data_Seurat <- RenameIdents(data_Seurat, new.cluster.ids)
DimPlot(data_Seurat, reduction = "umap", label = TRUE, pt.size = 0.5, cols = color8,repel = TRUE,label.size = 5) + NoLegend()
freq_table = prop.table(x=table(data_Seurat@active.ident))
freq_table
##
## Microglia Granule neurons Oligodendrocytes
## 0.24300808 0.23244251 0.18645121
## Serotonergic neurons Excitatory neurons Noradrenergic neurons
## 0.12119329 0.08825357 0.06836544
## Hypothalamus cells Schwann cells
## 0.03231821 0.02796768
barplot(freq_table, col = color8, legend.text = TRUE,axisnames = FALSE)
Microglia and granule neurons are the most frequent cells type in our data set. Microglia are the immune cells in the central nervous system (CNS). They represent 5-20% of the brain’s glial cell population in normal condition but their density can increase in case of lesion, inflammation or cancer. These cells can also be involved in the regulation of neuronal activity. Oligodendrocytes and Schwann cells are responsible for the formation and maintenance of the myelin in CNS and in the peripheral nervous system respectively. Oligodendrocytes are adjacent to neuron cell bodies and can coat multiple axons. Granule cells are small neurons abundant in cerebellum cortex and in olfactory bulbs. Finally different type of neuron with different neurotransmitters are present such as serotonergic neurons, excitatory neurons and noradrenergic neurons. They transmit the information through electric signals. From the percentage found in our data set we can expect the cells to be from cerebellum.
Interestingly, the proportion of microglia is higher than expected. Indeed, we might have expected ~20% of microglia. This might be explained by different means. A first explanation could be due to a sampling bias corresponding to the filtration part: we might have removed some cells that are from other cell types, inducing an increase in the proportion of microglia. Another explanation could be due to a stress caused to the mice. Indeed, it has been shown that microglia act as neuroimmmune-sensors of stress and the amount of microglia increase [Matthew G. Frank et al. (2019)]. We could test it by analysing the release of immune markers of microglia such as the cytokine Il-1beta, in order to verify if the protocol used is stress-free. Unfortunately, such analysis cannot be done with the dataset. Thus, the explanations for this observation are hypothetical.
To verify that the clusters are composed of one unique cell population, we decided to do violin plots on top of the boxplots. Therefore, in addition to observe how many genes are expressed per cell type, we ensure that those cell type are not consisting of two different subpopulations.
df = data.frame("Cells" = names(data_Seurat@active.ident), "Clusters" = data_Seurat@active.ident,
"Detected_Genes" = sapply(names(data_Seurat@active.ident), function(x) sum(data[x] != 0)))
ggplot(df, aes(x = Clusters, y = Detected_Genes)) +
geom_violin()+
geom_boxplot(outlier.shape = 1,outlier.color='black',color=color8) +
stat_boxplot(geom ='errorbar',color = color8)
There is a wide disparity in the number of genes expressed per cell type. It could be because of the cell phenotype involving a higher number of genes expressed in one cell type than in other or due to the cell cycle state in which the cells are.
Some cell type shows many outliers in the boxplots. However, with the violin plot we can observe that those outliers are not a significative part of the cells forming the cluster.
Does the difference in gene expression of the cells come from their state in the cell cycle ?
To determine this we will use a scoring method from the Seurat package that allows to determine wether the cell is in G1, G2M or S phase of the cell cycle based on the expression of some of the genes that are key factors of those phases.
s.genes = c("Mcm5", "Pcna", "Tyms", "Fen1", "Mcm2", "Mcm4", "Rrm1", "Ung", "Gins2", "Mcm6", "Cdca7", "Dtl",
"Prim1", "Uhrf1", "Mlf1ip", "Hells", "Rfc2", "Rpa2", "Nasp", "Rad51ap1", "Gmnn", "Wdr76", "Slbp",
"Ccne2", "Ubr7", "Pold3", "Msh2", "Atad2", "Rad51", "Rrm2", "Cdc45", "Cdc6", "ExO1", "Tipin",
"Dscc1", "Blm", "Casp8ap2", "Usp1", "Clspn", "Pola1", "Chaf1b", "Brip1", "E2f8")
g2m.genes = c("Hmgb2","Cdk1","Nusap1","Ube2c","Birc5","Tpx2","Top2a","Ndc80","Cks2","Nuf2","Cks1b",
"Mki67","TmpO","Cenpf","Tacc3","Fam64a","Smc4","Ccnb2","Ckap2l","Ckap2","Aurkb","Bub1",
"Kif11","Anp32e","Tubb4b","Gtse1","Kif20b","Hjurp","Cdca3","Hn1","Cdc20","Ttk","Cdc25c",
"Kif2c","Rangap1","Ncapd2","Dlgap5","Cdca2","Cdca8","Ect2","Kif23","Hmmr","Aurka","Psrc1",
"Anln","Lbr","Ckap5","Cenpe","Ctcf","Nek2","G2e3","Gas2l3","Cbx5","Cenpa")
data_Seurat <- CellCycleScoring(data_Seurat, s.features = s.genes, g2m.features = g2m.genes, set.ident = TRUE)
new_df = data.frame("Cells" = names(data_Seurat@active.ident), "Clusters" = data_Seurat$old.ident,
"Count_gene" = sapply(names(data_Seurat@active.ident), function(x) sum(data[x] != 0)),
"Phase" = data_Seurat$Phase)
ggplot(new_df, aes(x = Clusters, y = Count_gene)) +
geom_boxplot(outlier.shape = 1,outlier.color='black') +
geom_jitter(shape=16, position=position_jitter(0.2), aes(color = Phase)) +
stat_boxplot(geom ='errorbar')
percentData <- new_df %>% group_by(Clusters) %>% count(Phase) %>%
mutate(ratio=scales::percent(n/sum(n)))
ggplot(new_df, aes(x = Clusters, fill=Phase)) +
geom_bar(position = "fill") +
theme_classic() +
geom_text(data=percentData, aes(y=n,label=ratio),
position=position_fill(vjust=0.5))
From this barplot we can observe that all the clusters have cells in the three different states (G1, G2M, S) but with different proportion. It appears to be difficult to find a correlation between the cell cycle state of the cells in a cluster and the number of detected genes for the cells in a same cluster. Therefore, the difference in the number of detected genes between the clusters might be due to another phenomenon than the cell state, such as the size or the functionality of the cells for example, or it may also be because the scoring method is not classifying correctly the cells in their corresponding state.
By performing this analysis on the heterogenous dataset of mouse brain cells provided, we were able to cluster those cells in 8 different groups, each representing a cell type.
So, by applying some preprocessing and a dimensionality reduction analysis (PCA and then UMAP) on the data, it rendered possible to separate the cells in function of their gene expression. Then, by analyzing the gene expression of each group formed after the UMAP, we managed to detect the genes that were differentially expressed in each group. Those genes represent the biomarkers of the cell type associated to the cluster considered. Therefore, after investigation of the biomarkers using the Mouse Brain Atlas and EnrichR, we have determined that the initial dataset was composed of Microglia, Granule neurons, Oligodendrocytes, Serotonergic neurons, Excitatory neurons, Noradrenergic neurons, Hypothalamus cells and Schwann cells. Eventually, we observed that there was a wide disparity in the number of genes expressed between those cell type, which seemed not be due to the cell cycle state of the cells. We hypothesized that this difference is due to the various role of each cell type, which therefore involve different gene expression rate.