Evan BéalProf. Darlene Goldstein GWAS Project Applied Biostatistics August 2020
1
Genome wide association analysis for identification of single-nucleotide
polymorphisms correlated to coronary artery disease predisposition
Introduction
Genome Wide Association Studies (GWAS) include a series of techniques trying to establish
associations between loci and phenotypes, typically predispositions to diseases, which are qualitative
phenotypes, or to specific physical traits, that are quantitative phenotypes, such as height for instance.
Therefore, the genomic analyses are performed at a population level with an experimental group
expressing a particular phenotype and a control group that doesn’t express this phenotype. It’s
common to classify genetic variations in two groups in function of the frequency in a population of the
less common allele of a diploid genome, which is the Minor Allele Frequency (MAF). Thus, the genetic
variations are classified as rare variants when the MAF is below 1% and common variants for MAF
above 1%. GWAS are generally used to discover common disease variants with MAF above 5%. For rare
variants identification, techniques with higher specificity are needed, such as targeted sequencing, as
it’s harder to detect associations with those variants.
GWAS scan an entire species genome for association between up to millions of Single-Nucleotide
Polymorphisms (SNPs) and the given trait of interest. SNPs are single-nucleotide substitutions that can
be present in coding or noncoding regions, leading to synonymous or nonsynonymous mutations.
Nonsynonymous SNPs can be missense or nonsense, meaning that an amino acid codon has been
converted into a codon coding for a different amino acid or into a stop codon respectively. Both SNPs
in coding and noncoding regions can be associated with diseases through different mechanisms of
action. They are usually detected with SNP array technology enabling a very accurate identification.
Human genomes analyses of the past years allowed to create some very large SNPs libraries, where
samples are classified according to many different criteria that can be, for example, frequency of
expression or ancestry. Nowadays, more than hundreds of millions of SNPs have been identified with
each individual having roughly 4 to 5 millions SNPs in its own genome.
Using genomic and especially the SNPs data many applications exist, among which there are of course
the association studies (GWAS) but also linkage disequilibrium studies (LD, defined as nonrandom
(correlated) loci associations), haplotype mapping or even forensic DNA fingerprint identification.
Genomic information from individuals can thus result to different functions. One of the most
interesting being the clinical application aiming to notice predisposition for particular disease
phenotypes by identifying SNPs highly correlated to those specific phenotypes. Knowing those
information for each person would allow to perform precision medicine and design personalized
therapies based on the individual genetic background aspiring to increase the therapies response and
provide more efficient treatments to patients.
Hence, genome wide association analyses are aimed for detecting variants at genomic loci that are
associated with complex traits in the population and, in particular, at detecting associations between
common SNPs and common diseases in order to stratify the patients and provide personalized
treatments in a second step. However, GWAS can be subjected to different bias. Therefore, the sources
of those potential bias have to be identified and managed before performing the analysis and
concluding on the potential correlation of certain SNPs and the phenotype studied. The main sources
of bias are samples size and definition of control and experimental groups. Some factors such as the
age, the sex or the population stratification (systematic ancestry differences between cases and
controls) can appear to be confounding factors. This results in some SNPs that seem to be correlated
to the phenotype of interest but that are in fact linked to another factor and not directly related to the
targeted disease.
Evan BéalProf. Darlene Goldstein GWAS Project Applied Biostatistics August 2020
2
How GWAS are performed ?
Fundamentally, knowing p SNPs from n samples (individuals), a GWA analysis will fit p independent
univariate linear models using the number of minor alleles of each SNP as predictor of the trait of
interest. Then, to identify significant SNPs within a group, odds ratios (OR) are used. This statistic
quantifies the strength of the association between two events and corresponds to the ratio of the odds
of case for individuals having a specific allele and the odds of case for individuals who do not have that
same allele. If the OR is greater than one, it means that the allele frequency in the population harboring
the phenotype of study is much higher than in the population without this phenotype and it’s the
opposite for OR smaller than 1. Therefore, GWAS try to identify significant odds greater than 1,
suggesting an association of the SNP with the disease studied. To conclude about the null hypothesis
and determine if this latter has to be rejected or not, the p-value has to be calculated. This can be done
via a chi-squared test for instance. The p-values are then often displayed via a Manhattan-plot in log10
negative scale to observe the significant SNPs. To define the level to consider for the conclusion about
the significance of each SNP, sampling extension must be taken into account and especially in such
population studies where a very large number of statistical tests are performed. Indeed, the problem
of multiple comparisons has to be counteracted and this can be done using Bonferroni correction for
example.
Limitations of GWAS
From what has been said so far, it’s possible to identify the limitations of GWAS. First of all, only the
common variants can be taken into account. Therefore, rare variants with MAF inferior to 1% are not
considered, as well as some other common variants depending on the threshold chose. In fact, this is
because only SNPs having a pairwise LD coefficient higher than a certain threshold are selected. This
score indicates the power that a SNP1 has to predict the genotype of a SNP2 and vice-versa, meaning
that the presence of a SNP could be sufficient to explain the presence of some other linked SNPs that
are not tagged by sequencing, reducing the data to be obtained and considered for the analysis but
also reducing the identification of rare variants.
GWAS are thus useful to detect common disease variants when data for a large population are
available in order to reach significance. However, causality is in general difficult to certify as ORs for
GWAS are typically in the 1.2-1.5 range. In such cases, a variation could be significant but not the cause
of the phenotype because of LD bins. Indeed, it could be another linked SNP (high LD), located at
hundreds of kilobases for example, that is the cause of the targeted disease. Therefore, the SNPs not
targeted by GWAS could be the cause of missing heritability of some phenotypic traits. That’s why, to
discover rare disease variants, targeted sequencing, allowing to reach ORs typically greater than 2 and
rendering thereby causality more likely, must be performed.
GWA study
The present study aims to review the PennCATH1 study, attempting to identify novel loci associated
with Coronary Artery Disease (CAD) via GWAS. CAD is also known as Ischemic Heart Disease (IHD) and
is the most common type of heart disease. It is characterized by a narrowing and hardening of arteries
supplying blood to heart muscle, resulting in limited provision of blood and therefore oxygen leading
to chest pain, heart attacks or contributing to heart failures and arrhythmias. CAD is usually caused by
atherosclerotic plaque rupture and if all patients with plaque rupture or myocardial infarction have
coronary atherosclerosis there are only a few with coronary atherosclerosis that develop myocardial
infarction. This suggests that unique (genetic) factors are likely to predispose to plaque rupture or
myocardial infarction in coronary atherosclerosis. Therefore, identifying SNPs that are unique to
patients with CAD via a GWA analysis may allow to describe some of the unique factors in questions.
Evan BéalProf. Darlene Goldstein GWAS Project Applied Biostatistics August 2020
3
The original dataset used for the study contained genomic data of 3’850 individuals for the 22
autosomal chromosomes, from which only an European-origin subset of 1’401 individuals (with
existing associated clinical data) was isolated to reduce population stratification bias. In total, there
are data on 861’473 SNPs, as well as clinical data including age, sex, high-density lipoprotein level, low-
density lipoprotein level and triglyceride level (all molecular factors of risk for cardiovascular diseases)
as well as CAD status of the individuals.
Methods and primary results
Data pre-processing
The first important step of a GWA analysis is to pre-process the data in order to keep only the SNPs
and individuals of interest after different filtering. Thus, the genomic data are initially loaded and only
samples associated to clinical data are selected, as those data are necessary to complete further
analyses. Filtering methods are then applied to remove biological samples introducing bias in the
association study and increasing the number of false positive and false negative associations. As the
number of SNPs is very large, it’s of great importance to decrease as much as possible the percentage
of false positive or negative results, otherwise many markers could be mistakenly associated to a given
phenotype. Removing some samples leads generally to a minimal impact on the overall association
power when dealing with large datasets as in this case. In the contrary, the removal of few SNPs could
result in missed relationships, due to the search of the disease-to-markers association. Therefore, it’s
safer to remove individuals than SNPs but the ideal scenario remains to perform both types of filtering
and this is what has been done in this analysis.
The first filtering occurred at the SNP level. It allowed to remove SNPs that fail to meet some minimum
criteria that could be due to factors such as missing data, low variability or genotyping errors. The
primary criterion considered was to remove all the SNPs for which we didn’t have data for minimum
95% of the individuals. It’s a quality control of the DNA samples acquisition and permits to ensure that
the potential significative results obtained later on will be linked to the effect of the SNPs and not due
to missing values. This filtering reduced significantly the number of SNPs to study in the next analyses
as we can see on figure 1 that many SNPs have a call rate lower than the 95% threshold applied.
Another filtering criterion was based on the MAF. A low threshold of 1% was used, to remove the rarest
SNPs that were below this threshold and which could have decreased the causative power and reduced
the robustness of the analysis. Indeed, the associative clusters containing those very rare SNPs would
have been very small and only driven by few individuals. This first filtering step, using those two
thresholds on the call rate and the MAF, allowed to discard 203’287 SNPs, going from an initial total of
861’473 SNPs to 658’186 SNPs for 1401 individuals.
Figure 1
Histogram of the SNPs call rate.
The threshold was set to discard SNPs with a lower value
than 0.95.
The second filtering was applied at the sample level. This included to remove individuals due to missing
data, sample contamination or any ambiguity in the clinical data. To accomplish this filtering, once
again different thresholds were used, starting with a threshold on the sample call rate. To be kept,
each sample had to express more than 95% of the SNPs that were retained after the filtering at the
1e+01
1e+03
1e+05
0.00 0.25 0.50 0.75 1.00
Call rate
Nb. of SNPs (log)
All SNPs
Evan BéalProf. Darlene Goldstein GWAS Project Applied Biostatistics August 2020
4
SNP level. The second used criterion was the autosomal heterozygosity at single SNPs loci. This
criterion is another measure of DNA sample quality and an excessive or reduced level of heterozygosity
could respectively indicate contamination or inbreeding. This heterozygosity value is computed using
F statistic calculation in the form of: |F| = (1O/E), thus proportional to the inverse of the ratio of the
observed proportion of heterozygous genotypes (O) and the expected proportion of heterozygous
genotypes (E) for a given sample based on the MAF of all the SNPs kept for this sample. Therefore, a
threshold was applied and the samples were kept if their F value was below 0.1. Overall, no individual
was discarded using the filtering based on the call rate and heterozygosity.
As an additional per sample filtering, an identity-by-descent (IBD) analysis was performed to filter the
samples on relatedness criteria. This analysis allowed to study allele frequencies at a population level
using the IBD score to look at alleles similarities for all SNPs computed between pairs of individuals.
This score is the estimated kinship coefficients between alleles and is computed via the method of
moments technique. To enhance the performance of this analysis, only alleles that are independently
assorted and not correlated were considered meaning that highly correlated LD regions were
excluded. Therefore, using a LD cut-off value of 0.2, only 72’890 SNPs out of the 658’186 were
considered for the IBD analysis. This is approximately 7.5% to 11% of SNPs for each chromosome. The
IBD pairwise sample relatedness score was then computed and used to potentially remove iteratively
samples that were too similar. However, using a kinship cut-off of 0.1, no samples were finally removed
as we can see on figure 2 that all the IBD kinship coefficients are below the selected threshold of 0.1.
Therefore, at the end of the per sample filtering, no sample was removed based on the IBD analysis as
well as with the filtering on the call rate and heterozygosity. The data were thus always of 658’186
SNPs for 1401 individuals. The call rate and heterozygosity values of the individuals are summarized
on figure 3.
Figure 2
Histogram of the IBD coefficients for the tested
samples couples. The threshold was set to
discard samples with a higher value than 0.1.
Figure 3
Summary of samples call rate and
heterozygosity rate.
Before a third filtering, a dimensionality reduction analysis using principal component has been
performed to test for ancestry. The data provided for this study are normally homogeneous and have
been acquired on individuals that all have European origin, to avoid the introduction of population
stratification as a confounding factor. By doing a principal component analysis (PCA), it allows to test
for macro stratification and have a primary observation on the potential stratification of the data
considered. If stratification is detected, then, the differences in loci could be related to both migrations
or to diseases without clear evident distinction in the genomic data. As this analysis is to determine
only migration clusters separations at the continent level, a plot of the first two principal components
is usually enough to observe if some patterns resulting to different ancestry appear. The results are
shown on figure 4. As no clear outliers emerged, no samples were removed. However, we can see that
1e+01
1e+03
1e+05
0.000 0.025 0.050 0.075 0.100
IBD kinship coeff.
Counts (nb. of tested samples couples) [log]
0.31
0.32
0.33
0.97 0.98 0.99 1.00
Call rate
Heterozygosity rate
Evan BéalProf. Darlene Goldstein GWAS Project Applied Biostatistics August 2020
5
two clusters are present, although linked by some individuals, which could indicate that the data come
from two main sub-populations of European origin.
Figure 4
Plot of the two first PCs from the PCA to test for ancestry.
Eventually, the last filtering was performed on the control samples to check for the Hardy-Weinberg
equilibrium. This model states that allele frequencies in a population don’t change from generation to
generation in the absence of other evolutionary influences. The frequencies for a simple case of one
locus with two alleles (A and a) are f(A) = p and f(a) = 1 - p = q, meaning that the expected genotype
frequencies under the HWE are f(Aa) = 2pq for the heterozygotes and f(AA) = p2 and f(aa) = q2 for the
two different homozygotes possible. It’s a classical theory of population genetics and it’s necessary to
check it because if a sample doesn’t follow this model it means that minimum one of the following
assumptions is violated: no selection, absence of mutation, no gene flow, infinite population size and
random mating. Therefore, deviations from HWE for the control group could mean that there are
genotyping errors (calling errors) or could be an indication of population substructures. This filtering,
based on the deviance from equilibrium, is realized by doing a Pearson chi-squared test to measure
differences between expectations and observations of SNPs couples. For instance, the observed
frequency p of an allele can be computed by doing:
𝑝 = # !"#"$%&
'
((
)
*$%&'(+)
!"#"'$%&
'
((
)
*$%&
'
(+
)
*$%&'++)
.
From this value the frequency q can be retrieved and a 1 degree of freedom chi-squared test can be
done:
𝒳!= #
(𝑂 𝐸)!𝐸
. If a value above 3.84 would be obtained, the null hypothesis that the
population is in HWE would be rejected at a 5% significance level. In addition to this test, if the number
of loci was too large in respect to the number of samples and the chi-square assumptions didn’t hold,
a Fisher’s exact test was performed. A threshold on the p-value was then set at 10-6, a low value to
ensure to not remove SNPs that deviate from HWE by chance, and all the SNPs below this value were
removed. This discarded 1’296 SNPs leading to a dataset after pre-processing of 656’890 SNPs for 1401
individuals.
Data generation
After that the data have been pre-processed to remove SNPs based on different criteria, a second step
was necessary before performing the GWA analysis to deal with confounding factors and to extend the
data at key locations by imputing SNPs.
The first data generation consisted to perform a PCA, to compute and to save the 10 first PCs for each
sample. This is similar to the ancestry analysis but by taking into account more than the 2 first it allows
to perform the stratification analysis and observe potential sub-populations patterns on a larger scale.
The goal being to correct genomic relations between individuals that could be confounding factors
instead of SNPs link to a disease risk. As stratification is strictly related to data variance, which is what
is isolated and encoded by the principal components, a PCA will show those possible stratification.
0.05 0.00 0.05
0.02 0.00 0.02 0.04 0.06 0.08
Ancestry Plot
Principal Component 2
Principal Component 1
Evan BéalProf. Darlene Goldstein GWAS Project Applied Biostatistics August 2020
6
Therefore, by incorporating the first 10 PCs (a randomly chosen number usually recommended in
GWAS) along with clinical covariates it will be possible to produce a model adjusted for confounders
and notably to compare it with an unadjusted model.
The second data generation consisted to extend the analysis to other known SNPs, that were not
present yet in the dataset because they were either not genotyped or were removed with the previous
filtering. Indeed, SNPs identification is typically done using micro-array techniques that allow to
identify around 1 million SNPs. However, like said previously, each individual has generally
approximately 4 to 5 million SNPs. Therefore, not all the SNPs are acquired using micro-array
techniques and the GWA analysis won’t be performed on every SNPs of the individual. This is a problem
of particular importance in disease studies, since often a single SNP can be phenotypically causative as
well as in the process of establishing correct genome-to-disease risks relationships. Thus, it’s of great
interest to complete the dataset for the missing SNPs and it’s possible thanks to the existence of very
large genome related SNPs datasets covering populations of different geographical origin and
containing LD-bins relationships. Using those datasets and setting a threshold on the LD-bins
relationship, full genotypes can be inferred from the detected SNPs, estimating the missing SNPs.
Different imputation methods exist. In this case, imputation analysis is performed at the chromosomal
level, testing only on the same chromosome and reducing by this way the computational cost and the
number of non-tagged SNPs. This local analysis can be done by determining for all the data-bank
genomes the posterior probability of matching a target genome given the observed (measured)
genome at a local position. The missing and common SNPs between the observed and the targeted
genotypes are used as binary labels for linear regression classification and new SNPs are added to the
regression model until the satisfaction of some stopping criteria. Eventually, the last step is a quality
control to remove un-typed SNPs where the imputation rules can’t be satisfied and to filter the ones
with a low estimated MAF and low imputation accuracy.
Consequently, this imputation analysis is performed at the chromosomal level and more specifically
on the chromosome 16. This is justified because the gene coding for cholesteryl ester transfer protein
(CETP) is present on this chromosome. This protein is involved in the transfer of cholesteryl ester from
high density lipoprotein (HDL) to other lipoproteins and therefore is strictly linked to HDL
concentration, which is the selected phenotypic trait studied here. Imputations rules were estimated
for 197888 SNPs and the quality control removed 35’323 of those imputed SNPs adding eventually
162’565 SNPs to the dataset.
Genome-wide association analysis
With the previous steps, the data have been correctly loaded and pre-processed to remove SNPs that
weren’t following some criteria. In addition, some SNPs were imputed on the chromosome of interest,
leading eventually to a dataset that can be used to run the GWA analysis. Case-control GWAS can be
of two types. Either the association between an endophenotype (trait underlying a disease, here HDL-
concentration) and SNPs or the identification of SNPs with significant higher frequency in experimental
(case) group in respect to the control one can be studied. In our case, since we dispose of clinical data
where each patient is associated to measured endophenotypic values, the first type of analysis was
carried on.
Therefore, the GWA analysis consisted of regressing each SNP separately to the adjusted HDL
cholesterol concentration. Thus, HDL concentration had to be adjusted for confounding factors. Part
of those factors have already been filtered and studied previously to potentially remove some samples
for instance but additional parameters, such as the age, the sex and the 10 PCs computed during the
data generation step, have to be considered to adjust the model. This is of great importance knowing
that the age and the sex are indeed risk factors to HDL-cholesterol trait, which is associated to
Evan BéalProf. Darlene Goldstein GWAS Project Applied Biostatistics August 2020
7
cardiovascular diseases. After being adjusted, the HDL concentration had to be normalized as well. The
normalization is useful to prevent scale-induced bias and consisted of an inverse-normal
transformation allowing to remove extreme values (figure 5). Eventually, as the endophenotype for
this GWA analysis was the HDL concentration, all the individuals without this clinical data were
removed of the study resulting to 1309 samples left.
Figure 5
Histogram of HDL concentration before and after
inverse-normal transformation.
To build a statistical model taking into account all the per-sample SNPs of the dataset, logistic
regression was used. The regression was done using only a simple single additive model due to the
large amount of data and extremely small proportion of SNPs that are link with the outcome. This
model is suitable for binary outcomes as here the presence or absence of disease phenotype linked to
the continuous phenotypical variable (transformed HDL concentration). Therefore the model will be
of the following form:
𝔼
[ Y | xi X = (x1, ..., xn), C = (c1, ..., ck) ] =
𝛼
0 +
𝛽
i
×
xi +
𝜔
1
×
c1 +...+
𝜔
k
×
ck +
𝜀
#
with Yi representing the continuous phenotypical variable, the N SNPs being tested one by one and
represented by the discrete variable x (xi when SNPi tested) and the k confounding factors,
corresponding to the age, the sex and the population stratification using PCs, being denoted by the
covariables ck. The analysis consisted therefore to test if the coefficients
𝛽
were significantly different
to zero using chi-squared tests. The tests are independent and are performed for each SNP ‘i’ under
the null hypothesis H0,i = {#
𝛽
i = 0 } and the alternative hypothesis Ai = {#
𝛽
i 0 }. The threshold of
significance had to be adjusted due to the large size of the dataset and the number of independent
tests performed. The corrective method used was the one of Bonferroni which considerably reduces
multiple comparisons bias in a simple way. Using this method, the significance level was thus set to
0.05/106 = 5·10-8, taking an initial alpha value of 0.05 and considering that around 1 million
independent tests were performed (order of magnitude of the number of SNPs considered in this
study).
A factor that can usually be considered when performing a GWA analysis is the genetic nature of each
SNPs which can be dominant or recessive. However, in this study, even if the information was present
in the dataset, it wasn’t taken into account as there is no a priori hypotheses that linked the phenotype
studied to a dominance type. In addition, by studying an additive model, it allows to detect both
recessive and dominant SNPs using a unique model. Indeed, this model makes an hypothesis that there
is a uniform and linear increase in risk for each copy of a dominant allele giving a reasonable power to
detect dominance effects, even if this model may be a bit underpowered for detecting some recessive
effects compared to other models.
Histogram of HDL
HDL
Frequency
20 60 100 140
0 100 200 300 400
Histogram of Tranformed HDL
Transformed HDL
Frequency
31 0 1 2 3
0 50 100 150 200 250
Evan BéalProf. Darlene Goldstein GWAS Project Applied Biostatistics August 2020
8
Results: significant SNPs, visualization, quality control and final analysis
The GWA analysis resulted to no significant SNPs for a Bonferroni significance level of 5·10-8. Thus, the
threshold has been changed and using a less stringent significance level of 10-6, 18 SNPs were
detected as significant. From those 18 SNPs, 2 were tagged SNPs, rs1532625 and rs247617, and the
other 16 were imputed SNPs. It’s important to remind that it’s not because those SNPs are statistically
significant with this threshold that there is necessarily a causality link between the SNPs and the
phenotype studied. However, it’s interesting to observe that all those SNPs are on chromosome 16
and close to the CETP gene, the gene of interest. Indeed, by analyzing the CETP region (± 5 Kb) using
genetic coordinates established with the Genome Reference Consortium Human Build 37 (GRCh37)
database, it has been found that a total of 77 SNPs were identified in this region (7 tagged and 70
imputed), with all the 18 significant SNPs belonging to this list.
Eventually, the final adjusted association model of transformed HDL concentration is defined for the
18 significant SNPs and is of the following form:
𝔼
[ Y | xi ∈#X = (x1, ..., x18), C = (age, sex, PC1, …, PC10) ]
=
𝛼
0 +
𝛽
i
×
xi +
𝜔
age
×
age +
𝜔
sex
×
sex +
𝜔
PC1
×
PC1 + … +
𝜔
PC10
×
PC10 +
𝜀
#
Manhattan plot
To observe the results of this GWA study, different visualization techniques were used. The first one
and probably the most commonly applied for picturing the outcomes of the analysis in a simple graph
is the Manhattan plot. In this type of plot, the x-axis represents the genetic coordinates at a
chromosome-scale and the y-axis the negative logarithm of the p-values computed for each SNP,
allowing to easily visualize significant SNPs that are on top of the graph as they are the largest value in
a negative log-scale. This plot is represented on figure 6 and shows all the genotyped (black) as well as
the imputed (blue) SNPs. Horizontal lines are drawn to observe the significance levels used in this
study. This confirms that indeed no SNP is above Bonferroni corrected threshold line meaning that
using this significance level, no SNP is significant. Although this type of representation is practical to
have a global view on the results, it has a low resolution and may not be optimal as all the significant
SNPs found are present in the same chromosomal region.
Figure 6
Manhattan plot of SNPs p-values.
The level of significance used are represented by the two horizontal lines.
Genotyped SNPs are in black or grey and imputed SNPs are in blue.
Evan BéalProf. Darlene Goldstein GWAS Project Applied Biostatistics August 2020
9
Quantile-Quantile (QQ) plot
It’s another common representation in GWAS that allows to compare the quantile distributions of
observed and expected p-values following a certain distribution. If no causal SNP was present, this plot
would simply represent a linear function meaning that all the SNPs would follow an uniform
distribution. If a tail was observed, this would be indicative of the presence of causal SNPs that would
produce significant p-values. Eventually, the absence of a straight line would be indicative of problems
such as unaccounted interaction terms (covariates) or in general of the presence of unaccounted
confounding terms, giving information on whether or not the data preprocessing has been correctly
performed.
Those plots have been produced for two different models. The one used for the whole analysis and
taking into account the potential confounding factors of the sex, the age and the PCs and another one
where those confounding factors were not used to adjust the model. The results were produced using
only the genotyped SNPs and not the imputed SNPs. The QQ-plots can be observed on figure 7. We
can see a tail on both of those graphs with 2 and 3 observed values appearing to be significantly
different to the expected values for the adjusted and non-adjusted model respectively. Therefore,
those graphs assess the data pre-processing steps as well as the model selection. The results observed
for the adjusted model are consistent with all the previous ones as the 2 values that are different to
the expected values are the 2 significant genotyped SNPs obtained previously. A deviation from the y
= x line can be observed for the unadjusted model and not for the adjusted one, highlighting the
importance to consider the confounding factors when building the model to have only significant SNPs
that deviate from the expected distribution.
To compare the model adjusted for confounders to the one unadjusted, the lambda statistic, also
known as genomic inflation factor, was used. This statistic corresponds to the ratio between the
median of the chi-squared test observed and the one expected:
𝜆 = ,-./+0"1%&-23-."4!
,-./+0"567-89-."4!
.
Therefore, a value of 1 is expected if the data follows the normal chi-squared distribution meaning that
no causal SNP is present. The lambda values obtained were
𝜆
#= 1.0016 for the adjusted model and
𝜆
#=
1.0108 for the unadjusted one. Those results mean that the tail of the distribution is brought closer to
the y = x line after accounting for the different confounding factors such as age, sex and population
stratification through the PCs in the model. However, as there is only a small difference between the
two models, it means that the population is relatively homogeneous.
Figure 7
QQ-plots of the model adjusted for confounding factors (left) and the unadjusted model (right).
Evan BéalProf. Darlene Goldstein GWAS Project Applied Biostatistics August 2020
Heatmap
The last graphic representation considered in this study is the heatmap (figure 8). This graphical tool
allows to visualize the LD pattern between different SNPs (significant or not) in the region of interest.
Therefore, the heatmap representation for this analysis is done in the CETP region of the chromosome
16. The LD measure used is r2. The correlation coefficient r is strictly related to the chi-squared test
and a value of r2 = 1 means that the two SNPs provide identical information whereas a value of r2 = 0
means that the two SNPs are in perfect inheritance equilibrium. Tagged SNPs (black) and imputed SNPs
(red) are included and can be identified with the plot of the p-values of the SNPs that have been used
to do this heatmap. It can be seen that, overall, the most significant SNPs are physically relatively close
to each other and are neither in total linkage equilibrium nor disequilibrium but in between. As an
example, the two most significant tagged SNPs have a correlation coefficient r2 » 0.7.
Figure 8
Heatmap representation
of the LD structure in the
region of the CETP gene
on chromosome 16.
Above the heatmap is
the plot representing the
p-values associated to
genotyped and imputed
SNPs on a negative
logarithmic scale in black
and red respectively.
Conclusion
The goal of this analysis was to identify novel SNPs in European patients correlated to the
predisposition of developing a CAD. This was done by performing a GWA analysis using generalized
models to regress the HDL cholesterol concentration data. HDL concentration was used as the
explanatory variable as this factor is known to be associated to cardiovascular risk. The analysis was
performed after a preprocessing step that allowed to complete some filtering on both the SNPs and
the individuals and remove the ones that weren’t following different criteria as well as after adjusting
for confounding factors in order to build a strong model. Additional SNPs on the chromosome 16 were
then imputed to enhance the chance of finding significant relations between SNPs and HDL cholesterol
concentration. It was done only on the chromosome 16 as prior knowledge indicated that a gene
coding for a protein (CETP) directly involved in a process linked to HDL concentration is present on this
specific chromosome. The analysis was then performed on the processed data and resulted on the
identification of significant SNPs linked to the explanatory variable. Those SNPs were all present in the
CETP region as expected. On the 18 significant SNPs detected, two were initially genotyped, rs1532625
and rs247617, and could be of particular interest to determine European patients with predisposition
for CAD. However, the causality can’t be established from this study and in order to specifically
associate those polymorphisms to a specific cardiovascular disease form such as CAD, an analysis of
association taking into account the CAD-status of the patients would have to be performed.
Physical Length:31.6kb
R2 Color Key
0 0.2 0.4 0.6 0.8 1
UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics
CETP
CETP
0
2
4
6
Negative Log Pvalue
Evan BéalProf. Darlene Goldstein GWAS Project Applied Biostatistics August 2020
Bibliography
Web tutorial for GWAS available at: http://stat-gen.org/tut/tut_intro.html
Reilly, M.P., Li, M., He, J., et al. Identification of ADAMTS7 as a novel locus for coronary atherosclerosis
and association of ABO with myocardial infarction in the presence of coronary atherosclerosis: two
genome-wide association studies. Lancet (2011)
Bush, W. S. & Moore, J. H. Chapter 11: Genome-Wide Association Studies. PLOS Comput. Biol. (2012)
de Abreu e Lima, F. poissonisfish: Genome-wide association studies in R. (2017)
Yang, J. et al. Genomic inflation factors under polygenic inheritance. Eur. J. Hum. Genet. (2011)
Jousilahti, P., Vartiainen, E., Tuomilehto, J. & Puska, P. Sex, age, cardiovascular risk factors, and
coronary heart disease: a prospective follow-up study of 14 786 middle-aged men and women in
Finland. Circulation (1999)
Sikorska, K., Lesaffre, E., Groenen, P.F., Eilers, P.H. GWAS on your notebook: fast semi-parallel linear
and logistic regression for genome-wide association studies. BMC Bioinformatics (2013)
Zhang, Z., Ersoz, E., Lai, C. et al. Mixed linear model approach adapted for genome-wide association
studies. Nat Genet (2010)
Liu, J. Statistical Methods for Genome-wide Association Studies and Personalized Medicine. University
of Wisconsin-Madison (2014)
Ehret, GB. Genome-wide association studies: contribution of genomics to understanding blood
pressure and essential hypertension. Curr Hypertens Rep. (2010)