Evan Béal – Prof. Darlene Goldstein GWAS Project – Applied Biostatistics – August 2020
Therefore, by incorporating the first 10 PCs (a randomly chosen number usually recommended in
GWAS) along with clinical covariates it will be possible to produce a model adjusted for confounders
and notably to compare it with an unadjusted model.
The second data generation consisted to extend the analysis to other known SNPs, that were not
present yet in the dataset because they were either not genotyped or were removed with the previous
filtering. Indeed, SNPs identification is typically done using micro-array techniques that allow to
identify around 1 million SNPs. However, like said previously, each individual has generally
approximately 4 to 5 million SNPs. Therefore, not all the SNPs are acquired using micro-array
techniques and the GWA analysis won’t be performed on every SNPs of the individual. This is a problem
of particular importance in disease studies, since often a single SNP can be phenotypically causative as
well as in the process of establishing correct genome-to-disease risks relationships. Thus, it’s of great
interest to complete the dataset for the missing SNPs and it’s possible thanks to the existence of very
large genome related SNPs datasets covering populations of different geographical origin and
containing LD-bins relationships. Using those datasets and setting a threshold on the LD-bins
relationship, full genotypes can be inferred from the detected SNPs, estimating the missing SNPs.
Different imputation methods exist. In this case, imputation analysis is performed at the chromosomal
level, testing only on the same chromosome and reducing by this way the computational cost and the
number of non-tagged SNPs. This local analysis can be done by determining for all the data-bank
genomes the posterior probability of matching a target genome given the observed (measured)
genome at a local position. The missing and common SNPs between the observed and the targeted
genotypes are used as binary labels for linear regression classification and new SNPs are added to the
regression model until the satisfaction of some stopping criteria. Eventually, the last step is a quality
control to remove un-typed SNPs where the imputation rules can’t be satisfied and to filter the ones
with a low estimated MAF and low imputation accuracy.
Consequently, this imputation analysis is performed at the chromosomal level and more specifically
on the chromosome 16. This is justified because the gene coding for cholesteryl ester transfer protein
(CETP) is present on this chromosome. This protein is involved in the transfer of cholesteryl ester from
high density lipoprotein (HDL) to other lipoproteins and therefore is strictly linked to HDL
concentration, which is the selected phenotypic trait studied here. Imputations rules were estimated
for 197’888 SNPs and the quality control removed 35’323 of those imputed SNPs adding eventually
162’565 SNPs to the dataset.
Genome-wide association analysis
With the previous steps, the data have been correctly loaded and pre-processed to remove SNPs that
weren’t following some criteria. In addition, some SNPs were imputed on the chromosome of interest,
leading eventually to a dataset that can be used to run the GWA analysis. Case-control GWAS can be
of two types. Either the association between an endophenotype (trait underlying a disease, here HDL-
concentration) and SNPs or the identification of SNPs with significant higher frequency in experimental
(case) group in respect to the control one can be studied. In our case, since we dispose of clinical data
where each patient is associated to measured endophenotypic values, the first type of analysis was
carried on.
Therefore, the GWA analysis consisted of regressing each SNP separately to the adjusted HDL
cholesterol concentration. Thus, HDL concentration had to be adjusted for confounding factors. Part
of those factors have already been filtered and studied previously to potentially remove some samples
for instance but additional parameters, such as the age, the sex and the 10 PCs computed during the
data generation step, have to be considered to adjust the model. This is of great importance knowing
that the age and the sex are indeed risk factors to HDL-cholesterol trait, which is associated to