You will find here a small description of the various data/computer science projects, primarily applied to the life sciences, completed during my master’s degree, along with a link to the submitted reports to get a better idea of the work accomplished for these projects.

Contrast agnostic deep learning based registration pipeline

NeuroPoly lab (Polytechnique Montreal)

Medical image registration can be challenging, in that optimal solutions depend on the application domain (unimodal, multimodal, intra-subject, inter-subject), anatomical sites (e.g., brain, spinal cord, lungs), dimensionality of the data (2D, 3D, 4D), deformation constraints (rigid, affine, nonlinear, etc.) as well as computational time. Solutions that could accommodate a large variety of applications while producing satisfactory results are needed. SynthMorph was recently introduced as an unsupervised deep-learning based registration method. A particularly interesting feature of SynthMorph is that its models can be trained on synthetic data, rendering the registration method agnostic to image contrast and anatomy. However, SynthMorph is particularly sensitive to the initial closeness of the images. In this thesis, the SynthMorph method was extended by developing a cascaded pipeline of two models that can accommodate large and fine deformations, respectively. This pipeline was validated for the registration of intra-subject multimodal and inter-subject uni/multimodal MRI data of the spinal cord. This task is known to be particularly difficult due to the vicinity of multiple tissue types whose morphometrics can vary substantially across subjects and contrasts. Evaluation of the method was conducted on a publicly available dataset (spine-generic, 267 subjects) and was compared to a benchmark (SCT and ANTs). Results demonstrate better registration accuracy compared to benchmark, and about 24-30 times faster on CPUs depending on the image size. This pipeline represents an advance for the researchers community as it provides an easy-to-use, accurate and fast solution for multimodal 3D registration, a necessary task for many medical applications involving MRI. The code and trained models are freely available.

This project was realized for my master thesis and consists of the implementation of a fully automatized deep learning pipeline for multimodal registration of brain and spinal cord MRI images for the NeuroPoly lab led by Prof. J. Cohen-Adad at Polytechnique Montreal. The models and pipeline are publicly available on GitHub. The project was realized in Python and Shell and resulted in the submission of a scientific article to the journal Aperture Neuro.

Gene expression prediction from ChIP-seq data

Laboratory of Computational and Systems Biology (EPFL)

Understanding the contribution of the different transcription factors in the process of gene expression regulation at a genome-wide level is a key challenge in molecular biology. With the progress of the high-throughput technologies, more and more gene expression data are collected under specific conditions. Linking those data to predicted interactions data between transcription factors and genes (ChIP-seqdata) is therefore of great interest to understand the regulatory process leading to the gene expression observed. This is what is proposed with the TF-NN, a neural network that has been developed to determine how the transcription factors are controlling gene expression. Given ChIP-seq data, the TF-NN predicts gene expression data obtained under certain conditions of interest. The network has been designed to account for some biological properties allowing to extract key information on theeffect of each transcription factor on gene expression by looking at the parameters of the model trained for the prediction task. Using gene expression data collected in the liver and at specific times, the results for the prediction task show that 53.3% of variance of the gene expression is explained with the ChIP-seq data. The transcription factors being the most involved in the prediction appear to be either some global transcription factors or transcription factors known to be involved in the liver or the circadian rhythm process. Hence, it assesses that some mechanisms regulating gene expression in the cells are reflected in the parameters of the model. The analysis of the TF-NN can be extended to multiple tissues following a specific process. Eventually, an alternative architecture of the TF-NN can be used to reduce the dimensionality of the ChIP-seq data provided at the input of the model, going from more than 10’000 experiments considered to only 10 dimensions.

In this project, a neural network has been implemented for the Laboratory of Computational and Systems Biology led by Prof. F. Naef at EPFL to predict gene expression data from ChIP-seq data and the biological meaning of the network has been explored. The network has been developed to be biologically interpretable. The project was realized in Python and the networks were developed with Keras/TensorFlow. The complete report is available here.

Analysis of Tesco and socio-economic data of Londoners

Applied Data Analysis (EPFL CS-401, Prof. R. West)

The goal of this study was to observe the impact of the ethnicity of Tesco’s customers in London on the kind of alimentary products they purchase. To achieve this, we used the Tesco Grocery 1.0 dataset that reports the food items purchased in London’s Tesco supermarket and extend it with the LSOA atlas, a dataset containing sensus data, which gives us information on the representation of different ethnic groups accross the city of London. We first studied the correlations between ethnicity and the type of products purchased at different levels of granularity. We determined if ethnic diversity is linked to diversity in grocery purchase, if the various represented ethnic groups correlate differently to the diversity in grocery purchase and how the represented ethnic groups correlate with the different food product categories reported in the Tesco dataset. After that, we moved forward and tried to assess if the distribution of the different ethnic groups and their diversity has a causal effect on the variation of food purchases, when taking into account other socio-economic features reported in the LSOA atlas that could act as confounders. We first carried a simple linear regression analysis, followed by a propensity score matching analysis. These analyses were conducted to try to answer the following research questions. Does ethnic diversity have an effect on food purchase at area level? And if yes, what is its nature? To which extent is the ethnic diversity responsible for the food purchase diversity of some aliment categories? Can we attribute particular purchase habits to specific ethnic groups?

The project’s report takes the form of a data story with a lot of interactive visualizations of the results and is available here, while the code used for this study can be found here.

Other projects

Instance segmentation of yeast cells

Implementation of a convolutional neural network for the Laboratory of the Physics of Biological Systems (LPBS) led by Prof. S. Rahi at EPFL to do instance segmentation of yeast cells on microscope images. Neural network implemented: Mask R-CNN, using Python and Keras/TensorFlow (report available here).

Image analysis to describe the behaviour of a robot in a special environment

Use of image analysis and pattern recognition to build a pipeline (OOP) able to track a robot and recognize patterns (segmentation, description, classification). The presentation of the project is available here.

Heterogeneous mouse brain cells sample analysis

Determine the number of cell types present in a dataset and the marker genes characterizing the different cell types. Identification of the cell types present and the amount of genes expressed per cell and per cell type. The project was realized in R (report available here).

Molecular analysis of thalamic nuclei via unsupersived hierarchical clustering and PCA

Determine three major molecular profiles across 22 distinct nuclei of the thalamus through unsupervised hierarchical clustering of the most differentially expressed genes. The project was realized in R (report available here).

GWAS for identification of SNPs correlated to coronary artery disease predisposition

Genome wide association analysis for identification of single-nucleotide polymorphisms correlated to coronary artery disease predisposition using HDL cholesterol concentration data. The project was realized in R (report available here).