Instance Segmentation of Yeast Cells using Mask
R-CNN
Bagnoud Alexandre, B´
eal Evan, Buu-Hoang Alix
EPFL Lausanne, Switzerland
Abstract—In this computer vision task, we worked in collabo-
ration with the Laboratory of the Physics of Biological Systems
(LPBS) in order to detect and generate masks of yeast cells
from microscope images. In this report, we explain our work
and results with the neural network Mask R-CNN, as well as its
architecture.
I. INTRODUCTION
Image segmentation is a specific image processing tech-
nique used to partition an image into different regions. Techni-
cally, this process gives a label to each pixel in the image such
that all the pixels with the same assigned label are connected
in respect to some properties.
In our project ‘ML for Science‘, we collaborated with ‘the
Laboratory of the Physics of Biological Systems‘ (LPBS)
led by Prof. S. Rahi. This laboratory focuses on two model
organisms: C.Elegans and yeasts. One of their aim is to try
to develop machine learning based algorithm to track yeast
cells or neurons in C.Elegans. By using videos recordings
combined with instance segmentation of such models, they can
learn about many different aspects of biological systems, such
as genetics or even behavior. Our role in this project, along
other groups, was to each test different image segmentation
algorithms, using their yeast cells dataset, in order to help
them decide which architecture suited their experiments the
best.
Our first objective was to choose a paper that implemented
an image segmentation algorithm, and try to reproduce the
mentioned results. For this matter, we chose Mask Region-
based convolutional Neural Network (Mask R-CNN). This
algorithm was chosen as it showed to produce great results
for different image segmentation competitions. As a second
objective, we had to adapt the neural network to fulfill the
requirements of the lab, i.e. segment instances of yeast cells
images taken from the microscope.
II. IMAGE SEGMENTATION AND MASK R-CNN
As briefly stated in the discussion, Mask R-CNN is an image
segmentation algorithm (neural-net) that can perform object
detection as well as instance segmentation, which is to detect
how many of the same objects are present. Applied to our
project, it will be able to create a bounding box around each
yeast cell, and label differently (as different instances) each
pixels of each yeast cell. Instance segmentation is a great asset
for the scope of this laboratory, as it would allow them to track
each cell and analyse them individually.
III. ARCHITECTURE OF THE NEURAL NETWORK
The architecture of Mask R-CNN is an improvement of
its former algorithm named Faster R-CNN. Its structure is
composed of two stages. The former stage, which is highly
similar to Faster R-CNN, is responsible to generate areas in
the image that are prone to contain an object. The latter stage
is responsible for the classification of objects. Mask R-CNN
principally adds instance segmentation as well as masks in the
second step.
FIG. 1: General framework of Mask R-CNN algorithm. Adapted from
[1]
A. First stage: Generation of region of interest
As stated above, the first stage is designed to propose
regions of interests (RoI). For that, its structure is composed
of two main elements: the backbone and the region proposal
network (RPN). At first, the image is processed by a con-
volutional network that will extract a feature map, using a
combination of ResNet-101 and Feature Pyramid Network.
The early layers will detect simple features such as straight
lines or corners, whereas the latter layers will detect more
advanced features such as whole objects. The output is then
fed into the RPN, which scans the feature map over around
200’000 regions (i.e. portions of varying size of the map). For
each region, the RPN predicts in a binary fashion whether it
contains an object (foreground) or nothing (background) and
adds a confidence value. It also computes where in this region
the object would be located. The highest rated regions are
then selected and the bounding box is reshaped to the correct
size using the different regions that overlapped with the same
object. At the end of this first stage, we have RoI that are
expected to contain an object and those will be analyzed by
the second stage.
B. Second stage: Classification and Masking
The second stage only takes as inputs the RoI and resize
them all to the same shape. The RoI classifier predicts the
class of the object, which in our case is either yeast cell or
background, and refines again the dimension of the bounding
box. Each RoI are also fed in parallel to the mask branch,
where each instance of object gets a 28x28 pixel mask. The
mask are finally reshaped in size to match with the original
image size. A representation of the architecture can be seen
in Figure 1.
IV. METHOD
A. Adaptation of the code and early testing
As mentioned in the introduction, we selected Mask R-CNN
based on its results in image segmentation competitions. The
original paper [1] described the architecture and demonstrated
its performance. Upon research, we found a GitHub repository
where a replication of the paper was provided with some
minor changes [2]. The model was well documented, and
already trained for different examples such as the COCO
dataset. The COCO dataset is a publicly available set that was
used in image segmentation competitions. It contains labeled
data containing up to 91 different objects. The repository also
provided some functions to visualize the dataset as well as the
predictions.
After analyzing how the code was constructed, we were
able to use the algorithm with provided weights for the COCO
dataset. Upon testing on different images, we could assess that
the image segmentation worked perfectly fine, it was able to
detect different objects, such as cars, different animals or some
type of food. An example is shown below in Figure 2.
FIG. 2: Example of image segmentation using the provided COCO
weights. Each animal is correctly labelled
At this point, we decided to keep this algorithm and adapt
it to our needs.
Our first task was to modify the code and allow it to detect
yeast. In the GitHub repository, some different classes of
objects were provided as examples of use. One class, named
‘nucleus’, was coded in a way that it was able to decipher
between two classes, i.e. nucleus or background. This class
was of great use for us as our adaptation of the code should
also differentiate between two objects, yeasts or background.
After having studied how every file played a role in the
execution of the algorithm, we created the yeast Final.py file,
overwriting some functions of the model and adding new ones.
B. Dataset
The data that we gave the program to train on came directly
from the LPBS laboratory. The pictures were given as com-
pressed ‘.tif files, composed of pictures of size 2044x2048.
For each ‘data point’ or image, we had two files: the first
consisted of a picture of the yeast cells under a microscope,
and the second was a corresponding ground truth image
containing masks placed at the location and at the same shape
of each cells as seen on Figure 3. Note that the mask image is
containing all the yeasts at once. We had to tackle this issue
because our goal was to label each yeast cell differently (to
perform instance segmentation), but the program understood
it as one giant cell that was made out of smaller circles. To
deal with this issue, for each mask image, we detected how
many yeast cells were labelled. Then, for each cell, we created
a new image containing only one mask per file.
FIG. 3: Example of yeast cells image visualisation using
’run visualize.py’. Each cell has a ground truth overlapping mask
Looking through the data, we observed that the average
number of yeast cells per image was generally around 50
to 150. However, we found some files where the number of
masks exceeded a thousand cells, or some where masks were
composed of one pixel only. After we took the decision to
remove those data points, we ended up with around 290 images
as our dataset.
The dataset needed a specific folder structure to be com-
patible for the algorithm. We wrote different functions (in
yeast Final.py, see ReadMe.md) allowing to take any com-
pressed ‘.tif file containing labelled data and format it to work
with the program, as well as separating masks for instance
segmentation.
C. Training
After the modification on the code and the yeast cells dataset
were done, we started to train the model using the provided
weights for the COCO dataset. We justify that choice by the
fact that it seemed better to start with weights that were already
trained to detect specific patterns such as lines and corners
than to start from scratch. We separated our dataset into 261
images to train and 33 images as the validation set. As our
classification was binary, i.e. either yeast cell or background,
we set the score for a positive result to be at 50%.
At the end of each training, we mainly looked at four values
to assess the quality of the new weights: the loss (mainly
L1-loss) on the train data (loss) and on the validation data
(val loss), the ’mask loss’, the pixel-wise cross entropy loss
on the validation data, and the average precision (AP) in
combination with intersection over union (IoU). The loss gave
us a direct indication if the model was getting better or worse,
and as our main goal was to generate masks for the cells,
the mask loss was also useful in the same manner giving an
information on the pixel-wise precision. The average precision
was calculated using precision and recall in function of IoU
where it computes the region of the predicted bounding box
overlapping the ground truth divided by the joined area of
those two. In sum, we looked at the two losses after each
training, and visually inspected the new weights by predicting
an image from the validation set where it could compute the
AP for different values of IoU.
The training sessions took on average from 6 to 10 hours
for one epoch, so we could only do one at the time. We
used Google Colab hosted servers as our computers could not
handle the processing. Using Google Colab was unstable and
resulted in many training sessions to be aborted due to the
way the firm allocate their servers.
D. Improvement attempts
After getting the first training sessions completed and the
first positive results, we tried to improve the model by mod-
ifying some aspects. Firstly, we adjusted the batch size (the
number of data points the network takes at each iteration) and
the number of steps per epoch to find the best combination
between execution speed and the number of images that were
used to train. Secondly, all pictures looked similar, that is
with a colony of yeast in the center of the image. To try to
make the algorithm more robust in case a new image deviated
from the usual, and also to rise the number of training data,
we decided to use image augmentation. This function takes
an original picture, and modifies it by rotation, symmetry,
cropping and adding blur randomly. This way, the images that
we train the network with have much more variability than
the ones provided by the lab. The final size of the training
image are 512x512, but the original images are cropped to
smaller size and magnified by a certain factor to match the
previously mentioned size. This makes the yeast cells appear
larger and allows the network to learn to generate precise
masks. Note that different values were tested here and the
results are presented in the next section. Another modification
was to try to pre-process the images by increasing contrast
and luminosity, which made the cells much easier to see with
the naked eye, as the original images are low-contrast images.
V. RESULTS
In this section, we present the results of many training,
where we tested different parameters or image pre-processing
to eventually obtain the weights giving the most accurate
results, based on the pixel-wise cross entropy loss on the
validation dataset (val mask loss).
Firstly, we aimed to determine if all the layers of the neural
network were necessary. By comparing the loss results in Table
I, we observe a significant difference between training on
all the layers, or only on the heads layers (RPN, classifier
and mask heads of the network). Moreover, we could choose
to train between different stages of the ResNet-101 (e.g. 4
and above), but it lead to major increase in the loss without
reducing the running time, so this option was discarded.
heads layers All layers
Loss 1.2419 1.0246
TABLE I: Loss in function of the layers trained
Secondly, we wanted to determine the optimal number of
images per batch. To do so, we tested 3 different batch size as
seen in Table II. We can observe that the loss significantly
decrease when the number of images per batch increase.
However, we weren’t able to increase the batch size above 4
to find the optimal number of images due to the considerable
training time and the lack of power and memory space.
Batch size: 1 3 4
Loss 1.2384 1.0684 1.0246
TABLE II: Loss in function of the number of images per batch
Then, we compared the number of augmentation needed per
image, as presented in Table III. We observe that increasing
the number of augmentation per image makes the dataset more
different leading to a small increase of the loss on the train
data but a large decrease on the validation data.
Number of augmentations Between 0 and 2 Between 0 and 4
Loss 1.0246 1.0388
val loss 1.1712 1.0477
val mask loss 0.1828 0.1721
TABLE III: Loss on the train and the validation data in function of
the number of augmentation
Afterwards, to try to reduce the training running time
without considerably increasing the loss, we tested to train
with or without a scaling factor of 2. The results in Table IV
show that the scaling factor is necessary to decrease the loss
even though the suppression of scaling factor allow to divide
the training time by almost a factor of 2.
Scaling factor 1x (no rescaling) 2x
Loss 1.2750 1.0246
TABLE IV: Loss in function of the scaling factor
Eventually, we added an option in the yeast class configu-
ration to select a pre-processing of the images before running
the train. The different pre-processing available (changing the
contrast, the luminosity or both) have been tested. Although
it allowed us to better visualize the yeast cells by eye, we can
observe that it did not decrease the loss. See Table V.
Therefore, following these steps, we were able to optimize
our model for the yeast dataset leading to the results described
below in Table VI.
Pre-processing None Contrast Luminosity + Contrast
Loss 1.0246 1.4369 1.4538
TABLE V: Loss in function of the pre-processing performed
Optimal Model
Loss 1.0388
mask loss 0.1829
val loss 1.0477
val mask loss 0.1721
TABLE VI: Loss on the train and validation data of the optimal model
The final parameters used to get those weights are defined
in the YeastConfig in yeast Final.py. For example, the number
of images per batch has been set to 4, the image resize mode
chosen is a crop of 256x256 using a scaling factor of 2 (to
obtain 512x512). The number of augmentations per image
(rotation, blur, symmetry or pixel value multiplication) is
random between 0 and 4. The region proposal network anchors
have been set to smaller size than the original config in order
to deal with the yeast cells that are small objects. Using those
weights and those parameters, an example of segmentation on
an image of the validation set is shown in Figure 4.
FIG. 4: Results of detection on an image from the validation set (the
image has been cropped for better view)
Using the optimal weights on new images and performing
an instance segmentation of yeast cells leads to the prediction
of segmentation masks of the yeast as seen on Figure 5. The
predicted masks by our model attributes a different pixel value
for each detected yeast, allowing the laboratory to separate
each cell individually for further analysis.
FIG. 5: Results of detection prediction (right) on test data (left,
original image modified to observe yeast cells to the naked eye)
VI. DISCUSSION
As discussed in the results section, we have seen that we
could get promising results with the training that was done.
We were able to find relevant parameters given our dataset
and the time we had at our disposition. The detection is not
optimal, as it can be seen in Figure 5 where one yeast in the
center is not detected, but given more training, we believe that
the algorithm should perform as intended.
Our main issue was the time of execution for the training.
Although the first section of the algorithm based on Faster
M-RCNN was designed to optimize training and prediction
time, one epoch still took us at least 6 to 8 hours and was
therefore difficult to perform many test or increase the number
of epoch. We believe the reason for the length may reside in
the fact that there are lots of yeasts per image, which produces
a lot of RoI, and could therefore slow down the progress.
Upon research, we found some suggestions on how to optimize
the programm to run faster, however after implementation of
those changes the running time did not reduce significantly
whereas the loss increased consequently. Another issue was
that the architecture was designed to take in images in RGB
format (3 layers of color). However, our dataset was composed
of grayscale images only (1 layer deep). Using the grayscale
images without transformation produced errors, therefore we
had to transform the grayscale images to RGB before using the
different functions. Although using grayscale images would
have probably decreased the running time, we were not able
to modify all the functions that used images in a RGB format.
As mentioned in the introduction, Mask R-CNN was de-
veloped in order to compete on the COCO dataset, where it
had to classify over 90 different objects. In our case, there is
only one object to detect. We believe that the complexity of
the architecture, i.e. the number of layers, could be reduced
to produce a simpler and more efficient algorithm. Combined
with a greater computing power, more training could have been
done.
In sum, as non computer scientists, we found the original
code complex to understand and to work with. It was reward-
ing to adapt the code as much as we could and observe what
worked better or worse, but we were sometimes limited to our
fresh knowledge in machine learning and neural networks.
VII. CONCLUSION
This project was the first neural network we worked on.
Looking at the overall results, we believe our goal was met. We
were able to work on a complex algorithm and adapt it to our
needs, also fulfilling the requirements of the laboratory LPBS.
We gained insight on how computer vision was operating and
seized how important it can be for the field of research. We
also know that there is always room for improvements that can
be made with further knowledge on the subject. We would like
to thank Prof. S. Rahi for the opportunity to work at their side
on their project, as well as everybody in the Laboratory of the
Physics of Biological Systems for their help and insight.
VIII. REFERENCES
[1] K. He, G. Gkioxari, P. Doll´
ar, and R. Girshick, “Mask
R-CNN, arXiv:1703.06870 [cs], Jan. 2018.
[2] GitHub for Mask R-CNN: matterport/Mask RCNN.
Matterport, Inc, 2019.
[3] “Splash of Color: Instance Segmentation with
Mask R-CNN and TensorFlow. [Online]. Available:
https://engineering.matterport.com/splash-of-color-instance-
segmentation-with-mask-r-cnn-and-tensorflow-7c761e238b46.
[Accessed: 19-Dec-2019].
[4] S. Ghosh, N. Das, I. Das, and U. Maulik, “Under-
standing Deep Learning Techniques for Image Segmentation,
arXiv:1907.06119 [cs], Jul. 2019.
[5] E. Moen, D. Bannon, T. Kudo, W. Graf, M. Covert, and
D. Van Valen, “Deep learning for cellular image analysis,
Nature Methods, vol. 16, no. 12, pp. 1233–1246, Dec. 2019.