Instance Segmentation of Yeast Cells using Mask

R-CNN

Bagnoud Alexandre, B´

eal Evan, Buu-Hoang Alix

EPFL Lausanne, Switzerland

Abstract—In this computer vision task, we worked in collabo-

ration with the Laboratory of the Physics of Biological Systems

(LPBS) in order to detect and generate masks of yeast cells

from microscope images. In this report, we explain our work

and results with the neural network Mask R-CNN, as well as its

architecture.

I. INTRODUCTION

Image segmentation is a speciﬁc image processing tech-

nique used to partition an image into different regions. Techni-

cally, this process gives a label to each pixel in the image such

that all the pixels with the same assigned label are connected

in respect to some properties.

In our project ‘ML for Science‘, we collaborated with ‘the

Laboratory of the Physics of Biological Systems‘ (LPBS)

led by Prof. S. Rahi. This laboratory focuses on two model

organisms: C.Elegans and yeasts. One of their aim is to try

to develop machine learning based algorithm to track yeast

cells or neurons in C.Elegans. By using videos recordings

combined with instance segmentation of such models, they can

learn about many different aspects of biological systems, such

as genetics or even behavior. Our role in this project, along

other groups, was to each test different image segmentation

algorithms, using their yeast cells dataset, in order to help

them decide which architecture suited their experiments the

best.

Our ﬁrst objective was to choose a paper that implemented

an image segmentation algorithm, and try to reproduce the

mentioned results. For this matter, we chose Mask Region-

based convolutional Neural Network (Mask R-CNN). This

algorithm was chosen as it showed to produce great results

for different image segmentation competitions. As a second

objective, we had to adapt the neural network to fulﬁll the

requirements of the lab, i.e. segment instances of yeast cells

images taken from the microscope.

II. IMAGE SEGMENTATION AND MASK R-CNN

As brieﬂy stated in the discussion, Mask R-CNN is an image

segmentation algorithm (neural-net) that can perform object

detection as well as instance segmentation, which is to detect

how many of the same objects are present. Applied to our

project, it will be able to create a bounding box around each

yeast cell, and label differently (as different instances) each

pixels of each yeast cell. Instance segmentation is a great asset

for the scope of this laboratory, as it would allow them to track

each cell and analyse them individually.

III. ARCHITECTURE OF THE NEURAL NETWORK

The architecture of Mask R-CNN is an improvement of

its former algorithm named Faster R-CNN. Its structure is

composed of two stages. The former stage, which is highly

similar to Faster R-CNN, is responsible to generate areas in

the image that are prone to contain an object. The latter stage

is responsible for the classiﬁcation of objects. Mask R-CNN

principally adds instance segmentation as well as masks in the

second step.

FIG. 1: General framework of Mask R-CNN algorithm. Adapted from

[1]

A. First stage: Generation of region of interest

As stated above, the ﬁrst stage is designed to propose

regions of interests (RoI). For that, its structure is composed

of two main elements: the backbone and the region proposal

network (RPN). At ﬁrst, the image is processed by a con-

volutional network that will extract a feature map, using a

combination of ResNet-101 and Feature Pyramid Network.

The early layers will detect simple features such as straight

lines or corners, whereas the latter layers will detect more

advanced features such as whole objects. The output is then

fed into the RPN, which scans the feature map over around

200’000 regions (i.e. portions of varying size of the map). For

each region, the RPN predicts in a binary fashion whether it

contains an object (foreground) or nothing (background) and

adds a conﬁdence value. It also computes where in this region

the object would be located. The highest rated regions are

then selected and the bounding box is reshaped to the correct

size using the different regions that overlapped with the same

object. At the end of this ﬁrst stage, we have RoI that are

expected to contain an object and those will be analyzed by

the second stage.

B. Second stage: Classiﬁcation and Masking

The second stage only takes as inputs the RoI and resize

them all to the same shape. The RoI classiﬁer predicts the

class of the object, which in our case is either yeast cell or

background, and reﬁnes again the dimension of the bounding

box. Each RoI are also fed in parallel to the mask branch,

where each instance of object gets a 28x28 pixel mask. The

mask are ﬁnally reshaped in size to match with the original

image size. A representation of the architecture can be seen

in Figure 1.

IV. METHOD

A. Adaptation of the code and early testing

As mentioned in the introduction, we selected Mask R-CNN

based on its results in image segmentation competitions. The

original paper [1] described the architecture and demonstrated

its performance. Upon research, we found a GitHub repository

where a replication of the paper was provided with some

minor changes [2]. The model was well documented, and

already trained for different examples such as the COCO

dataset. The COCO dataset is a publicly available set that was

used in image segmentation competitions. It contains labeled

data containing up to 91 different objects. The repository also

provided some functions to visualize the dataset as well as the

predictions.

After analyzing how the code was constructed, we were

able to use the algorithm with provided weights for the COCO

dataset. Upon testing on different images, we could assess that

the image segmentation worked perfectly ﬁne, it was able to

detect different objects, such as cars, different animals or some

type of food. An example is shown below in Figure 2.

FIG. 2: Example of image segmentation using the provided COCO

weights. Each animal is correctly labelled

At this point, we decided to keep this algorithm and adapt

it to our needs.

Our ﬁrst task was to modify the code and allow it to detect

yeast. In the GitHub repository, some different classes of

objects were provided as examples of use. One class, named

‘nucleus’, was coded in a way that it was able to decipher

between two classes, i.e. nucleus or background. This class

was of great use for us as our adaptation of the code should

also differentiate between two objects, yeasts or background.

After having studied how every ﬁle played a role in the

execution of the algorithm, we created the yeast Final.py ﬁle,

overwriting some functions of the model and adding new ones.

B. Dataset

The data that we gave the program to train on came directly

from the LPBS laboratory. The pictures were given as com-

pressed ‘.tif’ ﬁles, composed of pictures of size 2044x2048.

For each ‘data point’ or image, we had two ﬁles: the ﬁrst

consisted of a picture of the yeast cells under a microscope,

and the second was a corresponding ground truth image

containing masks placed at the location and at the same shape

of each cells as seen on Figure 3. Note that the mask image is

containing all the yeasts at once. We had to tackle this issue

because our goal was to label each yeast cell differently (to

perform instance segmentation), but the program understood

it as one giant cell that was made out of smaller circles. To

deal with this issue, for each mask image, we detected how

many yeast cells were labelled. Then, for each cell, we created

a new image containing only one mask per ﬁle.

FIG. 3: Example of yeast cells image visualisation using

’run visualize.py’. Each cell has a ground truth overlapping mask

Looking through the data, we observed that the average

number of yeast cells per image was generally around 50

to 150. However, we found some ﬁles where the number of

masks exceeded a thousand cells, or some where masks were

composed of one pixel only. After we took the decision to

remove those data points, we ended up with around 290 images

as our dataset.

The dataset needed a speciﬁc folder structure to be com-

patible for the algorithm. We wrote different functions (in

yeast Final.py, see ReadMe.md) allowing to take any com-

pressed ‘.tif’ ﬁle containing labelled data and format it to work

with the program, as well as separating masks for instance

segmentation.

C. Training

After the modiﬁcation on the code and the yeast cells dataset

were done, we started to train the model using the provided

weights for the COCO dataset. We justify that choice by the

fact that it seemed better to start with weights that were already

trained to detect speciﬁc patterns such as lines and corners

than to start from scratch. We separated our dataset into 261

images to train and 33 images as the validation set. As our

classiﬁcation was binary, i.e. either yeast cell or background,

we set the score for a positive result to be at 50%.

At the end of each training, we mainly looked at four values

to assess the quality of the new weights: the loss (mainly

L1-loss) on the train data (loss) and on the validation data

(val loss), the ’mask loss’, the pixel-wise cross entropy loss

on the validation data, and the average precision (AP) in

combination with intersection over union (IoU). The loss gave

us a direct indication if the model was getting better or worse,

and as our main goal was to generate masks for the cells,

the mask loss was also useful in the same manner giving an

information on the pixel-wise precision. The average precision

was calculated using precision and recall in function of IoU

where it computes the region of the predicted bounding box

overlapping the ground truth divided by the joined area of

those two. In sum, we looked at the two losses after each

training, and visually inspected the new weights by predicting

an image from the validation set where it could compute the

AP for different values of IoU.

The training sessions took on average from 6 to 10 hours

for one epoch, so we could only do one at the time. We

used Google Colab hosted servers as our computers could not

handle the processing. Using Google Colab was unstable and

resulted in many training sessions to be aborted due to the

way the ﬁrm allocate their servers.

D. Improvement attempts

After getting the ﬁrst training sessions completed and the

ﬁrst positive results, we tried to improve the model by mod-

ifying some aspects. Firstly, we adjusted the batch size (the

number of data points the network takes at each iteration) and

the number of steps per epoch to ﬁnd the best combination

between execution speed and the number of images that were

used to train. Secondly, all pictures looked similar, that is

with a colony of yeast in the center of the image. To try to

make the algorithm more robust in case a new image deviated

from the usual, and also to rise the number of training data,

we decided to use image augmentation. This function takes

an original picture, and modiﬁes it by rotation, symmetry,

cropping and adding blur randomly. This way, the images that

we train the network with have much more variability than

the ones provided by the lab. The ﬁnal size of the training

image are 512x512, but the original images are cropped to

smaller size and magniﬁed by a certain factor to match the

previously mentioned size. This makes the yeast cells appear

larger and allows the network to learn to generate precise

masks. Note that different values were tested here and the

results are presented in the next section. Another modiﬁcation

was to try to pre-process the images by increasing contrast

and luminosity, which made the cells much easier to see with

the naked eye, as the original images are low-contrast images.

V. RESULTS

In this section, we present the results of many training,

where we tested different parameters or image pre-processing

to eventually obtain the weights giving the most accurate

results, based on the pixel-wise cross entropy loss on the

validation dataset (val mask loss).

Firstly, we aimed to determine if all the layers of the neural

network were necessary. By comparing the loss results in Table

I, we observe a signiﬁcant difference between training on

all the layers, or only on the heads layers (RPN, classiﬁer

and mask heads of the network). Moreover, we could choose

to train between different stages of the ResNet-101 (e.g. 4

and above), but it lead to major increase in the loss without

reducing the running time, so this option was discarded.

heads layers All layers

Loss 1.2419 1.0246

TABLE I: Loss in function of the layers trained

Secondly, we wanted to determine the optimal number of

images per batch. To do so, we tested 3 different batch size as

seen in Table II. We can observe that the loss signiﬁcantly

decrease when the number of images per batch increase.

However, we weren’t able to increase the batch size above 4

to ﬁnd the optimal number of images due to the considerable

training time and the lack of power and memory space.

Batch size: 1 3 4

Loss 1.2384 1.0684 1.0246

TABLE II: Loss in function of the number of images per batch

Then, we compared the number of augmentation needed per

image, as presented in Table III. We observe that increasing

the number of augmentation per image makes the dataset more

different leading to a small increase of the loss on the train

data but a large decrease on the validation data.

Number of augmentations Between 0 and 2 Between 0 and 4

Loss 1.0246 1.0388

val loss 1.1712 1.0477

val mask loss 0.1828 0.1721

TABLE III: Loss on the train and the validation data in function of

the number of augmentation

Afterwards, to try to reduce the training running time

without considerably increasing the loss, we tested to train

with or without a scaling factor of 2. The results in Table IV

show that the scaling factor is necessary to decrease the loss

even though the suppression of scaling factor allow to divide

the training time by almost a factor of 2.

Scaling factor 1x (no rescaling) 2x

Loss 1.2750 1.0246

TABLE IV: Loss in function of the scaling factor

Eventually, we added an option in the yeast class conﬁgu-

ration to select a pre-processing of the images before running

the train. The different pre-processing available (changing the

contrast, the luminosity or both) have been tested. Although

it allowed us to better visualize the yeast cells by eye, we can

observe that it did not decrease the loss. See Table V.

Therefore, following these steps, we were able to optimize

our model for the yeast dataset leading to the results described

below in Table VI.

Pre-processing None Contrast Luminosity + Contrast

Loss 1.0246 1.4369 1.4538

TABLE V: Loss in function of the pre-processing performed

Optimal Model

Loss 1.0388

mask loss 0.1829

val loss 1.0477

val mask loss 0.1721

TABLE VI: Loss on the train and validation data of the optimal model

The ﬁnal parameters used to get those weights are deﬁned

in the YeastConﬁg in yeast Final.py. For example, the number

of images per batch has been set to 4, the image resize mode

chosen is a crop of 256x256 using a scaling factor of 2 (to

obtain 512x512). The number of augmentations per image

(rotation, blur, symmetry or pixel value multiplication) is

random between 0 and 4. The region proposal network anchors

have been set to smaller size than the original conﬁg in order

to deal with the yeast cells that are small objects. Using those

weights and those parameters, an example of segmentation on

an image of the validation set is shown in Figure 4.

FIG. 4: Results of detection on an image from the validation set (the

image has been cropped for better view)

Using the optimal weights on new images and performing

an instance segmentation of yeast cells leads to the prediction

of segmentation masks of the yeast as seen on Figure 5. The

predicted masks by our model attributes a different pixel value

for each detected yeast, allowing the laboratory to separate

each cell individually for further analysis.

FIG. 5: Results of detection prediction (right) on test data (left,

original image modiﬁed to observe yeast cells to the naked eye)

VI. DISCUSSION

As discussed in the results section, we have seen that we

could get promising results with the training that was done.

We were able to ﬁnd relevant parameters given our dataset

and the time we had at our disposition. The detection is not

optimal, as it can be seen in Figure 5 where one yeast in the

center is not detected, but given more training, we believe that

the algorithm should perform as intended.

Our main issue was the time of execution for the training.

Although the ﬁrst section of the algorithm based on Faster

M-RCNN was designed to optimize training and prediction

time, one epoch still took us at least 6 to 8 hours and was

therefore difﬁcult to perform many test or increase the number

of epoch. We believe the reason for the length may reside in

the fact that there are lots of yeasts per image, which produces

a lot of RoI, and could therefore slow down the progress.

Upon research, we found some suggestions on how to optimize

the programm to run faster, however after implementation of

those changes the running time did not reduce signiﬁcantly

whereas the loss increased consequently. Another issue was

that the architecture was designed to take in images in RGB

format (3 layers of color). However, our dataset was composed

of grayscale images only (1 layer deep). Using the grayscale

images without transformation produced errors, therefore we

had to transform the grayscale images to RGB before using the

different functions. Although using grayscale images would

have probably decreased the running time, we were not able

to modify all the functions that used images in a RGB format.

As mentioned in the introduction, Mask R-CNN was de-

veloped in order to compete on the COCO dataset, where it

had to classify over 90 different objects. In our case, there is

only one object to detect. We believe that the complexity of

the architecture, i.e. the number of layers, could be reduced

to produce a simpler and more efﬁcient algorithm. Combined

with a greater computing power, more training could have been

done.

In sum, as non computer scientists, we found the original

code complex to understand and to work with. It was reward-

ing to adapt the code as much as we could and observe what

worked better or worse, but we were sometimes limited to our

fresh knowledge in machine learning and neural networks.

VII. CONCLUSION

This project was the ﬁrst neural network we worked on.

Looking at the overall results, we believe our goal was met. We

were able to work on a complex algorithm and adapt it to our

needs, also fulﬁlling the requirements of the laboratory LPBS.

We gained insight on how computer vision was operating and

seized how important it can be for the ﬁeld of research. We

also know that there is always room for improvements that can

be made with further knowledge on the subject. We would like

to thank Prof. S. Rahi for the opportunity to work at their side

on their project, as well as everybody in the Laboratory of the

Physics of Biological Systems for their help and insight.

VIII. REFERENCES

[1] K. He, G. Gkioxari, P. Doll´

ar, and R. Girshick, “Mask

R-CNN,” arXiv:1703.06870 [cs], Jan. 2018.

[2] GitHub for Mask R-CNN: matterport/Mask RCNN.

Matterport, Inc, 2019.

[3] “Splash of Color: Instance Segmentation with

Mask R-CNN and TensorFlow.” [Online]. Available:

https://engineering.matterport.com/splash-of-color-instance-

segmentation-with-mask-r-cnn-and-tensorﬂow-7c761e238b46.

[Accessed: 19-Dec-2019].

[4] S. Ghosh, N. Das, I. Das, and U. Maulik, “Under-

standing Deep Learning Techniques for Image Segmentation,”

arXiv:1907.06119 [cs], Jul. 2019.

[5] E. Moen, D. Bannon, T. Kudo, W. Graf, M. Covert, and

D. Van Valen, “Deep learning for cellular image analysis,”

Nature Methods, vol. 16, no. 12, pp. 1233–1246, Dec. 2019.