Towards operational phytoplankton recognition with automated high-throughput imaging and compact convolutional neural networks

Plankton communities form the basis of aquatic ecosystems and elucidating their role in increasingly important environmental issues is a constantly present research question. The concealed plankton community dynamics reflect changes in environmental forcing, growth traits of competing species, and multiple food web interactions. Recent technological advances have led to the possibility of collecting real-time big data opening new horizons for testing core hypotheses in planktonic systems, derived from macroscopic realms, in community ecology, biodiversity research, and ecosystem functioning. Analyzing 5 the big data calls for computer vision and machine learning methods capable of producing interoperable data across platforms and systems. In this paper we apply convolutional neural networks (CNN) to classify a brackish-water phytoplankton community in the Baltic Sea. For solving the classification task, we utilize compact CNN architectures requiring less computational capacity and creating an opportunity to quickly train the network. This makes it possible to 1) test various modifications to the classification method, and 2) repeat each experiment multiple times with different training and test set combinations to 10 obtain reliable results. We further analyze the effect of large class imbalance to the CNN performance, and test relevant data augmentation techniques to improve the performance. Finally, we address the practical implications of the classification performance to aquatic research by analyzing the confused classes and their effect on the reliability of the automatic plankton recognition system, to guide further development of plankton recognition research. Our results show that it is possible to obtain good classification accuracy with relatively shallow architectures and a small amount of training data when using effective 15 data augmentation methods even with a very unbalanced dataset.

laboration between experts and exchange with other disciplines, like modelers. Phytoplankton imaging is recognized as one of the main emerging technologies also within the pan-European coastal observing JERICO-Research Infrastructure consortium (Puillat et al., 2016;Farcy et al., 2019), and forming an important part of the future European integrated coastal observation system. Our study is one step in the technology development towards these goals, getting the phytoplankton recognition using imaging systems more reliable.

2 Materials and methods
The data consist of images of phytoplankton captured with the IFCB (Olson and Sosik, 2007). The IFCB is an in situ automated submersible imaging flow cytometer. It captures images of suspended particles in the size range of 10 to 150 µm, with an image resolution of roughly 3.4 pixels per µm. The device samples seawater at a rate of 15 ml per hour and can produce tens of thousands of images per hour, depending on the cell densities. The IFCB also gives analog-to-digital converted data 105 from the photomultiplier tubes of the device. The photomultiplier tubes are used to detect light scatter and fluorescence from particles excited by the device laser and the analog-to-digital converted values of the photomultiplier tubes are used to determine whether an image should be captured or not. The data was collected from the northern Baltic Sea in autumn 2016 and from spring to summer 2017 (Fig. 1). The 2016 data was collected from the Alg@line ferrybox system of M/S Finnmaid and Silja Serenade (Ruokanen et al., 2003;Kaitala et al., 2014). The 2017 data originate from the Utö Atmospheric and Marine Research 110 Station (Laakso et al., 2018). Example images of the data set can be seen in Fig. 2.
The dataset contains grayscale images and metadata of phytoplankton divided into 53 different classes (not representing taxonomic phytoplankton classes). Furthermore, the following classes were divided into subclasses based on visual characteristics: Ciliata, Chroococcus, Cryptophyceae, Dinophyceae, Mesodinium rubrum, Pennales, and Snowella/Woronichinia sp.
The reason for this was the assumption that dividing classes with two or more visually different clusters into subclasses would 115 improve the classification result. The total amount of classes with the subclasses included was 61.
There is large variation in the sizes and aspect ratios of the images. The vertical axes of the images range from 21 pixels to 770 pixels and the horizontal axes of the images from 52 pixels to 1,359 pixels. There is also a very large imbalance in the number of images per class as shown in Table 1. The number of images per class varies from 10 (Apedinella radians) to 3710 (Snowella/Woronichinia sp.).

120
The metadata used in the experiments contain the original image dimensions and analog-to-digital converted data of the average and peak values of the photomultiplier tubes during a laser pulse. The metadata was pseudo-normalized to have values between -1 and 1. Four data subsets named Subset100, Subset50, Subset20, and Subset10 were constructed from the data by  selecting classes with the number of images equal to or greater than the selected threshold value. For example, Subset100 contains all the classes in the data with at least 100 images per class. The details of these data subsets are shown in Table 2.

125
Each data subset was randomly split into training sets and testing sets. The number of images in a subset assigned to the testing set is equal to 25% of the threshold value of the subset. The remaining images, up to one thousand images, are then assigned to the training set.

Data preprocessing
A CNN typically requires the input images to have a constant and predefined size. Resizing plankton images to a uniform size 130 was done by reducing the size of the large images with bicubic interpolation so that the larger axis of the image matches the   target frame. The target frame refers to the desired size of the resized images and the image aspect ratio was kept as constant as possible. Then the image was padded to match the target frame dimensions. Images that are smaller than the target frame were enlarged to have their larger axis match the target frame and then the images were padded to fit the target frame. As there is a lot of background in the images, the mode of the image, that is, the most common pixel value in the image, was used as 135 the padding color for the image.
Data augmentation was used to increase the number of training images. The implemented data augmentation methods include reflection, 90-degree rotation and cropping 10 percent off both ends of the larger image axis of the original training images. The cropping was performed before the images were resized to the uniform size. By using the augmentation methods, adding the reflected, cropped and rotated versions of the training set images increased the amount of training data by a factor of two, two 140 and four, respectively. Therefore, the proposed augmentation allowed the generation of 16 times more training data for classes  CNN256 is a scaled-up version of CNN128 for input images with higher spatial resolution. The main difference between the two architectures is that CNN128 takes in 128x128 sized images and CNN256 takes in 256x256 sized images as the input.
CNN256 consists of approx. four million trainable parameters whereas CNN128 consists of approx. three million trainable 155 parameters. As a comparison, the AlexNet architecture consists of approx. 60 million trainable parameters (Krizhevsky et al., 2012).

Description of experiments
CNN128 was trained for 70 epochs on the Subset50 training data with the Nesterov's method (Nesterov, 2013) using the learning rate of 0.01, momentum of 0.9 and batch size of 256. The parameters are based on small-scale empiric tests where it 160 was observed that the CNN can be trained successfully with these parameters. The network accuracy was tested every 5 epochs with the appropriate testing set.
CNN256 was trained for 70 epochs on the Subset100, Subset50, Subset20 and Subset10 training data with the Nesterov's method using the learning rate of 0.01, momentum of 0.9 and batch size of 256. Additionally, CNN256 was trained on a variation of Subset50 where the images that were too small for the network input were only padded, instead of having them 165 enlarged first and then having them padded.
The CNNs were compared with a classifier based on a Random Decision Forest (RDF) utilizing handcrafted features (Sosik and Olson, 2007). The RDFs were created from the Subset100, Subset50, Subset20 and Subset10 training data consisting of 1000 decision trees.

Evaluation criteria 170
Cross validation was used to evaluate the performance. The number of cross validations for each data subset was selected based on the number of images per class. Classifiers trained with the data subsets containing less testing images per class were cross validated more in order to have more confidence on the result. The results for CNN128 were cross validated 30 times, whereas the results for CNN256 for Subset100, Subset50, Subset20 and Subset10 were cross validated 30, 30, 30 and 60 times respectively. To evaluate the performance and to compare different methods, the classification performances were measured 175 using top-k accuracies and confusion matrices. The top-k accuracy is defined as where T k is the number of test images for which the correct class is within the top-k model predictions, and F k is the number of images for which the correct class is within the top-k model predictions. The standard deviations were also calculated for the top-1 accuracies of the cross validations to represent the variation in the performance.

Results
The overall performance of the classification was good, and the network was able to identify many key species of the Baltic Sea phytoplankton community. Tests with different subsets all reached high performance. A noticeable number of misclassifications occurred between subclasses of close taxonomic affiliation. When the subclasses were combined, CNN256 obtained classification accuracies of 0.874 for Subset10, 0.892 for Subset20, 0.889 for Subset50, and 0.887 for Subset100. The classification 185 accuracies for each class with CNN256 on Subset10 with subclasses combined can be seen in Table 5.
The main classification problems occurred evidently with classes having similar resemblance, typically of close taxonomic relation. The highest confusions were within different classes of dinoflagellates and between a species level class and more generic level class belonging to the same order. Table 6 illustrates the most common classification errors. Each row shows a pair of classes that were commonly confused with each other. It should be noted that the example images in the table are 190 randomly selected and were not necessarily misclassified during the experiments.
As a promising result for the real-time/near real-time method utilization, a relatively small number of trainable parameters in the CNN128 architecture kept the training time short, which made it ideal to test various modifications to the architecture and plankton classification algorithm. The results with the different variations of CNN128 are shown in Table 7. It can be seen that using metadata as additional features does not increase the top-1 classification accuracy. Using the mode of the images 195 as the padding color results in higher classification accuracy than white or black as the padding color. Resizing images with the bicubic interpolation also results in a better top-1 accuracy than the nearest-neighbor interpolation. A small increase in the top-1 accuracy of CNN128 is observed when using the parametric ReLU activation function.
A small decrease in the top-1 accuracy of the network is observed when opting not to use cropping as a data augmentation method. A significant decrease in the network top-1 accuracy can be seen when no data augmentation methods are used. The

200
importance of using dropout can also be seen. The top-1 accuracy of the network drops drastically when opting not to use dropout. Having 1024 nodes in the fully-connected layers resulted in a 0.804 ± 0.021 top-1 accuracy at 70 epochs, the network was then trained to 100 epochs and the top-1 accuracy increased to 0.813 ± 0.018. Increasing the number of nodes at the fully connected layers has clear potential to increase the top-1 accuracies. The increase from 512 nodes to 1028 nodes roughly doubled the number of parameters in the network and resulted in a considerably longer training time. Removing the last pooling 205 layer of the network clearly reduced the network top-1 accuracy and also roughly quadrupled the number of parameters in the