Linking satellites to genes with machine learning to estimate phytoplankton community structure from space

El Hourany, Roy; Pierella Karlusich, Juan; Zinger, Lucie; Loisel, Hubert; Levy, Marina; Bowler, Chris

doi:https://doi.org/10.5194/os-20-217-2024

Articles | Volume 20, issue 1

https://doi.org/10.5194/os-20-217-2024

© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/os-20-217-2024

© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 20, issue 1

Research article

|

21 Feb 2024

Research article |

| 21 Feb 2024

Linking satellites to genes with machine learning to estimate phytoplankton community structure from space

Roy El Hourany, Juan Pierella Karlusich, Lucie Zinger, Hubert Loisel, Marina Levy, and Chris Bowler

Download

Final revised paper (published on 21 Feb 2024)
Supplement to the final revised paper
Preprint (discussion started on 14 Dec 2022)

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2022-1421', Anonymous Referee #1, 11 Feb 2023
The work by El Hourany et al. describes machine learning techniques for application to (blue) ocean color data to determine the global distribution of phytoplankton functional types. A special focus is on the description of ML techniques with the identification of crucial features based on parameters of the merged GlobColour dataset. The details of the methods used are often cryptically written and difficult to follow, and reproduction of the methods and results is not possible. The methods section should be revised accordingly. Besides the application of ML methods in the context, the advantage of the method remains unclear and is not further specified; it could well be higher accuracy or computing speed. I recommend a thorough revision of the paper to describe the methods in a more understandable way and to prove the added value (also of future ML methods).
Specific comments:
The title is a bit catchy and inaccurate. It is rather about pigments, which are typical for color groups, but which can be very different in type of phytoplankton and corresponding genes.

The figures should all be revised, e.g. Fig. 5. Axis labels with units are often missing. Partly chlorophyll concentrations are given in log10, this is better in Fig. 2.

Line 87: Only as a comment that size fractioning often damages the cells and such data should therefore be treated with caution.

For the discrimination of absorption features, rather the central visible region is necessary (e.g. Xi et al. 2015). In this respect, the use of the GlobColour data set with Rrs only up to 555 nm is unfavorable, as the correlation plots show. The OC-CCI dataset has more (MERIS) bands here and corresponding differences could be underlined. References to GlobColour and matchup procedure are missing.

It is a Case-1 approach for a medium range of chlorophyll concentrations, which should be communicated in a better way. Maybe flagging and an uncertainty product would be useful. However, in such open ocean conditions, HPLC methods are often at the limit (if low volumes of water are filtered) – extreme uncertainties may exist in the fundamental training data.

Besides SST is salinity actually a strong indicator for some PFTs.

Method part is unclear, especially lines 163-212. A part of the problem could be that less common naming convention is used, e.g. do you refer to neural network architecture if you optimize the size map? How does the final map or architecture look like?

Line 269: The more parameter we utilize, the more we must trust the data quality. Nevertheless, seen over the global ocean, there are many uncertainties in all mentioned parameters and regions. Especially Rrs in blue bands and the retrieved chlorophyll concentration must be considered as critical, even more because reflectances are derived from multi-mission merged data with sensor-specific atmospheric correction.

6: The marine model of ocean color algorithms is for atmospheric correction and chlorophyll retrieval is mostly based on a diatom-like chlorophyll-specific absorption and scattering behavior (e.g. Bricaud et al., 1995). Thus, good that there is relatively high correlation of diatoms and chlorophyll concentration. But what is actually with features that are not captured, e.g. specific optical properties of Coccolithophores (e.g. Balch, 2018)? There is a high abundance, e.g. in The Great Calcite Belt, where Fig. 7 indicates high reliability of the model with a C2 distribution in Fig. 10, that seems to be different. I see some question marks and would ask for more careful discussion about the model uncertainty.

It is unclear how the new method behaves compared to the mentioned operational model by Xi et al. (2020). What are the advantages of the presented method?
Citation: https://doi.org/10.5194/egusphere-2022-1421-RC1
- AC1: 'Reply on RC1', Roy El Hourany, 10 Jun 2023
  
  Dear Editors and Referees,
  We would like to express our gratitude for the valuable comments and suggestions provided for improving our manuscript. We acknowledge the referee's observations regarding communication ambiguities and technical issues in the initial version, and we have prepared this revised manuscript to address these concerns and clarify the highlighted aspects.
  This response aims to address the common major issues raised by both referees. We acknowledge that the development steps of the omic-based satellite algorithm in the paper were unclear, and the inclusion of a pigment-based approach for validation was misleading.
  To clarify, our method is based on the link between omic and satellite data. The pigment approach only played a role in the post-training process to compare the outputs of both approaches. We incorporated the pigment HPLC data in this study due to its widespread use in ocean color remote sensing techniques for estimating phytoplankton groups, primarily because of its high data availability. However, it is important to note that numerous studies have demonstrated significant uncertainties between the pigment approach and phytoplankton abundance observed through other methods (Chase et al., 2020). These uncertainties arise from factors such as the overlapping presence of pigments across phytoplankton classes, photoacclimation, and physiological processes. Therefore, it is crucial to recognize that our study addresses two types and levels of information: Omics and Pigments. The use of pigments in this work is more for comparison purposes rather than validation, and we acknowledge that our previous message regarding this matter was misleading.
  Additionally, the referees found it unconvincing to introduce physiological uncertainties when transforming omics data into Chla fraction per phytoplankton class. We introduced this aspect to compare the omic and pigment-based approaches.
  Based on the comments from both referees, we have chosen to thoroughly revise the methodology section. We have added flowcharts to simplify the process and enhance its applicability for readers. The entire methodology has been revised in light of the suggestions provided by the referees. To address the concerns regarding chlorophyll-a fractionation and enable the emergence of different levels of information as outputs, we trained two algorithms using the same satellite data and the Self-Organizing Map (SOM) methodology. One algorithm provides the relative cell abundance of phytoplankton (including the estimation of direct psbO relative abundance values), while the other algorithm estimates the phytoplankton Chla fraction per group. Importantly, this revised methodology now considers the psbO occurrence per size fraction, which was not taken into account in the initial version of the manuscript. In both algorithms, uncertainties on the outputs were evaluated and therefore are presented with the outputs.
  The outputs from both algorithms will allow us to address questions regarding phytoplankton diversity from an ecological perspective (through relative cell abundance) and a biogeochemical perspective (through Chla fraction per group), while considering physiological uncertainties.
  We sincerely hope that this response is convincing and meets your expectations. We appreciate the thorough review process and are confident that the revisions have significantly improved the manuscript.
  Figure S1: Different levels of information on phytoplankton groups. Noting that cell relative abundance (psbO relative abundance) and Chla contribution and fraction per group are two outputs of two different algorithms based on the same SOM-psbO methodology described in this paper.
  Comments Referee #1
  The work by El Hourany et al. describes machine learning techniques for application to (blue) ocean color data to determine the global distribution of phytoplankton functional types. A special focus is on the description of ML techniques with the identification of crucial features based on parameters of the merged GlobColour dataset. The details of the methods used are often cryptically written and difficult to follow, and reproduction of the methods and results is not possible. The methods section should be revised accordingly. Besides the application of ML methods in the context, the advantage of the method remains unclear and is not further specified; it could well be higher accuracy or computing speed. I recommend a thorough revision of the paper to describe the methods in a more understandable way and to prove the added value (also of future ML methods).
  We thank the referee for the valuable comments and suggestions provided for improving our manuscript. We acknowledge the referee's observations regarding communication ambiguities and technical issues in the initial version, and we have prepared this revised manuscript to address these concerns and clarify the highlighted aspects.
  In this paper, we approach the estimation of phytoplankton groups as a unified community structure to preserve inter-group coherence. Our objective was to develop a method capable of estimating all seven groups using a single set of satellite predictors. The challenge we faced was twofold: the problem was multivariate in nature, and the dataset was relatively small, with missing values in the satellite matchups.
  To address these challenges and ensure that no valuable psbO measurements were lost, we turned to the technique of Self-Organizing Maps (SOM) and topology conservation. SOM is a powerful unsupervised learning algorithm that allows for the establishment and reproduction of relationships between variables. By utilizing SOM, we were able to fill in the gaps in the dataset and exploit the preserved topology to estimate the phytoplankton groups.
  The advantage of using SOM in this context is its ability to handle multivariate data and preserve the underlying structure of the variables. It enables us to capture the complex relationships between the predictors and the phytoplankton groups, even with missing values. By leveraging the topology conservation property of SOM, we ensure that the estimated relationships are consistent with the overall structure of the data.
  Specific comments:
  The title is a bit catchy and inaccurate. It is rather about pigments, which are typical for color groups, but which can be very different in type of phytoplankton and corresponding genes.
  Indeed, in this work, pigments were used, but not for training the method. We appreciate the referee's concerns regarding clarity, and we would like to address them.
  The method we introduce in this manuscript, SOM-psbO, is based on a dataset of phytoplankton groups quantified using the psbO molecular method and expressed in terms of Chla fraction, in combination with satellite variables. As described in the text, psbO is a single-copy gene that is present across all phytoplankton groups. This gene encodes proteins that structure a compartment of the Chloroplast photosystems.
  It is important to note that pigments were not used for training the SOM-psbO method. Instead, they were employed solely for comparative purposes. The outputs of SOM-psbO were compared to in-situ phytoplankton groups estimated using diagnostic pigment analysis (DPA), as described in studies such as Soppa et al. (2014). DPA methods have been widely used in remote sensing studies to estimate phytoplankton functional types (PFT) or size classes, and current operational methods such as Xi et al. (2020) and PHYSAT are based on them. However, it is crucial to acknowledge the high uncertainties associated with DPA methods, as highlighted by Chase et al. (2020). These uncertainties can lead to misleading interpretations of real PFT and phytoplankton size class relative abundance.
  Therefore, it is important to clarify that diagnostic pigment data were not utilized in the development of the SOM-psbO method. Their inclusion was solely for comparison purposes, to highlight the differences and uncertainties associated with the pigment-based approach.
  We hope this clarification addresses the concerns regarding the use of pigments in our study.
  The figures should all be revised, e.g. Fig. 5. Axis labels with units are often missing. Partly chlorophyll concentrations are given in log10, this is better in Fig. 2.
  All figures were revised according to the referee’s suggestion.
  Line 87: Only as a comment that size fractioning often damages the cells, and such data should therefore be treated with caution.
  We are aware of the drawbacks of size fractionation. The filters may retain cells smaller than the nominal pore because of net clogging, or because they were trapped in fecal pellets. On the contrary, long needle-like species and broken cells and colonies can pass through small mesh sizes. The patterns that we described in the current work based on size-fractionated samples can be complemented in the future by exploring non-fractionated samples. However, there is still no equivalent standardized sampling covering the main ocean regions as the size-fractionated samples from Tara Oceans.
  For the discrimination of absorption features, rather the central visible region is necessary (e.g. Xi et al. 2015). In this respect, the use of the GlobColour data set with Rrs only up to 555 nm is unfavorable, as the correlation plots show. The OC-CCI dataset has more (MERIS) bands here and corresponding differences could be underlined. References to GlobColour and matchup procedure are missing.
  As the referee correctly pointed out, the SOM-psbO method was trained using a dataset that included 17 variables, including satellite reflectance at 412, 443, 490, and 555nm. We apologize for not clearly indicating in the initial version of the manuscript that this method is specifically developed for open ocean applications.
  In the clear open ocean, beyond 555nm, the information contained in the remote sensing reflectance (Rrs) bands is limited due to the strong absorption by water, as also mentioned in Xi et al. (2015). Our choice of the range and number of satellite reflectance bands was inspired by the work conducted in the PHYSAT method (Alvain 2005, Ben Mustapha et al., 2013), which is a classification method that utilizes reflectance anomalies in the four selected bands to identify dominant phytoplankton functional types.
  To further support our argument, we rebuilt and cross-validated our methodology using different combinations of 11 bands ranging from 412 to 645nm. However, we found that increasing the number of bands did not lead to a significant improvement in the SOM performance.
  It is important to note that one of the advantages of using machine learning methods such as SOM is to reduce the complexity of the problem while capturing non-linear relationships that are present in the environment. The correlations with the Rrs bands, which the referee mentioned as unfavorable in Figure 6, are indeed essential and statistically significant. It is crucial to consider that the problem we are addressing is multivariate. Preserving the inter-variable relationships, even those with lower correlations, is a major advantage of utilizing such a machine learning method.
  It is a Case-1 approach for a medium range of chlorophyll concentrations, which should be communicated in a better way. Maybe flagging and an uncertainty product would be useful. Indeed, as clarified in the previous comment, the method developed in this paper is specifically designed for open ocean (case 1) applications. This statement has been further clarified in the revised version of the manuscript.
  The approach proposed in this paper to estimate phytoplankton groups from satellite data is based on an unsupervised neural classification technique, the Self-Organizing Map (SOM). The SOM summarizes the non-linear relationship between the satellite data and phytoplankton groups, effectively reducing noise and mitigating the influence of uncertainties within the dataset.
  The function that links the predictors (satellite data) to the predicted variables (phytoplankton groups) is represented by an allocation function based on a weighted Euclidean distance. In other words, this function searches for and associates the closest neuron in the SOM to a new or unfamiliar observation.
  The main source of uncertainties in the estimation process lies in the allocation function. Among hundreds of neurons in the SOM, one neuron is chosen as the assignment based on the minimum distance between the neurons and the pixel, regardless of whether the distance is strong or weak. Since one of the properties of SOM is the preservation of topology (where neighboring neurons are similar), a pixel can be assigned to several adjacent neurons, with a distance order, representing a neighborhood of close neurons.
  Now, how do uncertainties in the satellite variables influence the allocation function and, consequently, the results?
  If the distance between a pixel and a neuron is small, the influence of uncertainties is minimal and will not significantly affect the assignment of the pixel. However, if a large distance is observed between the observation and the assigned neuron, uncertainties in the variables can have a greater impact on the choice but remain within the bounds of the chosen neuron's neighborhood.
  To consider all the uncertainties associated with the allocation function, we have chosen to associate each pixel with a weighted standard deviation based on the first 10 closest neurons. The weights correspond to the distances between the first 10 matching neurons and the pixel. This allows us to incorporate uncertainties into the assignment process and provide a measure of confidence for each pixel's assignment.
  By considering the weighted standard deviation, we account for the influence of uncertainties in the satellite variables and provide a more comprehensive understanding of the allocation process within the SOM.
  Figure S2: Global uncertainties regarding phytoplankton groups’ cell relative abundance, Chla contribution, and Chla fraction. In this context, the following uncertainties on the outputs represent the interval (defined with a standard deviation calculated on the neighboring associated neurons per satellite pixel) of SOM-psbO to estimate the different phytoplankton groups.
  However, in such open ocean conditions, HPLC methods are often at the limit (if low volumes of water are filtered) – extreme uncertainties may exist in the fundamental training data.
  The very deep sequencing of the Tara Oceans metagenomes (between ~10^8 and ~10^9 total reads per sample) allows high detection power (e.g., for rare species). In addition, filter volumes were high: 100 L for 0.22-3, 0.8-5 and 5-20, 1-20 m3 for 20-180, and 10-100 m3 for 180-2000.
  Besides SST is salinity actually a strong indicator for some PFTs.
  SSS is a strong indicator of some PFTs due to intervariable correlations, and their patterns are related to physical conditions, like the ones of SST. However, SSS satellite products are not as accurate as SST products and at a lower resolution (best resolution at 25kms vs. 4kms). The addition of Satellite SSS products might corrupt the output of the operational phase.
  Method part is unclear, especially lines 163-212. A part of the problem could be that less common naming conventions are used, e.g. do you refer to neural network architecture if you optimize the size map? How does the final map or architecture look like?
  We acknowledge the referee for highlighting these communication issues. We introduced a clearer definition of the SOM size; We refer to the number of neurons represented by n=p x q, where p and q are the dimensions of the SOM 2D neuron grid.
  Line 269: The more parameters we utilize, the more we must trust the data quality. Nevertheless, seen over the global ocean, there are many uncertainties in all mentioned parameters and regions. Especially Rrs in blue bands and the retrieved chlorophyll concentration must be considered as critical, even more because reflectances are derived from multi-mission merged data with sensor-specific atmospheric correction.
  The question raised highlights the importance of considering data quality when utilizing parameters. In the context of the global ocean, there exist numerous uncertainties associated with the mentioned parameters and regions. As mentioned in the previous comment regarding uncertainties, the SOM process attunes uncertainties and enables the possibility to estimate uncertainties in the outputs. This has been implemented in the second version of our algorithm.
  Regarding the number of variables used, through various combinations analyzed in this work, it was found that three optimal combinations provided reliable estimations: Chla and the 4 RRS parameters with and without SST, as well as a 9-parameter combination out of 10. It is worth noting the optimal number of variables used does not solely depend on the number of variables employed, as indicated in response to comment #4, but more on the total amount of variance explained per set of variables’ combination.
  The marine model of ocean color algorithms is for atmospheric correction and chlorophyll retrieval is mostly based on a diatom-like chlorophyll-specific absorption and scattering behavior (e.g. Bricaud et al., 1995). Thus, good that there is relatively high correlation of diatoms and chlorophyll concentration. But what is actually with features that are not captured, e.g. specific optical properties of Coccolithophores (e.g. Balch, 2018)? There is a high abundance, e.g. in The Great Calcite Belt, where Fig. 7 indicates high reliability of the model with a C2 distribution in Fig. 10, that seems to be different. I see some question marks and would ask for more careful discussion about the model uncertainty.
  We admit that within the first version of the algorithm, since we didn't take into consideration the effect of size per group and per sample, the Chla fraction concentration per group was biased. The pos-training classification (figure 3 below) into dominant phytoplankton communities was revised accordingly after incorporating the phytoplankton size information as described in Sommeria-Klein et al 2021 Science:
  Abundance integrating body size = Sum_over_the_size_fractions_of (proportion_of_psbO reads_belonging_to_group_G x mid-range_value_of_size_fraction) / Sum_over_the_size_fractions_of (proportion_of_psbO reads_belonging_to_group_G ).
  Therefore, when converting psbO reads to relative abundance, considering the size of the phytoplankton cell for each group, we highlight the contribution of each group's size to the total chlorophyll-a (Chla) concentration.
  Compared to the previous version, and due to the data conversion, five clusters turned out to be sufficient to describe the dominant patterns. In the Southern Ocean, the C3 group emerges and dominates, while there is also a higher relative abundance of Haptophytes (including coccolithophores) and Diatoms. In the Arctic Ocean, the C4 group dominates. Although the phytoplankton communities of both C3 and C4 clusters were relatively similar, the optical signal was significantly different, allowing us to distinguish between the two clusters.
  Figure S3: Satellite-derived biomes of phytoplanktonic communities, obtained by unsupervised clustering (Hierarchical clustering) on the SOM’s referent vectors. The normalized (by the variance of the initial database) and original Rrs spectrum were also derived to characterize each cluster’s optical signature. The global map shows the most frequent community structure recorded during the 1997-2021 period.
  It is unclear how the new method behaves compared to the mentioned operational model by Xi et al. (2020). What are the advantages of the presented method?
  Xi et al., (and the SOM-Pigments method) is based on the DPA pigment approach to identify phytoplankton groups (4 functional types).
  The method described in this paper is developed with a harmonized database on the phytoplankton taxonomic community structure based on the psbO gene quantification. Molecular methods like this have a deep taxonomic resolution (including for cryptic species) as well as high detection power (e.g., for rare species). In addition, this particular gene is present in all phytoplankton groups, eukaryotes, and prokaryotes alike, with a single copy per cell.
  Quantifying it using satellite data provides an unbiased picture of phytoplankton cell relative abundances.
  
  Citation: https://doi.org/10.5194/egusphere-2022-1421-AC1
RC2:
'Comment on egusphere-2022-1421', Anonymous Referee #2, 19 Mar 2023

The authors develop a machine learning approach to link ocean colour data and in situ omics to improve detection of phytoplankton functional types and groups from space. The topic they are dealing with is innovative. However, the methodology and algorithm development steps are hard to follow and need to be revised to make the workflow clearer to the reader. In this scope, a flowchart is essential.
I am not fully convinced by the validation approach of the method. The training is done using the whole omics database and cross-validation statistics show the good prediction capabilities of the model. Then, the validation is made with an external database built on HPLC-based information. From my point of view, this cannot be considered a proper validation because one quantity is based on HPLC data, the estimated one on omics data. Such a comparison thus implies that the two approaches bring the same level of information on phytoplankton taxonomy. In this case, there would be no need to develop a new approach based on omics. However, as discussed at the end of the paper, HPLC- and omics-based phytoplankton information have some degree of correlation, which is good because this means that OMICS information can be found in optical properties to some extent and OMICS based approaches are welcome because they will bring new and complementary information on phytoplankton from space.
I realize that the OMICS database used to develop the new ML approach is small, but probably the authors might think to train the model over 70% of the database and validate it with the remaining 30%.
Results needs to be discussed more and the text about retrieved global distribution of phytoplankton and biomes needs to be profoundly checked and revised.
The work thus needs to be deeply revised to improve the methodology and make the validation stronger as well as the text more readable.
Specific comments:
Figure 1 is misleading as the same color palette has been used for both columns though the % axis are different from left to right. A quick reader could interpret the yellow dots of (e.g.) Cryptophytes as abundant as Green Algae or Diatoms.
Line 91: this statement means that we have phytoplankton also in the 180-2000 um size class, which is possible in case of diatoms chains. Could you provide a distribution of frequency of phytoplankton groups within each size class? This would help the reader to have a wider image of the type of phytoplankton in the database (and especially for those chain-forming species and classes spanning a wide size range).
Line 115: why normalizing omics data on Chl? Because Chl varies according to the physiological status of phytoplankton, a photoacclimation component is re-introduced (which is a major problem in the DPA analysis). Why not using OMICS-ased % of the whole population?
Lines 121-123: it is not clear which data are interpolated. In situ or satellites?
Table 2 contains mistakes on the coefficients. From Uitz et al. (2006), the coefficient for Chl-b is 1.01 while 0.35 is for 19-BF. In addition, 19-BF is here only attributed to the pelagophytes while is also a pigment within haptophytes (except coccolithophores). So, from the current coefficients all haptophytes only contains 19-Hex.
Line 155: which cross-validation procedure? Do these statistics refer to all pigments or is it a global indicator for the technique?
Line 163: please indicate and explain better which are the “several machine learning algorithms” you tested and why a SOM has been chosen. This will be very helpful for scientists approaching the same problem.
Section 3.1 need to be rewritten and a flowchart added. That’s strange to see 3.1.1 and 3.1.2 as two different sections when (if I had well understood) the work is done simultaneously. Figure 5: y- and x- axes should be the same and indicate the name of the solid and dashed lines in the caption.
Line 191: which several experiments? How many? Please explain better.
Section 3.1.3 needs to be clearer
Line 269: what is the impact of interpolation on bbp and Kd? (i.e., Interpolation declared in the methods)
Line 275: from Table 3, pelagophytes instead of cryptophytes
Line 314: generally speaking, are you referring to the surface-to-volume ratio?
Line 329 and Line 331: please check and discuss: C4, C5 and C6 are dominated by Prokaryotes, but these areas are generally known to be dominated by large phytoplankton. Same for C1, dominated by diatoms but in the subtropics. In addition, it would be nice to see these clusters plotted on map in Figure 10.
Figure 10: How the spectra have been normalized? By the minimum? The spectral shape should be discussed.

Citation: https://doi.org/10.5194/egusphere-2022-1421-RC2
- AC2:
  'Reply on RC2', Roy El Hourany, 10 Jun 2023
  General answer
  Dear Editors and Referees,
  We would like to express our gratitude for the valuable comments and suggestions provided for improving our manuscript. We acknowledge the referee's observations regarding communication ambiguities and technical issues in the initial version, and we have prepared this revised manuscript to address these concerns and clarify the highlighted aspects.
  This response aims to address the common major issues raised by both referees. We acknowledge that the development steps of the omic-based satellite algorithm in the paper were unclear, and the inclusion of a pigment-based approach for validation was misleading.
  To clarify, our method is based on the link between omic and satellite data. The pigment approach only played a role in the post-training process to compare the outputs of both approaches. We incorporated the pigment HPLC data in this study due to its widespread use in ocean color remote sensing techniques for estimating phytoplankton groups, primarily because of its high data availability. However, it is important to note that numerous studies have demonstrated significant uncertainties between the pigment approach and phytoplankton abundance observed through other methods (Chase et al., 2020). These uncertainties arise from factors such as the overlapping presence of pigments across phytoplankton classes, photoacclimation, and physiological processes. Therefore, it is crucial to recognize that our study addresses two types and levels of information: Omics and Pigments. The use of pigments in this work is more for comparison purposes rather than validation, and we acknowledge that our previous message regarding this matter was misleading.
  Additionally, the referees found it unconvincing to introduce physiological uncertainties when transforming omics data into Chla fraction per phytoplankton class. We introduced this aspect to compare the omic and pigment-based approaches.
  Based on the comments from both referees, we have chosen to thoroughly revise the methodology section. We have added flowcharts to simplify the process and enhance its applicability for readers. The entire methodology has been revised in light of the suggestions provided by the referees. To address the concerns regarding chlorophyll-a fractionation and enable the emergence of different levels of information as outputs, we trained two algorithms using the same satellite data and SOM methodology. One algorithm provides the relative cell abundance of phytoplankton (including the estimation of direct psbO relative abundance values), while the other algorithm estimates the phytoplankton Chla fraction per group. Importantly, this revised methodology now considers the psbO occurrence per size fraction, which was not taken into account in the initial version of the manuscript. In both algorithms, uncertainties on the outputs were evaluated and therefore are presented with the outputs.
  The outputs from both algorithms will allow us to address questions regarding phytoplankton diversity from an ecological perspective (through relative cell abundance) and a biogeochemical perspective (through Chla fraction per group), while considering physiological uncertainties.
  We sincerely hope that this response is convincing and meets your expectations. We appreciate the thorough review process and are confident that the revisions have significantly improved the manuscript.
  (Supplementary figures attached)
  Figure S1: Different levels of information on phytoplankton groups. Noting that cell relative abundance (psbO relative abundance) and Chla contribution and fraction per group are two outputs of two different algorithms based on the same SOM-psbO methodology
  Comments Referee #2
  The authors develop a machine learning approach to link ocean colour data and in situ omics to improve detection of phytoplankton functional types and groups from space. The topic they are dealing with is innovative. However, the methodology and algorithm development steps are hard to follow and need to be revised to make the workflow clearer to the reader. In this scope, a flowchart is essential.
  I am not fully convinced by the validation approach of the method. The training is done using the whole omics database and cross-validation statistics show the good prediction capabilities of the model. Then, the validation is made with an external database built on HPLC-based information. From my point of view, this cannot be considered a proper validation because one quantity is based on HPLC data, the estimated one on omics data. Such a comparison thus implies that the two approaches bring the same level of information on phytoplankton taxonomy. In this case, there would be no need to develop a new approach based on omics. However, as discussed at the end of the paper, HPLC- and omics-based phytoplankton information have some degree of correlation, which is good because this means that OMICS information can be found in optical properties to some extent and OMICS based approaches are welcome because they will bring new and complementary information on phytoplankton from space.
  I realize that the OMICS database used to develop the new ML approach is small, but probably the authors might think to train the model over 70% of the database and validate it with the remaining 30%.
  Results need to be discussed more and the text about retrieved global distribution of phytoplankton and biomes needs to be profoundly checked and revised.
  The work thus needs to be deeply revised to improve the methodology and make the validation stronger as well as the text more readable.
  We would like to express our gratitude for the valuable review. We acknowledge the referee's observations regarding technical issues in the initial version, and we have prepared this revised version while applying the referee’s suggestions.
  We would like to admit that the reasoning behind validating with a pigment-based approach was misleading. For that, we chose to follow the referee’s major comment and evaluate the algorithm using a two-step procedure:
  We split the Tara Oceans psbO dataset into 80% to train the SOM, and 20% as a test set.
  1- During the SOM training based on 80% of the dataset, a different combination of satellite variables was used to determine the best set of variables to estimate the 7 phytoplankton groups in terms of relative cell abundance and Chla fraction.
  Per a combination of variables, we increase the number of neurons to determine the optimal size of the SOM from 10 neurons to 1000 neurons.
  
  o For each number of neurons used, the quantization and topographic errors related to the SOM are calculated and a one leave-out cross-validation procedure is performed to assign performance metrics (R2 and RMSE) to help choose the best SOM size and satellite variables combination.
  The best SOM configuration and variable combination are based on the lowest errors and highest R2 values.
  2- The chosen SOM is tested using the 20% test set, providing an independent set of performance metrics.
  As a result, we present in the paper the performance metrics of the best SOM configuration based on the cross-validation procedure and the test set.
  The comparison with the HPLC DPA approach will be introduced for comparative purposes between satellite products only.
  Specific comments:
  Figure 1 is misleading as the same color palette has been used for both columns though the % axis are different from left to right. A quick reader could interpret the yellow dots of (e.g.) Cryptophytes as abundant as Green Algae or Diatoms.
  Indeed, we adjusted the colorbars according to the referee’s comment. For that, a different color palette was used.
  Line 91: this statement means that we have phytoplankton also in the 180-2000 um size class, which is possible in case of diatoms chains. Could you provide a distribution of frequency of phytoplankton groups within each size class? This would help the reader to have a wider image of the type of phytoplankton in the database (and especially for those chain-forming species and classes spanning a wide size range).
  The distribution of taxonomic groups between size fractions in the psbO dataset is displayed in Fig 2a-b, Fig 7a, Fig 8 and Figure S16 in Pierella Karlusich et al 2023 Mol Ecol Res.
  We provide boxplots to illustrate the distribution of the phytoplankton groups per size filter.
  The filters may retain cells smaller than the nominal pore because of net clogging, or because they were trapped in fecal pellets. On the contrary, long needle-like species and broken cells and colonies can pass through small mesh sizes. The patterns that we described in the current work based on size-fractionated samples can be complemented in the future by exploring non-fractionated samples.
  Line 115: why normalizing omics data on Chl? Because Chl varies according to the physiological status of phytoplankton, a photoacclimation component is re-introduced (which is a major problem in the DPA analysis). Why not using OMICS-based % of the whole population?
  Indeed, this type of normalization introduces physiological uncertainties in the data. However, it was judged important to achieve a comparable quantity to what we can observe using DPA pigments approach and current satellite operational products.
  But, as mentioned in the general answer:
  **To address the concerns regarding chlorophyll-a fractionation and enable the emergence of different levels of information as outputs, we trained two algorithms using the same satellite data and SOM methodology. One algorithm provides the relative cell abundance of phytoplankton (including the estimation of direct psbO relative abundance values), while the other algorithm estimates the phytoplankton Chla fraction per group. Importantly, this revised methodology now considers the psbO occurrence per size fraction, which was not taken into account in the initial version of the manuscript. In both algorithms, uncertainties on the outputs were evaluated and therefore are presented with the outputs.
  The outputs from both algorithms will allow us to address questions regarding phytoplankton diversity from an ecological perspective (through relative cell abundance) and a biogeochemical perspective (through Chla fraction per group), while considering physiological uncertainties.**
  Lines 121-123: it is not clear which data are interpolated. In situ or satellites?
  Since both data sources contain missing values, In-situ and Satellite data are both interpolated automatically within the initial data space during the training phase.
  Table 2 contains mistakes on the coefficients. From Uitz et al. (2006), the coefficient for Chl-b is 1.01 while 0.35 is for 19-BF. Fixed, we apologize for this mistake. In addition, 19-BF is here only attributed to the pelagophytes while is also a pigment within haptophytes (except coccolithophores). So, from the current coefficients all haptophytes only contain 19-Hex.
  Indeed, in the study by Chase et al. (2020), it was demonstrated that the presence of pigments overlaps within size classes and types of phytoplankton. Several ocean color studies, such as those by Hirata et al. (2011) and Xi et al. (2020), have attributed the 19Hf pigment to nanophytoplankton in general, and to Haptophytes. Taking into account the reviewer's comments and the uncertainties associated with pigments, we decided to list the major phytoplankton groups and indicate the most representative pigment for each group.
  Line 155: which cross-validation procedure? Do these statistics refer to all pigments or is it a global indicator for the technique?
  We acknowledge that the sentence citing the statistics was unclear. The reported statistics, a regression coefficient of 0.75, and an average RMSE of 0.016 mg.m^-3, represent a global indicator for the technique. They reflect the mean error and regression coefficient across the 10 estimated pigments. The cross-validation procedure was conducted using a one-leave-out random pick from the initial dataset constituted of 12 000 HPLC observations. This has been further clarified in the new version of the manuscript.
  Line 163: Please indicate and explain better which are the “several machine learning algorithms” you tested and why a SOM has been chosen. This will be very helpful for scientists approaching the same problem.
  We would like to clarify the sentence introducing the machine learning algorithms used in our study: SOM, hierarchical ascending clustering (HAC), and Random Forest.
  Developing an operational algorithm that estimates the abundance of phytoplankton groups from satellite information was achieved using these algorithms. Firstly, the SOM algorithm was utilized to train a model based on the psbO pigment dataset. This allowed us to identify global large-scale patterns and characterize phytoplankton biomes. Last, to explain the potential divergence between the DPA approach and psbO measurements, we employed a Random Forest approach. This analysis highlighted the cumulative importance of pigment composition in estimating the abundance of phytoplankton groups.
  We tested approaches based on Feed-Forward Neural Networks. However, due to the limited number of observations in the dataset, these approaches were not very conclusive. The choice of SOM was based on the previous work by El Hourany et al. (2019), which demonstrated improved performance with an increasing number of neurons, the number of neurons almost twice compared to the observations in the initial dataset, accounting for missing values.
  In the following section, each methodology and algorithm are explained in detail. Section 3.1 need to be rewritten and a flowchart added. That’s strange to see 3.1.1 and 3.1.2 as two different sections when (if I had well understood) the work is done simultaneously. Figure 5: y- and x- axes should be the same and indicate the name of the solid and dashed lines in the caption.
  We apologize for the misleading sectioning. To better clarify the methodology, a flowchart was added and both above-mentioned sections were merged according to the methodology as the reviewer mentioned. Indeed sections 3.1.1 and 3.1.2 are done simultaneously, and iteratively as shown in the new flowchart #2.
  (Flowcharts added in supplementary)
  Flowchart 1: General scheme of the SOM-psbO methodology to estimate phytoplankton groups from satellite data.
  Flowchart 2: A focus on the training phase of the SOM-psbO which is based on an iterative procedure between different satellite variable combinations and SOM grid size. The best satellite variables combination and SOM size were based on the lowest errors and highest R2.
  Line 191: which several experiments? How many? Please explain better.
  The SOM grid size was sampled between 10 to 1000 neurons with a step of 10. Therefore, there were 100 SOM grids that were tested for each variable combination.
  Section 3.1.3 needs to be clearer.
  Line 269: what is the impact of interpolation on bbp and Kd? (i.e., Interpolation declared in the methods)
  Below is a comparison of SOM-psbO and the initial dataset’s values for each variable including bbp and Kd. For a SOM grid size of 242 neurons, the SOM was able to catch the values’ distribution for both parameters (presenting missing values within the initial dataset).
  Figure S2: Distribution of values for each variable in the initial dataset D and SOM neurons.
  Line 275: from Table 3, pelagophytes instead of cryptophytes
  Indeed, we apologize for this error.
  Line 314: generally speaking, are you referring to the surface-to-volume ratio?
  We have corrected the term: 'biovolume-to-size' was replaced by 'surface-to-size'
  Line 329 and Line 331: please check and discuss: C4, C5 and C6 are dominated by Prokaryotes, but these areas are generally known to be dominated by large phytoplankton. Same for C1, dominated by diatoms but in the subtropics. In addition, it would be nice to see these clusters plotted on map in Figure 10.
  We admit that within the first version of the algorithm, since we didn't take into consideration the effect of size per group and per sample, the Chla fraction concentration per group was biased.
  The pos-training classification into dominant phytoplankton communities was revised accordingly after incorporating the phytoplankton size information as described in Sommeria-Klein et al 2021 in Science:
  Abundance integrating body size = Sum_over_the_size_fractions_of (proportion_of_psbO reads_belonging_to_group_G x mid-range_value_of_size_fraction) / Sum_over_the_size_fractions_of (proportion_of_psbO reads_belonging_to_group_G ).
  Therefore, upon converting psbO reads to relative abundance accounting for the size of the phytoplankton cell per group, we highlight the size contribution of each group to the total Chla.
  Compared to the previous version, and due to the data conversion, five clusters turned out to be sufficient to describe the dominant patterns (Figure S3).
  Figure S3: Satellite-derived biomes of phytoplanktonic communities, obtained by unsupervised clustering (Hierarchical clustering) on the SOM’s referent vectors. The normalized (by the variance of the initial database) and original Rrs spectrum were also derived to characterize each cluster’s optical signature. The global map shows the most frequent community structure recorded during the 1997-2021 period.
  Figure 10: How the spectra have been normalized? By the minimum? The spectral shape should be discussed.
  Each wavelength was normalized by its values distribution variance within the dataset. We are providing a discussion of both phytoplankton distribution and the spectral signal in the revised manuscript.
  
  Citation: https://doi.org/10.5194/egusphere-2022-1421-AC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

AR by Roy El Hourany on behalf of the Authors (11 Jul 2023) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (27 Jul 2023) by Jochen Wollschlaeger

RR by Anonymous Referee #3 (17 Sep 2023)

Suggestions for revision or reasons for rejection

Hourany et al. have been developing a machine learning based algorithm trained on several remotely sensed products (RRS, bbp, Kd490, SST, CHL) combined with omics-based biomarker developed from the RV Tara Ocean data set to obtain cell abundance and fraction to total Chla of seven major marine phytoplankton groups. They have evaluated their algorithm with cross-comparison, independent validation and intercomparison to similar satellite products. While I think overall the method development seems to be robust and documented, the manuscript lacks especially
a) correctly referencing other work done in the field of phytoplankton measurements, analysis and especially PFT algorithm development,
b) several details in the two chapters “Materials” and “Methods”, and
c) discussion on their algorithm performance regarding pixel uncertainty, cross-validation, independent validation and intercomparison results.
Below I detail further these shortcomings.
Because of this I think the manuscripts requires in these aspects substantial revision before it can become accepted, while most of the other parts can mostly remain.

Detailed comments:
1. It would be good also to have a list of abbreviation in the supplement, there are so many abbreviations used and parameters listed in the manuscript, it becomes confusing.
2. Introduction: at several sentences the references provided are not clear or correct or do not merit former work executed in the field:
a) Line 34-35: that is a very sloppy statement “… a range of ecological and biogeochemical problems” what is meant with problems?
b) Line 39 ff. it is not clear if the methods developed to detect “… abundance of PFT and SC are also meant to be based optical characteristics – since this is clearly stated for the “specific taxa” this should also be clarified here and the references provided then should match the specific method principle. I recommend then to cite here overview papers (see IOCCG 2014, Mouw et al. 2017, Bracher et al. 2017) or at least to put ”e.g.” since the citations provided are far from complete. In addition, Alvain et al. 2005 and Ben Mustapha et al. 2013 retrieve dominant groups and no abundances, and Chase et al. 2020 method does retrieve PSC from satellite ocean color data, it assessed the diagnostic pigment method based on in-situ data for phytoplankton size classes.
Citations:
Bracher A., Bouman H. A., Bricaud A., Brewin R.W.J., Brotas V., Ciotti A. M., Clementson L., Devred E., Di Cicco A. M., Dutkiewicz S., Hardman-Mountford N. J., Hickman A. E., Hieronymi M., Hirata T., Losa S. N., Mouw C. B., Organelli E., Raitsos D. E., Uitz J., Vogt M., Wolanin A. (2017a) Obtaining Phytoplankton Diversity from Ocean Color: A Scientific Roadmap for Future Development. Frontiers in Marine Science 4: 00055, doi: 10.3389/fmars.2017.00055
IOCCG (2014). Phytoplankton functional types from space. S. Sathyendranath and V. Stuart (eds), Reports of the International Ocean Color Coordinating Group, No. 15, IOCCG, Dartmouth, Canada.
Mouw, C.B., Barnett, A., McKinley, G.A., Gloege, L., Pilcher, D. (2016), Phytoplankton size impact on export ﬂux in the global ocean, Global Biogeochem. Cycles, 30, 1542–1562, doi:10.1002/2015GB005355
c) Line 48 ff. : should also merit Brewin et al. 2010. A three-component model of phytoplankton size class for the Atlantic Ocean. Ecological Modelling, 221(11), pp.1472-1483. – I would put “e.g.” since this list is far from complete!
d) Line 52 should reference to Brewin at al. 2015 not 2014!
e) Line 62 (also Methods chapter 2.3.1): You say you downloaded the Xi et al. product from the Copernicus website – if it was after July 2021, it most probably is the product based on Xi et al. 2021 which includes the SST as variable to constrain the algorithm .
Citations:
Xi H., Losa S.N., Mangin A., Garnesson P., Bretagnon M., Demaria J., Soppa M.A., Hembise Fanton d'Andon, O., Bracher A., 2021. Global chlorophyll a concentrations of phytoplankton functional types with detailed uncertainty assessment using multi-sensor ocean color and sea surface temperature satellite products. Journal of Geophysical Research-Oceans, doi: 10.1029/2020JC017127

3. Material & Method sections:
a) a flow chart (Fig. 4) is provided for the SOM DRCA & DChlF data sets – however, everything else connected to methods applied in study is lacking. Since you did many different other parts (DPA three coefficients averaging for HPLC data global and Tara, uncertainty assessment, satellite product intercomparison, cross validation, etc.) – it would be good to have an overview.
b) Chapter 2.1.1- line 75 ff.: It is not clear why stations are discarded when not all 5 size fractions were contained in a station sample – for me it does not make sense from an ecological stand point. In addition, you do not mention how many stations were then excluded. Also add the information what exact values for the weights were taken for each size fraction to obtain their chl-a fraction. Why do these weight values make sense for the conversion? In Line 82 it is not clear what 5% here means – relative to the total abundance in each size class or for each size class?
c) Chapter 2.1.2, line 205: Add more information by providing exactly the 11 bands used from 412 to 670 nm from the RRS data set.
d) Chapter 2.2: Overall, I wonder why not much more HPLC data have been used for your algorithm validation. E.g., you cite Xi et al. 2020 – then you should be aware of the much bigger pigment data set used in this work (taking advantage of the compilation in Losa et al. 2017). Further check also identification on the error in LTER Palmer HPLC data in Xi et al. (2021) – it may also affect already your compiled data set. Finally before your paper becomes accepted, the compiled HPLC data set with the diagnostic pigments, total chl, and retrieved PFT chl-a conc. should be made available to the readers (e.g., by storage in a public repository).
Citation:
Losa S., Soppa M. A., Dinter T., Wolanin A., Brewin R. J. W., Bricaud A., Oelker J., Peeken I., Gentili B., Rozanov. V. V., Bracher A., Synergistic exploitation of hyper- and multispectral precursor Sentinel measurements to determine Phytoplankton Functional Types at best spatial and temporal resolution (SynSenPFT). Frontiers in Marine Science 4: 203; doi: 10.3389/fmars.2017.00203
e) Chapter 2.2: Why did you choose to apply for the dpa method using the 3 sets of coefficients proposed by Uitz, Brewin, Soppa and that then taking from these calculations the average fraction. You should at least somewhere discuss why you followed this method, instead of just using the coefficient proposed by one of author (I would rather recommend then the newest citation – actually newer ones have been published since then).
f) Chapter 2.3.1: mind to check if the basis of the CMEMS global PFT product is really Xi et al. 2020 (see comment 2e)– add also the version number of the product in the description. In any case the product is not provided from 1997, but only from 2002 onward. In any case you description that this algorithm uses 15 bands is not correct at all. Please carefully check and provide a correct description.
g) Chapter 2.3.2: it is unclear if also the PFT-chla derived from SOM predicted pigments using Hourany et al. 2019a have been produced by using the average value from applying in the DPA the 3 sets of coefficients proposed by Uitz, Brewin, Soppa. Please clarify.
h) Chapter 3.1 – line 162: it seems except for matching the data based on 3x3 pixel box +/-1 day not further criteria to select “valid” matchups has been used. Protocols recommend that at least 50% of the pixels are valid (unflagged) and the coefficient of variation is within 20% (e.g., see EUMETSAT protocol: https://www.eumetsat.int/media/44087 ). Can you provide more details or comment why no further quality control had been applied.
i) Chapter 3.2.2 – line 227 ff: Since you noticed that using 670nm in the algorithm did not improve it, why did you keep it? Further, in Line 230 the reference of Xi et al. (2015) is not suited since the paper is focusing on simulated data sets across many (all) water types – probably much better to cite here Torecilla et al. (2011) or Taylor et al. (2011) where the HCA method (or Alvain et al. 2005 with Physat) has been applied to RRS data from the open ocean in order to derive information on phytoplankton community structure.
Citations: Taylor B.B., Torrecilla E., Bernhardt A., Taylor M. H., Peeken I., Röttgers R., Piera J., Bracher A. (2011) Bio-optical provinces in the eastern Atlantic Ocean. Biogeosciences 8: 3609-3629. doi:10.5194/bg-8-3609-2011
Torrecilla, E., Stramski, D., Reynolds, R. A., Millan-Nunez, E., and Piera, J.: Cluster analysis of hyperspectral optical data for discriminating phytoplankton pigment assemblages in the open ocean, Remote Sens. Environ., 115, 2578–2593, doi:10.1016/j.rse.2011.05.014, 2011b.
j) Chapter 3.2.4: I miss a discussion about the input data uncertainty influencing the uncertainty of the retrieved PFT products (should be put then in chapter 4).
k) Chapter 3.4: The cross-validation results should also provide information of the mean or median relative deviation (MRD) in order to be comparable to other approaches (e.g., Xi et al. 2020, 2021, Lange et al. 2020) – it would be good to have here more statistical measured
4. Section Results and Discussion
a) Figure 7 caption: provide n (number of observations) for both data sets, the cross-val set and the test set. As stated above also show (and discuss) results for RMSD and MRD since R^2 is not a very robust measure of accuracy of a product. For the PG-Chla comparisons it should be clearly stated in chapter 3 that R^2 results from calculations based on log-transformed data, while MRD and RMSD are based on non log-transformed data.
b) Line 320ff: I think it is difficult to understand what is presented in Figure 8 and discussed here and no values specific for each group and separately for chla-fraction and abundance are provided. Your pixel-by-pixel uncertainty assessment in terms of values and what it actually considered should be compared to other PFT/PSC algorithms results (e.g. see Brewin et al. 2017, Xi et al. 2021, Lange et al. 2021) - probably in chapter 4.3.
Citation: Brewin, R.J.W., Tilstone, G.H., Jackson, T., Cain, T., Miller, P.I., Lange, P.K., Misra, A. and Airs, R.L., 2017. Modelling size-fractionated primary production in the Atlantic Ocean from remote sensing. Progress in Oceanography, 158, pp.130-149.
Lange P. K., Werdell P. J., Craig S., Erickson Z. K., Dall’Olmo G., Brewin R., Zubkov M., Tarran G., Bouman H. A., Bracher A., Poulton N., Lomas M., Slade W., Cetinić I. (2020) Radiometric approach for the detection of picophytoplankton assemblages across oceanic fronts. Optics Express 28 (18): 25682 [10.1364/oe.398127
c) In addition, in chapter 4.1 and 4.2 a discussion of your two gene-SOM algorithms performance in respect to cross-validation (e.g. as done in Brewin et al. 2015, Xi et al. 2020, 2021, ) and independent validation to other PFT /PSC algorithms presented in literature (see Mouw et al. 2017 and search newer literature on PSC algorithms) should be added.

d) Figure 9, also add the number of matchups (at least in the figure caption), add also the MRD!

e) Fig. 11 color scale for Chl-a should contain more colors, as in Fig.11 abundance presentation and in Fig. 13, so differences in Chl-a are more visible.

f) Typos: in line 358 and 370 – this should cite the correct subfigures of Fig. 11.

Hide

RR by Alison Chase (04 Oct 2023)

ED: Reconsider after major revisions (11 Oct 2023) by Jochen Wollschlaeger

AR by Roy El Hourany on behalf of the Authors (09 Nov 2023) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (10 Nov 2023) by Jochen Wollschlaeger

RR by Anonymous Referee #3 (29 Nov 2023)

RR by Alison Chase (29 Nov 2023)

ED: Publish subject to technical corrections (05 Dec 2023) by Jochen Wollschlaeger

AR by Roy El Hourany on behalf of the Authors (15 Dec 2023) Author's response Manuscript

Short summary

Satellite observations offer valuable information on phytoplankton abundance and community structure. Here, we employ satellite observations to infer seven phytoplankton groups at a global scale based on a new molecular method from Tara Oceans. The link has been established using machine learning approaches. The output of this work provides excellent tools to collect essential biodiversity variables and a foundation to monitor the evolution of marine biodiversity.