Using feature-based verification methods to explore the spatial and 1 temporal characteristics of forecasts of the 2019 Chlorophyll-a 2 bloom season over the European North-West Shelf

9 A feature-based verification method, commonly used for atmospheric model applications, has been 10 applied to Chlorophyll-a (Chl-a) concentration forecasts from the Met Office Atlantic Margin Model at 11 7 km resolution (AMM7) North West European Shelf Seas model, and compared against gridded 12 satellite observations of Chl-a concentration from the Copernicus Marine Environmental Monitoring 13 Service (CMEMS) catalogue. A significant concentration bias was found between the model and 14 observations. Two variants of quantile mapping were used to mitigate against the impact of this bias on 15 feature identification (determined by threshold exceedance). Forecast and observed Chl-a objects for the 16 2019 bloom season (March 1 to 31 July), were analysed, firstly in space only, and secondly as space17 time objects, incorporating concepts of onset, duration and demise. It was found that forecast objects 18 tend to be too large spatially, with lower object numbers produced by the forecasts compared to those 19 observed. Based on an analysis of the space-time objects the onset of Chl-a blooming episodes at the 20 start of the season is almost a month too late in the forecasts, whilst several forecast blooms did not 21 materialise in the observations. Whilst the model does produce blooms in the right places, they may not 22 be at the right time. There was very little variation in forecasts and results as a function of lead time. A 23 pre-operational AMM7 analysis, which assimilates Chl-a concentrations was also assessed, and found 24 to behave more like the observations, suggesting that forecasts driven from these analyses could 25 improve both timing errors and the bias. 26 https://doi.org/10.5194/os-2020-100 Preprint. Discussion started: 2 November 2020 c © Author(s) 2020. CC BY 4.0 License.

1 Introduction 27 The advancements in atmospheric numerical weather prediction (NWP) such as the improvements in 28 model resolution began to expose the relative weaknesses in so-called traditional verification scores 29 (such as the root-mean-squared-error for example), which rely on the precise matching in space and 30 time of the forecast to a suitable observation. These metrics and measures no longer provided adequate 31 information to quantify forecast performance (e.g. Mass et al. 2002). One key characteristic of high-32 resolution forecasts is the apparent detail they provide, but this detail may not be in the right place at the operational version (hereafter referred to as AMM7v8). Note that the analysis (1-day hindcast) and 136 forecasts used here are available from the CMEMS catalogue.  3) Define a threshold which captures the feature of interest and apply it to both the smoothed 203 forecast and observed fields to identify simple objects. 204 4) The original intensity information in the field is then reinstated in the identified features (i.e. the 205 analysis of the object attributes is not based on the smoothed fields). 206 5) Depending on the merging option that is chosen, simple objects that are identified as being 207 related to each other are merged to form cluster (complex) objects. 208 6) Lastly, objects in the forecast and observed fields can be matched based on a range of criteria 209 using a fuzzy logic engine (low level artificial intelligence), which together are expressed as the 210 so-called "interest" score. The higher the score the stronger the match. All objects are compared 211 in both fields and interest scores are computed for all. A threshold is set on the interest score 212 value (typically 0.7) to denote which are the best matches to provide a unique best match for 213 each object pair. Some objects will remain unmatched (either because there is none or because 214 there are no interest values high enough to provide a credible match) and these can be analysed 215 separately. 216 Simple forecast and observed (analysis) object attributes which can be evaluated include centroid 217 location, area, axis angle, curvature and aspect ratio. 1) The sensitivity to threshold and smoothing (convolution) radius should be explored. 228 Numerically information such as the object counts, and areas associated with each combination 229 of threshold and smoothing radius can be summarised into what is known as a "quilt plot". 230 2) The sensitivity to the merging option must also be investigated. The options provided include 231 none, threshold only (using double thresholds), a fuzzy logic engine, or a combination of both 232 threshold and fuzzy logic. Depending on the field this could have an impact. In this instance the 233 merging option had very little impact.

234
3) The behaviour of the matching can also be configured. The interest values that are computed for 235 each possible pair of forecast-observation objects are thresholded to define which objects match. 236 Options include no_merge, merge_forecast and merge_both. There is an increase in computation 237 expense for the merge_both option, which may, or may not, be necessary for a given application. Note also that a minimum size (area) is set for object identification. This is often a somewhat pragmatic 240 choice. If the size is set too small, too many objects are identified, which end up being merged. If too 241 large, very few objects are identified. In this study the merge_both option was used for MODE with a 242 minimum area of 10 grid squares (~70 km 2 ). 243 244 Identical to MODE, identifying time-space objects in MTD uses smoothing and thresholding. Applying 245 a threshold yields a binary field where grid points exceeding the threshold are set to one. At this stage 246 each contained region of non-zero grid points in space and time is considered a separate object, and the 247 grid points within each object are assigned a unique object identifier. For MTD the search for 248 contiguous grid points not only means examining adjacent grid points in space, but also the grid points 249 in the same or similar location at adjacent times to define a space-time object. The same fuzzy logic-250 based algorithms used for merging and matching in MODE apply to MTD as well. Similarly, to MODE 251 a minimum volume of 1000 grid squares was imposed for space-time object identification. For MTD a 252 lower interest score of 0.5 is used for matching objects. 253 MODE and MTD produces object attributes for both "single" and "paired" objects (when matching), as 255 well as for "simple" or "cluster" (when merging objects within either the forecast or analysis field) 256 object attributes. Throughout this analysis the single simple objects have been used when considering 257 forecast-only or analysis-only attributes. In addition to the interpolation of the L4 ocean colour product onto the AMM7 grid a smoothing radius 270 of 5 grid points was also applied to the observed fields to remove some of the very small and noisy 271 objects typically found near the coast (which neither AMM7v8 nor AMM7v11 can resolve). No 272 smoothing was applied to the forecasts or model analyses as these were considered to be smooth 273 enough. This radius was identified based on the sensitivity analysis, which will be described in more 274 detail in Section 4. This sensitivity analysis also identified the concentration thresholds which were 275 viable for analysis. Only the 2.5 mg.m -3 threshold will be discussed here. For this study the default 276 settings in MODE were used to compute the interest score.    AMM7v8 analyses, the two that clearly differ more dramatically from Fig 2. These are plotted in Fig. 3, 324 showing that the differences are not just due to an offset in the concentrations but a more complex 325 difference. Close to half of the AMM7v8 analyses concentrations are significantly lower than observed, In practice the application of a quantile mapping method means that the threshold-exceedance seen in 338 the forecasts occurs at the same proportion as that seen in the observations. This frequency equivalence, 339 applied across the whole field, behaves as a bias removal tool. To explain quantile mapping another 340 way, the observed values at that time are ranked and the threshold value is determined as a quantile of 341 that distribution. The equivalent quantile is then selected from the ranked forecast values. For the three-dimensional MTD analysis tool this option was not available as yet. In this instance the 361 seasonal distribution shown in Fig. 3 was used to derive a seasonal threshold denoting a percentile 362 https://doi.org/10.5194/os-2020-100 Preprint. Discussion started: 2 November 2020 c Author(s) 2020. CC BY 4.0 License. equivalence across the two datasets. The reference (fixed) threshold is based on the L4 product. In this 363 instance the day-to-day frequency bias will not necessarily be 1 but the frequency bias will be 364 approximately 1 when the season is taken as a whole.

366
Once the bias has been taken account of in this manner, the spatial properties of the subsequent 367 identified objects can be analysed without the concern that the concentration differences are leading to a 368 misinterpretation of results (remembering that the primary purpose of a feature-based assessment is to 369 determine whether features of interest can be identified with any skill).

420
For the MTD analysis objects in the L4 ocean colour product and the AMM7v11 analyses were defined 421 using a Chl-a concentration threshold of 2.5 mg.m -3 , whereas for the AMM7v8 forecasts and analysis a 422 threshold of 6 mg.m -3 was used, derived from the CDFs plotted in Fig. 3. This is slightly higher than the In order to ensure that MODE used optimal settings for the ocean forecasts under study, the sensitivity 429 of results to smoothing and Chl-a concentration were investigated to find the best object identification 430 results, balancing the need for identifying objects with keeping the number of objects manageable. Much of the initial identification of thresholds and smoothing requirements was done using data from 433 the 2018 bloom season. It is worth noting that this work was done without accounting for the 434 concentration differences but simply analysing the distributions inherent within the data sets. Figure 6 435 provides a selection of quilt plots derived from using the L4 ocean colour products and AMM7v8 436 analyses during July 2018, using one of the merging options which was tested. As stated earlier, results 437 for other options were very similar and will not be shown.  Fig. 6 it is clear there is switch in the sign of the object count "bias" for thresholds above 2.5 451 mg.m -3 , where the AMM7v8 analysis has far more objects than the L4 ocean colour product. 452 Conversely at or below this threshold there are far more L4 objects identified than AMM7v8 objects. 453 Further examination shows that there are very few L4 objects above 2.5 mg.m -3 of any sensible size, so 454 this was chosen as the threshold for identifying Chl-a bloom objects. The median object area increases 455 with increasing smoothing so that the largest areas occur for the largest smoothing radii. It is therefore 456 logical that the potential for variations and larger differences increases also with increasing smoothing 457 radius. This is shown in (b) where it is apparent that the differences between the data sets becomes 458 larger with increasing smoothing, thus suggesting an upper limit of 6 grid squares on the smoothing 459 radius for the L4 product. The starkest differences, and hence the need for addressing the concentration

505
Overall, it will be shown that the AMM7v11 analysis is much closer to the L4 observations than the 506 AMM7v8 analysis. Therefore, the AMM7v11 can be used as a credible source for assessing the AMM7 507 forecast model system going forward. The AMM7v8 analysis on the other hand, does not resemble the 508 L4 observations sufficiently, and should not be used for assessing the forecasts. The major benefit of 509 using a model analysis is that it is at the same spatial resolution, with the same ability to resolve Chl-a 510 bloom objects (i.e. limits the uncertainty due to whether an object could be missing due to the inability 511 of the model to resolve the feature). At this model resolution any coastal objects do not feature in any 512 subsequent data analysis.    In Fig. 10 the number of objects in both sets of observations (AMM7v11 and L4) starts off small and 574 increases as the bloom develops. In general, the number of matched forecast objects in Fig. 10(a) 575 evolves in the same way as the number seen in Fig. 10(b). A spike in the number of matched objects 576 seen in early April can be attributed to several coastal locations, which appear to be spatially well-

588
The identified objects in each of the data sets: AMM7v8, AMM7v11 and L4 ocean colour product can 589 also be considered spatially, by counting the frequency with which a given grid square falls within an 590 identified object on any given day. These can be added up over the entire season to produce a spatial 591 https://doi.org/10.5194/os-2020-100 Preprint. Discussion started: 2 November 2020 c Author(s) 2020. CC BY 4.0 License. composite object or "frequency-of-occurrence" plot. Figure 11 shows this spatial composite identified 592 through the 2019 bloom season for each of the AMM7v8 Day 1 forecast objects (a), the L4 ocean 593 colour product objects (b) and the AMM7v11 objects (c). All objects are identified using the 2.5 mg.m -3 594 threshold. The AMM7v8 objects in (a) are clearly larger and cover more locations but each location 595 with a lower frequency; there are more grid squares where there is an object identified between 0-20% 596 of the time than for the L4 observed objects, as seen in Fig. 11(b). Noticeably, there is a patch in the 597 central North Sea where the AMM7v8 forecasts identify objects some of the time, but the L4 ocean 598 colour product does not have objects there at all. The AMM7v11 analysis, shown in Fig. 11(c) has 599 objects there some of the time, but looks more like the L4 composite; this could indicate the model 600 tends to generate high Chl-a concentrations in this area, but the data assimilation is able to constrain it.

618
For (a) the thresholds varied to but were anchored to "truth" threshold of 2.5 mg.m -3 , which was used for (b) and (c).

620
Thus far all the attributes have been based on only the forecast or only the observed objects. Figure 12   621 gives an example of a paired object attribute using box-and-whisker plots, which are produced by 622 comparing the AMM7v8 day 0 forecast to L4 and AMM7v11 (labelled AMM7v8 vs AMM7v11, and 623 AMM7v8 vs L4) and a third option of comparing the two truth sources (labelled AMM7v11 vs L4). 624 Figure 12 shows the intersection-over-area diagnostic, which essentially gives a measure of how much 625 the paired forecast-observed objects overlap in space. If the objects do not intersect, this metric is 0. The 626 IQR is ~0.45 with 50% of paired objects having an intersection-over-area of 0.6 or greater (it is easy for 627 smaller L4 ocean colour product areas to be completely enveloped by the model analyses, even with the 628 concentration bias accounted for). However, the whisker spans the entire range of values (between 0 629 and 1) which shows that there are instances where this metric is 0. It clearly shows that the AMM7v11 630 analysis is closest to the L4 ocean colour product, with all pairs overlapping in some way. Finally, the 631 AMM7v11 vs L4 shows the most compact distribution of values. There is quite a difference between 632 the median (notch) and the mean (dashed line) for this metric, suggesting the distribution is skewed with 633 the mean affected more by many small values.  The focus shifts to MTD output in subsequent sections. Having information in space and time enables 641 one to ask, and hopefully answer, many questions related to how the bloom season was initially 642 detected and subsequently forecast. What is particularly helpful is that elements such as location and 643 timing errors can be treated separately to answer: "did the model predict the bloom to start in the 644 observed location?" or "did the model predict the onset at the right time?" and "did the model predict 645 the peak and duration of the bloom correctly?". We address location errors first.  There is a significant concentration bias in the forecasts compared to the satellite ocean colour product. 766 This needs to be mitigated against before using a threshold-based methodology such as MODE or 767 MTD, which aims to understand the spatial properties of the forecasts (i.e. the spatial extent is affected).

768
A quantile mapping approach was used to mitigate against this concentration bias to ensure that the 769 frequency of occurrence of specific concentrations remained the same, either precisely (for MODE) or Once AMM7v8 has picked up the start of the season, subsequent events are handled somewhat better. 795 Beyond the timing issues, the model does generally produce Chl-a blooms in roughly the right locations 796 but not necessarily at the right time, though the overlap between blooms can still be limited, despite the 797 apparent size advantage of the AMM7v8 bloom objects.