Clustering analysis of the Sargassum transport process: application to stranding prediction in the Lesser Antilles

. The massive Sargassum algae strandings observed over the past decade are the new natural hazard that currently impacts the island states of the Caribbean region (human health, environmental damages, and economic losses). This study aims to improve the prediction of the surface current dynamic leading to beachings in the Lesser Antilles, using clustering analysis methods. The input surface currents including windage effect were derived from the Mercator model and the Hybrid Coordinate Ocean Model 10 (HYCOM). Past daily observations of Sargassum stranding on Guadeloupe coasts were also integrated. Four representative current regimes were identified for both Mercator and HYCOM data. The analysis of the backward current sequences leading to strandings showed that the recurrence of two current regimes is related to the beaching peaks observed respectively in March and in August. A decision tree classifier was built and its accuracy reaches 73.3% with 0.04°-scale HYCOM data and 50.8% with 0.08°-scale Mercator data. This significant accuracy difference highlights the need of very small-scale current data (i.e., lower than 5 km scale) 15 to assess coastal Sargassum hazard in the Lesser Antilles. The present clustering analysis predictive system would help improve this risk management in the islands of this region.

given to areas at stake such as, inhabited areas, shores with economic or tourist activities and ecosystems or other environmental niches. The particularity and the difficulty lay in the fact that 60% of this coastline and/or the volume stranded remained inaccessible to the techniques currently proven and at costs currently bearable.
(2) an impact on human health and ecosystems because in shallow and small bays, the accumulated algae degrade by 35 fermentation and emit chemical compounds such as hydrogen sulfide (H2S) and ammonia (NH3) (Anses, 2017, Van Tussenbroek et al., 2017Resiere et al., 2018).
(3) The survey conducted by the organizations responsible for socio-economic development estimated that the decline in tourism resulted in an economic loss of $5.5 million for the first half of 2015 (https://eos.org/features/sargassum-watchwarns-of-incoming-seaweed). 40 The volumes needed to be collected being considerable compared to the size of these islands (< 1200 km 2 each) and the vulnerability of these territories. This new phenomenon has raised several scientific questions such as their transports, origins, the sources of nutrients promoting their growth but especially the physical factors that led to the occurrence and the development of Sargassum rafts in the tropical and equatorial Atlantic.
Using large-scale observations with ocean color satellite remote sensing, historical hydrographic observations, time series of 45 Sargassum volume collected on ships, multi-year reanalysis of wind and current, numerical models estimated both the role of subsurface nutrient supply and surface current transport. Several authors have contributed to the understanding of the mechanisms and physicochemical processes governing the phenomenon (Gower et al., 2006;Gower and King, 2011;Gower et al., 2013;Maréchal et al., 2017;Johns et al., 2020). Operational systems have been developed such as the satellite based Sargassum Watch System SaWS (Hu, 2009;Hu et al., 2015) and the Sargassum Early Advisory System (SEAS) (Webster and Linton, 2013). They 50 provide a temporal and spatial assessment of annual seasonal increases and decreases in Sargassum algae amount over wide areas of the tropical Atlantic and Caribbean (Wang and Hu, 2016;Wang and Hu, 2017;Wang et al. 2019). Time series from remote sensing were coupled with spatial distribution models to determine the mechanisms that aggregate Sargassum algae along a zonal band in the tropical Atlantic considering possible nutrient sources promoting the observed annual blooms (Wang et al., 2018;Wang et al., 2019;Johns et al., 2020)). 55 Tropical Atlantic currents and winds seasonally aggregate and carry these algae towards the Caribbean (Franks et al., 2016;Brooks et al., 2018;Cuevas et al., 2018). Modeling studies mainly focused on the transport properties of Sargassum rafts by offshore currents (Wang and Hu, 2017;Brooks et al., 2018;Maréchal et al., 2017;Putman et al., 2018Putman et al., , 2020Wang et al., 2019;Berline et al., 2020). Johns et al. (2020) extended this analysis to highlight anomalous transport due to the 2009-2010 NAO anomaly and seasonal aggregation by the Inter Tropical Convergence Zone (ITCZ). 60 A combination of MODIS AFAI Satellite images with HYCOM surface current forecast data were used by Maréchal et al. (2017) to short-term predict Sargassum strandings for Guadeloupe and the French Antilles islands. Maréchal et al. (2017) showed that this short-term prediction system (i.e., detection starting within 50-100 km of the coasts) worked efficiently during the year 2015 with a performance percentage of 62% and a stranding forecast date uncertainty below one day.
In the above works, the implementation of methods based on several independent data sets has led to the production of scientific knowledge and even to the development of large-scale forecasting systems. None of them used predictive modelling, including classifiers, to determine the probability of a set of data belonging to another set in order to discover repeatable patterns, allowing to produce a decision for risk prevention managers.
In this paper, we propose to use clustering and decision tree classifier methods, combining ocean surface current and wind reanalysis with past observed strandings to obtain a first predictive model of Sargassum beaching on the Caribbean coasts. This model will be 70 used with forecast data as input to produce an operational decision support system.
As ocean data are spatio-temporal fields, machine learning methods such as K-Means (KMS) may be used to obtain a finite number of possible k-cluster partitions of the surface currents. These methods have been widely used in weather forecasting (Michelangeli et al., 1995;Cassou et al., 2004;Boé and Terray, 2008) but are much less common in physical oceanography (Harms and Winant, 1998;Hisaki, 2013). To optimize the final partitioning, an additional metric based on the Kullback Leiber divergence (Kulback and75 Leibler, 1951, Biabiany et al., 2020) will be included.
We focused on the offshore region covering either side of the Lesser Antilles, between 55-66°W and 8-17°N (Fig. 1a). Visual analysis of the monthly SaWS maps indicates that this region remains the primary pathway for Sargassum rafts from the Atlantic Ocean to the Caribbean Sea. The North Equatorial Current (NEC), the Guiana Current (GC), the eddies and the retroflection front of the North Brazil Current (NBC) are the main contributors of this transport. Figure 1b describes the focused area divided into a 80 first sub-set "LA1" for the Caribbean Sea, a second one, "LA2" between 18°N and 14.5°N (Guadeloupe, Dominica, Martinique, Saint Lucia) and a third one "LA3" south of 14.5°N (Saint Vincent, Barbados, Trinidad and Tobago). This ocean region corresponds to the CA and TA1 boxes in Johns (2020).
The questions are as follows. Can dynamic patterns of surface currents in the Lesser Antilles be summarized as a discrete set of cases? What is their temporal recurrence? What combinations of currents enhance Sargassum rafts arrival and strandings on the 85 Lesser Antilles coasts? What is the contribution of this type of predictive modelling to the prevention of this new natural hazard?
The database, clustering methods and decision tree used in this study are described in Sect. 2. The obtained current regimes, their relationship to Sargassum hazard and the decision support system performances are presented in Sect. 3. These results are discussed in Sect. 4.

HYCOM surface current dataset
Fine scale surface current data from the 1/25-degree HYCOM + NCODA Gulf of Mexico analysis model (GOMu0.04/expt_90.1m000 version, Hogan et al, 2014;Helber et al., 2013;Cummings and Smedstad, 2013;Cummings, 2005) between 1 st January 2019 (i.e., available data starting date) and 31 December 2020 were analyzed. Daily 12Z fields giving the u and v components of the current at 50 cm depth were used. These fine resolution current data were not used in previous studies dealing 95 with Sargassum hazard (Putman et al., 2018;Johns et al., 2020).

Mercator surface current dataset
The daily 50-cm depth current components from the PSY4V3R1 Mercator 1/12-degree 3D analysis system including the version 3.1 of the NEMO ocean model (Lellouche et al., 2018;Gasparin et al., 2019) were also analyzed along the same period as HYCOM https://doi.org/10.5194/os-2021-109 Preprint. Discussion started: 18 November 2021 c Author(s) 2021. CC BY 4.0 License. the method used here is to consider the spatial variability in the dynamics of the analyzed daily surface currents from L2. The LA study area was separated into three parts (Fig. 1b) based on the Sargassum rafts transport centers of action reported in the literature 135 (Franks et al., 2016;Berline et al., 2020). To the west of LA, the first zone, LA1, is centered on the Caribbean Sea. To the east, the Atlantic zone has been split into two areas towards 13.5°N, just above Barbados island. To the south-east is the LA3 zone under the influence of the North Equatorial Recirculation Region (NERR) and its retroflection rings, while to the north-east is the LA2 zone, more representative of the North Equatorial Current. The analyzed daily fields include a total of 14 279 meshes (4 282 meshes in LA1, 3 407 meshes in LA2 and 4 536 meshes in LA3). The remainder corresponds to land areas. 140 The second step was to group the information carried by the daily current velocity fields conditionally to the three given zones into histograms. The similarity of the most similar fields is estimated per pair and per zone based on the symmetrized Kullback-Leibler (KL) divergence computed from the histograms (Kullback and Leibler, 1951). This allows the entropy between two distributions to be expressed without having a priori reasoning concerning the probability distribution. The similarity between two histograms was quantified this way. The last step consisted in calculating the average of the divergence values for each zone. This allows to have a 145 single value, named Expert Distance (ED) quantifying the similarity between the individuals of the database during clustering. The clustering results have been evaluated using the Silhouette Index (Rousseeuw, 1987).
The SaMk index defined in Biabiany et al. (2020) was used. This allows to express the quality of a clustering, by the average of the quality of each cluster, which is itself the average of the silhouette indices s(i) over the cluster elements. This index is defined as follows: 150 (2)

Clustering analysis on stranding backward sequences
To better understanding current regime dynamics which may lead to Sargassum strandings on the coasts of Guadeloupe, the past stranding 30-day current backward sequences were analyzed. While 110 observed stranding days were registered between January 2019 and December 2020, only 107 back-sequences were studied here. This is explained by the fact that stranding days registered 155 in January 2020 were removed to avoid back-sequences missing data of the December 2018 period. These 107 stranding backsequences were examined with the highest resolution surface current model, i.e., HYCOM fields. Dissimilarities between these backward sequences were calculated with optimal matching methods before dividing the population into several groups using a hierarchical classification (Larmarange et al., 2015). The Longest Common Subsequence (LCS) method was used to compute the distances between the backward sequences (Elzinga and Struder, 2015;Studer and Ritschard, 2016). A dendrogram was calculated 160 using Wald's algorithm. The highest relative inertia loss criterion allowed to determine the optimal number of partitions (TraMiner package (Gabadinho et al., 2011)).

Decision support system
To determine the probability of Sargassum stranding at a given location, a decision tree was built using complementary elements called "modules" (Fig. 2). They each generate information based on input data including surface currents with windage effects (Mercator, HYCOM and ERA-5) and past observations of strandings in Guadeloupe. Thus, for a given day, the proposed system works as follow: • Module A takes as input the month of the selected day and returns the associated monthly probability (frequency) of stranding; • Module B which assigns a cluster number to the focused day after the ED clustering of the daily surface currents. Then, it 170 builds from this day empirical backward sequences of numbers between 1 and 4 (type of cluster) over a period of 30 past days; • Module C which takes as input the daily cluster number produced by module B and returns the probability (frequency) of stranding associated with the type of cluster. This probability is calculated, by cluster type, from the strandings observed on the coasts of the Guadeloupe archipelago. The system has 107 30-day stranding backward sequences. These backward sequences start the day of standing on the coasts of Guadeloupe. This set of referenced stranding backward sequences is called BASE (Fig. 2b); 175 • Module D, which compares the backward sequence of the given day to the stranding backward sequences with Jaccard distance. The module D is interconnected to BASE and module B. It returns the percentage of correspondence between them.
In the literature, the average of the different modules is often used as the decision operator (Bo. et al., 2020;Swain and Hauska, 1977). In the present work, the percentage of stranding for a given day was determined using the percentages provided by modules A, C and D, according to the following formula: 180 ( 3) where P(i) is the quantity used in the design of the decision rule. This rule is simply the linear combination of the percentages from modules A, C and D, calculated according to: where ∈ , the set of past days (2019-2020) and DECISION(i) is a (logical) response of the decision tree for a given day i.e., 185 expressed in binary form. The proposed tree in Fig. 2 was experimented on the first 120 days of the year 2021, from 1 st January 2021 to 30 April 2021, i.e., 120 tests.

Surface current patterns in the focused area
The deciles of surface current velocities including windage, according to equation (1), are presented in Table 1. For both models 190 HYCOM and Mercator, the velocity intensities do not exceed 2.57 m s -1 and 90% of them remain below 0.65 m s -1 . The Mercator data have a median of 0.28 m s -1 , the mean of 0.33 m s -1 , while for HYCOM these values are respectively equal to 0.32 m s -1 and 0.36 m s -1 . The ratio between the first and the last decile is close to 6. Figure 3 shows skewed distributions with skewness equal to 1.31 and 1.21. The distribution mass is concentrated on the left. There are extreme values indicating surface current speeds with deviations 5 times greater than the standard deviation. 195 To assess the contribution of each of the three regions (i.e., LA1, LA2, LA3) to the deciles, the relative frequency against the decile thresholds given in Table 1 is shown in Fig. 4. Three different shapes can be seen. In the Caribbean Sea, the LA1 relative frequency The frequency distributions show two opposite behaviors respectively for LA2 and LA3. In the Atlantic north LA part, LA2 area, 200 the frequency decreases with current speed. The current speeds above 0.65 m s -1 are very uncommon. On the contrary, in the Atlantic south LA part, LA3 area, the frequency increase is observed with maximum frequency linked with current speeds above 0.65 m s -1 .
These three significant specific current speed distributions associated with LA1, LA2 and LA3 confirm the need to separate these three areas in the ED metric clustering process.
The differences between HYCOM and Mercator current vectors were also examined for each grid cell (Fig. 5). Globally, at sea, the 205 current speed differences are small and remain below 0.15 m s -1 . These differences between HYCOM and Mercator increase close to the islands with an average value of 0.3 m s -1 . The largest differences, above 0.5 m s -1 are observed in the South part of the LA arc, around Trinidad and Tobago.
At each grid point, the angular deviations found between the medians of the surface current velocity vector directions can be divided into three magnitude groups of 45°. The current direction differences between 0 and 45° are the most frequent group in the region, 210 while those between 45 and 90° remain localized downstream of the islands. Finally, those above 90° occur exclusively around Trinidad.

Clustering analysis
To identify surface current patterns in the region, and then those that lead the transport of Sargassum rafts to the LA islands coasts, the clustering of the gridded data according to equation (1) was performed. 215

Clustering assessment
One of the known uncertainties in the k-means method is induced by the selected number of clusters. To find an optimal number of clusters and identify the best partition (Biabiany et al. 2020), the silhouette index (SaMk) evolution against the number of clusters, k. is shown in Fig. 6. The silhouette indices obtained by the KMS-ED method, are in general above 0.2 for any k<15, and remain higher than those from KMS-L2, HAC-L2 and HAC-ED methods. These values indicate that the quality of the clusters is much 220 better with the KMS-ED method. The inflection point of the KMS-ED curve occurs for the same number of clusters, k=4, for both Mercator and HYCOM data. This highlights four representative current regimes in the studied region, respectively named MC1, MC2, MC3, MC4 for Mercator and HC1, HC2, HC3 and HC4 for HYCOM.

Visual analysis of current regimes
The four types of surface current circulation, obtained in intensity and direction, are shown in Figs. 7 and 8, respectively for the 225 Mercator and HYCOM analysis. The paragon which is the closest day to the centroid, was chosen to represent each type of cluster.
The four clusters may be distinguished by the NBC expansion and by the induced retroflection ring locations. The surface current velocities and their associated streamlines are driven by the following structures: -those which enter through the Caribbean Sea from the south, remaining almost parallel to the continental shelf.
-Those due to the propagation of the eddies dynamic characteristics related to the retroflection rings of the NBC.
They are coming from the south of the LA3 region, along the Atlantic side of the Lesser Antilles arc, before passing through the Caribbean Sea towards 12-14° N; -Those generally coming from the northeast of the LA1 and LA2 regions, representing the southern limit of the subtropical gyre which cuts the Lesser Antilles at about 15° N. They keep their initial direction and are sheared 235 by the South-East currents.
The number of days corresponding to each cluster is given in Table 2. MC1, HC2 and HC3 are the most common along the studied period. Each of them represents almost 30% of daily output. However, none of the four clusters really stands out. For both analyses, the differences between cluster occurrences stay lower than 10%.

Matching days between clusters 240
The clusters found are also related by a set of days in common. Match percentages have been calculated using the following formula.
where p(m,h) is the percentage of correspondence between cluster Cm and cluster Ch derived from Mercator and HYCOM datasets respectively. N(m, h) is the number of days shared by these two clusters. Table 3 shows results.

Distribution and comparison of intensities
Deciles were used to study and analyze the velocity distributions characterizing each cluster. Evolutions of the relative frequency of Us(x,y,t) as a function of the deciles (Table 1)  For both models, globally, three main patterns are identified. The first pattern includes the following clusters MC1, MC3, HC1 and HC3. This pattern is characterized by the increase of the relative frequency curve in LA1 and LA3 regions and its decrease in LA2 region. The elements of these clusters include strong current velocities above the median of 0.28 m s -1 . The second pattern includes MC2 and HC2 clusters which are characterized by the decrease or the relative frequency for the three regions (i.e., LA1, LA2, LA3).
The last pattern includes MC4 and HC4 clusters and corresponds to three concave curves with maximums located at different 255 velocity thresholds depending on the region under study.
To examine possible relationships, for a given region, between the two variables, decile speed thresholds and identified clusters, contingency tables were constructed (not shown) and the chi-squared test was performed. For the three areas, the p-value was much lower than 0.01. The chi-squared test results indicated that for the LA1, LA2 and LA3, the speed distribution depends on the identified cluster. 260

Seasonality
The monthly distribution of each cluster is plotted (Figs. 11 and 12 followed by MC2 and HC1 from April to July. The last two regimes are observed from August to December. The pair MC4 HC2, reaches a maximum in September while MC1 and HC4 persist until February of the following year. 265

Links with Sargassum strandings
As with many floating objects, before coming ashore on the coasts of the LA, Sargassum algae accumulate on the ocean surface in large amounts and form slicks, or filamentary structures, interspersed with void areas, under the influence of currents. These dynamic structures regularly observed from satellites, aircraft, and ships, have a certain inertia (Maximenko et al., 2012).
Beyond biological production, it is therefore the specific dynamic conditions of the surface currents and the surface winds which 270 may lead to massive Sargassum strandings on Caribbean coastal areas.
The monthly evolution of observed stranding days on the Guadeloupe coasts, the monthly evolution of Sargassum abundance over the Central Atlantic region (SaWS, https://optics.marine.usf.edu/projects/SaWS.html) were also analyzed on the focused period 2019-2020 (Figs. 11 and12). During these two years, the amount of Sargassum over the Central Atlantic region increased significantly from February to July, then decreased from July to November. 275 Two stranding peak values are found: one in March and the second in August. The strandings dates and the cluster occurrence dates were also compared in Table 4. The MC3 -HC3 pair gather the greatest number of similarities, followed by the MC1 and HC2 clusters.
The pairs (MC1, HC2) and (MC3, HC3) include the greatest number of observed stranding days in Guadeloupe (Table 4). These pairs of clusters would be favourable to the transport of these algae toward the coasts of the Lesser Antilles islands. MC2 and HC1 280 are the two clusters with the smallest number of stranding days.

Current regime backward sequences leading to strandings
The HAC clustering analysis on the current regime backward sequences leading to observed stranding days allowed to distribute the 107 backward sequences into four classes, respectively called Seq1, Seq2, Seq3 and Seq4. This analysis integrated only the HYCOM surface current data which have a greater resolution than Mercator. During the focused period (i.e., 2019-2020), Seq4 285 (39.3%) and Seq2 (37.4%) have the greatest occurrence (Table 5). Seq1 and Seq3 have a respective occurrence of 16.8% and 6.5%. Figure 13 shows that Seq2, Seq3, and Seq4 are characterized by the respective modal current regimes HC3, HC1, and HC2. For the Seq1 backward sequences, there is no clear prevalent current regime. The monthly distribution of the main backward sequence classes Seq2 and Seq4 highlights a significant seasonal splitting (Fig. 14). The Seq2 backward sequences occurred from December to June while the Seq4 ones occurred from July to November. These two distributions seem also significantly correlated with the 290 monthly occurrences of observed strandings. While the first stranding peak occurring in March is linked with the Seq2 maximum occurrences, the second stranding peak occurring in August is linked with the Seq4 maximum occurrences.  (Table 6). Overall, the performance of the decision tree reached 50.8% for the Mercator database and 73.3% for HYCOM. The behavior of each module is presented in Fig. 15. In general, modules A and C remain with probabilities The percentages of stranding per cluster associated with module C show empirical probabilities close to 0.3 indicating that one third 300 of the days in the concerned clusters are stranding days. Module D produces empirical probabilities related to the links between the past observed sequences and the sequences corresponding to the forecast day. In our case, they can reach 0.95 (Fig. 15a) indicating strong similarities between the sequences.

Performance indices and clustering quality 305
The performance of the clustering and the quality of the clusters were assessed using the silhouette coefficient. The evolution of this coefficient (Fig. 6) shows clearly that on the one hand, the methods based on the HAC algorithm produce lower values than those obtained by the KMS algorithms. On the other hand, for ED, silhouette indices are largely above those found by the L2 distance as written by Biabiany et al. (2020). This silhouette coefficient evolution allows us to keep four representative types of current regimes in this part of the Caribbean region. However, due to the lack of works for this region, comparisons between the present results and 310 other studies were very limited. In other studies, authors have proposed a similar number of dominant regimes on a large scale, in the tropical Pacific (Fereday et al., 2008), for the determination of robust modes of Northern Hemisphere Sea ice variability (Fučkar et al., 2016), or for ocean mapping from environmental data (Zhao et al., 2020).
In our case, the velocity distributions show four singular profiles confirming the good performance of the clustering. Each cluster also had distinct monthly distributions. This analysis allowed to better understand the variability of the surface current circulations 315 in this region.

Surface current analysis applied to Sargassum hazard
In terms of spatial distribution, clusters show notable differences for both types of model analysis and three variability factors can be identified.
The first one is the seasonal evolution of the NBC retroflexion front (Baklouti et al. 2007). The NBC feeds the Guiana Current (GC) 320 but also separates sharply, near 6°-8°N, from the South American coastline and retroflects to feed, this time, the eastward NECC.
Isolated large rings move north westward toward the Caribbean Sea, on a course parallel to the South American coastline, then interact with the Lesser Antilles (Fratantoni et al., 2002(Fratantoni et al., , 2006. These two dynamic structures, GC and NBC rings, contribute significantly to the transfer of South Atlantic surface water to the Caribbean. These dynamic structures were found on the four identified clusters and seem to work year-round with intensity variations. 325 Another part of this variability is caused by the rings of the NBC that move northwestward from the equatorial Atlantic and interact with the steep topography of the Lesser Antilles arc. MC2 and HC1 are two typical cases. Interactions with the island chain cause significant disturbances of the inflow through the southern passages with a blocking. This provides a meridional transport of surface water northward, along the LA arc (Fratantoni and Richardson, 2006;Huang et al., 2021). The Lesser Antilles arc clearly diverted the initially northwestward drift of the NBC rings to a more northward course parallel to the island arc. Johns et al. (2002)  Antilles south islands from Trinidad to Martinique) has a highly asymmetric seasonal cycle, with a maximum in June and a minimum in September-October. The annual distribution of MC2 and HC3 clusters is close to that found by Johns et al. (2002).
The last identified factor is related to surface currents present in the North Atlantic region due to the North Current and the associated gyre circulation. In this part of the study area, several clusters show lower current speeds and areas with large angular deviations in 335 direction have also been identified. In the LA2 area (i.e., Atlantic area between 14.5°N and 18°N), the relative frequencies of aboveaverage speeds are the lowest with the northeast Trade Winds. The wind-current shear zones are also the most extensive. The winddriven flow occurs from the subtropical gyre location to 15°N, near Martinique island (Johns et al., 2002). Passages through the Leeward islands have a maximum inflow in September and a minimum one in June.
The comparison between the large-scale meteorological situations corresponding to the paragons showed that main differences 340 between the current regime clusters are related to the location and the extension of the high-pressure centers, the positioning of the ITCZ, the intensity of the low Caribbean Level Jet.
All clusters contain stranding days in relative abundance, 12 to 36 % of stranding days for the two years 2019, 2020. The monthly distribution of clusters and the distribution of observed strandings in Guadeloupe are out of sync. The first peak of strandings, in March and seems linked with the maximum frequency of MC3 and HC3 clusters. The second peak of observed strandings occurs 345 in August and seems associated with MC1, HC2 and HC4 clusters. Johns et al (2020) found that windage forcing induced by the wind convergence accumulates Sargassum rafts within the ITCZ between April and September. This accumulation would contribute to the observed stranding peak in August. The clustering analysis on the stranding current backward sequences confirmed that the recurrence of HC3 (between December and June) and HC2 (between July and November) would induce large strandings on the Guadeloupe coasts during these respective periods. The HC2 current regime is characterized by the prevalence of the North Atlantic 350 gyre with weak velocities in the Western Central Atlantic and zonal streamlines. As for the HC3 current regime, it is characterized by strong Guiana Current with high velocities in LA3 region and meridional streamlines almost parallel to the Lesser Antilles Arc.

Predictive model performance
A machine learning based method for predicting Sargassum beaching was proposed and was built from a decision tree. This method has already been used for other parameters and it allows to improve both the prediction accuracy and the fully black-box effect of 355 the neural network. Compared to usual parametric statistical methods, it can effectively overcome the multicollinearity of independent variables (e.g., ocean current and surface wind). The accuracy of the decision tree reaches 73.3% for HYCOM against 50.8% for Mercator. Similar performance scores were found for decision trees predicting summer rainfall in Chongqing (China) (Bo et al., 2020) or landslide hazard in the Yen Bai Province (Vietnam) (Pham et al., 2020). However, asymmetric performances have been highlighted with better results for true negatives than for true positives (Table 6). These can be attributed to the algorithm 360 and to the weak ability of the model to handle different data sets. These prediction errors are greater for Mercator.
Several ways to improve the predictive model were identified. The lack of observational data in time (i.e., only two years) may weaken the final decision and induce overfitting. The tree could also be improved by weighting and prioritizing the different modules to increase their relevance. The improvement of the results can be found by optimizing the proposed decision calculation rule (3) to better integrate the characteristics of the observed phenomenon.

Conclusion
For a decade, the Caribbean countries, and particularly the LA, have suffered from the impacts induced by the massive and regular arrival of Sargassum on their coastal areas. This study presents the application of a clustering approach to determine the types of surface current circulations integrating the additional wind drift and their possible links with the Sargassum strandings observed on the LA coasts. The Guadeloupe archipelago was chosen as stranding observational site for the period 2019-2020. This analysis was 370 performed using the most recent versions of ocean current 3D models, Mercator and HYCOM. The surface wind speed data from the ERA-5 model were also used. The Clustering of the spatiotemporal surface current fields including windage was produced using the k-mean algorithm combined with the expert distance metric. Silhouette index was used to determine the optimal number of clusters.
For this region (8-17°N, 66-55°W) divided into three sub-regions, we identify four coherent patterns from data sets. They contain 375 the current structures related to the Guiana currents, the branches of the subtropical Atlantic gyre, the front and the retroflection rings related to the NBC.
The finer resolution of HYCOM analysis provided more detailed information on surface current velocities near the islands than Mercator fields (i.e., mean local velocity difference of 0.3 m s -1 ). Offshore, these differences remain very small.
Links between clusters and observed strandings in Guadeloupe were studied considering windage, paragon velocity distributions 380 and monthly abundance maps. The surface current circulations characterizing the (MC3; HC3) and (MC4; HC2) cluster pairs seemed the most favorable for the transport and the beaching of Sargassum on the Lesser Antilles coasts.
The clustering analysis on the stranding current backward sequences based on HYCOM fields confirmed that the recurrence of HC3 (Seq2, between December and June) and HC2 (Seq4, between July and November) would induce large strandings on the Guadeloupe coasts during these respective periods. While the HC2 current regime is characterized by the prevalence of the North Atlantic gyre 385 with weak zonal velocities, the HC3 current regime is marked by the influence of the NBC, the induced retroflection rings and strong Guiana Current leading to higher meridional velocities in the LA3 region.
Machine learning algorithms (KMS, ED, decision tree classifier) were applied to estimate the probability of Sargassum strandings in Guadeloupe, based on: surface current forecasts, current regime backward sequences and several combinations of probabilities.
The performance score of this predictive model showed that the finer resolution of HYCOM (i.e., lower than 5 km scale) seems 390 more suitable to reproduce small-scale current patterns inducing or not strandings in the Lesser Antilles. The decision tree accuracy reached respectively 50.8% and 73.3% for Mercator and HYCOM. This accuracy could be improved by weighting and prioritizing the different modules. New modules would also be added like Sargassum remote sensing observations. Due to the very recent availability of the selected HYCOM new generation version, the present study was conducted only on two years (i.e., 2019-2020). The studied period could be extended to more years to integrate the inter-annual variability of the surface 395 currents. The present clustering analysis predictive system could be applied to other Lesser Antilles changing the observational stranding site. 400

Nevertheless
The association of clustering methods and decision trees requiring low computational costs may enhance existing operational systems to help decision-makers in the Sargassum risk management. Maréchal et al. (2017) restrained the starting point of their operational short-term forecast system within 50-100 km of the LA coasts in order to reduce prediction errors. This geographical limit would correspond to a forecast period of 1-2 days before beaching. The present regional information on current dynamics leading to the arrival of Sargassum near the islands would be useful to extend this limit. In this way, it could be easier to anticipate 405 the implementation of the resources needed to collect the Sargassum algae on the shorelines. Data availability. Data from this research are not publicly available. Interested researchers can contact the corresponding author of this article.
Author contributions. The study was mainly conceptualized and written by DB and EB. RC1, RC2, NS provided comments for 410 the results and reviewed the manuscript. RC2 and NS helped with stranding observational data processing.
Competing interests. The authors declare that they have no conflict of interest.