Filtering method based on cluster analysis to avoid salinity drifts and recover Argo data in less time

Currently there is a huge amount of freely available hydrographic data and it is increasingly important to have access to it efficiently and easily provided with as much information as possible. Argo is a global collection of around 4000 active autonomous hydrographic profilers. Argo data goes through two quality processes, real time and delayed mode. This work shows a methodology to filter profiles within a given polygon using the odd-even algorithm, this allows analysis of a study area, regardless of size, shape or location. Also, gives two filtering methods to discard only the real time quality control data 5 that present salinity drifts, thus taking advantage of the largest possible amount of valid data within a given polygon. In the study area selected as an example, it was possible to recover around 80% in the case of the first filter and 30% in the case of the second of the total real time quality control data that are usually discarded due to problems such as salinity drifts, this allows researchers to use any of the filters or a combination of both to have a greater amount of data within the study area of their interest in a matter of minutes, unlike waiting for the delayed mode quality control that takes up to 12 months to be completed. 10

TEOS-10 were applied and compared with the data from the DMQC of the Argo HAPs to review the quality of the DMQC data in the area.
The data and the number of hydrographic profiles within the TPCM were also analyzed, it was found that there are few profiles within the area and that around 30% of the data are part of the RTQC. The Argo manual (Argo Data Management Team, 2019) indicates that there are flags that establish the quality of the adjusted data in both quality controls, one being the 70 best and the fourth being the worst. Tests were performed by graphing the TS diagrams using these flags, adding the density isoline and the water masses according to Portela et al. (2016), although only the data with the best RTQC quality were used, salinity drifts were shown, so it is not feasible to use these indicators to filter the data in RTQC. To increase the amount of available data, cluster analysis was applied to the data, since two groups of data can be visually located in the TS diagrams; those that form the same patterns as those of the DMQC and those that do not. This analysis, groups a set of objects in such 75 a way that the characteristics of the objects of the same group are more similar to each other than to the other groups (Everitt et al., 2011). In this case, the aim is to separate the RTQC data into groups, a group that contains data with characteristics similar to DMQC data and other groups with salinity drift problems.
To perform the cluster analysis, the unsupervised K-means classification algorithm was chosen, this algorithm groups the data into k groups, minimizing the distance between the data and the centroid of its group (Hartigan and Wong, 1979). The 80 algorithm starts by setting the k centroids in the data space and assigning the data to its closest centroid. Then, it updates the position of the centroid of each group, calculating the position of the average of the data belonging to each group, and the data is reassigned to its closest centroid. This process is repeated until the centroids do not change position. An algorithm based on distances was selected because it seeks to obtain only the RTQC data closest to the DMQC data.
Since it is necessary to indicate the number of k centroids when we use K-means, a manual enumeration of the groups to be 85 searched is required. To automate this process to retrieve RTQC data, Algorithm 1 was programmed. The Algorithm 1 receives the data from the DMQC and the RTQC, separates it by month in an array, and iterates it. Within each iteration, it calculates the mid-ranges of each quality control and divides the data into two groups (using the mid-ranges as the starting position of the centroids), up to a maximum of ten iterations, each time verifying if there are DMQC data in both groups, if so, the algorithm stops and returns the data without grouping them, on the contrary, if only a group contains the data 90 in DMQC, it associates the data of that group with the data at depths less than 1500 m, taking into consideration the month, the profiler code and the profile number and replaces the group data with the associated data. The mid-ranges are used as the initial position of the centroids to prevent them from being generated randomly. The procedure described above is the first filter of the RTQC data. To increase the reliability of the filtering, a second filter was created. In the second filter the algorithm stores in discards the profiles of the profilers that presented problems.
To test the above methods in a more extensive and irregular polygon area, a web application was developed. The study area for this web application was delimited by the Exclusive Economic Zone (EEZ) of Mexico as example (Fig. 2) and the geographical location of the profiles from around the world are filtered by the PIP algorithm, to automatically download the data every 24 hours within this irregular polygon through the IFREMER synchronization service (Argo, 2020). In Figure 2, the 100 blue line delimits the EEZ of Mexico and the yellow box delimits the TPCM. Every time that new data from HAPs is downloaded, they go through a processing phase, the data is cleaned and transformed to be integrated into the web application. For example, the variables of temperature and salinity are converted to conservative temperature and absolute salinity, as the Thermodynamic Equation of SeaWater 2010 (TEOS-10), current description of the properties of seawater defines it. Afterwards, graphs and useful files are generated to show information about the HAPs and 105 their profile data.
The web application was developed on a satellite map, to which tools were added for data management and visualization, such as drawing irregular polygons to define study areas within the main polygon, filtering data to display statistical and graphical data according to the selected filter, trajectory tracing, among others. Also, RTQC data filtering was implemented in the web application, the same irregular polygons with which statistical data are obtained, can be used to indicate a study area 110 in which it is sought to obtain as much data as possible without salinity drifts.

Results
The use of the chosen PIP algorithm to filter the measured profiles within the polygon (Fig. 3) worked correctly, in addition to establishing the range of maximums and minimums of the latitude and longitude of the polygon to discard the profiles measured outside it, allowed the PIP algorithm to filter only the profiles made near or inside the polygon. In Figure 3a, the 115 geographical locations of the profiles from HAPs that were made within the polygon filtered by the even-odd algorithm are shown, in the same way, in Fig. 3b, the location of the filtered data belonging to WOA18 is shown. The blue line represents the given polygon and the locations of the filtered profiles inside and outside the polygon are represented by dots in red and black respectively.  Figure 4 shows the result of the TS (temperature and salinity) diagram comparison between the DMQC data and the WOA18 120 data. The DMQC and WOA18 data are located in the same water masses, and the data is spliced at depths greater than 1500 m, which validates that the DMQC data following the same patterns as the data from other international DBs. According to Portela et al. (2016), this region is made up of the California Current Water (CCW), Tropical Surface Water (TSW), Gulf of California Water (GCW), Subtropical Subsurface (SS) and the Pacific Intermediate Water (PIW).
On the contrary, the data in RTQC with the best quality flag present drifts in salinity. The RTQC and DMQC data were 125 plotted in the TS diagrams together per month of the TPCM, some of the data in RTQC were the cause of salinity drifts in almost all the months (Fig. 5).
In Figure 5 it is clear that the salinity drift in the RTQC data is important and therefore they are labeled as erroneous, however it is also shown that certain data follow the structure (shape) of the DMQC data. To avoid discarding the entire RTQC data, it is proposed to use cluster analysis. By applying cluster analysis to all data in RTQC with the K-means algorithm and with 130 different values in k, the resulting groups mix data that show salinity drifts, with data that follow the same patterns as the DMQC data at 1500 meters, this is because, at depths less than 1500 meters, salinity data is more dispersed than at greater depths.
Taking into consideration that at depths greater than 1500 m, the variations in salinity and temperatures are imperceptible, the cluster analysis was performed with the salinity data measured at depths greater than 1500 m. The resulting groups are  shown in Fig. 6a and b, in the figure it is observed that one of the resulting groups contains the data that follow the same patterns as the DMQC data and the rest of the groups contain data with salinity drifts, therefore the next step was to associate the data of these groups with the rest of the data, taking into consideration the profiler code and the profile number and thus obtaining complete groups ( Fig. 6c and d). Figure 6 shows how the groups are separated with the chosen algorithm. In the months of January and December, DMQC 140 data is displayed as yellow dots and the orange groups contain the RTQC data that follow the patterns of the data in DMQC.
The blue, green and red groups contain the data showing salinity drifts.
To manually avoid indicating the number of k centroids, Algorithm 1 was developed. Figure 7 shows the first three iterations of the month of January as an example. In Figure 7a and b blue data represents the group that contains DMQC data and the orange color group represents the group of the RTCQ data. The data contained in the orange groups are discarded. The Figure   145 7c is the third iteration, both groups contain data in DMQC therefore the algorithm stops.
The results of the first filtering of the proposed algorithm are shown in Fig. 8a, the filtered data from the RTQC show the same patterns as the DMQC data, except for the months of July, August and September. In July and August, the salinity drifts  are found at depths less than 1500 m, while in September, the drifts present values very close to the DMQC data and this prevents the algorithm from being able to separate them. This filter allows obtaining a greater amount of admitted RTQC data, 150 but as seen in the figure, it still shows salinity drifts in some cases. For this reason, the second filter was incorporated, Fig. 8b shows the results of it, since it considers those profilers that have presented salinity drifts, a significant reduction in admitted data from the RTQC is observed, but these no longer show salinity drifts. Table 1 shows the total measurements made in the TPCM area and the measurements filtered by the aforementioned algorithms.

155
The total usable data in the TPCM due to the first and second filters represent ∼95% and ∼80% of the data, compared to the ∼70% that would be obtained by automatically discarding the data in RTQC. By presenting this option to the researcher and filtering the data from the RTQC, instead of discarding ∼30% of the total, only ∼5% would be discarded in the case of the first filter and ∼20% in the case of the second, which would mean a considerable increase in the data available for use, after all, the admitted data presents similar characteristics to the data that were already evaluated with the DMQC, they have a high     probability of not needing adjustments and therefore could be used in research before waiting for the DMQC to be applied to them.
Despite the fact that in the first filter some months were not filtered in the desired way in the study area, the researcher may simply not use the data from those months or use the second filter if the researcher wishes to use only the most reliable data.
Also, the possibility of using a combination of both filters is not ruled out, if the researcher uses the months of the first filter that 165 no longer present salinity drifts and uses the data of the second filter in which they present drifts, the largest possible amount of admissible data would be used in any study area.  et al., 2021).

170
The web application got interesting results and its access is through the cluster_qc library repository. In Figure 9, it is observed that the PIP algorithm filters the profiles that were made within the EEZ of Mexico correctly, even when the irregular polygon that comprises the study area is defined by more than 350 vertices. The blue line represents the given polygon and the locations of the filtered profiles inside and outside the polygon are represented by dots in red and black respectively. Once the data has been downloaded and transformed, statistical data specific to the EEZ of Mexico can be obtained, such 175 as the number of profilers within the polygon, the number of profiles or profilers per year, the DACs to which these profilers belong, among others. Table 2 shows the profilers that have carried out measurements within the polygon given in the month of November 2019. We can see from the table that there is a shortage of biogeochemical profilers within the polygon. These 4 biogeochemical HAPs are capable of measuring oxygen in addition to temperature and salinity, but none of their oxygen data satisfactorily finish the quality control process, so they are not available. So we can conclude that within the Mexican EEZ 180 there are no good quality biogeochemical data from PHAs Argo. For each of these profilers their profiles of temperature (Fig. 10a) and salinity (Fig. 10b), the Temperature-Salinity (TS) diagram (Fig. 10c), the estimation of the profiler trajectory (Fig. 10d) and the profiles of temperature (Fig. 10e) and salinity ( Fig. 10f) with respect to time were generated, these diagrams are basic for analysis in scientific ocean research, the profiler 4901635 is shown as an illustrative example in Fig. 10. The satellite map of the web application is interactive, it shows the active and inactive HAPs, filters the data, shows statistics, trajectories, diagrams (Fig. 11a) and has other tools to facilitate the visualization and management of the data, such as displaying statistics of a given study area within the main polygon ( Fig. 11b and c).
Finally, the filtering of RTQC data that have patterns similar to DMQC data are offered in the web application, which allows that to filter the data in a study area within the EEZ of Mexico, it is not necessary to have programming knowledge. Access to 190 the web application is through the cluster_qc library repository.

Discussion
Despite the existence of reports on salinity drifts such as the one announced by Argo Data Management on September 25, 2018, the quality control processes in real time are not yet robust enough to identify them, since these processes are automatic and search for data that is impossible or outside the global and regional ranges. Therefore, the quality established by the flags 195