Defining Southern Ocean fronts using unsupervised classification

Oceanographic fronts are transitions between thermohaline structures with different characteristics. Such transitions are ubiquitous, and their locations and properties affect how the ocean operates as part of the global climate system. In the Southern Ocean, fronts have classically been defined using a small number of continuous, circumpolar features in sea surface height or dynamic height. Modern observational and theoretical developments are challenging and expanding this traditional framework to accommodate a more complex view of fronts. Here, we present a complementary new approach for calculating fronts using an unsupervised classification method called Gaussian mixture modelling (GMM) and a novel inter-class parameter called the I -metric. The I -metric approach produces a probabilistic view of front location, emphasising the fact that the boundaries between water masses are not uniformly sharp across the entire Southern Ocean. The I -metric approach uses thermohaline information from a range of depth levels, making it more general than approaches that only use near-surface properties. We train the GMM using an observationally constrained state estimate in order to have more uniform spatial and temporal data coverage. The probabilistic boundaries defined by the I -metric roughly coincide with several classically defined fronts, offering a novel view of this structure. The I -metric fronts appear to be relatively sharp in the open ocean and somewhat diffuse near large topographic features, possibly highlighting the importance of topographically induced mixing. For comparison with a more localised method, we also use an edge detection approach for identifying fronts. We find a strong correlation between the edge field of the leading principal component and the zonal velocity; the edge detection method highlights the presence of jets, which are supported by thermal wind balance. This more localised method highlights the complex, multiscale structure of Southern Ocean fronts, complementing and contrasting with the more domain-wide view offered by the I -metric. The Sobel edge detection method may be useful for defining and tracking smaller-scale fronts and jets in model or reanalysis data. The I -metric approach may prove to be a useful method for inter-model comparison, as it uses the thermohaline structure of those models instead of tracking somewhat ad hoc values of sea surface height and/or dynamic height, which can vary considerably between models. In addition, the general I -metric approach allows front definitions to shift with changing temperature and salinity structures, which may be useful for characterising fronts in a changing climate.

. In part because of its unique structure, the SO is a critical regulator of global climate, having thus far absorbed more than 75 % of the excess energy and 50 % of the excess carbon added to the climate system from anthropogenic emissions (Mikaloff-Fletcher et al., 2006;Frolicher et al., 2015). As such, the thermohaline structure of the Southern Ocean may be considered an important climate system parameter, as it affects how heat and carbon are partitioned between the atmosphere and ocean.
Through decades of observational and theoretical effort, the global oceanographic community has curated a detailed theoretical understanding of the structure of the Southern Ocean. One of the hallmarks of this view is the presence of fronts, i.e. transitions in temperature, salinity, and/or biogeochemical properties (Deacon, 1937;Orsi et al., 1995). Although fronts are not identical to the sharp jets found in the SO, fronts and jets at the mesoscale share a close relationship partly due to thermal wind balance Rintoul, 2002, 2009). Traditionally, oceanographers have defined SO fronts using a small number of continuous, circumpolar features that follow contours of sea surface height or dynamic height (Kim and Orsi, 2014). However, satellite altimetry shows that the ACC features a braided and meandering structure that is not necessarily reflected in the traditional, time-averaged view of fronts as continuous property contours (Chapman, 2017;Mackie, 2018). Using individual property contours to define fronts, for example, contours of temperature or sea surface height, is somewhat limited by the fact that such contours do not always line up with the locations of strong gradients (Thompson et al., 2010;Thompson and Sallée, 2012;Graham et al., 2012;Chapman, 2017). In response to more detailed SO observations, the global oceanographic community has been developing a variety of new approaches for defining and tracking fronts in more application-specific ways (Chapman et al., 2020). For example, coastal applications and open-ocean applications may benefit from conceptually different treatments of ocean fronts, which are characterised by different spatial and temporal scales. For a historical view and summary of advances in the area of front definition and detection, see the recent review article by Chapman et al. (2020).
In order to help us broaden our view of Southern Ocean fronts, we look to a branch of machine learning called unsupervised classification (also known as clustering). Broadly speaking, unsupervised classification attempts to identify subpopulations in data distributions that have not already been labelled or sorted. Although such methods have existed for decades, the amount of SO data has only in recent years become large enough for clustering approaches to be suitable; the application of unsupervised classification to oceanographic data is in its infancy. Several recent studies have used unsupervised classification to identify coherent regimes of thermohaline structure and the transitions between them, specifically in the North Atlantic (Maze et al., 2017), Southern Ocean , and Indian sector of the Southern Ocean (Rosso et al., 2020). These methods have also been used to define coherent dynamical and biogeochemical regimes from depth-averaged ocean structure Le Bras et al., 2019;Jones and Ito, 2019). Recently, unsupervised classification has been used to define coherent ecological regimes from physical and biogeochemical data (Sonnewald et al., 2020). Researchers are also exploring potential connections between changes in class properties and large-scale climate phenomena. For example, a recent study tied evolution in the longitudinal extent of an algorithmically defined class to the onset of El Niño, suggesting that unsupervised classification methods could complement existing index-based assessments of large-scale climate modes (Houghton and Wilson, 2020).
Unsupervised classification does not use specific property contours to define boundaries between thermohaline structures, so it avoids one of the fundamental limitations of many traditional front definition approaches. Given the required information, unsupervised classification methods can use more detailed thermohaline data from throughout the water column to define classes and their boundaries. Across a given front, one might expect to find not only a transition in surface values but also a change in the thermohaline structure, as indicated by a change of profile class with latitude and/or longitude. In this work, we use an unsupervised classification technique called Gaussian mixture modelling (GMM), which attempts to represent subpopulations in the data distribution using multidimensional Gaussian functions. Because GMM is a probabilistic method, in addition to automatically clustering the thermohaline profiles into classes, it returns a set of weights across the different classes for each data point. That is, it returns a probability distribution that can be exploited to define boundaries between coherent regimes in a novel way. In this paper, we propose that GMM can be used to represent the boundaries as "fuzzy" regions, which reflects the fact that not all transitions in the SO are uniformly sharp.
In Sect. 2, we introduce the observationally constrained state estimate from which we draw our temperature and salinity data (Sect. 2.1), discuss principal component analysis (PCA) for dimensionality reduction (Sect. 2.2), and cover our application of GMM (Sect. 2.3). We then define the interclass comparison metric (i.e. the I -metric) that we use to quantify water mass boundaries (Sect. 3.1). Next, we apply the I -metric to the reduced-dimension state estimate data (Sect. 3.2). For comparison, we contrast this method with a more local front detection approach (Sect. 3.4). Finally, we discuss some caveats (Sect. 4) and offer our summary and conclusions (Sect. 5). Our front identification method uses a combination of principal component analysis, unsupervised classification, and a new probabilistic metric to quantify the boundaries between coherent thermohaline structures. First, we describe the dataset that we used for developing and training our method.

The Southern Ocean State Estimate
We developed our method using the Biogeochemical Southern Ocean State Estimate (B-SOSE) (Verdy and Mazloff, 2017). B-SOSE is an observationally constrained numerical simulation created using MITgcm (https://mitgcm.org/, last access: 10 August 2021) (Marshall et al., 1997a, b) zloff et al., 2010;Forget et al., 2015). We chose to develop our method using a state estimate because such products offer (1) uniform coverage in latitude, longitude, and time as well as (2) relatively high fidelity with respect to observations. We chose B-SOSE, in particular, because it represents the Southern Ocean using a spatial resolution of 1/6 • , which is eddypermitting in the latitude range of the ACC, enabling a realistic representation of mesoscale eddy structure (Hallberg, 2013). We expect that training our model on the physics-only SOSE would produce similar results, although we did not attempt that here. In principle, our methods can be readily applied to any gridded temperature and salinity profile dataset. It may be possible to apply these methods to in situ data as well, if the user addresses the problem of non-uniform spatial and temporal sampling. In this paper, we focus only on applications to gridded datasets. To construct a state estimate, researchers bring a numerical simulation into better consistency with an observational dataset using the 4D-Var method. This method uses adjoint sensitivities to calculate the required changes in the "controls" (e.g. initial conditions, mixing parameters, boundary conditions) needed to improve the agreement between the simulation and the observational dataset (Stammer et al., 2002;Wunsch and Heimbach, 2007).
The B-SOSE domain extends from the Equator to 78 • S, but we only use data south of 30 • S to focus on the Southern Ocean and to avoid the model boundary. It uses bathymetry and coastline based on Amante and Eakins (2009). B-SOSE solves the heat, salt, and momentum equations using a third- order direct space and time advection scheme with a 1 h time step. The time-evolving atmospheric boundary conditions use bulk formulae to solve for fluxes of heat, freshwater, and momentum, with 6-hourly atmospheric state variables as inputs (Large and Yeager, 2009;Dee et al., 2011). The state estimation process iteratively adjusts the atmospheric state variables and oceanic initial conditions to improve modeldata agreement. B-SOSE uses dynamic sea ice (Losch et al., 2010;Fenty and Heimbach, 2013). For vertical mixing, it uses the GLL90 mixed layer parameterisation (Gaspar et al., 1990). It also uses horizontal and vertical viscosity and diffusivity. River runoff comes from the product of Dai and Trenberth (2002) augmented with an estimate of Antarctic freshwater input from iceberg and ice sheet melting (Hammond and Jones, 2016). It does not include mesoscale eddy parameterisation, as this particular configuration falls into the horizontal resolution range wherein mesoscale parameterisation may actually worsen the representation of the mesoscale (Hallberg, 2013). Because we are interested in quantifying physical, large-scale fronts, we only used monthly mean temperature and salinity data. Also, because we are not interested in the surface seasonal cycle at present, we only used temperature and salinity data between 300 and 2000 m, following Rosso et al. (2020). We used the whole period of iteration 106 of this state estimate, which covers January 2008 to December 2012. Some key properties of B-SOSE iteration 106 are listed in Table 1. For further details, see Verdy and Mazloff (2017).

Principal component analysis
Each vertical profile in the full B-SOSE dataset is comprised of temperature and salinity values at multiple depth levels, at every grid cell and every output month. Values close to each other in the water column are correlated to some degree. Therefore, we do not necessarily need values of potential temperature (θ ) and salinity (S) at every depth level to capture most of the variability, and reducing the dimensionality of the data can improve the convergence of the training process. One specific dimension reduction technique is prin-cipal component analysis (PCA), which identifies the functions that capture most of the variability with depth in the dataset. The result is a representation of the dataset as a linear combination of eigenfunctions (i.e. principal components), sometimes called a principal component expansion or principal component decomposition. Using this procedure, we can describe each profile using a small set of eigenvalues (i.e. coefficients of the principal component expansion) instead of a full set of temperature and salinity values. In addition to improving the speed and efficiency of the GMM algorithm, PCA reveals potential physical structures that may be useful for understanding the stratification of the SO (Pauthenet et al., 2017). We choose the number of principal components such that the percentage of variability explained (in a statistical sense) by the PCA expansion is sufficiently high for our purposes.
Following Rosso et al. (2020), we only keep values between 300 and 2000 m to exclude most of the surface seasonal variability from the dataset. Because the data are spaced on an irregular grid in the vertical direction, we first interpolate the temperature and salinity profiles onto a regular grid with 10 m cells in the vertical. Following Pauthenet et al. (2017), at each grid cell and time we concatenate the temperature and salinity profiles into a single vector. We normalise each depth level for both temperature and salinity separately: subtracting the mean and dividing by the standard deviation calculated for all time periods on that particular depth level and variable. That is, we standardise the temperature values at each level using the distribution of temperatures at that same depth level, and we standardise the salinity values using the distribution of salinities at that same depth level. This is a slightly different approach from Pauthenet et al. (2017), in which the authors standardise across the entire dataset. We found that, for the work shown in this paper, the choice of normalisation approach does not make a large difference in the results (not shown). After normalisation, we carry out PCA expansion. We keep the first three principal components (PCs), which together statistically explain 98 % of the variability across the thermohaline dataset. For completeness, we show the structure of the principal components in Appendix B.
The coefficients associated with PC1 indicate a broad division between polar, high-latitude Southern Ocean waters and the subtropics (Fig. 1a). The most negative PC1 coefficients are found in the Weddell Gyre, and we also see the imprints of the South Pacific Gyre and the ACC (Vernet et al., 2019). The coefficients of PC2 bear the imprint of the ACC and of its northward flow along the eastern Pacific Basin (Fig. 1b). This northward flow is associated with the formation and export of Subantarctic Mode Water and Antarctic Intermediate Water (Iudicone et al., 2007;Sallee et al., 2010;Jones et al., 2016). PC2 also has the imprint of the Agulhas Current around South Africa. Finally, PC3 has strong negative values in the Weddell Gyre and over most of the Pacific, with a band of circumpolar positive values that somewhat mirrors the southward drift of the ACC when considered from west to east. The spatial structure of PC1 and PC2 are largely consistent with those of Pauthenet et al. (2017), but the structure of PC3 is somewhat different from theirs, particularly in the subtropics. These differences are possibly a result of our choice of a different depth range. Given that PC3 explains a small fraction of the variability (7 % of the variance explained), we do not expect these differences to impact our results.
After we perform dimensionality reduction, each monthly output at each model grid cell in latitude and longitude is represented using the first three coefficients of the PC expansion. The three PC values contain combined information about both temperature and salinity, simplifying our analysis. This approach defines an abstract three-dimensional space in which we can perform unsupervised classification. In typical machine learning terminology, this abstract threedimensional space can be called the "feature space", in which each PC axis is a "feature". To be explicit, we can say that each combined temperature-salinity profile in latitude, longitude, and time is represented by a three-dimensional vector of PC values. Each three-dimensional PC vector derived from B-SOSE is an "observation". In the next section, we use unsupervised classification to identify subpopulations in the three-dimensional distribution of PC values.

Gaussian mixture modelling
Unsupervised classification attempts to identify subpopulations within a data distribution, without the assistance of any predefined labels. In our application, we attempt to identify data subpopulations in the abstract three-dimensional space defined by the PC coefficients (i.e. the "feature space"). Here, we use GMM, an algorithm that attempts to fit a set of multidimensional Gaussian functions to the data by iteratively adjusting the means and covariances of the Gaussians (McLachlan and Basford, 1988; see Appendix C for more detail). This method has recently been used to classify Argo temperature profiles in the top 2 km of the North Atlantic Ocean and the Southern Ocean (Maze et al., 2017;. GMM is well-suited to ocean applications because it offers a probabilistic measure of classification in the form of posterior probabilities, which is useful when working with a highly correlated dataset. Because GMM-derived clusters will likely feature some overlap due to the highly correlated nature of ocean data, such posterior probabilities offer an important complement to the GMM-derived class labels. In this application, we use the posterior probabilities to define coherent thermohaline regimes and their boundaries. The GMM method attempts to represent the underlying data distribution using a set of K Gaussian functions in D dimensions (in our case D = 3): Figure 1. Each combined temperature and salinity profile can be approximated using a three-term PC expansion. Above are monthly mean coefficients of the PC expansion from June 2011. In order to limit the influence of seasonal variability, we use temperature and salinity profiles between 300 and 2000 m. The first three PCs explain (a) 75 %, (b) 16 %, and (c) 7 % of the variance respectively, together explaining a total of 98 % of the variability. The white space represents bathymetry shallower than 2000 m, and its boundary is marked by a grey line. (1) where x ∈ R D×1 is a vector in the PC space, µ ∈ R D×1 is the centre of the Gaussian distribution expressed in vector form, k ∈ R D×D is the covariance matrix, and | k | is its determinant. The covariance matrix determines the orientation of the Gaussian ellipsoids in PC space. We model the dataset, in the statistical sense of representing the dataset using a probability distribution, as a weighted sum of Gaussians: where λ k is the weight associated with the kth Gaussian. The process of fitting the GMM uses expectation maximisation (EM), which consists of iteratively adjusting λ k , µ k , and k to decrease the model-data misfit. For additional details, see Appendix C.
Once the weights, means, and covariances are fitted, each data vector x is associated with a posterior probability distribution across all of the K classes. Although we kept the random seed used in the initial guess fixed for this paper, our results are robust to the choice of random seed (not shown). This distribution is the set of likelihoods that the data vector belongs to any particular class, and the probabilities sum to one. GMM assigns each data vector to the class with the maximum posterior probability. We will now use this distribution to define an inter-class metric, which gives us a novel perspective on fronts as transitions in thermohaline structures.
3 The inter-class comparison metric (the I -metric) First, we examine the structure of our profile data in PC space and introduce the I -metric for identifying boundaries between coherent hydrographic regimes (Sect. 3.1). Next, we examine the I -metric in both a monthly averaged and multiyear averaged view in latitude-longitude space, and we explore the class structure in more detail by examining the associated coherent regions and vertical profile types (Sect. 3.2). Following that, we compare our results with a local edge detection method (Sect. 3.4).

Defining the I -metric
For each combined temperature-salinity profile, GMM returns a probability distribution across all of the K classes. This distribution is called the posterior probability distribution, and it quantifies the probability that a particular profile is in a particular class. If the posterior probability is close to 1.0 for class k and very small for the other classes, then within the context of the Gaussian statistical model (i.e. GMM), the classification of the profile into class k is unambiguous and clear. However, if the posterior probability is close in value for the two classes with highest probabilities, then the classification is ambiguous and less clear. With this in mind, we can use the difference between the highest probability and the second-highest probability to quantify how clearly the profile has been classified. If the classification is unambiguous, then the profile is less likely to be associated with a boundary between coherent thermohaline regimes. If the classification is ambiguous, then the profile is more likely to be associated with a boundary. With this in mind, we propose a probabilistic inter-class comparison metric of the following form: where x n is the nth profile's PC values, and P(c = c k ) highest is the highest posterior probability that GMM has assigned the nth profile as belonging to class k. The term P(c = c l ) runner-up is the second-highest posterior probability belonging to class l. If the difference between the highest and runner-up poste-rior probabilities is close to one, then I is small. This would indicate that the profile is not likely to be associated with a boundary between thermohaline regimes. If the difference between the highest and runner-up posterior probabilities is small, then I is close to one, indicating that the profile is likely to be associated with a boundary between different thermohaline regimes. The I -metric offers an alternative method for defining boundaries as fuzzy transitions between coherent regimes. In general, some regions will feature sharp transitions across boundaries, whereas other regions will feature more gradual transitions. The relative sharpness of a transition is influenced by the processes that form, mix, and destroy water masses. In contrast with approaches that define fronts as sharp transitions located along property contours or local gradients, the I -metric approach allows for a wider variety of transition types between regimes. In our I -metric application, GMM clusters the profiles in feature space (Fig. 2a). The structure of the data shown in PC space is broadly consistent with that found in other studies (e.g. Pauthenet et al., 2017Pauthenet et al., , 2018Pauthenet et al., , 2019. The data distribution is reasonably well represented by a linear combination of multidimensional Gaussian functions (Fig. 2). The I -metric values indicate transition regions between classes, where the class labelling is relatively ambiguous (Fig. 2b). We choose K = 5 to represent the general, large-scale pattern of the data; we explore the sensitivity of our results to K in Sect. 4.4. In the next section, we examine the I -metric and class structure in physical space.

Geographic view of the I -metric
The I -metric viewed in latitude-longitude space illustrates the rich variety of transition types found in the Southern Ocean (Fig. 3). In all sectors of the SO, we see sharp transitions where the regions of high I values are narrow and more gradual transitions where the regions of high I values are more spread out. Some features are circumpolar, which is consistent with the view of SO fronts as continuous lines that encircle Antarctica. However, we also see regions where the continuity and circumpolar nature of the fronts is not as clear, suggesting that a broader view may be appropriate (Chapman et al., 2020). The fronts are not uniformly sharp across all longitudes; for example, the northernmost transition is broad and gradual in the Atlantic sector, sharp in the Indian sector, and relatively broad in the Pacific sector. The southernmost band of high I -metric values is relatively sharp in the Atlantic sector, becoming increasingly broad as we follow it into the Indian and Pacific sectors. In the Pacific sector, it extends into an especially broad region in the Amundsen Sea, which is consistent with the intersection of the classically defined southern boundary (SBdy) with the Antarctic continent (Kim and Orsi, 2014). Upstream of Kerguelen Plateau, there is a region where the I -metric is spread-out and diffuse between classes 2 and 3; this region also features a standing meander associated with enhanced eddy kinetic energy  (Frenger et al., 2015;Siegelman et al., 2019). The enhanced mesoscale eddy kinetic energy associated with the meander is consistent with increased lateral mixing and the spreadout pattern in the I -metric found in the same region. Closer to the Antarctic continent, we also see the imprints of both the Weddell Gyre and the Ross Gyre, in regions of coherent structures with low I -metric values, in part enforced by the gyre circulation.
The monthly mean I -metric (Fig. 3a) also highlights individual ring-like eddies; although these features are not typically considered fronts, they are small-scale transition regions between different hydrographic structures. We do expect the I -metric to be non-zero across these features. The monthly view also features mesoscale meanders, highlighting the detailed structure of the SO, which is partly a result of the energetic mesoscale eddy field. The I -metric does feature some month-to-month variability: in some locations, the fronts meander in their north-south extent, whereas in others, they are relatively stationary, likely due to bathymetric constraints (see animations in Thomas, 2021, in the "gifs" directory).
By averaging the 4 years worth of monthly means, we obtain a map of the climatological I -metric, which is averaged over many eddy lifetimes (Fig. 3b). Comparing an example monthly field with the climatological field, we can examine the imprint of eddy spatial variability and the meandering of the fronts on the I -metric pattern. Most of our observations about the metric are unchanged by this averaging; we identify three roughly circumpolar bands of high I -metric values, with significant spatial variability and some overlap. The three bands are fairly distinct in the Atlantic sector, with the northernmost transition being the broadest. Upstream of Kerguelen Plateau, the two northernmost bands become somewhat hard to distinguish. This is possibly a consequence of the eddy mixing and upwelling hotspot in that region, which tends to spread out hydrographic features in latitudelongitude space, increasing the degree of spatial correlation found there. Upstream of Kerguelen Plateau, the Polar Front features strong seasonal variability (Pauthenet et al., 2018). Note that the I -metric band aligned roughly with the Polar Front only passes south of the plateau (e.g. south of Heard Island), which is consistent with other studies of the subsurface component of the Polar Front (e.g. Pauthenet et al., 2018).
The three bands of higher I -metric values are distinct downstream of the Kerguelen Plateau in the Indian sector; notably, the southernmost band features especially high Imetric values in this sector. This pattern is associated with the transition between the Antarctic Circumpolar Current and the Antarctic Slope Current (ASC), which tend to flow in opposite directions (Thompson et al., 2018;Pauthenet et al., 2021). In the Pacific sector, we see the southernmost band turn into the Amundsen Sea and intersect with the Antarctic continental slope, spreading out in a diffuse region that is consistent with the behaviour of the southernmost extent of the ACC, the eastern boundary of the Ross Gyre, and the eastward shelf circulation along the West Antarctic Peninsula (Nakayama et al., 2018). In this same sector, two large regions of low I -metric values spatially correspond to export pathways of Subantarctic Mode Water and Antarctic Intermediate Water (Iudicone et al., 2007;Sallee et al., 2010Sallee et al., , 2012Jones et al., 2016). The higher I-metric values delimit the edges of these more coherent thermohaline regimes (Fig. 3b), which are influenced by basin-scale stratification and the structure of the South Pacific Gyre.

Properties of the thermohaline regimes
In order to better understand the coherent thermohaline regimes underlying our I -metric results, we examine their lateral extents and their vertical properties. Despite not being given any latitude or longitude information, the underlying GMM captures several coherent, large-scale features of Southern Ocean thermohaline structure (Fig. 4a). Class 1 contains the coldest waters in the SO, covering both the Weddell and Ross gyres near Antarctica. The mean profile in this class features cold temperatures that are nearly uniform with depth; in general, they are salt stratified in that the near-surface waters are fresher than the subsurface waters, ensuring that the density profile is stable overall (Fig. 5). The boundary between class 1 and class 2 broadly lies between the classically defined southern ACC Front (SACCF) and the southern boundary (SBDY) (Kim and Orsi, 2014), including the turn of the SBDY towards almost being perpendicular with the Antarctic continent in the Pacific sector of the SO. Class 2 is circumpolar, with excursions into the Amundsen Sea and the area just south of Kerguelen Plateau. It features salt stabilisation, with a fresh layer near 300 m (Fig. 5). Class 3 is also circumpolar, with a northward excursion in the Atlantic sector. The boundary between classes 2 and 3 roughly follows the Polar Front (PF), separating the colder, fresher Antarctic waters from the warmer, saltier subtropical waters further north. Finally, there are two subtropical classes: class 4 represents the Atlantic and Indian sectors of the subtropics, and class 5 represents the large-scale South Pacific Gyre. The boundary between classes 3 and 4 roughly aligns with the Subantarctic Front (SAF), particularly over large portions of the Indian and Pacific sectors (Fig. 4). The mean of class 5 has a salinity minimum around 700 m, corresponding to the presence of the Antarctic Intermediate Water layer (Iudicone et al., 2007).

An edge detection approach towards identifying fronts
For comparison with the GMM method, which uses properties of an entire training dataset to detect changes in thermohaline structure, we use a more local front detection method implemented by Hjelmervik and Hjelmervik (2019) in the North Atlantic. This method, called the Sobel method, directly examines spatial gradients in the principal component fields using a Sobel operator (Duda and Hart, 1973). To do this, the PCs of each grid point are placed onto a rectangular grid with the same spacing as the data sampling, where points without data are masked. The strength of an edge at a point is found by the two dimensional convolution (represented by  salinity. This is calculated from the profiles classified using the statistical model fitted on the training data itself. The central line is the mean, and the envelope on either side indicates 1 standard deviation. * ) of the gridded PCs and the following two matrices. In the x direction the Sobel operator is and in the y direction the Sobel operator is The effect of this operator is similar to a gradient operator with some smoothing. There is a correlation coefficient of 0.99 between G y * PC1 and the y gradient, and there is a correlation coefficient of 0.999 between G x * PC1 and the x gradient of PC1. The motivation for using the Sobel operator rather than the gradient operator is principally that it can reduce the noise in data, as shown by application to photographs (Vincent and Folorunso, 2009). Hjelmervik and Hjelmervik (2019) used the magnitude of the x and y Sobel operators, which approximates the magnitude of the gradient, to examine fronts in the Arctic and North Atlantic. They show that the magnitude of the Sobel gradient can be thresholded to highlight features such as the Gulf Stream. Rather than working with the gradient magnitude, Fig. 6 shows G y * PC1, G y * PC2, and G y * PC3 alone. This is more interpretable, as the G y * PC1 component is strongly correlated with the zonal velocity U (r = 0.85). Hjelmervik and Hjelmervik (2019) use a threshold value to define fronts, but we plot the gradient directly as a colourmap for each PC instead, which is useful as it does not obscure any information about the fronts themselves. Appendix A shows that the correlation between the Sobel G y gradient of PC1 and the meridional velocity, V , and the correlation between G x * PC1 and zonal velocity U increase for roughly the first 2 years of B-SOSE iteration 106, suggesting that the model is still spinning up to geostrophic balance.
The GMM and Sobel methods are complementary. GMM reveals the large-scale temperature and salinity structure associated with changes in stratification, which has traditionally been used to define the fronts, whereas edge detection methods like the Sobel method used here reveals the smallerscale structure of multiple jets, which can merge and separate. As such, both approaches may be useful ways of characterising ocean structure without making ad hoc assumptions related to particular property values or strict requirements that the structures be circumpolar and continuous. The present proliferation of front definition and analysis methods is driven by the need to expand how the oceanographic community deals with ocean structure across a wide variety of spatial and temporal scales (Chapman et al., 2020).

Discussion
In this section, we discuss the sensitivity of our results to our choice of dataset (Sect. 4.1), touch on the temporal variability in our results (Sect. 4.2), discuss a possible connection with the Antarctic Slope Current (Sect. 4.3), examine the sensitivity of the results to the choice of the number of classes K (Sect. 4.4), and discuss the interpretation of posterior probabilities (Sect. 4.5).

Sensitivity to choice of dataset
We chose to use B-SOSE data for this study in order to (1) work with a dataset that features relatively uniform latitude-longitude coverage and (2) to allow us to examine temporal variability as well as spatial variability. B-SOSE is an observationally constrained estimate of the hydrographic structure of the Southern Ocean, so it accurately captures many features of large-scale and mesoscale structure (Verdy and Mazloff, 2017). However, because B-SOSE is a numerical model run, it will no doubt have some biases with respect to observations, particularly on smaller scales. We expect that our results would not change dramatically on basin-wide scales across different state estimate and reanalysis products.
To examine the differences of this bias on the class structure and the structure of the inter-class comparison metric, this study could be repeated with a purely observational dataset such as Argo. One trade-off for such a study would be the fact that observational datasets are relatively sparse in terms of both spatial and temporal coverage relative to a state estimate or other numerical model run. One could attempt to use the same GMM trained on B-SOSE with Argo data, but possible biases between B-SOSE and the Argo dataset could make this challenging. It might be possible to adjust for those biases in the data cleaning and preparation step of the analysis; the standardisation process, which is already a part of the analysis presented here, is a step towards this bias removal and correction that may facilitate comparisons between models and observations. Alternatively, one could attempt to retrain the GMM using Argo data alone. This has been done in other studies, so it should be possible in principle (e.g. Maze et al., 2017;Rosso et al., 2020).

Temporal variability of the fronts
We found that the class structure and boundary positions did not feature large temporal variability with respect to the mean state, but much more work could be done to examine this variability and its connection to the processes that determine thermohaline structure (e.g. surface forcing, subsurface mixing, and advection). This is outside the scope of the present study, which is focused on proposing a new metric for identifying and tracking boundaries in Southern Ocean structure.

The Antarctic Slope Current
The Antarctic Slope Current (ASC) that separates warmer open-ocean waters from the colder waters on the Antarctic continental shelf is an important component of heat transport in the Southern Ocean. It acts to control the flow of warm water onto the continental shelf and eventually under the floating ice shelves. In a recent paper, Thompson et al. (2020) suggest that if the source of the Antarctic Slope Current (ASC) intersects with the ACC in the Bellingshausen Sea, then the ASC source would be considered a major component of the overturning circulation. In our study, we found a diffuse boundary between classes in the Bellingshausen Sea region, which may be relevant for the physical context of the ASC, which is still under investigation (Fig. 4b).

Sensitivity to the maximum number of classes
In this study, we chose K = 5 as the number of classes based on sensitivity tests and also based on a priori knowledge. Specifically, previous studies used a front structure with five broad regions, delineated by four fronts, so we might expect a value of around K = 5 based on this (e.g. Orsi et al., 1995;Pollard et al., 2002;Kim and Orsi, 2014).
Generally, the choice of the maximum number of classes K can be thought of as a way to select models of varying degrees of complexity. Statistical models with lower K values are potentially easier to interpret, only capturing the most dominant structures in the dataset. For example, the probabilistic boundary between the two classes in a K = 2 statistical model roughly separates colder, fresher Antarctic waters from the warmer, saltier subtropical waters (Fig. 7a). Notably, in this case, the magnitude of the I -metric appears to largely decrease as we follow it from the Atlantic and Indian basins and into the Pacific Basin, indicating that the boundary becomes less sharp with longitude. This possibly reflects the fact that the Pacific Basin hosts some of the dominant northward export pathways of Subantarctic Mode Water and Antarctic Intermediate Water, consistent with a less sharp transition between polar and subtropical waters (Iudicone et al., 2007;Herraiz-Borreguero and Rintoul, 2011;Jones et al., 2016). A statistical model with K = 4 retains most of the features of our analysis with K = 5, but the transition region closest to Antarctica in K = 5 is no longer present.
The K = 5 statistical model that we used in this work captures near-Antarctic and circumpolar structure, as well as some subtropical structure. A more complex statistical model with higher K would capture more of the subtropical structure (not shown). This is consistent with sensitivity studies using temperature-only Argo data, where increasing K added details to the subtropical class structure while leaving the circumpolar class structure largely unchanged . Statistical models with much higher K values may capture more structure in the data, but increasing K also risks overfitting. That is, if we tune the GMM statistical model to match an increasing number of structures in PC space, we risk losing generality; the goal is to represent the dominant structures of the dataset without overfitting every small variation, some of which could represent noise in the data. This has a direct analogue with overfitting in terms of simple statistical models; it is unwise to use a 10th-order polynomial when a quadratic captures the dominant features of the dataset, because the higher-order polynomial is less likely to generalise to other similar datasets. In addition, statistical models with very high K values are increasingly difficult to interpret in terms of our current physical and biogeochemical understanding. Note that regional studies, such as those carried out in specific sectors of the SO, may find it useful to increase K based on local structure (e.g. Rosso et al., 2020). This is consistent with the suggestion by Chapman et al. (2020) that front definitions may need to be more flexible and region-specific, as opposed to expecting a particular definition to apply globally (or even across a single ocean basin).

Interpreting posterior probabilities
The posterior probabilities returned by a Gaussian mixture model are affected by our choice of K. We should be careful not to over-interpret the posterior probabilities as confidences in the correctness of the assigned labels. Notably, GMM does not indicate the probability that a given profile belongs to none of the classes in a given statistical model. With that in mind, we can interpret the posterior probability as a measure of unambiguity in the context of a given statistical model. When one probability is larger than all others with some margin, the profile is unambiguously classified, while probabilities of similar magnitude indicate that the profile cannot be unambiguously classified in the current statistical model with the specified number of classes. In this study, we used the posterior probability distribution to iden- tify boundaries between coherent thermohaline regimes, taking advantage of this property of GMM.

Conclusions
In this study, we proposed a new metric for defining and identifying boundaries between coherent regimes of temperature and salinity structure. Our method uses Gaussian mixture modelling, a type of unsupervised machine learning, to establish a statistical model of thermohaline structure that is intended to capture the large-scale features of the dataset in both PC space and in geographic space. We developed our method in the Southern Ocean due to the presence of circumpolar structures and relatively clear fronts, but our approach could be applied to other regions or even to the global ocean as a whole. The I -metric provides a flexible, probabilistic method to define and identify boundaries in an oceanographic dataset without using ad hoc property contours; the boundaries are derived in a generalised method that reflects the structure of the dataset. The I -metric has potential as a method for comparing different observational and numerical modelling datasets in a robust, algorithmic way that is not heavily affected by biases in the mean state between datasets. It features a parameter K that allows users to increase and decrease the level of complexity of the statistical model; the optimal value of K will vary between applications. The Sobel edge detection method may be useful for defining and tracking smaller-scale fronts and jets in model or reanalysis data. As discussed in Chapman et al. (2020), the field of oceanography needs to consider fronts and boundaries in a more general, application-specific way, due in part to the richness of ocean structure on different spatial scales. The I -metric was designed with this problem in mind; it is intended to be a complementary addition to the oceanographic toolbox as opposed to a replacement for any particular method.
Appendix A: The relation between edge detection and the velocity field Figure A1. A comparison between the G x * PC1 Sobel edge detection field and the zonal velocity, U , at 2 m. Panels (a) and (b) are from the monthly mean over June 2011, whereas panels (c) and (d) are the mean over all of the monthly means in the dataset. Figure A2. The correlation between G x * PC1 and the zonal velocity, U , at 2 m for each monthly mean in the B-SOSE iteration 106 dataset. The increase in the correlation over the first 2 years could be interpreted as the spin-up.  Figures A1 and A3 illustrate the spatial resemblance between G y * PC1 and U and between G x * PC1 and V respectively, compared over June 2011 or as an average over the full B-SOSE period. The domain-averaged correlation is shown quantitatively in Figs. A2 and A4, where there is an especially high correlation between the two in the last couple of years of the reanalysis product. That Figs. A2 and A4 show opposite signs in the correlation is equivalent to the reversal in sign that we would expect (Cushman-Roisin and Beckers, 2011, chap. 15 and 18). Those figures also show that the magnitude of the correlation between the respective variables increases during the first 2 years of the dataset before flattening off. This is suggestive of the model spinning up towards geostrophic balance. As the first principal component statistically explains the first-order structure in the ocean, it primarily represents the density contrast produced by the thermohaline structure from the tropics to the poles.

Appendix B: Principal component structure
In this work, we use principal component expansion for dimension reduction and to examine the structure of the Southern Ocean. In this appendix, we display the principal components (i.e. eigenvectors) used in this expansion. First, we examine the means of the temperature and salinity structure across the entire dataset (Fig. B1). The temperature decreases with depth, whilst the salinity has a minimum around 750 m, in part associated with the presence of Antarctic Bottom Water (AAIW).
The structure of the first three principal components (i.e. eigenvectors) reflects small variations on the mean structure (Fig. B2). The mean profiles are similar, but the variation associated with an increase or decrease in the principal component value changes with depth. The first principal component (PC1) explains 76 % of the variability in the dataset, notably consisting of variations throughout the mid-range of the profiles in temperature and throughout nearly the entire depth range in salinity. The second principal component explains 16 % of the variability and consists of larger changes in the upper part of the profile, above roughly 1000 m. The third principal component (PC3) explains 7 % of the variance and exhibits variations above and below a somewhat fixed mid-point. After principal component expansion is applied, each profile is represented by just three numbers, i.e. the eigenvalues of the principal component expansion. Figure B1. The means and standard deviation of samples taken from B-SOSE iteration 106. Figure B2. The temperature and salinity components of the three retained principal components that were used in this work. Specifically, they are PC1 (left column), PC2 (middle column), and PC3 (right column). Shown are the mean structures (black lines) with the effect of adding (red line) or removing (blue line) one unit of a principal component as a deviation from the mean profile, after Fig. 4 in Pauthenet et al. (2017). When compared to Fig. 1 we can see that PC1 corresponds to the hot-cold north-south contrast. 1558 S. D. A. Thomas et al.: Southern Ocean front detection Appendix C: Gaussian mixture modelling A Gaussian mixture model (GMM) attempts to represent a dataset using a linear combination of multidimensional Gaussian distributions. A multidimensional Gaussian (Eq. C1) is a simple generalisation of a Gaussian to D dimensions.
where k is the index for the kth cluster of K clusters, n is the index for the nth data point of N data points, and D corresponds to the three principal components of the data. We make the assumption that the probability distribution that generated the dataset can be approximated by a set of multivariate Gaussians (Eq. C2): Any probability distribution function (PDF) could be described by an arbitrarily large number of Gaussians (Eq. C3), but to be a good method of describing the data, this should be a manageable number.
In this paper, we showed that our Southern Ocean thermohaline dataset can be fairly represented as a series of plateaulike regions in PC variable space; thus, it can be approximated by a PDF made from a set of multivariate Gaussians, where the boundaries between these Gaussians correspond to the fronts (Fig. 2).

C1 Expectation maximisation
To initialise the method, the first K clusters are created randomly. Next, the set of Gaussians is iteratively adjusted (Eqs. C4, C5, and C6) until it reaches a local minimum in the cost function (Maze et al., 2017). It is generally expected that reducing the number of dimensions in the preprocessing step helps improve the convergence. The following section draws heavily from Maze et al. (2017).
The expectation of the model given the data is increased by updating the weights λ k , means µ k , and covariance matrices k in the following way: λ k (t+1) = 1 N N n=1 P c n = k | x n ; λ k , µ k , k (t) , (C4) µ k (t+1) = N n=1 P c n = k | x n ; λ k , µ k , k (t) x n N n=1 P c n = k | x n ; λ k , µ k , k , (C5) where c n is the classification of the nth cluster which could be any one of the K clusters. This is repeated until the parameters of the model have converged.

C2 Information criterion
GMM needs an input hyperparameter K that sets the number of clusters that will be fitted to the data. GMM is relatively cheap to run, and so it is reasonable to run it with a large range of K and choose the K which best describes them. The commonly used criterions are the Bayesian information criterion (BIC) (Eq. C7) and the Akaike information criterion (AIC) (Eq. C8). They both essentially contain a term that measures the agreement of the model to the data, and they have a penalty term for the number of parameters that have been used to achieve this (related to K). Thus, we are looking for minima in AIC/BIC to guide our choice of K. There is no clear minimum for this dataset in K for 2 ≤ K ≤ 100, which is typical of oceanographic applications due in part to the highly correlated nature of the data (e.g. Sonnewald et al., 2019;. Because K is weakly constrained, we are able to select a lower value of K for ease of interpretation, having verified that it captures the large-scale structure of the data in PC space, which is suitable for our purposes. BIC and AIC take the following forms: with η f = K − 1 + KD + KD(D − 1) 2 , and where the log-likelihood is expressed as L = log[P(X)] = N−1 n=0 log K k=1 λ k N x n ; λ k , µ k , k . (C9)

C3 Labelling the dataset
Each data point is assigned a posterior probability distribution across the K clusters (Eq. C10). This uncertainty information is one of the useful features of GMM. The probability takes the following form: P c n = k | x n ; λ k , µ k , k = λ k N x n ; µ k , k K k=1 λ k N x n ; µ k , k .
To label a dataset, each data point is assigned a label from the cluster that it would be the most likely to be generated by, in a statistical sense (Eq. C11). C = arg max k P c n = k | x n ; λ k , µ k , k , 1 : k (C11) Code and data availability. B-SOSE iteration 106 state estimate data are available from the Scripps Institution of Oceanography (http://sose.ucsd.edu/BSOSE6_iter106_solution.html, Verdy and Mazloff, 2021 Thomas, 2021). This software uses scikit-learn (Pedregosa et al., 2011) and pyxpcm (Maze, 2020) as foundations. We used Cartopy for mapping (Met Office, 2010. Author contributions. DCJ designed the initial project, and SDAT and DCJ developed it further. SDAT wrote the software (Thomas, 2021), performed the analysis, and created the figures. AF proposed a significant improvement to the inter-class metric. EM and EP provided expert guidance on fronts, Southern Ocean structure, and dynamics. SDAT and DCJ wrote the initial manuscript, and all authors assisted with edits.
Competing interests. The contact author has declared that neither they nor their co-authors have any competing interests.
Disclaimer. Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.