Four-dimensional temperature, salinity and mixed-layer depth in the Gulf Stream, reconstructed from remote-sensing and in situ observations with neural networks

. Despite the ever-growing number of ocean data, the interior of the ocean remains undersampled in regions of high variability such as the Gulf Stream. In this context, neural networks have been shown to be effective for interpolating properties and understanding ocean processes. We intro-duce OSnet (Ocean Stratiﬁcation network), a new ocean reconstruction system aimed at providing a physically consistent analysis of the upper ocean stratiﬁcation. The proposed scheme is a bootstrapped multilayer perceptron trained to predict simultaneously temperature and salinity ( T − S ) proﬁles down to 1000 m and the mixed-layer depth (MLD) from surface data covering 1993 to 2019. OSnet is trained to ﬁt sea surface temperature and sea level anomalies onto all historical in situ proﬁles in the Gulf Stream region. To achieve vertical coherence of the proﬁles, the MLD prediction is used to adjust a posteriori the vertical gradients of predicted T − S proﬁles, thus increasing the accuracy of the solution and removing vertical density inversions. The prediction is generalized on a 1 / 4 ◦ daily grid, producing four-dimensional ﬁelds of temperature and salinity, with their associated conﬁdence interval issued from the bootstrap. OSnet proﬁles have root mean square error comparable with the observation-based Armor3D weekly product and the physics-based ocean reanalysis Glorys12. The lowest conﬁdence in the prediction is located north of the Gulf Stream, between the shelf and the where the thermohaline variability is large. The OSnet reconstructed ﬁeld is coherent even in the pre-Argo years, demonstrating the good generalization properties of the network. It reproduces the warming trend of surface temperature, the seasonal cycle of surface salinity and mesoscale structures of temperature, salinity and MLD. While OSnet delivers an accurate interpolation of the ocean stratiﬁcation, it is also a tool to study how the ocean stratiﬁcation relates to surface data. We can compute the relative importance of each input for each T − S prediction and analyse how the network learns which surface feature inﬂuences most which property and at which depth. Our results demonstrate the potential of machine learning methods to improve predictions of ocean interior properties from observations of the ocean surface.

Abstract. Despite the ever-growing number of ocean data, the interior of the ocean remains undersampled in regions of high variability such as the Gulf Stream. In this context, neural networks have been shown to be effective for interpolating properties and understanding ocean processes. We introduce OSnet (Ocean Stratification network), a new ocean reconstruction system aimed at providing a physically consistent analysis of the upper ocean stratification. The proposed scheme is a bootstrapped multilayer perceptron trained to predict simultaneously temperature and salinity (T − S) profiles down to 1000 m and the mixed-layer depth (MLD) from surface data covering 1993 to 2019. OSnet is trained to fit sea surface temperature and sea level anomalies onto all historical in situ profiles in the Gulf Stream region. To achieve vertical coherence of the profiles, the MLD prediction is used to adjust a posteriori the vertical gradients of predicted T −S profiles, thus increasing the accuracy of the solution and removing vertical density inversions. The prediction is generalized on a 1/4 • daily grid, producing four-dimensional fields of temperature and salinity, with their associated confidence interval issued from the bootstrap. OSnet profiles have root mean square error comparable with the observation-based Armor3D weekly product and the physics-based ocean reanalysis Glorys12. The lowest confidence in the prediction is located north of the Gulf Stream, between the shelf and the current, where the thermohaline variability is large. The OSnet reconstructed field is coherent even in the pre-Argo years, demonstrating the good generalization properties of the network. It reproduces the warming trend of surface temperature, the seasonal cycle of surface salinity and mesoscale structures of temperature, salinity and MLD. While OSnet delivers an accurate interpolation of the ocean stratification, it is also a tool to study how the ocean stratification relates to surface data. We can compute the relative importance of each input for each T − S prediction and analyse how the network learns which surface feature influences most which property and at which depth. Our results demonstrate the potential of machine learning methods to improve predictions of ocean interior properties from observations of the ocean surface.

Introduction
In situ observations of the ocean vertical structure are accurate but sparsely distributed in time and space, hampering the study of mesoscale features (Siegelman et al., 2020a) and the computation of large-scale integrated variables such as ocean heat content (Wang et al., 2018;Durack et al., 2014). Meanwhile, the ocean surface has been observed at high temporal and spatial resolution with satellites since the early 1990s. Remote sensing allows us to observe surface signa-ture of mesoscale to submesoscale features (Siegelman et al., 2020b) and to track climatic trends of sea surface height (Nerem et al., 2018), temperature (Merchant et al., 2019) and salinity (Reul et al., 2020). It is therefore highly valuable for combining sparse in situ profiles and high-resolution remote sensing observations in order to predict the ocean stratification at higher resolution and frequency.
This problem can be approached from two main points of view. First, the physical approach aims at constraining a global circulation model with all observations available (e.g., Lellouche et al., 2021;Forget et al., 2015). The numerical models have the advantage of offering a product that is physically consistent but that can contain drifts and biases (Stammer, 2005). The data assimilation is a practical means of reducing the spurious model drifts and biases, but still the model can diverge from observations and even drift in uncharted states in poorly sampled regions (Forget et al., 2015). Second, the statistical approach aims at finding the empirical relationship between the surface ocean and the interior. The simplest method is to use a multiple linear regression between sea-level anomaly (SLA), sea surface temperature (SST) and T − S profiles Jeong et al., 2019). According to Guinehut et al. (2012), this method can only reconstruct 50 % to 30 % of the temperature and 20 % to 30 % of the salinity at depth. An improvement of the linear reconstruction method is to first reduce the T −S profiles and to link up the reduced variables to the satellite data. Indeed, it was found that only a few modes are needed to explain most of the variance or covariance of the temperature fields (Meinen and Watts, 2000) or of combined T − S profiles using the gravest empirical mode (GEM) projection (Sun and Watts, 2001). The GEM technique is a projection of hydrographic profiles onto a geostrophic stream function plane, which was used to estimate the four-dimensional structure of the Southern Ocean (Meijers et al., 2010). However, this requires that each dynamic height be associated with just one T − S profile at each longitude, meaning that outside of the Antarctic Circumpolar Current or boundary currents, the approach is questionable. Buongiorno-Nardelli and Santoleri (2005) developed the multivariate Empirical Orthogonal Function Reconstruction (mEOF-r) based on a similar idea. It is a linear system that uses surface data to predict the three leading modes of the EOFs applied to profiles of temperature, salinity, and geopotential thickness. They later showed that mEOF-r is outperformed by an artificial neural network for the North Atlantic region (Buongiorno Nardelli, 2020).
Machine learning approaches are increasingly used to deal with the ever-growing stream of geospatial data (Reichstein et al., 2019;Sonnewald et al., 2021;Wang et al., 2019). More specifically, deep learning methods are characterized by artificial neural networks (NNs) involving usually more than two hidden layers. They exploit feature representations learned exclusively from data . Multiple studies recently presented deep learning methods for reconstructing hydrographic profiles from satellites. Proof-of-concept papers established the important capabilities of self-organizing maps (SOM; e.g., Charantonis et al., 2015;Gueye et al., 2014) and feed-forward or long short-term memory (LSTM) neural networks for hydrographic profile predictions (e.g., Lu, 2019;Jiang et al., 2021;Contractor and Roughan, 2021;Buongiorno Nardelli, 2020;Su et al., 2021;Sammartino et al., 2020). NNs can also efficiently reconstruct Argo interpolated fields (Gou et al., 2020;Meng et al., 2021). A recent study focused on predicting the mixed-layer depth (MLD) from satellites using probabilistic machine learning (Foster et al., 2021). However, to our knowledge these deep learning studies do not explore the vertical coherence of the predicted profiles, i.e., the presence of density inversions and the accuracy of the MLD prediction. The presence of density inversions makes an ocean product more difficult to use to initialize regional forecast models. Statically unstable profiles have to be removed when using the product to analyse ocean dynamics (e.g., New et al., 2021). The accuracy of the MLD prediction also has large implications for the pertinence of an ocean product. Indeed, the MLD and the strength of underlying stratification regulate the rate at which the ocean exchanges heat and gas with the atmosphere, which directly impact our climate (Sallée et al., 2021). To understand and quantify ongoing climate changes, we need to document the variability of the vertical gradients of temperature, salinity and density in the water column. Physically consistent products, in the spirit of the MIMOC climatology (Schmidtko et al., 2013) but with higher resolution in time and space, are required to validate models used for climate projections. In particular, western boundary currents such as the Gulf Stream play a major role in climate variability by carrying warm and salty near-surface waters northwards (Smeed et al., 2018) and by deeply impacting the atmosphere (Minobe et al., 2008). In the present paper we focus on the Gulf Stream region for its challenging high variability and its dense in situ sampling coverage.
Here we present a method to estimate the ocean stratification from surface data and the associated confidence intervals using a prediction model fitted with in situ historical data. We train a NN to predict temperature and salinity (T − S) profiles down to 1000 m and the MLD, in the Gulf Stream region, from satellite data covering 1993 to 2019. Our goal is to combine MLD, T and S predictions to produce T − S profiles that are physically consistent. Model training is done on raw in situ profiles alone (not interpolated fields), and predictions are generalized on a grid with 1/4 • horizontal resolution and daily time steps. Our framework further delivers a quantification of uncertainties through a confidence interval of the model prediction as well as the relative importance of each input variable. The proposed reconstruction method and resulting product are named OSnet for Ocean Stratification network.
The paper is organized as follows. Section 2 introduces the datasets used as inputs and outputs of OSnet as well as the products used as a benchmark to evaluate the performance of our reconstruction. Section 3 presents the method composed of the neural network and the MLD adjustment. Section 4 evaluates the accuracy of OSnet predictions and presents property maps and sections. OSnet profiles are compared to a mooring; we also compare time series and an analysis of the relative importance of each input for each output. In Sect. 5 we explore the potential of OSnet by estimating profiles from synthetic satellite data. Our conclusions are presented in Sect. 6.

Temperature and salinity in situ profiles
We use the in situ temperature and salinity (T − S) vertical profiles sampled by Argo floats (Argo, 2022) and ships from the CMEMS quality-controlled Coriolis Ocean Dataset for Reanalysis (CORA) database (Cabanes et al., 2013;Szekely et al., 2019). We keep only the profiles extending at least from 25 to 1000 m for the period 1993 to 2019, totaling 67 767 T − S profiles for the region 80 to 30 • W and 23 to 50 • N. All the profiles that do not reach 1000 m or start deeper than 25 m are discarded. Profiles are interpolated on an uneven vertical grid with 51 levels, with spacing increasing with depth (27 levels are within the first 100 m leading to a vertical resolution of 1 m in the upper levels and 450 m at 1000 m depth). Profiles without data at the surface are extrapolated by repeating the shallowest observation point. There is little seasonal bias in the distribution of data with 5647 profiles by month on average, a minimum of 5027 in February and a maximum of 6257 in October. The spatial distribution of profiles kept in the analysis is shown in Fig. 1b. It reveals a general lack of data in the center of the subtropical gyre compared to the Gulf Stream region west of 60 • W. The temporal distribution (Fig. 1a) reveals a significant increase in sampling after 2000 thanks to the Argo program (Wong et al., 2020). After 2012, the number of T − S profiles stabilizes at ∼ 400 profiles per month.

Input data
Input data (Table 1) are surface satellite data and bathymetry. Mean dynamic topography (MDT) is from the Centre National d'Etude Spatiale (CNES-CLS18) (Mulet et al., 2021), and the bathymetry is the ETOPO1 bedrock, distributed by NOAA (2009). The SLA is the level-4 daily product from CNES-CLS (6.2_DUACS_DT2018). It has a 1/4 • horizontal grid resolution. We also use geostrophic surface velocities derived from the SLA and distributed by CNES-CLS (Table 1). MDT is calculated by merging information from altimeter data, GRACE, and GOCE gravity field and oceanographic in situ measurements (drifting buoy velocities, hydrological profiles) (Mulet et al., 2021), while SLA is from altimeter data only. Keeping MDT and SLA separated in the inputs allows us to determine their respective importance in the prediction (see Sect. 4.4

Additional datasets
We validate our results against other observational and synthetic datasets. The sea surface salinity (SSS) CCI dataset is distributed at 1/4 • horizontal grid resolution from 2010 to 2019 (Boutin et al., 2021). We do not use SSS as an input variable for several reasons. SSS satellite observations only cover the period 2010-2019, and its quality is questionable at high latitudes and in cold water (Boutin et al., 2021). As our prediction only depends on the input variables, it is risky to rely on data with systematic errors. Moreover, we tested an architecture with SSS as input, and results were not improved significantly: only the surface salinity was slightly better. The relative importance algorithm also showed that SSS was not used significantly in predictions. It has the same order of importance than geostrophic currents (see Sect. 3.4 on the explainability of the NN). However, SSS is a useful product to compare with, and we further discuss this in Sect. 4.5 and Appendix B.
We compare OSnet gridded fields to Armor3D and Glo-rys12 because they are the only ocean products, to our knowledge, that extend from 1993 to today with a spatial resolution of at least 1/4 • and a frequency under the month (weekly for Armor3D and daily for Glorys12). The global eddy-resolving reanalysis Glorys12 (Lellouche et al., 2021) is based on the physical model NEMO (Madec, 2015) and ocean observations assimilated by means of a reduced-order Kalman filter. It is provided at 1/12 • horizontal grid resolution and a daily mean. We also compare our results to the observation-based Armor3D weekly product . This later product is built in two steps: (i) prediction of synthetic T −S fields by multiple linear regression from SST and SLA and (ii) optimal interpolation combining synthetic and in situ T − S profiles. A section of OSnet is compared to hydrographic section AT20 sampled along 52.3 • W by the research vessel Atlantis from 1 to 11 May 2012(McCartney, 2012. Finally, we use T −S data sampled at moorings of the Line W array, deployed in April 2004 between Cape Cod and Bermuda. We use profiles from the third mooring located at 69.11 • W, 38.51 • N.  Longitude n/a n/a Latitude n/a n/a Day of the year (cosine and sine) n/a n/a

Method
The method is composed of three steps (Fig. 2). Firstly, a neural network is trained to predicts T − S profiles and MLD from satellite data. Secondly, an adjustment of predicted T , S and MLD corrects the vertical shape of the profiles towards a physically consistent solution. Finally, an operational phase uses the trained network and MLD adjustment to predict T , S and MLD on daily grids from the satellite data. The procedure is coded in Python with the help of several useful modules. The NN algorithm is coded with Tensorflow (Abadi et al., 2016) and the Keras application programming interface (Chollet and Others, 2015). It is explained with Shap (Kaur et al., 2020). Xarray (Hoyer and Hamman, 2017), Dask (Rocklin, 2015) and Numba (Lam et al., 2015) are used for fast computation and the management of large datasets. The color palettes for maps are from Cmocean (Thyng et al., 2016) and Colorcet (Kovesi, 2015). All the codes to build OSnet models are available at https: //github.com/euroargodev/OSnet (last access: 27 June 2022, Pauthenet et al., 2022b), and the models and prediction tools specific to the Gulf Stream region are available at https://github.com/euroargodev/OSnet-GulfStream (last access: 27 June 2022, Pauthenet et al., 2022c).

The multilayer perceptron
The neural network used is a multilayer perceptron (MLP), which is a class of feed-forward artificial neural networks (Rosenblatt, 1961;Gardner and Dorling, 1998). The MLP guesses the non-linear relation between inputs and outputs, through one or more hidden layers with many neurons stacked together. The learning mechanism that allows the MLP to iteratively minimize the loss function is called backpropagation. We keep the architecture simple with only two layers of 256 neurones each. Dropout is used as a regularization method to reduce overfitting and improve generalization (Srivastava et al., 2014). Activation functions are a rectified  Table 1 to predict profiles of temperature (T ), salinity (S) and MLD mask (K). T − S profiles are then adjusted using the profile K for a better prediction of the MLD. linear activation function (ReLU) for the hidden layers, linear for T and S output, and a sigmoid for the MLD output. We tested more complex architectures (additional layers, convolutional layers, bottleneck architecture) but could not improve the accuracy of the results with them. A simple architecture is advantageous for its lower computation time. The input consists of 12 values, listed in Table 1: latitude, longitude, day of the year (cosine and sine), bathymetry, MDT, SST, SLA, four geostrophic velocities U , V and both anomalies. Inputs are linearly interpolated at the in situ location of each profile. The outputs are predictions of three vectors of 51 depth levels (temperature, salinity and the MLD mask). The depth levels are the ones presented in the data section, on which the CORA profiles are interpolated.
The dataset is split (randomly with no replacement) as follows: 20 % of the profiles are set aside for testing. In the remaining 80 %, we use 80 % as training data and 20 % as validation data. Validation data are used to avoid overfitting by assessing the performance of the trained model after each epoch (one epoch sees all the training data). Be aware that the train and test data are not truly independent: the selection is random without accounting for spatial and temporal autocorrelation. Once the training of the NN is finished, we select the model with the best performance on the validation dataset. We then run this model on the test dataset (data not seen in training or validation) to confirm the good generalization of the model (i.e., training and test errors are similar, Fig. 3). Given a NN architecture with good generalization properties, we retrain a NN using all 67 767 profiles (Fig. 3, orange) as training data. The training of one model takes ∼ 20 min on an eight-core CPU with 32 Go of RAM.
To further improve the prediction performance and assess the associated confidence intervals, we exploit a bootstrapping scheme (Breiman, 1996). More precisely, we bootstrap the training procedure 15 times using a different initialization and training dataset each time. Indeed, because of the instability of the prediction method, bootstrapping can give substantial gains in accuracy. Overall, given 15 trained models, we compute the mean T , S and K profiles for all input data and their standard deviation (Fig. 3, grey). The latter delivers an estimate for the confidence interval. This bootstrap method reduces the estimation bias. Finally, the T − S prediction is generalized on a daily 1/4 • horizontal grid. The spatial resolution of the input data (Table 1) is unified to 1/4 • in longitude and latitude by a nearest neighbor interpolation method. This produces T and S fields with 51 depth levels from the surface to 1000 m for each day between 1 January 1993 and 31 December 2019. It is freely available (Pauthenet et al., 2022a).

Prediction of the mixed-layer depth
We define the mixed-layer depth H with a density deviation from the surface method, as proposed by de Boyer Montegut et al. (2004). It is the depth at which the potential density referenced to the surface, σ 0 , exceeds by a threshold of 0.03 kg m −3 the density of the water at 10 m, σ 0 (z = −H ) = σ 0 (z = −10 m) +0.03 kg m −3 . This definition is chosen for its simplicity of application but tends to overestimate deep winter MLDs compared to the more sophisticated hybrid algorithm of Holte et al. (2017). These regions of deep winter MLD are rarely observed in our dataset (1145 profiles with MLD deeper than 300 m, i.e., 1.7 % of the dataset), comforting our choice of using a simple density threshold. For the NN, the mixed layer is represented in the form of a unitless profile K of size Z = 51 that is filled with zeros between the surface and the mixed-layer depth H and with ones from H to −1000 m.
This formulation allows the NN to give an estimation of the gradients around the MLD instead of a single depth value . Normalized root mean square error (nRMSE) between temperature (a) and salinity (b) observed (CORA) and predicted profiles (Glorys12, Armor3D, NN and OSnet). The normalization is done with the standard deviation of the observed temperature and salinity by depth. The upper panels are a zoom of the first 100 m of the full-depth lower panels. The Glorys12 (blue) and Armor3D (purple) profiles are collocated with the CORA profiles, and the error is calculated between these subsamples. The NN profiles are only predicted with the NN, without adjustment of MLD, for 15 trained datasets (dark grey) and 15 test datasets (light grey). NN full (orange dotted) corresponds to the predictions using the full dataset (test + train) and is averaged for 15 models (bootstrap). Finally, the OSnet profiles (orange) are predicted with a NN bootstrapped 15 times and the MLD adjustment is performed, which slightly increases the error at the surface. (Fig. 4d). The resulting mask K is also convenient for the MLD adjustment performed on predicted profiles.

Adjustment of the mixed layer
We identified two types of error in the direct prediction of T − S profiles. Firstly, the MLD predicted by theK profile has a better accuracy (MLD RMSE of 40 m) than the MLD computed from the T − S profiles directly (MLD RMSE of 50 m). The latter is systematically too shallow due to unrealistic T − S excursions on the vertical in the mixed layer, causing the density threshold to be reached too close to the surface. These sharp variations of T and S in the mixed layer also create density inversions. Secondly, the gradients of the layer under the MLD are systematically underestimated compared to the observed profiles. The mean and standard deviation of gradients of σ 0 over a 200 m-thick layer under the MLD is 1.33 ± 0.9 kg m −3 for the observed profiles and 1.24 ± 0.8 kg m −3 for the predicted ones. The presence of strong gradients under the MLD has been documented (Johnston and Rudnick, 2009), and the profileK seems to contain this information. Indeed,K is a sigmoid-like profile (Fig. 4d), and its vertical gradients are proportional to the T − S gradients around the MLD. The summerK profiles have sharp vertical gradients compared to the winter ones (not shown), which is coherent with the seasonal cycle of the transition layer thickness (Johnston and Rudnick, 2009).
We thus choose to apply an MLD adjustment to the predicted profiles, in the same spirit as convective adjustment schemes are used in numerical hydrostatic models (Madec, 2015). We want to weight the vertical gradients ofT and S byK in order to reduce the gradients ofT andŜ in the mixed layer and increase the gradients ofT andŜ just below the MLD (K = λ) while keeping the deeper gradients unchanged. We first modify theK into a new maskK * as We display temperature (a), salinity (b), potential density (c) and profile K that is a mask of MLD (see Eq. 1). The T − S profile predicted by the NN is in red and the adjusted using theK * profile (OSnet) in green (see Sect. 3.3). The green and red bands are the confidence interval for each profiles, i.e., the standard deviation of the 15 bootstrapped models. MLDs observed, predicted and adjusted are shown with dotted horizontal lines. SST is added with a purple dot and a horizontal bar for its mapping uncertainty. follows: with λ = 0.57 the value of K corresponding to the MLD. The calibration of λ is done by a cross-validation procedure according to the estimation bias betweenT at the sea surface and the SST value (Fig. 5). In other words, λ = 0.57 allows us to adjust the MLD while keeping null the mean difference betweenT and SST (green in Fig. 5). We expect this value (λ) to be specific to our region and NN parameterization. It would likely require a new calibration for another study. After computing theK * profiles, we reconstruct iteratively T − S profiles with the gradients multiplied byK * , starting from the bottom value (at z = 1000 m) whereK * = 1 (because the deepest MLD never reaches 1000 m, the maximum observed in the CORA dataset is H = 628 m for our region). On a predicted temperature profileT (the same is applied to salinity), the adjusted temperature profileT * is computed as follows: and we retrieve the temperature profiles iteratively along depth by starting from the bottom z + 1 = 1000 m, wherê K * z+1 = 1 andT * z+1 =T z+1 : Figure 5 presents the overall bias of surfaceT andŜ relative to SST and SSS. If we adjust gradients with aK profile, the surface temperature is systematically too warm compared to SST (Fig. 5a, red). Now, if the adjustment is also increasing the gradients under the MLD (K * ), the surface temperature bias is null (Fig. 5a, green). This supports our choice to amplify gradients just under the MLD (K * > 1) and to reduce them in the mixed layer (K * < 1). Note that the direct prediction of temperature at the surface (Fig. 5a, blue) is more accurate compared to SST than in situ observations, because OSnet learns from SST. The salinity difference relative to SSS is too large for the adjustment to cause a significant issue (Fig. 5b). Still, the adjusted salinity profiles with K predicted create a fresh bias, and the use ofK * corrects that.
A good example of profile prediction and adjustment is presented in Fig. 4. In this case the adjustment corrects perfectly the MLD estimate, from 30 m predicted (red) to 133 m adjusted (green). It reduces the T − S gradients above the MLD estimated from the K profile (K = 0.5) and increases the T − S gradients just under this MLD. The decrease in T − S gradients in the mixed layer also removed the density inversion (Fig. 4c).
The adjustment proposed here reduces the variance ofT andŜ above the MLD, reduces the number of density inversions, improves the predictions of the MLD (see Table 2) and increases the gradients in the transition layer under the MLD. For the adjusted profiles, the mean and standard deviation of gradients of σ 0 over a 200 m-thick layer under the MLD are 1.38 ± 0.9 kg m −3 , which is closer to the observed values of −1.33 ± 0.9 than the predicted values of 1.24 ± 0.8 kg m −3 . The large summer gradients are especially well retrieved (not shown).
An alternative idea to solve the density inversion issue is to constrain the NN to predict profiles that are hydrostati- Figure 5. Density distribution of the difference between the in situ surface temperature and salinity and the remote sensing SST and SSS. The predicted profiles (blue) correspond to the profiles produced by the NN alone. The adjusted distribution withK is in red (also red in Fig. 4d). The adjusted distribution withK * is in green and corresponds to the OSnet product (i.e., NN + adjustment), also shown in green in Fig. 4d. cally stable. The physical relationship can be implemented in the NN to enforce consistency on the predictions (Karpatne et al., 2017). This can be done by modifying the loss of the NN and penalizing the predictions with density inversions (Appendix A). This solution is elegant and allows us to predict directly profiles without density inversions. We provide this alternative approach here for the record of a negative result with regard to the design of such a NN. Indeed, profiles predicted with this custom loss still have poor MLD estimates compared toK profiles and hence still need a posteriori adjustment. The modified loss (Appendix A) is not needed in this case because the MLD adjustment presented above happens to remove density inversions efficiently.

Explainability of the neural network
Explaining the predictive skills of the neural network is key to interpreting the prediction and strengthening trust in the model. It is also a useful tool in the development phase. Here, it gives insights into the relationship between surface data and in situ profiles. In this section, we use a game-theoretic approach to retrospectively estimate the relative importance of each input for each output. The algorithm, called SHapley Additive exPlanations (SHAP) (Lundberg and Lee, 2017), is a unified framework combining state-of-the-art methods to explain deep neural networks. It is based on a method called Deep Learning Important FeaTures (DeepLIFT) (Shrikumar et al., 2017) and Shapley values. DeepLIFT is a method for computing the effect of changing the original input to a reference value (uninformative background value for the input).
The change in the output is representative of the importance of the input for predicting the output. Shapley regression values (Shapley et al., 1953) represent the impact of an input on the output by removing it from the input set and retraining this model with the subset of inputs. This being computationally expensive, it is possible to obtain an approximation of the effect of removing a variable from the model by integrating over samples from the training dataset using the Shapley sampling values (Štrumbelj and Kononenko, 2014). It produces an "importance" value for each particular prediction. The importance value is positive or negative, indicating the direction in which the input influences the output relative to the averaged output. The SHAP algorithm being computationally expensive, it was not possible to run it over the full dataset. After some tests, we found that 300 random samples were representative enough to obtain stable results for the average feature importance across the entire dataset. Table 2 presents several metrics to evaluate OSnet, Armor3D and Glorys12 relative to CORA. Each dataset is predicted or subsampled at the locations of CORA profiles (in longitude, latitude, depth and time). The first two lines indicate the number and size of vertical density inversions. They indicate that the MLD adjustment of OSnet suppresses almost all density inversions, from 17.3 % to 0.32 % of profiles. Meanwhile, Armor3D has about 50 % of profiles with density inversions and Glorys12 almost none (0.01 %). Regarding the amplitude of density inversions, the MLD adjustment suppresses sufficiently large inversions and decreases by 1 order of magnitude the mean amplitude of inversions, from 10 −3 to 10 −4 kg m −3 . The root mean square error (RMSE) of the MLD (Table 2) indicates that the MLD adjustment improves the MLD RMSE of OSnet from 50 to 40 m, which is the same order of magnitude as Glorys12 (38.6 m) and Armor3D (39.4 m). Note that the MLD of Armor3D is computed with a different criterion to bypass density inversions. Armor3D uses the minimum of temperature and density threshold equivalent to a 0.2 • C decrease from the surface. The MLD of Armor3D computed with a density criterion of 0.03 kg m −3 yields a RMSE of 62.6 m. Finally, the global errors of temperature, salinity and density indicate that Armor3D profiles are the closest to the observed profiles. The metric is the normalized RMSE (nRMSE), which is the RMSE between predicted and observed profiles divided by the standard deviation of the observed profiles. It is a ratio of error compared to the variability observed. OSnet has a smaller nRMSE than Glorys12, and the MLD adjustment slightly increases the temperature nRMSE at the surface. Figure 3 reveals the vertical distribu-tion of nRMSE. OSnet gives a more accurate prediction at the surface compared to both Armor3D and Glorys12, but Armor3D is closer to observations for the rest of the water column. Overall, the nRMSE of OSnet predictions is the same order of magnitude compared to other products, and it does not contain significant density inversions.

Temperature and salinity maps
Let us examine a daily map of temperature and salinity at 1/4 • resolution (Fig. 6). We chose a date in the pre-Argo era to illustrate the generalization skill of the OSnet product. All the maps reveal coherent horizontal structures. At the surface, the warm Gulf Stream detaches from Cape Hatteras and meanders further east, transforming into the North Atlantic Current (Fig. 6a, b). The surface confidence intervals are maximum for the cold and fresh waters near the edge of the continental slope and inside the cold and fresh core eddies and meanders (Fig. 6c, d). On average, confidence intervals highlight cold waters north of the Gulf Stream (Fig. 7a, b), which is consistent with the error of prediction maps presented in Buongiorno Nardelli (2020). This could be due to the lack of profiles containing these cold waters in our dataset. At depth (1000 m in Fig. 6e, f), the signature of large eddies is visible, associated with a maximum of the confidence interval again (Fig. 6g, h). The salty and warm Mediterranean Overflow Waters are seen in the southeast of the region. The average confidence interval at 1000 m is maximum along the Gulf Stream and its meanders, rather than in waters north of the Gulf Stream like at the surface (Fig. 7c,  d). It corresponds to areas with the largest variability (Gaillard et al., 2016;Forget and Wunsch, 2007). Note the different color scale: the maximum confidence interval values at 1000 m depth are twice as small for temperature and 5 times smaller for salinity compared to the surface.

MLD maps
To illustrate the quality of the predicted MLD of OSnet, we show MLD maps for a given day (5 January 2018) in Fig. 8. We picked a winter day to emphasize deep MLD areas. The direct prediction of the NN (Fig. 8a) has shallow patches in a few places that are due to density inversions. The density threshold is met too shallow due to these artifacts in the water column (see the profile example in Fig. 4). The MLD adjustment corrects these shallow patches, and the MLD field of OSnet looks more consistent and more similar to Glorys12. The OSnet MLD does not exhibit any very deep patch (MLD > 300 m). These deep MLD events are rarely observed in CORA (1.7 % of the profiles have a MLD deeper than 300 m) but are often present in the MLD fields from Glorys12 (Fig. 8). They occur in warm meanders and eddies of the Gulf Stream. The MLD field of Armor3D (Fig. 8d) is for the week that contains 5 January 2018. It has several patches of either shallow or deep MLD (i.e., 28 • N, 52 • W Figure 6. OSnet temperature and salinity maps for 7 January 1993, at the surface (a, b) and at 1000 m (e, f). Their respective confidence intervals are displayed too, i.e., the standard deviation of the 15 bootstrapped models (c, d, g, h). or 48 • N, 32 • W) which look very sharp compared to OSnet and Glorys12. These patches might be regions around observed profiles for the given week, and the optimal interpolation of Armor3D is overfitting the profile at the expense of the general property field coherence. The Armor3D MLD is computed with a different criterion to bypass their density inversion issues. Still, some patches have no values where the criterion could not be matched (Fig. 8d).
Monthly MLD averages are presented for March and August in Fig. 9. The averages in 1 • × 1 • boxes for the in situ profiles (Fig. 9a, d) are compared to OSnet (b, e) and Glo-rys12 (c, f). The three estimates are in good agreement with each others and with climatologies not shown here (Holte et al., 2017;Sallée et al., 2021). The main structures are well respected with a large winter patch of deeper MLD extending between the Gulf Stream and the subtropical gyre. In winter, vigorous air-sea fluxes extract heat from the ocean and erode the superficial stratification. This process activates convective mixing and deepens the mixed layer, ventilating and creating the Eighteen Degree Mode Water (Speer and Forget, 2013;Maze and Marshall, 2011). In summer the near-surface water warms and caps the mode water layer. The summer MLD is shallower everywhere with a slightly deeper signature in the core of the Gulf Stream, as it separates from the coast at Cape Hatteras. A deeper summer MLD is also found south of ∼ 30 • N along the equatorward edge of the subtropical gyre (Fig. 9d, e, f), a feature also observed in the climatology of de Boyer Montegut et al. (2004). This tropical summer MLD deeper than 30 m is marked by the trade winds (Stramma and Schott, 1999). The large permanent anticyclonic "Mann eddy" (Mann, 1967;Rossby, 1996) is clearly visible as a deep mixed-layer patch at 43 • N, 42 • W (Fig. 9).  A region of the deeper MLD is also visible along the North Atlantic Current, deeper in OSnet than in Glorys12.

Importance of each input for the reconstruction
In Figs. 10 and 11 we present the absolute values of the relative importance of each input, on each output, averaged over depth, over the 300 test profiles and over the 15 bootstrapped models. Error bars correspond to the standard deviation of the 15 models. To be comparable, the importance values of the inputs are normalized so that the sum per output is equal to 1. Figures 10 and 11 give a general overview of what the NN uses for the predictions. These importance values can also be displayed for a specific profile or by depth levels, seasons, or geographical regions, providing insights to elucidate the behavior of the NN. The main result here is that SST is the main driver for estimating T , S and MLD profiles (Fig. 10). As expected, it is especially important for predicting surface temperature. MDT is the second-most important variable, and it is the most important deeper than ∼ 200 m (Fig. 11). The latitude, longitude, SLA and day of the year equally concur to explain the predictions. The rest of the input variables, i.e., bathymetry and the surface geostrophic currents derived from SLA, are smaller contributions to the predictions. Even if they have small contributions on average, they can be important for a specific profile. The cosine of the day of the year is very important for the prediction of the MLD (Fig. 10), probably because it is in phase with the MLD seasonal cycle, while temperature and salinity cycles are in phase with the sine of the days (not shown). Still, it means that the day of the year alone drives ∼ 29 % of the MLD predictions, which is equivalent to the importance of SST for MLD predictions (∼ 29 % too).

Time series of surface properties
To assess the accuracy of OSnet through time, we analyze the time series of the spatially averaged surface temperature and compare it to the observed SST time series as well as the Glorys12 and Armor3D products. We average data over the region after removing values on the shelf (bathymetry > 1000 m). The long-term trends are obtained by applying a seasonal-trend decomposition based on local regression (STL) (Cleveland et al., 1990). STL is a filtering procedure that extracts three components: (i) the variations in the data at the seasonal frequency (Fig. 13), (ii) the lowfrequency variation together with nonstationary and longterm changes (Fig. 12) and (iii) a remaining high-frequency component.
OSnet follows best the long-term trend of SST with a linear trend of 0.197 • C/decade, close to the SST trend of 0.190 • C/decade (Fig. 12a). For comparison, the surfaceaveraged temperature of Armor3D is warmer in the pre-Argo years, probably due to their background global average that is mostly composed of Argo floats. This is a validation that OSnet generalizes well and is not biased towards the recent years of in situ observations. Regarding surface salinity longterm trends, no significant trend is captured over the 1993-2019 period (Fig. 12b). There is no clear agreement between the different datasets, except in the last period, 2010 to 2019, where all averages increase like the SSS signal. Armor3D mean surface salinity drops significantly during the last 2 years, 2018 and 2019, out of the SSS error margin. Note that OSnet does not include the areas over the continental shelf and does not predict deeper than 1000 m. A significant part of the climatic signal takes place in the coastal regions (Ezer et al., 2013;Davis et al., 2017), and an improvement of OSnet would be to deal with profiles of different lengths in order to include these regions.
The mean seasonal variation of surface temperature, salinity and MLD is presented in Fig. 13. We compute it by averaging the signal by day of the year (week of the year for Ar-mor3D). The surface temperature seasonal signal is well reproduced for each dataset, as expected considering that SST is included in the input of OSnet and used to produce both Armor3D and Glorys12 (Fig. 13a). A close observation of the curves in Fig. 13a shows that the seasonal cycle of OSnet surface temperature is too cold between May and September (Fig. 13c). This is due to the MLD adjustment, since the direct prediction of surface temperature gives a very precise seasonal cycle of SST (red line in Fig. 13c). Armor3D seasonal temperature is warmer from August to April, which could be a bias caused by the undersampling of the pre-Argo, colder years (Fig. 13c). This bias is also present in the temperature time series (Fig. 12a).
The surface salinity seasonal signal is noisier, in part because it presents inter-annual variations that are the same or- Figure 10. Relative importance of each input for each output, averaged by depth. The inputs (x axis) are sorted by importance for the temperature to have the largest importance on the left of the plot. The cosine of the day of the year is more important than the sine for the MLD prediction because the cosine is in phase with the seasonal cycle of the MLD. Figure 11. Relative importance of each input for each output, averaged by depth. The areas are sorted by variance to have the input with the largest difference of impact by depth to the right of each panel.
der of magnitude as the seasonal variations (Fig. 13b). We compare 2010-2019 because it is the only period available for the SSS product. OSnet surface salinity is the closest to SSS compared to the predictions of Armor3D and Glorys12, which are both fresher than the observed SSS. We observe a delay in the SSS seasonal variations: it is fresher by almost 0.1 psu from January to March. We discuss this in Appendix B and Fig. A1.
Finally, the seasonal cycle of MLD (Fig. 13d) in the region is asymmetrical, with slow deepening from summer to winter and fast shoaling during the early spring. This asym-metry is expected since buoyancy loss at the surface leads to convective mixing (hence deepening the mixed layer requires buoyancy loss over an ever-increasing water column depth), while buoyancy gain directly leads to a stable stratification and shallow mixed layer (e.g., Taylor and Ferrari, 2010). OSnet MLD compares well to the MLD computed on Glorys12. We do not present the MLD of Armor3D here because it is computed with a different criterion. The winter MLD variance is larger in Glorys than OSnet, which is also observed on daily maps of MLD. Events of deep MLDs are not represented in OSnet (Fig. 8). The green shaded areas are the OSnet confidence intervals, i.e., the standard deviation of the 15 bootstrapped models.

Discussion
In the results section we have seen that OSnet predictions are overall coherent. We now want to assess whether OSnet can be used to help interpret local oceanographic measurements or for process studies. The goal is to be as close as possible from observations while being physically consistent. To do so, OSnet is compared to observations (remote and in situ) and to the other two products, Armor3D and Glorys12. In this section we present these comparisons and discuss the quality of our predictions.

Comparison to observed data
One major feature of OSnet is the possibility of estimating T and S profiles at any location, given that surface data are available. Here we predict T − S at the location of a mooring of the line W (Fig. 14) and along the hydrographic section AT20 (Fig. 15). Temperature and salinity at 1000 m are plotted in comparison to mooring data for a period of 3 years (10 May 2004to 11 March 2007. OSnet corresponds well to observations but with a slight warm and fresh shift in the first year (Fig. 14). A warm-core eddy crosses the location of the mooring in 2006 (see the SLA map in Fig. 14a), and its warm and salty deep signature is well captured by OSnet. Smaller warm and salty spikes appear in October 2004 and July and November 2005 but are not visible in the mooring data. They correspond to warm meanders of the Gulf Stream revealed by the SST and SLA at these three periods (not shown), causing deep T − S changes in OSnet.
We compare the OSnet T − S structure along the hydrographic section AT20 sampled in May 2012 by the research vessel Atlantis (Fig. 15). The OSnet prediction is done at the exact locations of the conductivity-temperature-depth (CTD) profiles, interpolated linearly on the maps of input data ( Table 1). The comparison is also done for Glorys12 but by collocating the profiles with a nearest neighbor method. Both Glorys12 and OSnet predictions are coherent, but we note two specific biases in Glorys12. First, Glorys12 displays a deep patch of MLD around 41 • N, north of the Gulf Stream, that is not observed by the CTDs nor predicted by OSnet (Fig. 15d). This deep MLD could be due to the nearest neighbor selection of the profile that is not exact in the case of Glorys12, or it could be an artifact of their model. Indeed, deep patches of MLD are also visible on daily MLD maps of Glorys12 but are absent on the OSnet daily MLD maps (Fig. 8). Second, the salinity reconstruction of Glorys12 differs more from the observed data north of the Gulf Stream (Fig. 15g). OSnet performs very well along this section in comparison to Glorys12.

OSnet to explore theoretical inputs
Since OSnet is very easy and fast to manipulate to make predictions, it can be used to make predictions using theoretical inputs. To illustrate this, we seek the interior signature of eddies detected with altimetry. We make two predictions of temperature and salinity for a section across two particular eddies observed on 6 October 2006 (Fig. 16a): one prediction is based on all observational inputs (Fig. 16b, c), and the second prediction is made by removing the eddy signature in SLA; we simply set it to 0 (Fig. 16e, f, g). The interior temperature and salinity anomalies associated with the eddies are obtained by difference (Fig. 16h, i). Anomalies are the largest at depth around 400/500 m with amplitudes of a few degrees per meter of SLA. These are reasonable amplitudes and structures for the region (Castelao, 2014) and illustrate how OSnet could easily be used to extract more knowledge than the standard realistic T − S predictions on a grid.

Potential improvement of the method
Several improvements of the method could be made in the future. One important limitation is the current NN architecture that prevents predictions of profiles of different lengths. It constrains the analysis to use a fixed maximum depth and thus prevents profiles from being kept shorter than that maximum depth. A solution would be to develop a custom loss that can deal with empty variables. Another way would be to add depth as an input variable like in Buongiorno Nardelli (2020). In that case, we could predict properties over shallow bathymetry and also deeper than our rather arbitrary 1000 m limit.
OSnet produces coherent horizontal and temporal patterns even though each profile is predicted independently. Yet we wonder how using horizontal and temporal surface gradients as inputs could improve predictions, especially in frontal regions. To test this, we would need to work with threedimensional (long-lat-time) patches of input data for each profile location and to build a different NN architecture (e.g., Ouala et al., 2018;Jouini et al., 2013;Tandeo et al., 2013;Lguensat et al., 2018) that takes patches of data as input and profiles as output. Convolutional neural architectures accounting for irregularly sampled space-time observations might also be appealing (Fablet et al., 2021). The expected result would be sharper fronts, as we observe that fronts from OSnet are smoother than in observations, e.g., Fig. 15. We wonder whether the NN could learn from the temporal surface gradient to anticipate vertical changes in stratification.
Prediction intervals (PIs), also called "coverage probability", could be computed as a complement of confidence intervals (Khosravi et al., 2011). While the confidence intervals ( Fig. 7) give the range of variation of a set of NNs with different initializations and training datasets, the PI gives the probability of the observation associated with the prediction being in a range of values. The PI can be obtained by adding output variables to the proposed architecture. Here, the PI could be represented as temperature and salinity profiles around the prediction, representing the 95 % interval. It means there would be a 95 % probability of the true value associated with the prediction lying within the interval.
Given the very promising results of our study, obvious future work would be to apply OSnet in other more challenging regions, with fewer data or more complicated vertical structures and different dynamics. Also, the 3D geostrophic velocities of the OSnet gridded product could be estimated using the thermal wind equation combined with surface altimeter geostrophic currents (Mulet et al., 2012). Finally, since we found that OSnet correctly captures the SST warming trend with confidence intervals in blue bands around the mean value for OSnet, i.e., the standard deviation of the 15 bootstrapped models. The contour intervals for the plot section are 0.5 psu from 33 to 37 psu for salinity and 2 • C from 4 to 24 • C for temperature. (Fig. 12a) and mesoscale structures, it would be interesting to apply OSnet to other boundary currents and to compare the resulting OHC estimates to previous reconstructions (e.g., Cheng et al., 2017). Other global ocean indicators such as ocean freshwater content or steric sea level could be investigated as well.

Conclusions
We proposed a method to estimate the ocean stratification from surface data using a neural network trained from in situ historical data. The originality of this study is the attention we gave to the vertical coherence of T − S profiles, in particular the accuracy of MLD predictions and the absence of unrealistic vertical density inversions. The global nRMSEs of T and S are better than a state-of-the-art ocean re-analysis (Glorys12) but worse than Armor3D predictions. However, OSnet predictions do not have any unrealistic density inver-sions, while Armor3D does. Each OSnet profile is predicted independently, but OSnet produces coherent horizontal patterns on a 1/4 • daily grid, especially for MLD. In addition, the pre-Argo years are well reconstructed, which supports the good generalization skill of OSnet. Confidence intervals issued from the bootstrapped method provide an estimate of the prediction variability. Confidence is lower in the cold surface waters north of the Gulf Stream and in the jet at depth, which corresponds to the most variable areas. The reconstructed surface temperature reproduces the observed warming trend. The seasonal cycle of surface salinity matches best the one of SSS compared to Glorys12 and Armor3D.
One convenient feature of OSnet is the possibility of estimating profiles at any location, given that the surface data are provided. This allows us to compare predicted profiles at the exact location of the observed CTD, for example. It is computationally inexpensive to run, and we encourage anyone who needs to predict ocean stratification from surface data to use OSnet. Another feature is the possibility of computing the relative importance of each input for each T − S prediction and analyzing which surface features influence most which properties. This is a development tool that can also be used to study how the ocean stratification reflects on the surface data. Finally, the horizontal resolution of OSnet is constrained at 1/4 • by the resolution of the SLA. The upcoming satellite mission SWOT should provide higher-resolution observations for OSnet to learn and predict smaller-scale features.
Appendix A: Alternative way to suppress the density inversions with a physics-constrained loss function

A1 Custom loss function
Without constraining the predictions in a physical space, most profiles show spurious density inversions that make the MLD computation impossible. To alleviate these issues, we develop a custom loss that constrains the density profile to be monotonous and the properties in the mixed layer to be well mixed. The loss function is the minimization of the mean square error between our prediction and the target profiles, which we complement with a physics-constrained loss Loss Phy and Loss H : (ŷ n − y n ) 2 + Loss Phy + Loss H , with N the batch size andŷ the predicted and y the observed profiles of temperature and salinity as tensors of size N ×D× 2 (D = 51 depth levels). To that standard loss we add two more terms. First we include the potential density σ 0 profile in the target y and predictionŷ. It ensures that the T − S predictions correspond to a profile of density closer to the observed density profile. We also take the profile K out of the standard loss and multiply by a coefficient λ MLD : Second, we add a constraint of monotony on the density profile to penalize the predictions that contain density inversions. A positive value of σ 0 is a violation of the hydrostatic stability of the water column. Such density inversion can exist in observed profiles at small temporal and vertical scales. As our predicted profiles are daily averages, we assume that they should not present any density inversions, i.e., σ 0 < 0, strictly.
A2 Optimization of the λ coefficient The λ coefficient of our custom loss needs to be optimized in order to minimize three metrics. A metric of accuracy is the root mean square error of the target relative to the prediction, and two metrics of physical consistency, the root mean square error of the MLD H , and m 3 the count of density inversions. Note that theĤ in m 2 is directly predicted by the NN; it is not computed on the predicted profiles with the density criterion. This is a multiobjective problem that we solve with the NSGA-II genetic algorithm (Deb et al., 2002). Financial support. This research has been supported by Euro-Argo RISE project of the European Union's Horizon 2020 research and innovation programme (grant no. 824131).
Review statement. This paper was edited by Anna Rubio and reviewed by Michel Crepon and one anonymous referee.