Enhancing temporal correlations in EOF expansions for the reconstruction of missing data using DINEOF

Enhancing temporal correlations in EOF expansions for the reconstruction of missing data using DINEOF A. Alvera-Azcárate, A. Barth, D. Sirjacobs, and J.-M. Beckers AGO-GHER, University of Liège, Allée du Six Août, 17, Sart Tilman, Liège, 4000, Belgium National Fund for Scientific Research, FNRS-FRS, Belgium Department of Life Sciences, University of Liège, Sart Tilman, Liège, 4000, Belgium Received: 15 June 2009 – Accepted: 26 June 2009 – Published: 9 July 2009 Correspondence to: A. Alvera-Azcárate (a.alvera@ulg.ac.be) Published by Copernicus Publications on behalf of the European Geosciences Union.

1 Introduction DINEOF (Data INterpolating Empirical Orthogonal Functions) is a technique to reconstruct missing data in geophysical data sets using an EOF basis to infer the missing data (Beckers and Rixen, 2003).This technique is typically applied to satellite data with gaps due to, for example, clouds present in the atmosphere that impede the surface IR radiation to reach the satellite sensor.Even in satellite measures taken with microwave-based sensors (such as the SST provided by TRMM (Tropical Rainfall Measuring Correspondence to: A. Alvera-Azcárate (a.alvera@ulg.ac.be)Mission) Microwave Imager (TMI) satellites, or the Advanced Microwave Scanning Radiometer (AMSR-E), http: //www.remss.com,Gentemann et al., 2004) large gaps can be present in rain zones and at days when the satellite swath does not completely cover the zone of study.
These clouds and gaps can obscure almost completely the studied domain, with a few clustered observations remaining in some parts of it.These observations are therefore not representative of the whole spatial variability of the domain, and an EOF decomposition, such as the one realised within DI-NEOF to retrieve missing data, can result in an over-fitting of the EOFs to the few present data.As a consequence the reconstructed image does not necessarily reflect the spatial variation of that given image.Satellite data sets typically used for reconstruction with DINEOF may contain some images with the characteristics mentioned above.It is therefore possible that the reconstruction presents some discontinuities in time (i.e.consecutive images with very different values), because of the presence of these highly clouded images.We propose a filter that, applied to the temporal covariance matrix of the data before the EOF decomposition step in DI-NEOF, will reduce the temporal discontinuities in the temporal EOFs and therefore in the reconstructed data set.
This work is organized as follows: the data used to test the filter is described in Sect. 2. Section 3 briefly explains the reconstruction technique, DINEOF, and the proposed filter to decrease spurious temporal variability in the temporal EOFs.The results of this approach are then compared to the DI-NEOF reconstruction without the filter in Sect. 4. This work is concluded in Sect. 5.The average cloudiness of this data set is about 70%.Temporal variability of total cloud cover presents an annual cycle with minimum average cloud cover of about 40 to 50% in summer and maximum average cloud cover of about 90% in winter (Fig. 1).Note the very high cloudiness in winter of in-dividual time frames in this figure.Spatially, regions with the highest cloud cover percentage are the Sea of Azov and the northwestern shelf of the Black Sea, with an average cloud cover higher than 75%.In the rest of the Black Sea, the average cloud cover varies from 60 to 80%.Typical cloud sizes are very large, covering most of the Black Sea at a time.

DINEOF
Consider a matrix X with size M×N, with M the spatial size and N the temporal size.DINEOF proceeds as follows to calculate the missing data: anomalies are computed and the missing data are initialised to the mean (i.e. to a zero anomaly).Starting with the calculation of the most dominant EOF mode of this matrix, the missing data are calculated by: (1) with m=1...M and n=1...N.U are the spatial modes, V are the temporal modes and are the singular values.This new estimation for the missing data is reintroduced in the data matrix, and the EOF mode calculation is repeated.This process is repeated until convergence of the missing values, and then the EOF modes calculated are increased to two, then three, etc.The EOF mode calculation is done using a Lanczos solver provided by Toumazou and Cretaux (2001), which uses the ARPACK freeware (Lehoucq et al., 1997).The optimal number of EOF modes retained are calculated by cross-validation (i.e., a few valid data are set aside and the error of the reconstruction is assessed by comparing the reconstructed data to these cross-validation data).
To test the new filtering approach, the data used for crossvalidation are not taken randomly as it is done in DINEOF by default (e.g.Beckers and Rixen, 2003;Alvera-Azcárate et al., 2005, 2007).For this work, the 15 cleanest images are artificially covered by clouds extracted from other images, following the approach described in Beckers et al. (2006).Three percent of data is covered in this way for the crossvalidation.The cross-validation clouds are a more realistic case than the random points, and therefore the reconstruction of the cross-validation data is done in the same conditions as the rest of the missing data.The cross-validation error using this approach can be expected to be a more realistic estimation of the reconstruction error.
Concerning the anomalies calculated within DINEOF, a space and time average is subtracted from the data to compute them.This choice guarantees that the method is symmetric in space and time.In addition, it allows to compute anomalies in any data set, without prior knowledge of its variability.For satellite SST, it can be useful to remove e.g. the annual cycle, and provide these anomalies to DI-NEOF to reconstruct the missing data.However, when the annual cycle is not removed from the initial data, the EOF decomposition identifies this as the most dominant signal, so the first EOF contains in most cases the annual cycle.One can see the EOF decomposition then as a way to separate the annual cycle from the rest of the variability of the data set.In this work we chose not to remove the annual cycle from the data, but rather the general spatial and temporal mean.This way, the effect of the new proposed filter can be assessed independently, as removing a different average might have also an impact on the reconstructed data.
For an extended description of DINEOF, and recent developments, the reader is referred to Beckers and Rixen (2003)

Filtering the covariance matrix
Matrix X (if complete) can be filtered in the time domain as follows: where F is a N×N matrix containing the filter.A singular value decomposition (SVD) of the filtered matrix X can be calculated: The temporal covariance matrix is defined as: Therefore, Eq. ( 3) can be also written as: where we made appear the filtered covariance matrix B=F T BF.
The filter F is implemented as a Laplacian filter on the terms of X, in which the temporal increment between two images is also taken into account: two consecutive images in matrix X with a large time gap between them are less related than two consecutive images that are in addition consecutive in time.The latest will be more strongly weighted to each other than the first mentioned case.Considering a vector x containing the data to be filtered, the filter can be expressed as follows: where i=1...N, s=1...p, and α is a parameter specifying the strength of the filter, t is a vector containing the time and x is a vector containing the data, all of length N .The filter is applied first to the columns of B and then to its rows, according to Eq. ( 5).This step can be easily incorporated to DINEOF, before the SVD decomposition.Filtering B instead of X is more appropriate since B is much smaller and is less sensitive to missing data.
The filter in Eq. ( 6) implies three consecutive terms of the times series.The reach of the filter can be increased by iteratively reapplying it a predefined number of times.This iteration acts then over 2p+1 members, with p the number of iterations of the filter.
The highest frequency removed by the filter depends on the values of α and p.In this work, α was fixed to 0. (if α> min( t) 2 2 the filter will be unstable) and the value for p was then decided by considering the cross-validation error provided by DINEOF.Figure 2 shows the cross-validation error for the reference run using DINEOF, and for five runs (with p=1, 3, 10, 30 and 100) with the filter included in DI-NEOF.The use of the filter, for all values of p, produces a smaller cross-validation error than the reference run.Of all the five runs with the filter, it appears that p=3 gives the best reconstruction, which means that frequencies higher than 2π √ αp=1.1 days are being filtered from the data set.Note that the choice of the number of iterations can be considered as a fine-tuning experiment to obtain the smallest cross-validation error, as all experiments with filtering obtained better cross-validation error than the non-filtered version.

Analysis of the covariance matrix
Figure 3 shows the covariance matrix B=X T X (non filtered -left panel -and filtered -right panel -) for the 2003 Black Sea SST data reconstructed using DINEOF.The diagonal in this matrix represents the variance of each day's data with respect to the mean of the data set.We can see a high variance in summer and winter, with spring and fall nearer to the average temperature, and therefore with a lower variance.The off-diagonal terms are related to the degree of similitude of each image with respect to the rest of the images (or covariance), so winter is negatively correlated with summer, for example.The comparison between the left and right panels allows to see the effect of the filtering in the covariance matrix.In the non-filtered covariance matrix, discontinuities in the form of vertical and horizontal lines of low correlation are apparent.These discontinuities are caused by the irregularities in time and space mentioned above, and can lead to an unrealistic reconstruction of the data.The filtered version of B, which is used within DINEOF in the version presented in this work, eliminates or reduces some of these discontinuities (Fig. 3 right).matrix.The filter is applied in the time dimension, therefore the spatial structure of the first EOF mode is very similar in both approaches.The first temporal mode of Fig. 4, however, does present a difference between both reconstructions.While representing the annual cycle of cooling and warming temperatures in both reconstructed data sets, the non-filtered version presents several peaks at moments when the cloud coverage of the original data set is higher, for example during winter months.These peaks are smoothed in the filtered DINEOF version, which will lead to smoother reconstructions in time, as it will be shown later.The first EOF mode (accounting for 98.11% of the total variability) represents the annual cycle variations of the Black Sea SST, because as we mentioned in Sect.3.1, this cycle was not removed from the initial data.The cyclonic, coastal-trapped, basin-wide Rim Current is the most prominent feature in the spatial maps of Fig. 4.This current presents the warmest temperature during winter, and it is colder than it surroundings during summer, in part influenced by coastal upwelling along the Turkish coast (Sur and Ilyin, 1997) and the Crimean peninsula (Sur and Ilyin, 1997;Gawarkiewicz et al., 1999).The coldest regions during winter are the northwestern shelf and the Sea of Azov, exposed to rapid temperature changes related to the atmospheric conditions due to its shallow bathymetry (Sur et al., 1996;Kosarev et al., 2008).

EOF mode analysis
The second EOF is shown in Fig. 5.The smoothing of the temporal EOF is more evident here than in the first EOF, and again the spatial modes present the same structure.The second temporal mode presents a semi-annual periodicity and a northeast-southwest spatial structure, mainly separating the Sea of Azov and the northwestern shelf from the southeast Black Sea.This mode accounts for a warming/cooling anomaly in spring/fall in the northwest shelf, and the opposite trend for the southeast zone of the Black Sea.This mode accounts for almost 1% of the total variability (or 53% of the remaining variability of modes of higher order than the first mode).
The third EOF mode, accounting for only 0.3% of the total variability (or 34% of the remaining variability of modes of higher order than the first and second modes), is presented in Fig. 6.The same comments as for the other modes stand here, as the spatial mode is almost the same for the filtered and non-filtered versions of DINEOF, and the temporal mode is smoother in the filtered DINEOF version.

SST Reconstruction
The three years of Black Sea SST have been reconstructed using DINEOF with no filtering, and including the filtering of the covariance matrix with p=3.In the classic approach (with no filtering), 17 EOF modes are retained by DINEOF, reaching a minimal cross-validation error of 0. explained variance of 99.92%.When the filter of the covariance matrix is included, 42 EOF modes are retained, with a minimal cross-validation error of 0.46 • C, and a total explained variance of 99.93%.It might appear surprising that when the filtering is included, more EOF modes are retained by DINEOF, although this can be explained by these modes being better constrained (i.e.determined) because the filter introduces information about the temporal sequence of the data.Also, note that the cross-validation error obtained with the inclusion of the filter is close to the accuracy of satellite SST data, when compared to in situ data (e.g.McClain et al., 1985;Kumar et al., 2000;Marullo et al., 2007;Castro et al., 2008).
In Fig. 7 a sequence of initial data and reconstructions is shown.An example when a cross-validation cloud has been added is shown, so that readers can see the original data (first row of Fig. 7) and the data set used for the reconstruction (second row).Of course the three days shown are only a small part of the whole 903 images used in this work.The third row in Fig. 7 shows the reconstructed data using DI-NEOF, and the fourth row shows the reconstruction using DINEOF with filtering of the covariance matrix.As the information surrounding a given date is used for the reconstruction of a particular image in the filtering approach, the reconstruction using this version shows improvements respect to the classic DINEOF approach.For example, due to the added cross-validation cloud, information on 3 July 2004 is almost non-existent, although we can see in the original data set that a cold filament started to develop south of the Crimean peninsula.This filament is still detected on 4 July 2004.The reconstruction with the classic DINEOF approach does not present this filament, but the filtered DINEOF version does, mainly because this information was taken from the surrounding data.As a result, the filtered DINEOF version is more realistic, and is able to reconstruct features even when these are completely missing on a given image.One benefit of retaining more EOFs (in the filtered DINEOF version) is that this increases the number of degrees of freedom.Typically, the lower rank EOFs contain small scale spatial information, which is now retained for the reconstruction.On the example given in Fig. 7 there are features that are also better represented in the filtered version of DINEOF, as the warm filament going from the Bosphorus strait to the Cape Kerempe, and the cold eddy in the Batumi region (southeast Black Sea).
Another example of the effect of the covariance matrix filtering on the reconstruction results can be seen in Fig. 8.The initially cloudy data from 24, 25 and 27 December 2003 present a large cloud cover on the last two days.On 25 December, only a few data near the east coast of the Black Sea are cloudless, which represent values typically warmer than the western part in this period of the year.The reconstruction using DINEOF with no filtering leads to a very warm reconstruction on 25 December.The temperature change of up to 6 • C within one day observed in the classic reconstruction is not realistic in the Black Sea.The reconstruction using the filtering of the covariance matrix leads to a more realistic result, with a temperature distribution similar to what is observed on 24 December 2003.On 27 December (the next day in the reconstructed images sequence), the information available in the initial matrix is from the colder western Black Sea.The reconstruction without the filtering leads to cold SST over the whole Black Sea, and the southeastern part of the basin does not present warmer water as it would be expected from past information.The filtered DINEOF reconstruction does present this east-west gradient, with warmer temperatures in the southeast and colder in the west.
To compare our filtering approach with a more direct a posteriori filtering of the data, a fourth row has been added to Fig. 8.This a posteriori approach consists in applying the Laplacian filter directly to the reconstructed data (i.e. to the data in the second row of Fig. 8).To establish the optimal number of iterations p, a cross-validation technique similar to the one present in DINEOF was used.A set of crossvalidation points (about 1% of the initial data) in the form of clouds was identified and removed from the initial data set, before reconstructing with DINEOF.The reconstructed data set is then filtered with different values of p, and the initial cross-validation data is used to asses the error.The error was minimised for p=300 iterations.The filtered data present a much better representation of the surface temperature on 25 December 2003 than the classic DINEOF reconstruction, however at the expense of much smoother SST for the surrounding dates.gyre presents lower temperatures than observed, and in general, the spatial structures are much smoother than when the filter is applied within DINEOF.Note that the same procedure (i.e. a cross-validation error calculation) has been used to establish the number of iterations that yield the best results in the embedded filter and the a posteriori filter.When the filter is applied a posteriori a much higher number of iterations is needed to obtain an accurate result.This is because the a posteriori filter has to counterbalance the effect of the noisy temporal EOFs that lead to the reconstruction in the second row of Fig. 8, whereas the embedded filter prevent the temporal EOFs to become noisy in the first place.

Effect of the number of EOFs on the reconstructions
The increased accuracy obtained with DINEOF when including the filter of the covariance matrix can be due to two reasons: the filtering of spikes on the temporal EOFs, and the increased number of EOFs that were retained in the Black Sea example shown here.To elucidate which of these effects account for the improvements seen in Figs.7 and 8, we have repeated the DINEOF reconstruction without filter, but forcing DINEOF to retain the same number of EOF modes than with the filter (i.e.42 EOF modes).As it might be expected, this reconstruction is able to better represent the small-scale variability of Fig. 7, to an extent that the reconstruction with and without filter present very similar results (the new results are not shown).However, this new reconstruction does still present the very warm reconstruction on 25 December 2003, and in general, it exhibits spurious high frequency variations (as those seen in the temporal EOFs of Figs. 4 to 6).The cross-validation error of the classic DI-NEOF reconstruction retaining 42 EOF modes is of 0.61 • C (i.e., higher than the reconstruction with 17 EOFs, and higher than the reconstruction with the filter).These results confirm that the filtering of the covariance matrix is able to reduce the spikes in the temporal EOFs, therefore obtaining more realistic reconstructions.The increased number of EOFs retained when including a filter within DINEOF accounts for a better reconstruction of the small-scale features.
The improved results obtained when using a higher number of EOFs brings us to the cross-validation technique used within DINEOF.It appears that a higher number of EOFs allows for a reconstruction of the small scale variability, but one must not forget that the cross-validation error is higher Ocean Sci., 5, [475][476][477][478][479][480][481][482][483][484][485]2009 www.ocean-sci.net/5/475/2009/ for those reconstructions.The higher order EOFs do not only contain small-scale information, but also noise, so forcing DINEOF to retain a higher number of EOFs than what is calculated with the cross-validation method is not to be done without caution, as it might degrade the overall accuracy of the reconstruction.The cross-validation approach used within DINEOF might however not be optimal, retaining less EOFs than necessary.This remains an open question that will be addressed in future work.

Conclusions
A technique to decrease the amount of noise in the temporal EOFs used for reconstruction of missing data in DI-NEOF (Data Interpolating Empirical Orthogonal Functions) has been described.The noise is caused by the presence of few valid observations at given time steps which are in addition clustered in space.The approach presented in this work consists on a Laplacian filter which is applied to the www.ocean-sci.net/5/475/2009/Ocean Sci., 5, 475-485, 2009 The combination of a lower reconstruction error and a higher number of temporal EOFs retained for estimating the missing values results in an improved estimation of the oceanographic features on the Black Sea sea surface temperature.An example has been shown in which a filament south of the Crimean peninsula was more accurately represented in the filtered reconstruction, because of the presence of this filament in the surrounding images, an information not exploited in the reconstruction with no filter.
It has also been shown that the approach presented here gives better results than simply filtering the data after the reconstruction, as in order to obtain a realistic temperature distribution a large amount of filtering is needed, therefore reducing gradients and losing some mesoscale information.This justifies the increased complexity of our filtering technique.

Fig. 1 .
Fig. 1.Top panel: temporal variation of cloud cover percentage in the Black Sea (in black) and with a low-pass filter of 30 days (red line).The year labels show the 1st January of each year.Bottom panel: spatial variation of cloud cover percentage in the Black Sea.CR stands for Crimean peninsula, and BA stands for Batumi region.

Fig. 2 .
Fig. 2.Cross-validation error decreasing with the increasing number of EOFs used for the reconstruction.The minimum error is marked with an asterisk for each line.The lines show the cross-validation error for DINEOF without filtering, and for DINEOF with filtering of the covariance matrix using 1, 3, 10, 30 and 100 iterations of the filter.

Figure 4 Fig. 3 .
Figure 4 presents the first spatial and temporal EOF modes of the reconstructed data set with the classic DINEOF approach and with DINEOF including the filtered covariance

Fig. 4 .
Fig. 4. First EOF spatial mode for the DINEOF reconstruction with no filtering of the covariance matrix (top left), with filtering of the covariance matrix (top right) and first temporal mode for both versions (bottom).The x-axis labels in the bottom panel indicate the 1st January of each year.The spatial EOF has units of • C and the temporal EOFs have no units.

Fig. 5 .
Fig. 5. Second EOF spatial mode for the DINEOF reconstruction with no filtering of the covariance matrix (top left), with filtering of the covariance matrix (top right) and second temporal mode for both versions (bottom).The x-axis labels in the bottom panel indicate the 1st January of each year.The spatial EOF has units of • C and the temporal EOFs have no units.

Fig. 6 .
Fig. 6.Third EOF spatial mode for the DINEOF reconstruction with no filtering of the covariance matrix (top left), with filtering of the covariance matrix (top right) and third temporal mode for both versions (bottom).The x-axis labels in the bottom panel indicate the 1st January of each year.The spatial EOF has units of • C and the temporal EOFs have no units.

Fig. 7 .
Fig. 7. Example of large cloud added for the cross-validation.Top row: initial cloudy data at three consecutive days, 2 to 4 July 2004.Second row: the same data set, but with a cross-validation cloud added on the 3rd July 2004, used for the reconstruction.Third row: reconstruction using no filtering of the covariance matrix.Fourth row: reconstruction with filtering of the covariance matrix.

Fig. 8 .
Fig. 8. SST initial data on 24, 25 and 27 December 2003 (top row) and two reconstruction by DINEOF: with no filtering of the covariance matrix (second row) and with filtering of the covariance matrix (third row).The last row show the result of a temporal filter applied after the reconstruction (i.e. to the data shown in the second row).