Potential Artifacts of Sequential State Estimation: Invariants

4 In sequential estimation methods often used in general climate or oceanic calculations 5 of the state and of forecasts, observations act mathematically and statistically as forcings 6 as is obvious in the innovation form of the equations. For purposes of calculating changes 7 in important functions of state variables such as total mass and energy, or in volumetric 8 current transports, results are sensitive to mis-representation of a large variety of parameters 9 including initial conditions, various uncertainty covariances, and systematic and random 10 errors in observations. Errors can be both stochastic and systematic, with the latter, as 11 usual, being the most intractable. Here, in Part 1, some of the consequences of such errors 12 are analyzed in the context of a simplified mass-spring oscillator system exhibiting many of 13 the issues of far more complicated realistic problems. Part 2 applies the same methods to 14 a slightly more geophysical barotropic Rossby wave plus western boundary current system. 15 The overall message is that convincing trend and other time-dependent determinations in 16 “reanalyis" like estimates requires a full understanding of both models and observations. 17 Part 1. Formalism and Simplified System 18


Introduction. Formalism and Simplified System
Intense scientific and practical interest exists in understanding the time-dependent behavior in the past and future of elements of the climate system. Best estimates of past, present, and future invoke knowledge of both observations and models. These both can involve physical-dynamical, chemical, and biological elements.
Fundamental to understanding many physical systems is analysis of long-term changes in quantities such as energy, enstro-15 phy, total mass, mean concentrations, that are subject to various conservation rules. These elements, absent external perturbations or internal sources or sinks, can be usefully regarded as potential "invariants" of the system 1 . In conventional science, violation e.g., of mass or energy conservation not attributable to specific disturbances, would preclude any claim to understanding of the physics, chemistry, etc. governing the temporal evolution. Observational scientific fields in which time series data are of basic importance thus struggle with inferences from changing observation systems-either or both of changing technology 20 or of spatial and temporal distributions. In climate science particularly, both of these factors determine the ability to determine trends over months, decades and longer. 1 The modifier "potential" is normally omitted here, being implicit as true invariants require the absence of generalized dissipation and external forces.
"Invariant" is also used as a label for specific sub-elements such as a fixed current transport.
Much interest also exists in the possibility of trends in major sub-elements of the system-oceanographically for example, in the transports of mass or heat or other properties in major currents such as the Gulf Stream. "Best estimates" of these values are also made using combinations of kinematic and dynamical models plus observations. 25 Methods for combining data with models fall into the general category of control theory, in both mathematical and engineering forms, although full understanding is made difficult by the need to combine major sub-elements of different disciplines, including statistics of several types, computer science, numerical approximations, oceanography, meteorology, climate, dynamical systems theory, and the observational characteristics of very diverse instrument types and distributions. Within the control theory context, distinct goals include "filtering" (what is the present system state?), "prediction" (what is the best estimate of 30 the future state?), and "interval smoothing" (what was the time history over some finite past interval?) and their corresponding uncertainties.
In oceanography, and climate physics and chemistry more generally, a central tool has become what meteorologists call a "reanalysis,"-a sequential estimation method based ultimately on long experience specifically with weather prediction. Particular attention is called, however, to the paper of Bengtsson et al. (2004) who showed the impacts of observational system shifts 35 on outcomes with some sequential methods. A number of subsequent papers (e.g., Bromwich and Fogt, 2004;Bengtsson et al., 2007;Carton and Giese, 2008;Thorne and Vose, 2010) have noticed difficulties in using "reanalyses" for long-term climate properties sometimes ending with advice-such as "minimize the errors" (and see Wunsch, 2020 for one global application).
For some purposes e.g., short-term weather or other prediction, system failure to conserve mass or energy or enstrophy may be of no concern-as the time-scale of emergence for measurable consequence of that failure can greatly exceed the forecast 40 time. In contrast, for reconstruction of past states, those consequences can destroy any hope of physical interpretation. In longduration forecasts with rigorous models, but by definition, no observational data at all, invariants are likely to be preserved, albeit tests of model elements and in particular of accumulating errors, are not possible.
Somewhat notorious, contradictory, results in the public domain (e.g., Hu et al., 2020;Boers, 2021) domain suggest that, given the technical complexities of a full reanalysis computation, that some simple examples of the known difficulties with 45 sequential reanalysis methods could be helpful. Expert practitioners of the methodology, particularly on the atmospheric side (e.g., Dee, 2005;Cohn 2010;Janjic et al. 2014;and Gelaro et al. 2017) clearly understand the pitfalls of the methodologies, but many of these discussions are couched in the mathematical language of continuous space-time (requiring the full apparatus of functional analysis) and/or the specialized language of atmospheric sciences. 2 The intention here is to produce simple examples, in discrete time-as used on computers with real data, and with a mainly generic vocabulary. Dee (2005) is close in 50 spirit to what is attempted here.

Some Concepts and Notation
Models Some basic notation is necessary to analyze even the simplest, linear, time-evolving system with data. A fuller account is given in Wunsch, 2006, hereafter W06, and in many other textbook references. (Notation here is similar to that in W06.). Let flexible structure is introduced as the unknown elements and corrections to B (t) q(t) for boundary/initial conditions, internal parameterizations, and forcing generally. Eq. (2) will be referred to below as the "prediction model," as it is used in practice to make the best prediction at any future time-given the immediate past best-estimate. 3 In such a calculation,ũ (t) = 0, as it is otherwise unknown. In many circumstances (e.g., Brown and Hwang, 1997, W06), Eqs. (1 or 2) are linearized about some reference state. That A (t) is itself then, and always, subject to significant error is a very important point, but that possibility 70 renders the problem non-linear, and for present purposes the implications and approaches are set aside. 4 Time-evolving systems require initial conditions,x (0) , having some known or assumed error (uncertainty), written for linear systems as a covariance matrix, and with the further, sometimes wholly implicit, assumption that the mean error x (0) − x (0) = 0. The brackets denote an 75 expected value, whether theoretical or estimated. Uncertainties generally determine the utility of any solution of any problem.
Part of the estimation problem is to cope with the likelihood that P (0) itself is not wholly accurate, and with implications depending on how long the system "remembers" its initial conditions (typically a function of A).
Let P (t, −) represent the error covariance (uncertainty) of the prediction,x (t, −) : a matrix Riccati equation, which is just the sum of the error covariance propagated from the predicted estimate, plus that generated by unknown forcing, etc., elements, u (t) , with Q (t) = u (t) u (t) T , the bracket defining an ensemble average.
(See any of numerous textbooks cited above.) Q (t) represents the error covariance of q (t), and thus of u (t) . Inaccuracies in this equation are discussed by Konstantinov et al. (1993), Zhou et al. (2009) and P must always itself be regarded as an 85 estimate, not "truth." Note that only precision, not accuracy, is the subject here.
A useful conceptual generalization of these methods is to create ensembles of solutions e.g., generated by random selection of different initial conditions from the probability density of the initial conditions (e.g., Evensen, 2009) and then using the results to calculate variances of the corresponding solutions. Difficulties lie with the very large number of elements subject to random and systematic errors, choice of the correct probability densities, and the usual very small number of ensemble 90 members feasible to compute relative to the dimension e.g., of x (t) . For trend determination accurate knowledge of the overall uncertainties remains important.

Data
Suppose now that at time t = τ > 0 some data are available, written generally, but linearly, as,

95
where n (τ ) is usually assumed to be a zero-mean unimodally distributed noise process in the observations, with known covariance matrix, R (τ ) , and which is often time-dependent. (Non-linear observations, for example that of a speed, require special treatment.) Observation matrix E (τ ) , possibly itself having inaccuracies, appropriately distributes the elements of x (τ ) making up the observations, and which can range from observation of a single element, x j (τ ) , to some arbitrarily complicated linear combination of different elements (e.g., weighted averages or differences).

100
Formally, one can deduce another estimate of x (τ ) directly from Eq. (4) as, where E (τ ) + is a generalized inverse deduced from standard, static, linear inverse methods using appropriate row and column scaling, and would be accompanied by an uncertainty P E (τ ) and a resolution analysis (e.g., W06). Such static, fixed-time, 105 calculations for time-dependent systems are uncommon, as E (τ ) + usually has a vast unknown nullspace in P E (τ ), dependent upon how comprehensive and accurate the data are.

Combining Data and Models
Suppose, and as is commonplace in both numerical weather prediction and in reanalyses, that the prediction model is used to forecast the state at time τ, written asx (τ, −) from a previous estimated state. Given the initial condition error P (0) , a 110 straightforward calculation produces an expected error of the forecast, P (τ, −) from above.
If data also exist at time τ, then a linear inversion, if carried out as in Eq. (5a), provides another estimate of the state, with its own uncertainty, dependent upon R (τ ) and the structure of E (τ ) . Evidently, a better estimate than either alone is to combine them in a weighted average, inversely proportional to their uncertainties, as is conventional in recursive least-squares (e.g., Lawson and Hanson, 1995;W06) rewritten with minor manipulation as, The "gain" matrix is, In this form, K is the " Kalman (1960) gain" and the operation is the "Kalman filter" and which includes, for discrete time, the uncertainty of the combined estimate, a matrix Riccati equation which is again a result of recursive least-squares. 5 Textbooks prove that the norm, P (τ ) ≤ P (τ, −) , that is, if used realistically, the data cannot worsen the error in the forecast, but can potentially improve it, perhaps greatly, depending upon E (τ ) , R (τ ) . (A tilde can sensibly be placed on P, R, but is omitted here.)

125
For all these and related methods, a potentially very important, but implicit, assumption is n (t) n (t ) T = 0, t = t , that is observational noise is uncorrelated over time. Similarly, u (t) u (t ) T = 0, t = t and n (t) u (t ) T = 0. If the assumptions fail, a general approach is to model the structures of n (t) , u (t) as part of the problem-essentially augmenting the state vectors.

130
A slight modification of the system is to combine Eqs.
(2) and (6) into the "innovation" forms, again setting the unknown Γ (t) u (t) = 0, or, 135 (Goodwin and Sin, 1984, P. 251), whose importance is that both show explicitly that data introduction acts as an analogue of externally imposed forcing.

Simple Example: Mass-Spring Oscillator
For a simple, intuitively accessible analogue system, consider the mass-spring oscillator, following any of McCuskey, 1959, Goldstein, 1980, W06, Strang, 2007 in the conventional continuous time formulation of simultaneous differential equations. This choice is obviously somewhat arbitrary, but is both intuitively accessible, and contains sufficient structure to permit the discussion e.g., of spatial average data, fixed mass positions, etc.
Three identical masses, m = 1, are connected to each other and to a wall at either end by springs of identical constant, k This second-order system is reduced to a canonical form of coupled first-order equations by introduction of a continuous 150 time state vector, the column vector, where superscript T denotes the transpose. Note the mixture of dimensional units in the elements of x c (t) , identifiable with the Hamiltonian variables of position and momentum. Then Eqs. (11) become (setting m = 1, or dividing through by it), where defining the 3x3 block matrices, K c , R c symmetric and diagonal respectively, and are constant. B c distributes inputs, q c = [q c1 , ..., q c6 ] T , variously amongst the six sub-equations. Putting e.g., r = 0.5, k = 30, A is full-rank with 3 pairs of complex conjugate eigenvalues, but non-orthonormal right eigenvectors. These parameter values are generally used throughout. Here, and in what fol-160 lows, the system is notationally simplified by using time-constant A, B.The physics and mathematics of such small oscillations is discussed in most classical mechanics textbooks and is omitted here. Although a non-dimensionalization is straightforward, what follows is left in dimensional form to make the results most intuitively accessible.

Energy
Consider now an energy principle. Define, without dissipation (R c = 0), or forcing, the sum of the kinetic and potential energies (the minus sign compensates for the negative definitions in K c ) and is here a Hamiltonian. The non-diagonal elements of K c redistribute the potential energy amongst the masses through time.
With finite dissipation and forcing, dE c (t) /dt = 0, if the forcing and dissipation vanish (see e.g., Cohn 2010, for a formal discussion of such continuous time systems.)

Discrete Version
Write Eq.
and the prediction model is unchanged except now, An example for the nearly dissipationless, unforced, example of the oscillator solution, from the discrete formulation is shown in Fig. 2  2), is formally identical to that in the continuous case, E c (t) and the potential and kinetic energies through time are also shown. The basic oscillatory nature of the state vector elements is plain, and the decay time is also visible.

Mass-Spring Oscillator with Observations
If the innovation form of the evolution Eq. (9) is used, the energy change becomes, showing explicitly the influence of the observations on the computed energy. With intermittent observations and/or with changing structures, E (t) , then E c (t) will undergo forced abrupt changes-as expected.
Given the very large number of potentially erroneous elements in any choice of model, data, and data distributions, and the ways in which they interact when integrated through time, a comprehensive discussion even of the 6-element state vector mass-spring oscillator system is difficult. Instead, some simple examples exploring primarily the influence of data density on 200 the state estimate and of its mechanical energy are described. One can experiment with the model and its time-constants, model Kinetic energy (solid) and potential energy making up E (t) .
time-step, accuracies and corresponding covariances of initial conditions, boundary conditions, data etc. The basic problems of any linear system already emerge in this simple example.
Consider, using the same k, r, ∆t = .001 to represent "truth" where the forcing Bq (t) = q 1 (t) = 0.1 cos(2πt/(2.5T diss ))+ ε (t) , that is, only mass 1 is forced in position, and with a low frequency not equal to one of the natural frequencies. T diss = 1/r, 205 is the dissipation time. ε (t) is a white noise element. Initial condition is ξ 1 (0) = 1, all other elements vanishing; see Fig. 3.
Accumulation of the influence of the stochastic element in the forcing clearly depends upon details of the model time-scales and if ε (t) were not white noise, on its spectrum as well. In all cases, the cumulative effect of a random forcing will have the nature of a random walk-with details dependent upon the forcing structure, as well as the memory elements of the model time scales.

210
The prediction model (Figs.4,5) has correct initial conditions and A, B matrices, but is forced by the deterministic component with 1/2 the correct amplitude, and with the stochastic component being treated as fully unknown-replaced by its zero mean The added noise in the measurements has a standard deviation of 0.2 of the total forcing, the latter standard deviation including that of the deterministic contribution.

215
To demonstrate the most basic problem of energy, consider a nearly-perfect observation of all 6 positions at two times τ 1,2 as displayed in Fig. 4 with E = I 6 No observational null-space exists. Although the new estimate of the state vector is an improvement over that from the pure forecast, any effort to calculate a trend in energy of the system will fail unless very

Quadratic Variability
In a linear system, a Gaussian assumption for the dependent variables is commonly appropriate. By focussing here on the quadratic invariant of energy, the variables become χ 2 distributed. Thus the ξ 2 i ,ξ 2 i have such distributions, but with differing means and variances, and with potentially very strong correlations, so that they cannot be regarded as independent variables.

225
Determining the uncertainties of the six uncertain covarying elements making up E (t) involves some intricacy. A formal analysis can be made of the resulting probability distribution for the sum in E (t), involving non-central χ 2 distributions (Imhof, 1961, Sheil and O'Muircheartaigh, 1977, Davies, 1980. In view of the purpose and simplicity of this example however, an estimate of the uncertainty was made by simply generating 50 different versions of the observations, differing in the particular choice of noise value in each one and tabulating the resulting range. These uncertainties can be used to calculate e.g., the 230 significance of any apparent trend in E (t) and although the result is not displayed here, use of reliable uncertainties can make an obvious important change in any physical inferences about the state. In these examples, the observational errors are intentionally made relatively small, with no implications for what could be the case in geophysically realistic cases. Notice that even in the observation interval, the estimated mechanical energy remains too low. This bias error is a systematic one owing to the availability of observations only of the velocity of one of the masses. Even if the observations are made perfect 235 ones (not shown), this bias error in the energy persists (see e.g., Dee, 2005).
As seen in the figure, with full-rank, near-perfect observations the elements of x i (t) and the total energy are forced to nearly the correct values at the two observation times, τ i , but do diverge in following times.

A Fixed Position
Exploration of the dependencies of energies of the mass-spring system is interesting and a great deal more can be said.

240
Turn however, to a somewhat different invariant: suppose that one of the mass positions is fixed, but with value unknown to the analyst. A significant literature exists devoted to finding changes in scalar quantities such as global mean atmospheric temperatures, or oceanic currents, with the Atlantic Meridional Overturning Circulation (AMOC) being a favorite focus. These quantities are typically sub-elements of complicated models involving very large state vectors. With this very simple massspring oscillator system, it is useful to consider a situation in which an element is a constant, an invariant, but which must be 245 determined from the sequential estimation procedure.
Using the same situation as above, added constraints, that x 3 (t) = ξ 3 (t) = 2, x 6 (t) = dξ 3 (t) /dt = 0, that is, an unmoving, fixed displacement in mass 3, are used in computing the true state vector. The observations are the velocity of moving mass i = 2, with similar noise in the interval shown in Fig. 6 . The question is whether one can infer accurately that ξ 3 (t) is a constant through time? Note that the fixed displacement means that the potential energy can never vanish. The resulting estimate for the 250 position,ξ 3 (t) , is shown in Fig. 6 and includes a significant error at all times.
Position variation occurs even during the data dense period and arises both from the entry of the data and the noise in the observations of dξ 2 (t) /dt. An average taken over the two-halves of the observation interval might lead to the erroneous conclusion that a decrease had taken place. Such an incorrect inference can be precluded by appropriate use of the computed uncertainties ( Fig. 6).

Observations of Averages
Consider now a set of observations of the average of the position of masses 2 and 3, and of the average velocity of masses 1 and 2, mimicking the type of observations that might be available in a realistic setting. Again for simplicity, the observations are very accurate and occur in the two-different sets of periodic time intervals The results are in Fig. 7. Position estimates shown are good, but not perfect, as is also true for the total energy. Visually, it is clear that the energy estimate carries oscillatory 260 power with the periodicity of the oncoming observations intervals and appears in the spectral estimate (not shown) with excess energy in the oscillatory band and somewhat low energy at the longest periods. Irregular spacing would introduce a potentially complex spectrum in the result.
A more general discussion of nullspaces involves that of the weighted P (τ, −) E T appearing in the Kalman gain. If E is the identity, and R (τ ) has sufficiently small norm, all elements of x (τ ) are resolved. If the noise is uniform in all elements    Similar results will apply e.g., to changing the observational accuracies (and biases) as well as the number of observations of individual or average elements x i (t) . With these parameters, the change is not large relative to the background, but as a climate analogue, the importance would depend upon the physical significance of a small change (e.g., Wunsch, 2020).

The Uncertainties
The structure of the uncertainties depends upon both the model and the detailed nature of the observations. Consider P (t) for Notice that changing variances along the diagonal, and the sometimes strong covariances implied amongst the different elements ofx (t) after 10 observations have been used. One of the eigenvalues of P (t) is almost zero, meaning that P (t) is singular. In this case, the only observation was relatively accurate-one of the velocity of the second mass. The eigenvector corresponding to the zero eigenvalue is close to 1 in position 5 (corresponding to the observed dξ 2 /dt) and zero elsewhere. The 290 implication is, that because very good observations were made of dξ 2 /dt, its uncertainty almost vanishes here, and a weighting of values by P (t) −1 , gives it a near infinite weight at that time.

A Fixed-Interval Smoother
As already noted, most physical models in use include some form of invariant principles, including quadratic ones related to energy, linear ones related e.g. to vorticity, or to positions or flows. These principles are violated whenever the model 295 is combined with data. A reasonable inference for science generally is that no system that violates conservation rules for mass or energy etc., can be physically understood in a meaningful way. The need for system descriptions over finite intervals that do satisfy such principles leads to the notion of "smoothers"-in which the state vector over a finite interval satisfies simultaneously a fully-known model, and the data within error bars such that no invariant violation occurs.
The basic notion is to find the corrections, Γũ (t) , the controls, such that the suitably modified prediction model produces a 300 new, the third, state estimatex (t, +) obeying the model time-evolution while simultaneously, consistent within error bars, of all the data. The plus sign denoting the use of formally future data. In that way, the usual invariants of energy etc., are restored. x (t, +) is generally a better estimate than isx (t) because it "knows" of the occurrence and values of observations future to the time t and accounts for them and with potentially important implications for the new estimates of initial conditions. Application is made below to a slightly more geophysically identifiable example.

305
The idea of smoothers is again a control theory construct (see Liebelt, 1967, Anderson and Moore., 1979, Brogan, 1991 W06 among many others), and algorithmically a number of different approaches for linear systems have been developed.

310
Consider now the problem of reconstruction of invariants over the entire time-span of estimation. Realism is still not the goalrather it is the demonstration of various elements making up estimates in simplified settings in another exercise in geophysical fluid statistics (GFS). A highly idealized oceanographic problem is described.

The Smoothing Problem
A variety of smoothing approaches exists (e.g. Anderson and Moore, 1979;Goodwin and Sin, 1984;Stengel, 1986). Here the 315 "fixed interval" smoother is of most interest. The fundamental idea is straightforward: to find a weighted least-squares fit of the invariant-conserving model (Eq. 2) to the data Eq. (4) at the sampling times. The ECCO Project (e.g., Fukumori et al., 2018Fukumori et al., , 2019) is a global ocean example of a smoothed solution consistent with the usual invariants, albeit one using a different approach.

320
Consider the sequential method of the RTS smoother, in which the assumption is made that the KF has already been used, rigorously, over the finite interval 0 ≤ t ≤ T dur , producing the estimates denotedx (t, −) ,x (t) with their corresponding uncertainty covariances P (t, −) , P (t) . At this stage, no further discussion of the data occurs: all information contained in the observations has been exploited by the KF and is encompassed inx (t) and its uncertainty P (t). What has not been exploited in an estimatex (t) , P (t) , is the information contained in data that were obtained afterwards, t + m∆t > t, but that information 325 is present in any later estimatesx (t + m∆t) , P (t + m∆t).
The underlying idea is the same as that employed above: if two estimates of state or other variables are available, a better estimate is made by combining them inversely weighted by their uncertainties. The resulting RTS algorithm is more complex appearing than is the KF, because all of the later estimates have a finite correlation with the previous ones, and they cannot be simply combined without first removing that correlation. (The pure scalar state vector is readily analyzed without any 330 matrix/vector algebra and is written out in Appendix B.). These new estimates become: involving the estimatedx(t + ∆t), P(t+∆t, −), P(t+∆t) at a formally future time, t + ∆t. As before, he + in the argument is used to label the estimates of these variables as now having employed the formally future data. One can examine putative steady-state behavior of the smoothing equations, to the extent it is plausible. Note that an estimate of the control,ũ(t, +), is obtained from the difference between the old forecast estimate, and the value obtained using future data.

Example: Rossby Wave Normal Modes
A more recognizably geophysical example than that above is the flat-bottom, linearized β-plane Rossby wave system, whose governing equation is, in a square beta-plane basin of horizontal dimension L. This problem is taken to be representative of those involving both 345 space and time structures, including boundary conditions. (Spatial variables x, y should not be confused with the state vector or data variables). Eq. (28) and other geophysically important ones are not self-adjoint, and the general discussion of quadratic invariants leads inevitably to adjoint operators (see Morse and Feshbach, 1953 or for bounding problems-Sewell, 1987, Chs.
Introduce non-dimensional primed variables, t = f t, x = Lx , q = q 0 q , ψ 1 = (a 2 /f )ψ 1 . Coriolis force f, and β are evaluated at 30 • N. Letting a be the Earth radius, and β = β a/f = 1.7, the non-dimensional equation becomes, choosing further L = a, and then omitting the primes from here on except for β , Hairer et al. (2006) describe numerical solution methods that specifically conserve invariants, but these are not discussed here.
This system was used by Gaspar and Wunsch (1989) for a demonstration of sequential estimation using altimetric data. Here a 360 different state vector will be used.
The solution used is the sum over normal modes satisfying the boundary conditions, ψ 1 = 0, and obeying the non-dimensional dispersion relation, where c nm is a coefficient dependent upon initial conditions and any forcing present; see especially, Pedlosky (1965).
The problem is now made a bit more interesting by addition to ψ 1 of a steady component, the solution, ψ s (x, y) from Stommel (1948) whose governing equation in this non-dimensional form is, where R a is a Rayleigh friction, R a = R a /f, with solution here written in the simple boundary-layer/interior approximation, 370 ψ s = e −xβ /R a sin πy + (x − 1) sin πy, which leads to a small error in the eastern boundary condition. The sin πy arises from Stommel's assumed time-independent wind-curl.
For the time-dependent components, the state vector is, where p is a linear ordering of n, m of total dimension N × M = N state − 1, which is equal to the number of n times the number of m, and the state transition equation is, with a complex, diagonal state transition matrix, A 2 = diag (exp (−iσ p ∆t)) , square of dimension N state − 1. A small, numerical dissipation is introduced, multiplying A by exp(-bt), b > 0, to accommodate loss of memory, e.g., as a conventional 380 Rayleigh dissipation. Some special care in computing covariances must be taken when using complex state vectors and transition matrices (Schreier and Scharf, 2010).
The time-independent flow is included as, and again, 385 where complex A is the same as A 2 except with an added zero row and column , and a single non-zero element, A (N state , N state ) = 1. Eq. (34) here is taken to exactly describe the putative "truth". q N state (t) = 0, because the Stommel solution results from a steady wind.
Consistent with the analysis in Pedlosky (1965), no westward intensification exists in the normal modes, which decay as 390 a whole. Rayleigh friction of the time-dependent modes is permitted to be different from that in the time-independent mean flow-a physically acceptable situation. The value b = σ 11 /30 = 1.8 × 10 −3 is used. No particular realism is intended here in the choices of numerical amplitudes, data properties etc. They are chosen only to demonstrate the estimation issues.
Eq. (34) is here taken to be "truth" and to generate the correct fields. As would be necessary in practice, a "prediction" model is introduced as with the only difference from the truth model in the initial conditions and forcings.
If q (t) = 0 and with no dissipation, then Eq. (28) has several useful invariants: the quadratic invariant of the kinetic energy and of the "energy" in ψ-x T (t) x (t) (complex transpose); and the linear invariant of the vorticity or circulation-when integrated over the entire basin domain. As above, estimates of the quadratic and linear invariants will depend explicitly on 400 initial conditions, forces, distribution and accuracy of the data, and the covariances and bias errors assigned to all of them.

System with Observations
A different problem is now posed of determining the transport of the western boundary current (WBC), which is here assumed to be a constant (invariant) in the presence of both physical noise-the normal modes-and the random noise of the observations y (τ i ) . For determining the transport of the WBC, the presence of both natural noise (the time-dependent modes) 405 and observational noise is analogous to the true physical circulation problem. Non-dimensional normal mode frequencies and periods for n = 3, 4, 5, m = 4, 5, ..., 9 are shown in Fig. 10. ∆t = 29, 1/b = 553, R a /β = 0.29 Initial modal amplitudes are taken to have a slightly "red" property. The field ψ (t = 167∆t) is shown in Fig. 11, keeping in mind that apart from the time-mean ψ, the structure is the result of a particular set of random forcings.
Noisy observations, y(t), are taken at the positions in Fig. 11. The prediction model has the correct A, but the magnitudes 410 of the initial conditions are 20% too large, and the forcing field magnitude of q p is 50% too small (the forcing is complex white noise).

Aliasing
In isolation, the observations will time-alias the field, if not taken at minimum intervals of 1/2 the shortest period present (here 4∆t). A spatial-alias occurs if the separation is less that 1/2 the shortest wavelength present (here ∆y = 1/9). Both these 415 phenomena are present in what follows, but their impact is minimized by the presence of the time-evolution model.   The RTS algorithm assumes that a proper KF result has been computed and the results stored. P (0) = diag N state (1), initial condition uncertainty, and is uniform amongst the elements. Here, as shown in Fig. 12, a priori knowledge that higher frequen-420 cies have smaller initial values is not being used. Q (t) = diag(0.015) except for x N state (t) for which Q Nstate,Nstate (t) = 0, R (t) = diag N obs (0.6 × 10 −3 ). The system is run with the knowledge that the time-mean wind is truly constant. Observations are available spatially as in Fig. 11 at intervals, initially at 50∆t beginning at t = 166∆t (Fig. 12) and then more densely at 25∆t spacing, a crude mimicking of observations becoming more dense with time. Observations cease prior to T dur , mimicking a pure prediction interval following the observed states.

425
The energy, Φ (t) = nm |α nm (t)| 2 , in the time-dependent components is shown for the true value and the KF estimate in Fig. 12. It is a surrogate for the total system energy and is a quadratic variable. A slow increase is visible in the true value and in the prediction value with a levelling off at around time-step 200, again a combination of the dissipation and the white noise random walk increase. Until the first observation time, the predicted energy is identical to that of the KF, Φ pred (t) = Φ kf (t) , when the latter takes a jump towards the true value, but remains low. As additional observations accumulate, the This system can theoretically be over-determined by letting the number of observations at time t exceed the number of unknowns-should the null space of E (t) then vanish. As expected, with 14 covarying observations, and 18 time-varying unknown x i (t), E (t) has a nullspace (rank 12) and thus energy in the true field is missed even if the observations were perfect.
As is conventional in inverse methods, the smaller eigenvalues and their corresponding eigenvectors are most susceptible 440 to noise biases. The solution nullspace of this particular E (t) found from the solution eigenvectors of the singular value decomposition, UΛV T = E. Solution resolution matrix at rank K = 13, V K V T K , is shown in Fig. 13 where V K contains the first K columns of V. Thus the observations carry no information about modes (as ordered) 3,6,9,12,15,18. In a real situation, if control over positioning of the observations was possible, this result could sensibly be modified and/or a strengthening of the weaker singular values could be achieved. Knowledge of the nullspace structure is very important in the interpretation of 445 results.
A more general discussion of nullspaces involves that of the weighted P (τ, −) E T appearing in the Kalman gain. If P (τ, −) 1/2 is the Cholesky factor of P (τ, −) (W06, P. 56), then EP (τ, −) 1/2 is the conventional column-weighting of E at time τ, and the resolution analysis would be applied to that combination. In the present special case, A is diagonal, and thus it does not distribute information about any covariance amongst the elements x j (τ ) and which would be carried in P (τ, −) . Figure 12. (a) The non-dimensional "energy", the sums Φ (t) = nm |αnm (t)| 2 , for the true and prediction models of the barotropic system. In the prediction model, the initial conditions are 20% too large, and the forcing is 50% too small (but is otherwise identical to the true value). Time positions of the data, initially at every 50∆t and then at every 25th ∆t, are shown. Prior to the data onset, the energy is that given by the prediction model. After the data interval, power is also from the prediction model, but starting with the final KF analysis estimate. Jumps in the KF power at observation times are visible, especially at the time of the first observation. The smoothed solution carries too much energy prior to the first observations as the system has no information about the growth of power before that time and the uncertainty assigned to the actual initial conditions is large. (b) The smoothed solution energy when initial conditions are set to be essentially perfect and showing the estimated reduced power towards the origin which does not occur when a finite uncertainty is assigned (as in (a)).
An important observational goal is determination of the north-south transport at each time-step from the velocity or stream function as, at fixed latitude index j 0 , as the stream-function difference between a longitude pair, i, i 0 . From the boundary condition, ψ (i = 1, j 0 , t) = 0 identically. In the present context, five different values are relevant: (1) the true constant, invariant, value,

455
(2) the true apparent value including the oscillatory mode noise, (3) the estimated value from the prediction model, (4) the estimate from the KF, (5) the estimate from the smoother.  In the KF (Fig. 14), observations move the WBC values closer to the truth, but retain the normal mode noise. Prior to the first observation, the values are indistinguishable from zero. Within the observation interval, the estimates are indistinguishable from the true value, but still have a wide uncertainty with time scales present both from the natural variability and the regular injection times of the data. Transport value uncertainties are derivable from the P of the state vector.

465
Turning now to the RTS smoother, one sees in Fig. 12 that the energy, Φ smth (t) , in the smoothed solution is continuous (up to the usual time-stepping changes), but sometimes exceeds the true energy prior to the appearance of the first observation. The only information available to the smoother prior to the observational interval lies in the initial conditions, which were provided only with a very large uncertainty and the unknown u (t) in this interval also has a large variance.
The smoother solution in the pre-data interval differs more widely from the true value than does the KF solution. That 470 behavior is a consequence of the comparatively large uncertainty estimate assigned to the initial conditions. If the initial conditions are made near-perfect then (Fig.12), they are reproduced in the smoother solution and the reduced energy in that interval is also the best estimate. An element through time of the control vector and its standard error are shown in Fig. 15. The complex result of the insertion of data is apparent. As with the KF, discussion of any systematic errors has to take place outside of the formalities leading to the smoothed solution.
475 Fig. 16 shows the behavior of the estimate of the WBC transport and its uncertainty when the smoothing algorithm is used with two different data densities. A test of the hypothesis that it was indistinguishable from constant would be based upon an analysis using the uncertainty (not shown here).   The very large uncertainty prior to the onset of data, even with use of a smoothing algorithm, is a central reason that the ECCO estimate (e.g., Fukumori, et al., 2018) is confined to the interval following 1992 when the data become far denser than 480 before. Estimates prior to a dense data interval will depend greatly upon the time durations built into the system, which in the present case are limited by the longest normal mode period. The real ocean does include some very long memory (Wunsch and Heimbach, 2008), but the skill will depend directly on the specific physical variables of concern, and which in ECCO include the time-sensitive flow field. The gain matrix M (t) for computation of the control vector is displayed in Fig. 18. Here the dependence is directly upon the a priori known Q (t) , the data distributions, and the determinants of P (t + ∆t, −) . The limiting cases discussed above for 490 the state vector also provide insights here.

Spectra
Computation of the spectral estimates of the various estimates of any state vector element or combination is straightforward and the z−transforms of Appendix C provide an analytic approach. What is not so straightforward is the interpretation of  the result in this non-statistically stationary system. Care must be taken to account for the non-stationarity, but results are not further described here.

Discussion
The behavior of dynamical system invariants, be they fundamental ones such as energy or circulation or scalar inventory, or derived ones such as a thermocline depth, in sequential estimation processes depends upon a number of parameters. These parameters include the time scales embedded in the dynamical system, the temporal distribution of the data relative to the em-500 bedded time-scales, the accuracies of initial conditions, boundary conditions, and data, as well as the accuracy of the governing time evolution model. In addition, the invariant estimates can depend directly upon the accuracies of the uncertainties, explicit or implicit, in all of the elements making up the estimation system. Because of their interplay, the only easy generalization is that the user must check the accuracies of all of these elements, including the often difficult appearance of systematic errors in any of them (e.g., Dee, 2005). When feasible, a strong clue to the presence of systematic errors e.g., in energies, lies in 505 determining the nullspace of the observation matrices coupled with the structure of the state evolution matrices, A. Analogous examples have been computed e.g., for advection-diffusion systems (not shown, but see Marchal and Zhao, 2021) with analogous results concerning e.g., estimates of fixed total tracer inventories.
Physical insights into the system behavior are essential, as is an understanding of the structure of the imputed statistical relationships. As a considerable literature cited previously has made clear, the inference of trends in properties in the presence 510 of time-evolving observation systems requires particular attention. At a minimum, one should test any such system against the behavior of a known result-e.g., treating a GCM as "truth" and then running the smoothing algorithms to test whether that truth is forthcoming.

Appendix A: Steady-State and Asymptotics
Time sequence equations starting at t = 0 (however defined) undergo a general transient behavior. For simplifying purposes, 515 and following much of the literature, assume that a statistical steady-state has been reached, so that the the innovation equation has become, respectively, with no time-dependence in A, B, or K. Time-dependence remains inx (t) but it can be statistically stationary (labelled "wide" or "weak" depending on the literature). The steady-state error covariance is 520 an algebraic Riccati equation, and is also constant. Pitfalls lie in the accuracies of P ∞ and in R.
Within a steady-state, the various moments can be computed. So for example, from the innovation state equation, the mean or and thus depends directly upon any bias errors in y (t) and E, and the accuracy of K ∞ . It will be sensitive directly to the structure and rank of I − A.

Predictor-Corrector Methods
Rigorous Kalman filters are widely used in many applications. In climate systems they are almost never used, despite claims to the contrary, because of the computational cost of Eq. (A1moved). Instead, K (τ ) is replaced by an ad hoc, often constant, 535 matrix, K pc , and in which Eq. (6) is a predictor-corrector system, K pc would be substituted for K ∞ in the previous equations whether derived from the formal solution Eq. (7) or not. As with the true KF,x (τ ) , once combined with data, using K pc , no longer satisfies the prediction model equations, having undergone a jump in values at time τ. As with the true KF, the predictor-corrector system can be written in innovation form, showing the 540 apparent forcing by data. Fig. A1 depicts the variation in some elements of the Kalman gain matrix for a set of observations at the places shown. Some elements do tend to become nearly constant at the data times, while others continue to show a structure. Whether choosing K pc from one particular time is adequate will be very much problem dependent.

545
The algebraic statement of the smoother is not easy to penetrate. Consider an even simpler system-that of a scalar obeying a time evolution equation, the "prediction equation" in the above terminology is, where |a| < 1 is a constant, and q (t) is a known forcing process. Γ (t) u (t) has zero-mean and variance Q (t) , perhaps constant in time, and represents any unknown element in q (t). Q is used as u (t) represents the uncertainty in q. b, Γ are known scale Only x5 (t) =ξ2 was observed. Zero values interlace the observation times.
factors and might as well be taken as 1, but are useful markers. Let there be observations ofx (τ ) , where the observation noise, ε (t) , is another zero-mean white noise process of variance, R. The system begins at t = 0, with an initial conditionx (0) , with a known uncertainty, P (0) = (x (0) − x (0)) 2 . No null space of E exists

555
Prediction is made using Eq. (C1) with the unknown u (t) set to zero and the estimated initial condition. Theñ At the previous time-step, the uncertainty, P (t) = (x (t) − x (t)) 2 , is known, and then the uncertainty of the prediction is, with as before, the minus sign indicating that no data at time t+∆t have been used. If no data are available,x (t + ∆t) =x (t) , 560 and its uncertainty is P (t + ∆t) = P (t + ∆t, −) with Eq. (C4) becoming, a simple difference equation which can be solved beginning at t = 0. If no data at all are available, taking the z−transform, z = exp (−2πis∆t) , and denoting the result with a circumflex, and P (z) = zQ (z) 1 − za 2 =Q (z) z 1 + za 2 + z 2 a 4 + ...
If data are available at time τ, the weighted average of the two estimates can be written as, where the difference, y (τ ) − Ex (τ, −) includes both the observational noise, and the discrepancies in the predicted state from the true value. Evidently, any mis-specification of a, E, Q or P (0) will lead to an error in the estimate, and in its uncertainty.
The uncertainty following employment of the observation at t = τ is P (τ ) = P (τ, −) 1 − P (τ, −) E 2 E 2 P (τ, −) + R ≤ P (τ, −) that is, the data reduces the uncertainty. Should R → 0, P (τ ) vanishes but would become finite again at the next predicted time-step. If R → ∞, P (τ ) = P (τ, −) . Now assuming that the KF has been run out to a duration 0 ≤ t ≤ T dur , the Rauch-Tung-Striebel (RTS) algorithm can be used to improve the estimates using observations that were formally future to times τ in the KF. Let any such new estimate be denotedx (t, +) , with a new uncertainty P (t, +) . Then for this scalar system, x (t, +) =x (t) + P (t) a P (t + ∆t, −) [x (t + ∆t, +) −x (t + ∆t, −)] (C11) 590 so that if there were no data future to t+∆t,x (t + ∆t, +) =x (t + ∆t, −) , and no change is made in the previous value,x (t) .
If, previously,x (t) , were know perfectly, P (t) = 0, and again no change is made.

600
Define an innovation matrix, D δ (t, j) = y (t) − E (t) x (t) = δ t,τ δ ij (C1) that is, D δ is a matrix of Kronecker deltas of the difference D ij (τ ) = δ t,τ δ ij = y j (τ ) − r E ir (τ ) x r (τ ) . The solutions to the equation are the columns of the Green function matrix,

605
K, now fixed in time, is sought as an indication of a delta impulse effects of observations on the prediction model at time τ.
Then the discrete Fourier transform of Eq. (C2) (the z−transform-a matrix polynomial in z) is, The norm of the variable (I−zA) −1 defines the "resolvent" of A in the full complex plane (see Trefethen and Embree, 2005), but here, only |z| = 1, is of direct interest, that is only on the unit circle. The full complex plane carries information about the behavior of A, including stability.
G can be obtained without the z−transform, but the frequency content of these results is of interest.

Green Function of Smoother Innovation
As with the innovation equation for filtering, Eq. (25) introduces a disturbance into the previous estimate,x (t) , in which the structure of L (t) will determine the magnitude and time scales of observational "disturbances" propagated backwards in time.

625
It provides direct insight in the extent to later measurements influence earlier estimates. As one example, suppose that the KF has been run to time t = t f so thatx (t f , +) =x (t f ) , which has the only measurement. Let the innovation,x(t f , +)−x(t f , −), be a matrix of δ functions in separate columns, then a backwards-in-time matrix Green function is, The various time-scales embedded in L depend upon those in A, P (t, −) , P (t) and with many observations including those of the observation intervals, and any structure in the observational noise. Similarly, the control modification will be determined by P(t + ∆t, −) −1 if Q(t)Γ(t) T are constant in time.
Author contributions. sole author