Least-squares fitting is a well-known statistical technique to estimateparameters in mathematical models. It concerns solving the optimisation problemof finding the minimum of the function

$F(\theta) = \sum_{i = 1}^N \rho (f_i(\theta)^2),$F(θ)=i=1∑Nρ(fi(θ)2),

where - $\theta=(\theta_1, \ldots, \theta_r)$θ=(θ1,…,θr) is a collection of parameters which we want to estimate, - $N$N is the number of available data points, - $\rho$ρ is a loss function to reduce the influence of outliers, and - $f_i(\theta)$fi(θ) is the $i$i-th component of the vector of residuals.

Given a model function $m(t; \theta)$m(t;θ) and some data points $D = \{(t_i, d_i) \mid i = 1,\ldots, N\}$D={(ti,di)∣i=1,…,N}, one normally defines the vector of residuals as thedifference between the model prediction and the data, that is:

$f_i(\theta) = m(t_i; \theta) - d_i.$fi(θ)=m(ti;θ)−di.

If the model is linear, i.e. $\theta = (m, n)$θ=(m,n) and $m(t; m, n) = mt + n$m(t;m,n)=mt+n,then the previous approach is reduced to the even more well known linearleast-squares fitting problem, in the sense that there exists an explicitformula for finding the optimal value for the parameters $\hat \theta$θ^. Inthis report we consider more involved problems where the model may be nonlinearand finding the optimal value $\hat \theta$θ^ must be done with iterativealgorithms. We shall not go into the theoretical details of the algorithms,but rather explore the implementation of the `least_squares`

function availablein the `scipy.optimize`

module of the SciPy Pythonpackage. In particular, we give examples of how to handle multi-dimensional andmulti-variate functions so that they adhere to the `least_squares`

interface.

## First example: a scalar function

The first example we will consider is a simple logistic function

$y(t) = \frac{K}{1 + e^{-r(t - t_0)}}.$y(t)=1+e−r(t−t0)K.

The three parameters in the function are:

- $K$K, the supremum of $y$y (think of this as a maximum that is achieved when $t = \infty$t=∞),
- $r$r, the logistic growth rate, or sharpness of the curve, and
- $t_0$t0 the value at which the midpoint $K/2$K/2 is achieved.

Optimising the parameters $\theta = (K, r, t_0)$θ=(K,r,t0) is straightforward in thiscase, since `least_squares`

allows us to input the model without any specialchanges. To do this we first define the model as the Python function $y(theta, t)$y(theta,t) and then generate some data in `ys`

by evaluating the model over a sample`ts`

of the independent variable $t$t. When generating the test data, we arepassing an array of numbers `ts`

directly to the model. We are able to do thisbecause we defined the model `y`

using NumPy's `array`

object which implementsstandard element-wise operations between arrays. We add some uniformlydistributed noise, make an initial guess for the parameters `theta0`

and definethe vector of residues `fun`

. The vector of residues must be a function of theoptimization variables $\theta$θ so that `least_squares`

can evaluate thefitness of its guesses.

`def y(theta, t): return theta[0] / (1 + np.exp(- theta[1] * (t - theta[2])))ts = np.linspace(0, 1)K = 1; r = 10; t0 = 0.5; noise = 0.1ys = y([K, r, t0], ts) + noise * np.random.rand(ts.shape[0])def fun(theta): return y(theta, ts) - ystheta0 = [1,2,3]res1 = least_squares(fun, theta0)`

We see that the estimated parameters are indeed very close to those of the data.

In some cases, we may want to only optimise some parameters while leavingothers fixed. This can be done by removing them from `theta`

and hard-codingthem into the model.

## Second example: a parametrised circumference

The motivation behind this example is that sometimes we may have data that is described by two dependent quantities which we hypothesize to be a function of a single independent variable. The simplest example I could think of was that of a parametrised circumference:

$s(t) = (c_x + r \cos(t), c_y + r \sin(t)).$s(t)=(cx+rcos(t),cy+rsin(t)).

Here, the parameters are - the centre $C = (c_x, c_y)$C=(cx,cy), and - the radius $r$r.

Like before, we define the model `s`

and generate some random data. Only thistime, the model outputs two dependent variables $s(t) = (x(t), y(t))$s(t)=(x(t),y(t)). Asexpected, this requires us to define the model differently so that it returnstwo numbers instead of one as well as adding noise to both outputs. When itcomes to defining the vector of residuals, we must take care to match the shapeexpected by `least_squares`

. Per the documentation, we must provide a vector of$N$N elements which `least_squares`

will square before inputting the result intothe loss function $\rho$ρ. However, we want to compute the square of thedistance between the model prediction and the test data. We would normally dothis by calculating $(m_x - d_x)^2 + (m_y - d_y)^2$(mx−dx)2+(my−dy)2. But if we did this theresult would get squared again by `least_squares`

. The solution is to return avector of residues of size $2N$2N where components $1$1 through $N$N correspond tothe differences $m_x - d_x$mx−dx and components $N + 1$N+1 through $2N$2N correspond tothe differences $m_y - d_y$my−dy. This is easily achieved by taking the differencebetween the prediction and the data as usual and then *flattening* the twoarrays into a longer one. We are able to do this because `least_squares`

neverreally sees the raw data and the cost functions presented at the beginning ofthe report is just there to provided a mathematical background. Also, thisapproach generalises easily to higher-dimensional model outputs.

`def s(theta, t): x = theta[0] + theta[2] * np.cos(t) y = theta[1] + theta[2] * np.sin(t) return np.array([x, y])ts = np.linspace(0, 2 * np.pi)cx = 1.5; cy = 1.3; r = 0.75; noise = 0.05ss = s([cx, cy, r], ts)ss[0] += noise * np.random.rand(ts.shape[0])ss[1] += noise * np.random.rand(ts.shape[0])def fun(theta): return (s(theta, ts) - ss).flatten()theta0 = [0,0,0]res2 = least_squares(fun, theta0)`

### Third example: an elliptic paraboloid

Lastly we focus on an elliptic paraboloid since it can be parametrised as afunction of two independent variables with

$h(x, y) = a(x - x_0)^2 + b(y - y_0)^2.$h(x,y)=a(x−x0)2+b(y−y0)2.

Here the parameters are - $a$a and $b$b, which determine the eccentricity of the ellipses that the level sets of the paraboloid ($a = b = 1$a=b=1 give circumferences), and - $(x_0, y_0)$(x0,y0), which determine the projection of the vertical axis of the paraboloid onto the $xy$xy-plane.

In this instance we must also be careful with how we sample the domain of theindependent variable. We choose to sample the square $[-1, 1] \times [-1, 1]$[−1,1]×[−1,1]with a $20 \times 20$20×20 mesh grid, i.e. we evaluate the model at points $(-1 + 0.1k, -1 + 0.1k)$(−1+0.1k,−1+0.1k) for $k = 0, \ldots, 20$k=0,…,20. This sounds more complicated than itreally is, since NumPy provides a method `numpy.meshgrid`

that does this forus. As before, we add noise, taking care to do so for each of the $400 = 20 \times 20$400=20×20 data points we generated. For the model definition, we do not needto do anything special since NumPy also implements binary element-wiseoperations between the components of a mesh grid.

`def h(theta, x, y): return theta[2] * (x - theta[0])**2 + theta[3] * (y - theta[1])**2xs = np.linspace(-1, 1, 20)ys = np.linspace(-1, 1, 20)gridx, gridy = np.meshgrid(xs, ys)x0 = 0.1; y0 = -0.15; a = 1; b = 2; noise = 0.1hs = h([x0, y0, a, b], gridx, gridy)hs += noise * np.random.default_rng().random(hs.shape)def fun(theta): return (h(theta, gridx, gridy) - hs).flatten()theta0 = [0, 0, 1, 2]res3 = least_squares(fun, theta0)`

## Goodness of fit and parameter distribution estimation

Once we have parameter estimates for our model, one question we may askourselves is how well does the model fit the data. Since we are doingleast-square estimation, one of the easiest errors to calculate is the meansquared error [MSE]. In the context of the previous question, i.e. how well dothese parameter estimates fit the data, this error is just the mean of thesquares of the residuals

$\text{MSE} = \frac{1}{N} \sum_{i = 1}^N f_i(\hat \theta).$MSE=N1i=1∑Nfi(θ^).

Therefore it is a function of both the training data set and the parametersthemselves. In the previous three cases the MSE can be calculated easily withPython, by using the result object returned by `least_squares`

:

`mse1 = (res1.fun ** 2).mean()mse2 = (res2.fun ** 2).mean()mse3 = (res3.fun ** 2).mean()# MSE in the logistic equation (ex1): 0.000806# MSE in the parametrised circumference (ex2): 0.000226# MSE in the elliptic paraboloid (ex3): 0.001717`

Now suppose we are using the parameter estimates and model for prediction. Forinstance, think of the first logistic growth model as the cumulative number ofcases during an epidemic. Let's say we are in the middle of the epidemic andonly have data available for the first couple of weeks but with to determinewhen the increase of new cases will start to decline (at $t_0$t0) or what thetotal number of cases will be (approximately $K$K). We estimate the parameterswith the available data and make estimates. Then the epidemic finally stops andwe wish to evaluate how well our model did.

In this context the MSE is called the mean square prediction error [MSPE] and is defined as

$\text{MSPE} = \frac{1}{Q} \sum_{i = N + 1}^{N + Q} (d_i - m(t_i; \hat \theta))^2,$MSPE=Q1i=N+1∑N+Q(di−m(ti;θ^))2,

where we suppose we have made $Q$Q predictions using the model at the valuesof the independent variable $t_{N + 1}, \ldots, t_{N + Q}$tN+1,…,tN+Q using theestimated parameters $\hat \theta$θ^.

`# generate data againts = np.linspace(0, 100, 100)K = 1400; r = 0.1; t0 = 50; noise = 50ys = y([K, r, t0], ts) + noise * (np.random.rand(ts.shape[0]) - 0.5)# only this time we only use the first 50% of the datatrain_limit = 50 # out of 100 data pointsdef fun(theta): return y(theta, ts[:train_limit]) - ys[:train_limit]# run the parameter estimation againtheta0 = [1000, 0.1, 30]res4 = least_squares(fun, theta0)# predict the values for the rest of the epidemicpredict = y(res4.x, ts[train_limit:])mspe = ((ys[train_limit:] - predict) ** 2).mean()# K = 1266.9713573318118, r = 0.10463262364967481, t_0 = 48.37440994315726# MSPE for a prediction with 50.0% of the data: 9660.411804269846`

The previous simulation is extremely sensitive to the amount of training data.In particular, if the training dataset ends much before $t_0$t0 the model can behorribly wrong to the point where the prediction can be that the epidemic willstill be in the initial exponential phase by day 100. In any case, this issueis beyond the scope of this report as it has more to do with the highsensitivity to small changes in parameters that characterises exponentialmodels.

However, the idea of measuring the accuracy of a prediction is in the rightdirection. Another question we might ask ourselves is how can we get errorestimates for our predictions before being able to validate them against realdata. In other words, can we predict how wrong we will be in addition topredicting how many infected cases there will be? For general models andgeneral techniques of parameter estimation, there are countless answers to thisquestion. In what follows we focus on a very particular approach that, ofcourse, has its flaws.

### Error estimates via residual resampling

One common technique for quantifying errors in parameter estimation is the useof confidence intervals. If the underlying distribution of the data is known,it can sometimes be used to derive confidence intervals via explicit formulas.When the underlying distribution is either unknown or too complex to treatanalytically, one can try to estimate the distribution of the data itself. Onepossible way of doing this is to instead estimate the distribution of theresiduals and generate new samples from the fitted values.

More formally, one does NLS fitting and retains the fit values $\hat y_i$y^i andthe residuals $f_i(\hat \theta)$fi(θ^) in addition to the estimated parameters $\hat \theta$θ^. Recall that because of the way we have defined $f$f, the relationbetween these three is $f_i(\hat \theta) = \hat \varepsilon_i = \hat y_i - y_i$fi(θ^)=ε^i=y^i−yi. Now we generate new synthetic data, $y_i^* = \hat y_i + \hat \varepsilon_j$yi∗=y^i+ε^j where $\hat \varepsilon_j$ε^j is randomly drawn from the collectionof residues $\hat \varepsilon_1, \ldots, \hat \varepsilon_N$ε^1,…,ε^N. We use thatsynthetic data to refit the model and retain interesting quantities such as thevalues of the parameter estimates. We repeat this process many times toestimate the distribution of the interesting quantities we picked. We are nowin a position to estimate the confidence intervals for the parameters.

`# real datats = np.linspace(0, 100, 100)K = 1400; r = 0.1; t0 = 50; noise = 50ys = y([K, r, t0], ts) + noise * (np.random.rand(ts.shape[0]) - 0.5)# again, we only use the first 50% of the datatrain_limit = 50 # out of 100 data pointsdef fun(theta): return y(theta, ts[:train_limit]) - ys[:train_limit]# run the parameter estimationtheta0 = [1000, 0.1, 30]res5 = least_squares(fun, theta0)# use the residuals to estimate the error distributioneps = res5.fun# residual resamplingM = 500 # number of resamplestheta_est = []for _ in range(M): # generate synthetic sample fit_ys = y(res5.x, ts[:train_limit]) np.random.default_rng().shuffle(eps) synthetic_ys = fit_ys + eps # fit the model again res = least_squares(lambda theta: y(theta, ts[:train_limit]) - synthetic_ys[:train_limit], theta0) theta_est.append(res.x)Ks, rs, t0s = np.array(theta_est).transpose()`

Here we can see the estimated distributions of the model parameters. Thedistributions for $r$r and $t_0$t0 are more or less centred while the distributionfor $K$K presents a huge variance and is quite skewed. We choose to do thefollowing, fix the parameters $\hat r = \bar r$r^=rˉ and $\hat t_0 = \bar t_0$t^0=tˉ0 andplot the predictions for the model with them and the values for the 95%confidence interval for $K$K which we estimate from the distribution.

`K05, K95 = np.quantile(Ks, [0.05, 0.95])K = Ks.mean()r = rs.mean()t0 = t0s.mean()y05 = y([K05, r, t0], ts)y95 = y([K95, r, t0], ts)print(f'K = {K:.0f} ({K05:.0f}, {K95:.0f}), r = {r:.3f}, t0 = {t0:.0f}')fig = plt.figure(figsize = (9, 5))ax = fig.add_subplot(111)ax.plot(ts, ys, '.-', label = 'data')ax.plot(ts, y05, '--', label = '5%')ax.plot(ts, y([K, r, t0], ts), label = 'fit')ax.plot(ts, y95, '--', label = '95%')ax.fill_between(ts, y05, y95, alpha = 0.2, color = 'gray')ax.set_xlabel('days')ax.set_ylabel('confirmed cases')ax.set_title('Prediction with confidence intervals')ax.legend()fig.show()# K = 1258 (1090, 1454), r = 0.105, t0 = 48`