We should have a lot a great talks this semester, on a very interesting topic,

# Tag Archives: optimization

# Foundations of Machine Learning, part 5

*This post is the nineth (and probably last) one of our series on* *the history and foundations of econometric and machine learning models. The first fours were on econometrics techniques. Part 8 is online here.*

## Optimization and algorithmic aspects

In econometrics, (numerical) optimization became omnipresent as soon as we left the Gaussian model. We briefly mentioned it in the section on the exponential family, and the use of the Fisher score (gradient descent) to solve the first order condition \mathbf{X}^T W(\beta)^{-1})[y-\widehat{y}]=\mathbf{0}. In learning, optimization is the central tool. And it is necessary to have effective optimization algorithms, to solve problems (described previously) of the form: \widehat{\beta}\in\underset{\beta\in\mathbb{R}^p}{\text{argmin}}\left\lbrace\sum_{i=1}^n \ell(y_i,\beta_0+\mathbf{x}^T\beta)+\lambda\Vert\boldsymbol{\beta}\Vert\right\rbraceIn some cases, instead of global optimization, it is sufficient to consider optimization by coordinates (widely studied in Daubechies et al. (2004)). If f:\mathbb{R}^d\rightarrow\mathbf{R} is convex and differentiable, if \mathbf{x} satisfies f(\mathbf{x}+h\boldsymbol{e}_i)\geq f(\mathbf{x}) for any h>0 and i\in\{1,\cdots, d\}then f(\mathbf{x})=\min\{f\}, where \mathbf{e}=(\mathbf{e}_i) is the canonical basis of \mathbb{R}^d. However, this property is not true in the non-differentiable case. But if we assume that the non-differentiable part is separable (additively), it becomes true again. More specifically, iff(\mathbf{x})=g(\mathbf{x})+\sum_{i=1}^d h_i(x_i)with\left\lbrace\begin{array}{l}g: \mathbb{R}^d\rightarrow\mathbb{R}\text{ convex-differentiable}\\h_i: \mathbb{R}\rightarrow\mathbb{R}\text{ convex}\end{array}\right.This was the case for Lasso regression, \beta)\mapsto\| \mathbf{y}-\beta_0-\mathbf{X}\beta\|_{\ell_2 }+\lambda\|\beta\|_{\ell_1}, as shown by Tsen (2001). Getting back to our initial notations, we can use a coordinate descent algorithm: from an initial value \mathbf{x}^{(0)}, we consider (by iterating)x_j^{(k)}\in\text{argmin}\big\lbrace f(x_1^{(k)},\cdots,x_{k-1}^{(k)},x_k,x_{k+1}^{(k-1)},\cdots,x_n^{(k-1)})\big\rbrace for j=1,2,\cdots,nThese algorithmic problems and numerical issues may seem secondary to econometricians. However, they are essential in automatic learning: a technique is interesting if there is a stable and fast algorithm, which allows to obtain a solution. These optimization techniques can be transposed: for example, this coordinate descent technique can be used in the case of SVM methods (known as “vector support” methods) when the space is not linearly separable, and the classification error must be penalized (we will come back to this technique in the next section).

## In-sample, out-of-sample and cross-validation

These techniques seem intellectually interesting, but we have not yet discussed the choice of the penalty parameter \lambda. But this problem is actually more general, because comparing two parameters \widehat{\beta}_{\lambda_1} and \widehat{\beta}_{\lambda_2} is actually comparing two models. In particular, if we use a Lasso method, with different thresholds \lambda, we compare models that do not have the same dimension. Previously, we have addressed the problem of model comparison from an econometric perspective (by penalizing overly complex models). In the learning literature, judging the quality of a model on the data used to construct it does not make it possible to know how the model will behave on new data. This is the so-called “generalization” problem. The traditional approach then consists in separating the sample (size n) into two parts: a part that will be used to train the model (the training database, in-sample, size m) and a part that will be used to test the model (the testing database, out-of-sample, size n-m). The latter then makes it possible to measure a real predictive risk. Suppose that the data are generated by a linear model y_i=\mathbf{x}_i^T \beta_0+\varepsilon_i where \varepsilon_i are independent and centred law achievements. The empirical quadratic risk in-sample is here\frac{1}{m}\sum_{i=1}^m\mathbb{E}\big([\mathbf{x}_i^T \widehat{\beta}-\mathbf{x}_i^T \beta_0]^2\big)=\mathbb{E}\big([\mathbf{x}_i^T \widehat{\beta}-\mathbf{x}_i^T \beta_0]^2\big),for any observation i. Assuming the residuals \varepsilon Gaussian, then we can show that this risk is worth \sigma^2 \text{trace} (\Pi_X)/m is \sigma^2 p/m. On the other hand, the empirical out-of-sample quadratic risk is here \mathbb{E}\big([\mathbf{x}^T \widehat{\beta}-\mathbf{x}^T \beta_0]^2\big) where \mathbf{x} is a new observation, independent of the others. It can be noted that \mathbb{E}\big([\mathbf{x}^T \widehat{\beta}-\mathbf{x}^T \beta_0]^2\big\vert \mathbf{x}\big)=\text{Var}\big(\mathbf{x}^T \widehat{\beta}\big\vert \mathbf{x}\big)=\sigma^2\mathbf{x}^T(\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x},and by integrating with respect to \mathbf{x}, \mathbb{E}\big([\mathbf{x}^T \widehat{\beta}-\mathbf{x}^T\beta_0]^2\big)=\sigma^2\text{trace}\big(\mathbb{E}[\mathbf{x}\mathbf{x}^T]\mathbb{E}\big[(\mathbf{x}^T\mathbf{x})^{-1}\big]\big).The expression is then different from that obtained in-sample, and using the Groves & Rothenberg (1969) increase, we can show that \mathbb{E}\big([\mathbf{x}^T \widehat{\beta}-\mathbf{x}^T \beta_0]^2\big) \geq \sigma^2\frac{p}{m}which is pretty intuitive, when we start thinking about it. Except in some simple cases, there is no simple (explicit) formula. Note, however, that if \mathbf{X}\sim\mathcal{N}(0,\sigma^2 \mathbb{I}), then \mathbf{x}^T \mathbf{x} follows a Wishart law, and it can be shown that \mathbb{E}\big([\mathbf{x}^T \widehat{\beta}-\mathbf{x}^T \beta_0]^2\big)=\sigma^2\frac{p}{m-p-1}.If we now look at the empirical version: if \widehat{\beta} is estimated on the first m observations,\widehat{\mathcal{R}}^{~\text{ IS}}=\sum_{i=1}^m [y_i-\boldsymbol{x}_i^T\widehat{\boldsymbol{\beta}}]^2\text{ and }\widehat{\mathcal{R}}^{\text{ OS}}=\sum_{i=m+1}^{n} [y_i-\boldsymbol{x}_i^T\widehat{\boldsymbol{\beta}}]^2and as Leeb (2008) noted, \widehat{\mathcal{R}}^{\text{IS}}-\widehat{\mathcal{R}}^{\text{OS}}\approx 2\cdot\nu where \nu represents the number of degrees of freedom, which is not unlike the penalty used in the Akaike test.

Figure 4 shows the respective evolution of \widehat{\mathcal{R}}^{\text{IS}} and \widehat{\mathcal{R}}^{\text{OS}} according to the complexity of the model (number of degrees in a polynomial regression, number of nodes in splines, etc). The more complex the model, the more \widehat{\mathcal{R}}^{\text{IS}} will decrease (this is the red curve, below). But that’s not what we’re interested in here: we want a model that predicts well on new data (i. e. out-of-sample). As Figure 4 shows, if the model is too simple, it does not predict well (as it does with in-sample data). But what we can see is that if the model is too complex, we are in a situation of “overlearning”: the model will start to model the noise. Of course, this figure should remind us of the one we’ve seen in our second post of that series

**Figure 4** : Generalization, under- and over-fitting

Instead of splitting the database in two, with some of the data that will be used to calibrate the model and some to study its performance, it is also possible to use cross-validation. To present the general idea, we can go back to the “jackknife”, introduced by Quenouille (1949) (and formalized by Quenouille (1956) and Tukey (1958)) relatively used in statistics to reduce bias. Indeed, if we assume that \{y_1,\cdots,y_n\} is a sample drawn according to a law F_\theta, and that we have an estimator T_n (\mathbf{y})=T_n (y_1,\cdots,y_n), but that this estimator is biased, with \mathbf{E}[T_n (\mathbf{Y})]=\theta+O(n^{-1}), it is possible to reduce the bias by considering \widetilde{T}_n(\mathbf{y})=\frac{1}{n}\sum_{i=1}^n T_{n-1}(\mathbf{y}_{(i)})\text{ where }\mathbf{y}_{(i)}=(y_1,\cdots,y_{i-1},y_{i+1},\cdots,y_n)It can then be shown that \mathbb{E}[\tilde{T}_n(Y)]=\theta+O(n^{-2})The idea of cross-validation is based on the idea of building an estimator by removing an observation. Since we want to build a predictive model, we will compare the forecast obtained with the estimated model, and the missing observation\widehat{\mathcal{R}}^{\text{ CV}}=\frac{1}{n}\sum_{i=1}^n \ell(y_i,\widehat{m}_{(i)}(\mathbf{x}_i))We will speak here of the “leave-one-out” (loocv) method.

This technique reminds us of the traditional method used to find the optimal parameter in exponential smoothing methods for time series. In simple smoothing, we will construct a forecast from a time series as {}_t\widehat{y}_{t+1} =\alpha\cdot{}_{t-1}\widehat{y}_t +(1-\alpha)\cdot y_t, where \alpha\in[0,1], and we will consider as “optimal” \alpha^\star = \underset{\alpha\in[0,1]}{\text{argmin}}\left\lbrace \sum_{t=2}^T \ell({}_{t-1}\widehat{y}_{t},y_{t}) \right\rbraceas described by Hyndman et al (2009).

The main problem with the leave-one-out method is that it requires calibration of n models, which can be problematic in large dimensions. An alternative method is cross validation by k-blocks (called “k-fold cross validation”) which consists in using a partition of \{1,\cdots,n\} in k groups (or blocks) of the same size, \mathcal{I}_1,\cdots,\mathcal{I}_k, and let us note \mathcal{I}_{\bar j}=\{1,\cdots,n\}\setminus \mathcal{I}_j. By noting \widehat{m}_{(j)} built on the sample \mathcal{I}_{\bar j}, we then set:\widehat{\mathcal{R}}^{k-\text{ CV}}=\frac{1}{k}\sum_{j=1}^k \mathcal{R}_j\text{ where }\mathcal{R}_j=\frac{k}{n}\sum_{i\in\mathcal{I}_{{j}}} \ell(y_i,\widehat{m}_{(j)}(\mathbf{x}_i))Standard cross-validation, where only one observation is removed each time (loocv), is a special case, with k=n. Using k=5 or 10 has a double advantage over k=n: (1) the number of estimates to be made is much smaller, 5 or 10 rather than n; (2) the samples used for estimation are less similar and therefore less correlated to each other, which tends to avoid excess variance, as recalled by James et al. (2013).

Another alternative is to use boosted samples. Let \mathcal{I}_b be a sample of size n obtained by drawing with replacement in \{1,\cdots,n\} to know which observations (y_i,\mathbf{x}_i) will be kept in the learning population (at each draw). Note \mathcal{I}_{\bar b}=\{1,\cdots,n\}\setminus\mathcal{I}_b. By noting \widehat{m}_{(b)} built on sample \mathcal{I}_b, we then set :\widehat{\mathcal{R}}^{\text{ B}}=\frac{1}{B}\sum_{b=1}^B \mathcal{R}_b\text{ where }\mathcal{R}_b=\frac{n_{\overline{b}}}{n}\sum_{i\in\mathcal{I}_{\overline{b}}} \ell(y_i,\widehat{m}_{(b)}(\mathbf{x}_i))where n_{\bar b} is the number of observations that have not been kept in \mathcal{I}_b. It should be noted that with this technique, on average e^{-1}\sim36.7\% of the observations do not appear in the boosted sample, and we find an order of magnitude of the proportions used when creating a calibration sample, and a test sample. In fact, as Stone (1977) had shown, the minimization of AIC is to be compared to the cross-validation criterion, and Shao (1997) showed that the minimization of BIC corresponds to k-fold cross-validation, with k=n/\log n.

All those techniques here are mentioned in the “machine learning” section since they rely on automatic, computational techniques, and no probabilistic foundations are necessary. In many cases we did use the notation m^\star (at least in the first posts on “machine learning” techniques) to highlight the fact that we want some sort of “optimal” model – and to make a distinction with estimators \widehat{m} considered earlier, when we had some probabilistic framework. But of course, it is possible (and necessary) to build bridges between those two cultures…

*References are online here. As explained in the introduction, it is some sort of online version of an introduction to our joint paper with Emmanuel Flachaire and Antoine Ly, Econometrics and Machine Learning (initially writen in French), that will actually appear soon in the journal Economics and Statistics (in English and in French).
*

# Classification from scratch, linear discrimination 8/8

Eighth post of our series on classification from scratch. The latest one was on the SVM, and today, I want to get back on very old stuff, with here also a linear separation of the space, using Fisher’s linear discriminent analysis.

## Bayes (naive) classifier

Consider the follwing naive classification rulem^\star(\mathbf{x})=\text{argmin}_y\{\mathbb{P}[Y=y\vert\mathbf{X}=\mathbf{x}]\}orm^\star(\mathbf{x})=\text{argmin}_y\left\{\frac{\mathbb{P}[\mathbf{X}=\mathbf{x}\vert Y=y]}{\mathbb{P}[\mathbf{X}=\mathbf{x}]}\right\}(where \mathbb{P}[\mathbf{X}=\mathbf{x}] is the density in the continuous case).

In the case where y takes two values, that will be standard \{0,1\} here, one can rewrite the later asm^\star(\mathbf{x})=\begin{cases}1\text{ if }\mathbb{E}(Y\vert \mathbf{X}=\mathbf{x})>\displaystyle{\frac{1}{2}}\\0\text{ otherwise}\end{cases}and the set\mathcal{D}_S =\left\{\mathbf{x},\mathbb{E}(Y\vert \mathbf{X}=\mathbf{x})=\frac{1}{2}\right\}is called the decision boundary.

Assume that\mathbf{X}\vert Y=0\sim\mathcal{N}(\mathbf{\mu}_0,\mathbf{\Sigma})and\mathbf{X}\vert Y=1\sim\mathcal{N}(\mathbf{\mu}_1,\mathbf{\Sigma})then explicit expressions can be derived.m^\star(\mathbf{x})=\begin{cases}1\text{ if }r_1^2< r_0^2+2\displaystyle{\log\frac{\mathbb{P}(Y=1)}{\mathbb{P}(Y=0)}+\log\frac{\vert\mathbf{\Sigma}_0\vert}{\vert\mathbf{\Sigma}_1\vert}}\\0\text{ otherwise}\end{cases}where r_y^2 is the Manalahobis distance, r_y^2 = [\mathbf{X}-\mathbf{\mu}_y]^{\text{{T}}}\mathbf{\Sigma}_y^{-1}[\mathbf{X}-\mathbf{\mu}_y]

Let \delta_ybe defined as\delta_y(\mathbf{x})=-\frac{1}{2}\log\vert\mathbf{\Sigma}_y\vert-\frac{1}{2}[{\color{blue}{\mathbf{x}}}-\mathbf{\mu}_y]^{\text{{T}}}\mathbf{\Sigma}_y^{-1}[{\color{blue}{\mathbf{x}}}-\mathbf{\mu}_y]+\log\mathbb{P}(Y=y)the decision boundary of this classifier is \{\mathbf{x}\text{ such that }\delta_0(\mathbf{x})=\delta_1(\mathbf{x})\}which is quadratic in {\color{blue}{\mathbf{x}}}. This is the quadratic discriminant analysis. This can be visualized bellow.

The decision boundary is here

But that can’t be the linear discriminant analysis, right? I mean, the frontier is not linear… Actually, in Fisher’s seminal paper, it was assumed that \mathbf{\Sigma}_0=\mathbf{\Sigma}_1.

In that case, actually, \delta_y(\mathbf{x})={\color{blue}{\mathbf{x}}}^{\text{T}}\mathbf{\Sigma}^{-1}\mathbf{\mu}_y-\frac{1}{2}\mathbf{\mu}_y^{\text{T}}\mathbf{\Sigma}^{-1}\mathbf{\mu}_y+\log\mathbb{P}(Y=y) and the decision frontier is now linear in {\color{blue}{\mathbf{x}}}. This is the linear discriminant analysis. This can be visualized bellow

Here the two samples have the same variance matrix and the frontier is

## Link with the logistic regression

Assume as previously that\mathbf{X}\vert Y=0\sim\mathcal{N}(\mathbf{\mu}_0,\mathbf{\Sigma})and\mathbf{X}\vert Y=1\sim\mathcal{N}(\mathbf{\mu}_1,\mathbf{\Sigma})then\log\frac{\mathbb{P}(Y=1\vert \mathbf{X}=\mathbf{x})}{\mathbb{P}(Y=0\vert \mathbf{X}=\mathbf{x})}is equal to \mathbf{x}^{\text{{T}}}\mathbf{\Sigma}^{-1}[\mathbf{\mu}_y]-\frac{1}{2}[\mathbf{\mu}_1-\mathbf{\mu}_0]^{\text{{T}}}\mathbf{\Sigma}^{-1}[\mathbf{\mu}_1-\mathbf{\mu}_0]+\log\frac{\mathbb{P}(Y=1)}{\mathbb{P}(Y=0)}which is linear in \mathbf{x}\log\frac{\mathbb{P}(Y=1\vert \mathbf{X}=\mathbf{x})}{\mathbb{P}(Y=0\vert \mathbf{X}=\mathbf{x})}=\mathbf{x}^{\text{{T}}}\mathbf{\beta}Hence, when each groups have Gaussian distributions with identical variance matrix, then LDA and the logistic regression lead to the same classification rule.

Observe furthermore that the slope is proportional to \mathbf{\Sigma}^{-1}[\mathbf{\mu}_1-\mathbf{\mu}_0], as stated in Fisher’s article. But to obtain such a relationship, he observe that the ratio of between and within variances (in the two groups) was\frac{\text{variance between}}{\text{variance within}}=\frac{[\mathbf{\omega}\mathbf{\mu}_1-\mathbf{\omega}\mathbf{\mu}_0]^2}{\mathbf{\omega}^{\text{T}}\mathbf{\Sigma}_1\mathbf{\omega}+\mathbf{\omega}^{\text{T}}\mathbf{\Sigma}_0\mathbf{\omega}}which is maximal when \mathbf{\omega} is proportional to \mathbf{\Sigma}^{-1}[\mathbf{\mu}_1-\mathbf{\mu}_0], when \mathbf{\Sigma}_0=\mathbf{\Sigma}_1.

## Homebrew linear discriminant analysis

To compute vector \mathbf{\omega}

m0 = apply(myocarde[myocarde$PRONO=="0",1:7],2,mean) m1 = apply(myocarde[myocarde$PRONO=="1",1:7],2,mean) Sigma = var(myocarde[,1:7]) omega = solve(Sigma)%*%(m1-m0) omega [,1] FRCAR -0.012909708542 INCAR 1.088582058796 INSYS -0.019390084344 PRDIA -0.025817110020 PAPUL 0.020441287970 PVENT -0.038298291091 REPUL -0.001371677757 |

For the constant – in the equation \omega^T\mathbf{x}+b=0 – if we have equiprobable probabilities, use

b = (t(m1)%*%solve(Sigma)%*%m1-t(m0)%*%solve(Sigma)%*%m0)/2 |

## Application (on the small dataset)

In order to visualize what’s going on, consider the small dataset, with only two covariates,

x = c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85) y = c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3) z = c(1,1,1,1,1,0,0,1,0,0) df = data.frame(x1=x,x2=y,y=as.factor(z)) m0 = apply(df[df$y=="0",1:2],2,mean) m1 = apply(df[df$y=="1",1:2],2,mean) Sigma = var(df[,1:2]) omega = solve(Sigma)%*%(m1-m0) omega [,1] x1 -2.640613174 x2 4.858705676 |

Using R regular function, we get

library(MASS) fit_lda = lda(y ~x1+x2 , data=df) fit_lda Coefficients of linear discriminants: LD1 x1 -2.588389554 x2 4.762614663 |

which is the same coefficient as the one we got with our own code. For the constant, use

b = (t(m1)%*%solve(Sigma)%*%m1-t(m0)%*%solve(Sigma)%*%m0)/2 |

If we plot it, we get the red straight line

plot(df$x1,df$x2,pch=c(1,19)[1+(df$y=="1")]) abline(a=b/omega[2],b=-omega[1]/omega[2],col="red") |

As we can see (with the blue points), our red line intersects the middle of the segment of the two barycenters

points(m0["x1"],m0["x2"],pch=4) points(m1["x1"],m1["x2"],pch=4) segments(m0["x1"],m0["x2"],m1["x1"],m1["x2"],col="blue") points(.5*m0["x1"]+.5*m1["x1"],.5*m0["x2"]+.5*m1["x2"],col="blue",pch=19) |

Of course, we can also use R function

predlda = function(x,y) predict(fit_lda, data.frame(x1=x,x2=y))$class==1 vv=outer(vu,vu,predlda) contour(vu,vu,vv,add=TRUE,lwd=2,levels = .5) |

One can also consider the quadratic discriminent analysis since it might be difficult to argue that \mathbf{\Sigma}_0=\mathbf{\Sigma}_1

fit_qda = qda(y ~x1+x2 , data=df) |

The separation curve is here

plot(df$x1,df$x2,pch=19, col=c("blue","red")[1+(df$y=="1")]) predqda=function(x,y) predict(fit_qda, data.frame(x1=x,x2=y))$class==1 vv=outer(vu,vu,predlda) contour(vu,vu,vv,add=TRUE,lwd=2,levels = .5) |

# Classification from scratch, SVM 7/8

Seventh post of our series on classification from scratch. The latest one was on the neural nets, and today, we will discuss SVM, support vector machines.

## A formal introduction

Here y takes values in \{-1,+1\}. Our model will be m(\mathbf{x})=\text{sign}[\mathbf{\omega}^T\mathbf{x}+b] Thus, the space is divided by a (linear) border\Delta:\lbrace\mathbf{x}\in\mathbb{R}^p:\mathbf{\omega}^T\mathbf{x}+b=0\rbrace

The distance from point \mathbf{x}_i to \Delta is d(\mathbf{x}_i,\Delta)=\frac{\mathbf{\omega}^T\mathbf{x}_i+b}{\|\mathbf{\omega}\|}If the space is linearly separable, the problem is ill posed (there is an infinite number of solutions). So consider

\max_{\mathbf{\omega},b}\left\lbrace\min_{i=1,\cdots,n}\left\lbrace\text{distance}(\mathbf{x}_i,\Delta)\right\rbrace\right\rbrace

The strategy is to maximize the margin. One can prove that we want to solve \max_{\mathbf{\omega},m}\left\lbrace\frac{m}{\|\mathbf{\omega}\|}\right\rbrace

subject to y_i\cdot(\mathbf{\omega}^T\mathbf{x}_i)=m, \forall i=1,\cdots,n. Again, the problem is ill posed (non identifiable), and we can consider m=1: \max_{\mathbf{\omega}}\left\lbrace\frac{1}{\|\mathbf{\omega}\|}\right\rbrace

subject to y_i\cdot(\mathbf{\omega}^T\mathbf{x}_i)=1, \forall i=1,\cdots,n. The optimization objective can be written\min_{\mathbf{\omega}}\left\lbrace\|\mathbf{\omega}\|^2\right\rbrace

## The primal problem

In the separable case, consider the following primal problem,\min_{\mathbf{w}\in\mathbb{R}^d,b\in\mathbb{R}}\left\lbrace\frac{1}{2}\|\mathbf{\omega}\|^2\right\rbracesubject to y_i\cdot (\mathbf{\omega}^T\mathbf{x}_i+b)\geq 1, \forall i=1,\cdots,n.

In the non-separable case, introduce slack (error) variables \mathbf{\xi} : if y_i\cdot (\mathbf{\omega}^T\mathbf{x}_i+b)\geq 1, there is no error \xi_i=0.

Let C denote the cost of misclassification. The optimization problem becomes\min_{\mathbf{w}\in\mathbb{R}^d,b\in\mathbb{R},{\color{red}{\mathbf{\xi}}}\in\mathbb{R}^n}\left\lbrace\frac{1}{2}\|\mathbf{\omega}\|^2 + C\sum_{i=1}^n\xi_i\right\rbracesubject to y_i\cdot (\mathbf{\omega}^T\mathbf{x}_i+b)\geq 1-{\color{red}{\xi_i}}, with {\color{red}{\xi_i}}\geq 0, \forall i=1,\cdots,n.

Let us try to code this optimization problem. The dataset is here

n = length(myocarde[,"PRONO"]) myocarde0 = myocarde myocarde0$PRONO = myocarde$PRONO*2-1 C = .5 |

and we have to set a value for the cost C. In the (linearly) constrained optimization function in R, we need to provide the objective function f(\mathbf{\theta}) and the gradient \nabla f(\mathbf{\theta}).

f = function(param){ w = param[1:7] b = param[8] xi = param[8+1:nrow(myocarde)] .5*sum(w^2) + C*sum(xi)} grad_f = function(param){ w = param[1:7] b = param[8] xi = param[8+1:nrow(myocarde)] c(2*w,0,rep(C,length(xi)))} |

and (linear) constraints are written as \mathbf{U}\mathbf{\theta}-\mathbf{c}\geq \mathbf{0}

U = rbind(cbind(myocarde0[,"PRONO"]*as.matrix(myocarde[,1:7]),diag(n),myocarde0[,"PRONO"]), cbind(matrix(0,n,7),diag(n,n),matrix(0,n,1))) C = c(rep(1,n),rep(0,n)) |

Then we use

constrOptim(theta=p_init, f, grad_f, ui = U,ci = C) |

Observe that something is missing here: we need a starting point for the algorithm, \mathbf{\theta}_0. Unfortunately, I could not think of a simple technique to get a valid starting point (that satisfies those linear constraints).

Let us try something else. Because those functions are quite simple: either linear or quadratic. Actually, one can recognize in the separable case, but also in the non-separable case, a classic quadratic program\min_{\mathbf{z}\in\mathbb{R}^d}\left\lbrace\frac{1}{2}\mathbf{z}^T\mathbf{D}\mathbf{z}-\mathbf{d}\mathbf{z}\right\rbracesubject to \mathbf{A}\mathbf{z}\geq\mathbf{b}.

library(quadprog) eps = 5e-4 y = myocarde[,"PRONO"]*2-1 X = as.matrix(cbind(1,myocarde[,1:7])) n = length(y) D = diag(n+7+1) diag(D)[8+0:n] = 0 d = matrix(c(rep(0,7),0,rep(C,n)), nrow=n+7+1) A = Ui b = Ci sol = solve.QP(D+eps*diag(n+7+1), d, t(A), b, meq=1, factorized=FALSE) qpsol = sol$solution (omega = qpsol[1:7]) [1] -0.106642005446 -0.002026198103 -0.022513312261 -0.018958578746 -0.023105767847 -0.018958578746 -1.080638988521 (b = qpsol[n+7+1]) [1] 997.6289927 |

Given an observation \mathbf{x}, the prediction is

y=\text{sign}[\mathbf{\omega}^T\mathbf{x}+b]

y_pred = 2*((as.matrix(myocarde0[,1:7])%*%omega+b)>0)-1 |

Observe that here, we do have a classifier, depending if the point lies on the left or on the right (above or below, etc) the separating line (or hyperplane). We do not have a probability, because there is no probabilistic model here. So far.

## The dual problem

The Lagrangian of the separable problem could be written introducing Lagrange multipliers \mathbf{\alpha}\in\mathbb{R}^n, \mathbf{\alpha}\geq \mathbf{0} as\mathcal{L}(\mathbf{\omega},b,\mathbf{\alpha})=\frac{1}{2}\|\mathbf{\omega}\|^2-\sum_{i=1}^n \alpha_i\big(y_i(\mathbf{\omega}^T\mathbf{x}_i+b)-1\big)Somehow, \alpha_i represents the influence of the observation (y_i,\mathbf{x}_i).

Consider the Dual Problem, with \mathbf{G}=[G_{ij}] and G_{ij}=y_iy_j\mathbf{x}_j^T\mathbf{x}_i

\min_{\mathbf{\alpha}\in\mathbb{R}^n}\left\lbrace\frac{1}{2}\mathbf{\alpha}^T\mathbf{G}\mathbf{\alpha}-\mathbf{1}^T\mathbf{\alpha}\right\rbrace

subject to \mathbf{y}^T\mathbf{\alpha}=\mathbf{0} and \mathbf{\alpha}\geq\mathbf{0}.

The Lagrangian of the non-separable problem could be written introducing Lagrange multipliers \mathbf{\alpha},{\color{red}{\mathbf{\beta}}}\in\mathbb{R}^n, \mathbf{\alpha},{\color{red}{\mathbf{\beta}}}\geq \mathbf{0}, and define the Lagrangian \mathcal{L}(\mathbf{\omega},b,{\color{red}{\mathbf{\xi}}},\mathbf{\alpha},{\color{red}{\mathbf{\beta}}}) as\frac{1}{2}\|\mathbf{\omega}\|^2+{\color{blue}{C}}\sum_{i=1}^n{\color{red}{\xi_i}}-\sum_{i=1}^n \alpha_i\big(y_i(\mathbf{\omega}^T\mathbf{x}_i+b)-1+{\color{red}{\xi_i}}\big)-\sum_{i=1}^n{\color{red}{\beta_i}}{\color{red}{\xi_i}}

Somehow, \alpha_i represents the influence of the observation (y_i,\mathbf{x}_i).

The Dual Problem become with \mathbf{G}=[G_{ij}] and G_{ij}=y_iy_j\mathbf{x}_j^T\mathbf{x}_i\min_{\mathbf{\alpha}\in\mathbb{R}^n}\left\lbrace\frac{1}{2}\mathbf{\alpha}^T\mathbf{G}\mathbf{\alpha}-\mathbf{1}^T\mathbf{\alpha}\right\rbrace

subject to \mathbf{y}^T\mathbf{\alpha}=\mathbf{0}, \mathbf{\alpha}\geq\mathbf{0} and \mathbf{\alpha}\leq {\color{blue}{C}}.

As previsouly, one can also use quadratic programming

library(quadprog) eps = 5e-4 y = myocarde[,"PRONO"]*2-1 X = as.matrix(cbind(1,myocarde[,1:7])) n = length(y) Q = sapply(1:n, function(i) y[i]*t(X)[,i]) D = t(Q)%*%Q d = matrix(1, nrow=n) A = rbind(y,diag(n),-diag(n)) C = .5 b = c(0,rep(0,n),rep(-C,n)) sol = solve.QP(D+eps*diag(n), d, t(A), b, meq=1, factorized=FALSE) qpsol = sol$solution |

The two problems are connected in the sense that for all \mathbf{x}\mathbf{\omega}^T\mathbf{x}+b = \sum_{i=1}^n \alpha_i y_i (\mathbf{x}^T\mathbf{x}_i)+b

To recover the solution of the primal problem,\mathbf{\omega}=\sum_{i=1}^n \alpha_iy_i \mathbf{x}_ithus

omega = apply(qpsol*y*X,2,sum) omega 1 FRCAR INCAR INSYS 0.0000000000000002439074265 0.0550138658687635215271960 -0.0920163239049630876653652 0.3609571899422952534486342 PRDIA PAPUL PVENT REPUL -0.1094017965288692356695677 -0.0485213403643276475207813 -0.0660058643191372279579454 0.0010093656567606212794835 |

while b=y-\mathbf{\omega}^T\mathbf{x} (but actually, one can add the constant vector in the matrix of explanatory variables).

More generally, consider the following function (to make sure that D is a definite-positive matrix, we use the nearPD function).

svm.fit = function(X, y, C=NULL) { n.samples = nrow(X) n.features = ncol(X) K = matrix(rep(0, n.samples*n.samples), nrow=n.samples) for (i in 1:n.samples){ for (j in 1:n.samples){ K[i,j] = X[i,] %*% X[j,] }} Dmat = outer(y,y) * K Dmat = as.matrix(nearPD(Dmat)$mat) dvec = rep(1, n.samples) Amat = rbind(y, diag(n.samples), -1*diag(n.samples)) bvec = c(0, rep(0, n.samples), rep(-C, n.samples)) res = solve.QP(Dmat,dvec,t(Amat),bvec=bvec, meq=1) a = res$solution bomega = apply(a*y*X,2,sum) return(bomega) } |

On our dataset, we obtain

M = as.matrix(myocarde[,1:7]) center = function(z) (z-mean(z))/sd(z) for(j in 1:7) M[,j] = center(M[,j]) bomega = svm.fit(cbind(1,M),myocarde$PRONO*2-1,C=.5) y_pred = 2*((cbind(1,M)%*%bomega)>0)-1 table(obs=myocarde0$PRONO,pred=y_pred) pred obs -1 1 -1 27 2 1 9 33 |

i.e. 11 misclassification, out of 71 points (which is also what we got with the logistic regression).

## Kernel Based Approach

In some cases, it might be difficult to “separate” by a linear separators the two sets of points, like below,

It might be difficult, here, because which want to find a straight line in the two dimensional space (x_1,x_2). But maybe, we can distort the space, possible by adding another dimension

That’s heuristically the idea. Because on the case above, in dimension 3, the set of points is now linearly separable. And the trick to do so is to use a kernel. The difficult task is to find the good one (if any).

A positive kernel on \mathcal{X} is a function K:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R} symmetric, and such that for any n, \forall\alpha_1,\cdots,\alpha_n and \forall\mathbf{x}_1,\cdots,\mathbf{x}_n,\sum_{i=1}^n\sum_{j=1}^n\alpha_i\alpha_j k(\mathbf{x}_i,\mathbf{x}_j)\geq 0.

For example, the linear kernel is k(\mathbf{x}_i,\mathbf{x}_j)=\mathbf{x}_i^T\mathbf{x}_j. That’s what we’ve been using here, so far. One can also define the product kernel k(\mathbf{x}_i,\mathbf{x}_j)=\kappa(\mathbf{x}_i)\cdot\kappa(\mathbf{x}_j) where \kappa is some function \mathcal{X}\rightarrow\mathbb{R}.

Finally, the Gaussian kernel is k(\mathbf{x}_i,\mathbf{x}_j)=\exp[-\|\mathbf{x}_i-\mathbf{x}_j\|^2].

Since it is a function of \|\mathbf{x}_i-\mathbf{x}_j\|, it is also called a radial kernel.

linear.kernel = function(x1, x2) { return (x1%*%x2) } svm.fit = function(X, y, FUN=linear.kernel, C=NULL) { n.samples = nrow(X) n.features = ncol(X) K = matrix(rep(0, n.samples*n.samples), nrow=n.samples) for (i in 1:n.samples){ for (j in 1:n.samples){ K[i,j] = FUN(X[i,], X[j,]) } } Dmat = outer(y,y) * K Dmat = as.matrix(nearPD(Dmat)$mat) dvec = rep(1, n.samples) Amat = rbind(y, diag(n.samples), -1*diag(n.samples)) bvec = c(0, rep(0, n.samples), rep(-C, n.samples)) res = solve.QP(Dmat,dvec,t(Amat),bvec=bvec, meq=1) a = res$solution bomega = apply(a*y*X,2,sum) return(bomega) } |

## Link to the regression

To relate this duality optimization problem to OLS, recall that y=\mathbf{x}^T\mathbf{\omega}+\varepsilon, so that \widehat{y}=\mathbf{x}^T\widehat{\mathbf{\omega}}, where \widehat{\mathbf{\omega}}=[\mathbf{X}^T\mathbf{X}]^{-1}\mathbf{X}^T\mathbf{y}

But one can also write y=\mathbf{x}^T\widehat{\mathbf{\omega}}=\sum_{i=1}^n \widehat{\alpha}_i\cdot \mathbf{x}^T\mathbf{x}_i

where \widehat{\mathbf{\alpha}}=\mathbf{X}[\mathbf{X}^T\mathbf{X}]^{-1}\widehat{\mathbf{\omega}}, or conversely, \widehat{\mathbf{\omega}}=\mathbf{X}^T\widehat{\mathbf{\alpha}}.

## Application (on our small dataset)

One can actually use a dedicated R package to run a SVM. To get the linear kernel, use

library(kernlab) df0 = df df0$y = 2*(df$y=="1")-1 SVM1 = ksvm(y ~ x1 + x2, data = df0, C=.5, kernel = "vanilladot" , type="C-svc") |

Since the dataset is not linearly separable, there will be some mistakes here

table(df0$y,predict(SVM1)) -1 1 -1 2 2 1 1 5 |

The problem with that function is that it cannot be used to get a prediction for other points than those in the sample (and I could neither extract \omega nor b from the 24 slots of that objet). But it’s possible by adding a small option in the function

SVM2 = ksvm(y ~ x1 + x2, data = df0, C=.5, kernel = "vanilladot" , prob.model=TRUE, type="C-svc") |

With that function, we convert the distance as some sort of probability. Someday, I will try to replicate the probabilistic version of SVM, I promise, but today, the goal is just to understand what is done when running the SVM algorithm. To visualize the prediction, use

pred_SVM2 = function(x,y){ return(predict(SVM2,newdata=data.frame(x1=x,x2=y), type="probabilities")[,2])} plot(df$x1,df$x2,pch=c(1,19)[1+(df$y=="1")], cex=1.5,xlab="", ylab="",xlim=c(0,1),ylim=c(0,1)) vu = seq(-.1,1.1,length=251) vv = outer(vu,vu,function(x,y) pred_SVM2(x,y)) contour(vu,vu,vv,add=TRUE,lwd=2,nlevels = .5,col="red") |

Here the cost is C=.5, but of course, we can change it

SVM2 = ksvm(y ~ x1 + x2, data = df0, C=2, kernel = "vanilladot" , prob.model=TRUE, type="C-svc") pred_SVM2 = function(x,y){ return(predict(SVM2,newdata=data.frame(x1=x,x2=y), type="probabilities")[,2])} plot(df$x1,df$x2,pch=c(1,19)[1+(df$y=="1")], cex=1.5,xlab="", ylab="",xlim=c(0,1),ylim=c(0,1)) vu = seq(-.1,1.1,length=251) vv = outer(vu,vu,function(x,y) pred_SVM2(x,y)) contour(vu,vu,vv,add=TRUE,lwd=2,levels = .5,col="red") |

As expected, we have a linear separator. But slightly different. Now, let us consider the “Radial Basis Gaussian kernel”

SVM3 = ksvm(y ~ x1 + x2, data = df0, C=2, kernel = "rbfdot" , prob.model=TRUE, type="C-svc") |

Observe that here, we’ve been able to separare the white and the black points

table(df0$y,predict(SVM3)) -1 1 -1 4 0 1 0 6 |

plot(df$x1,df$x2,pch=c(1,19)[1+(df$y=="1")], cex=1.5,xlab="", ylab="",xlim=c(0,1),ylim=c(0,1)) vu = seq(-.1,1.1,length=251) vv = outer(vu,vu,function(x,y) pred_SVM3(x,y)) contour(vu,vu,vv,add=TRUE,lwd=2,levels = .5,col="red") |

Now, to be completely honest, if I understand the theory of the algorithm used to compute \omega and b with linear kernel (using quadratic programming), I do not feel confortable with this R function. Especially if you run it several times… you can get (with exactly the same set of parameters)

or

(to be continued…)

# Traveling Salesman

In the second part of the course on graphs and networks, we will focus on economic applications, and flows. The first series of slides are on the traveling salesman problem. Slides are available online.

# Simple and heuristic optimization

This week, at the Rmetrics conference, there has been an interesting discussion about heuristic optimization. The starting point was simple: in complex optimization problems (here we mean with a lot of local maxima, for instance), we do not necessarily need extremely advanced algorithms that do converge extremly fast, if we cannot ensure that they reach the optimum. Converging extremely fast, with a great numerical precision to some point (that is not the point we’re looking for) is useless. And some algorithms might be much slower, but at least, it is much more likely to converge to the optimum. Wherever we start from.

We have experienced that with Mathieu, while we were looking for maximum likelihood of our MINAR process: genetic algorithm have performed extremely well. The idea is extremly simple, and natural. Let us consider as a starting point the following algorithm,

- Start from some
- At step , draw a point in a neighborhood of ,

- either then
- or then

This is simple (if you do not enter into details about what such a neighborhood should be). But using that kind of algorithm, you might get trapped and attracted to some local optima if the *neighborhood* is not large enough. An alternative to this technique is the following: it might be interesting to change a bit more, and instead of changing when we have a maximum, we change if we have almost a maximum. Namely at step ,

- either then
- or then

for some . To illustrate the idea, consider the following function

> x0=15 > MX=matrix(NA,501,2) > MX[1,]=runif(2,-x0,x0) > k=.5 > for(s in 2:501){ + bruit=rnorm(2) + X=MX[s-1,]+bruit*3 + if(X[1]>x0){X[1]=x0} + if(X[1]<(-x0)){X[1]=-x0} + if(X[2]>x0){X[2]=x0} + if(X[2]<(-x0)){X[2]=-x0} + if(f(X[1],X[2])+k>f(MX[s-1,1], + MX[s-1,2])){MX[s,]=X} + if(f(X[1],X[2])+k<=f(MX[s-1,1], + MX[s-1,2])){MX[s,]=MX[s-1,]} +}

It does not always converge towards the optimum,

and sometimes, we just missed it after being extremely unlucky

Note that if we run 10,000 scenarios (with different random noises and starting point), in 50% scenarios, we reach the maxima. Or at least, we are next to it, on top.

What if we compare with a standard optimization routine, like Nelder-Mead, or quasi gradient ?Since we look for the maxima on a restricted domain, we can use the following function,

> g=function(x) f(x[1],x[2]) > optim(X0, g,method="L-BFGS-B", + lower=-c(x0,x0),upper=c(x0,x0))$par

In that case, if we run the algorithm with 10,000 random starting point, this is where we end, below on the right (while the heuristic technique is on the left),

In only 15% of the scenarios, we have been able to reach the region where the maximum is.

So here, it looks like an heuristic method works extremelly well, if do not need to reach the maxima with a great precision. Which is usually the case actually.

# EM and mixture estimation

Following my previous post on optimization and mixtures (here), Nicolas told me that my idea was probably not the most clever one (there).

So, we get back to our simple mixture model,

In order to describe how EM algorithm works, assume first that both and are perfectly known, and the mixture parameter is the only one we care about.

- The simple model, with only one parameter that is unknown

Here, the likelihood is

so that we write the log likelihood as

which might not be simple to maximize. Recall that the mixture model can interpreted through a latent variate (that cannot be observed), taking value when is drawn from , and 0 if it is drawn from . More generally (especially in the case we want to extend our model to 3, 4, … mixtures), and .

With that notation, the likelihood becomes

and the log likelihood

the term on the right is useless since we only care about p, here. From here, consider the following iterative procedure,

Assume that the mixture probability is known, denoted . Then I can predict the value of (i.e. and ) for all observations,

So I can inject those values into my log likelihood, i.e. in

having maximum (no need to run numerical tools here)

that will be denoted . And I can iterate from here.

Formally, the first step is where we calculate an expected (E) value, where is the best predictor of given my observations (as well as my belief in ). Then comes a maximization (M) step, where using , I can estimate probability .

- A more general framework, all parameters are now unkown

So far, it was simple, since we assumed that and were perfectly known. Which is not reallistic. An there is not much to change to get a complete algorithm, to estimate . Recall that we had which was the expected value of Z_{1,i}, i.e. it is a probability that observation i has been drawn from .

If , instead of being in the segment was in , then we could have considered mean and standard deviations of observations such that =0, and similarly on the subset of observations such that =1.

But we can’t. So what can be done is to consider as the weight we should give to observation i when estimating parameters of , and similarly, 1-would be weights given to observation i when estimating parameters of .

So we set, as before

and then

and for the variance, well, it is a weighted mean again,

and this is it.

- Let us run the code on the same data as before

Here, the code is rather simple: let us start generating a sample

> X1 = rnorm(n,0,1)

> X20 = rnorm(n,0,1)

> Z = sample(c(1,2,2),size=n,replace=TRUE)

> X2=4+X20

> X = c(X1[Z==1],X2[Z==2])

then, given a vector of initial values (that I called and then before),

> s = c(0.5, mean(X)-1, var(X), mean(X)+1, var(X))

I define my function as,

> em = function(X0,s) {

+ Ep = s[1]*dnorm(X0, s[2], sqrt(s[4]))/(s[1]*dnorm(X0, s[2], sqrt(s[4])) +

+ (1-s[1])*dnorm(X0, s[3], sqrt(s[5])))

+ s[1] = mean(Ep)

+ s[2] = sum(Ep*X0) / sum(Ep)

+ s[3] = sum((1-Ep)*X0) / sum(1-Ep)

+ s[4] = sum(Ep*(X0-s[2])^2) / sum(Ep)

+ s[5] = sum((1-Ep)*(X0-s[3])^2) / sum(1-Ep)

+ return(s)

+ }

Then I get , or . So this is it ! We just need to iterate (here I stop after 200 iterations) since we can see that, actually, our algorithm converges quite fast,

> for(i in 2:200){

+ s=em(X,s)

+ }

Let us run the same procedure as before, i.e. I generate samples of size 200, where difference between means can be small (0) or large (4),

Ok, Nicolas, you were right, we’re doing much better ! Maybe we should also go for a Gibbs sampling procedure ?… next time, maybe….

# Optimization and mixture estimation

Recently, one of my students asked me about optimization routines in R. He told me he that R performed well on the estimation of a time series model with different regimes, while he had trouble with a (simple) GARCH process, and he was wondering if R was good in optimization routines. Actually, I always thought that mixtures (and regimes) was something difficult to estimate, so I was a bit surprised…

Indeed, it reminded me some trouble I experienced once, while I was talking about *maximum likelihooh estimation*, for non standard distribution, i.e. when optimization had to be done on the log likelihood function. And even when generating nice samples, giving appropriate initial values (actually the *true* value used in random generation), each time I tried to optimize my log likelihood, it failed. So I decided to play a little bit with standard optimization functions, to see which one performed better when trying to estimate mixture parameter (from a mixture based sample). Here, I generate a mixture of two gaussian distributions, and I would like to see how different the mean should be to have a high probability to estimate properly the parameters of the mixture.

The density is here proportional to

The true model is , and being a parameter that will change, from 0 to 4.

The log likelihood (actually, I add a minus since most of the optimization functions actually *minimize* functions) is

> logvraineg <- function(param, obs) {

+ p <- param[1]

+ m1 <- param[2]

+ sd1 <- param[3]

+ m2 <- param[4]

+ sd2 <- param[5]

+ -sum(log(p * dnorm(x = obs, mean = m1, sd = sd1) + (1 – p) *

+ dnorm(x = obs, mean = m2, sd = sd2)))

+ }

The code to generate my samples is the following,

>X1 = rnorm(n,0,1)

> X20 = rnorm(n,0,1)

> Z = sample(c(1,2,2),size=n,replace=TRUE)

> X2=m+X20

> X = c(X1[Z==1],X2[Z==2])

Then I use two functions to optimize my log likelihood, with identical intial values,

> O1=nlm(f = logvraineg, p = c(.5, mean(X)-sd(X)/5, sd(X), mean(X)+sd(X)/5, sd(X)), obs = X)

> logvrainegX <- function(param) {logvraineg(param,X)}

> O2=optim( par = c(.5, mean(X)-sd(X)/5, sd(X), mean(X)+sd(X)/5, sd(X)),

+ fn = logvrainegX)

Actually, since I might have identification problems, I take either or , depending whether or is the smallest parameter.

On the graph above, the x-axis is the difference between means of the mixture (as on the animated grap above). Then, the red point is the median of estimated parameter I have (here ), and I have included something that can be interpreted as a *confidence interval*, i.e. where I have been in 90% of my scenarios: the**black** vertical segments. Obviously, when the sample is not enough heterogeneous (i.e. and rather different), I cannot estimate properly my parameters, I might even have a probability that exceed 1 (I did not add any constraint). The blue plain horizontal line is the *true* value of the parameter, while the blue dotted horizontal line is the initial value of the parameter in the optimization algorithm (I started assuming that the mixture probability was around 0.2).

The graph below is based on the second optimization routine (with identical starting values, and of course on the same generated samples),

(just to be honest, in many cases, it did not converge, so the loop stopped, and I had to run it again… so finally, my study is based on a bit less than 500 samples (times 15 since I considered several values for the mean of my second underlying distribution), with 200 generated observations from a mixture).

The graph below compares the two (empty circles are the first algorithm, while plain circles the second one),

On average, it is not so bad…. but the probability to be far away from the tru value is not small at all… except when the difference between the two means exceeds 3…

If I change starting values for the optimization algorithm (previously, I assumed that the mixture probability was 1/5, here I start from 1/2), we have the following graph

which look like the previous one, except for small differences between the two underlying distributions (just as if initial values had not impact on the optimization, but it might come from the fact that the surface is nice, and we are not trapped in regions of local minimum).

Thus, I am far from being an expert in optimization routines in R (see here for further information), but so far, it looks like R is not doing so bad… and the two algorithm perform similarly (maybe the first one being a bit closer to the *true*parameter).