analistica/notes/sections/7.md

18 KiB
Raw Blame History

Exercise 7

Generating random points on the plane

Two sets of 2D points (x, y) --- signal and noise --- are to be generated according to two bivariate Gaussian distributions with parameters: \begin{align*} \text{signal}: \begin{cases} \mu = (0, 0) \ \sigma_x = \sigma_y = 0.3 \ \rho = 0.5 \end{cases} && \text{noise}: \begin{cases} \mu = (4, 4) \ \sigma_x = \sigma_y = 1 \ \rho = 0.4 \end{cases} \end{align*} where \mu stands for the mean, \sigma_x and \sigma_y for the standard deviations in x and y directions respectively and \rho is the bivariate correlation, namely:


  \sigma_{xy} = \rho\, \sigma_x \sigma_y

where \sigma_{xy} is the covariance of x and y.

In the programs, N_s = 800 points for the signal and N_n = 1000 points for the noise were chosen as default but can be customized from the command-line. Both samples were stored as n \times 2 matrices, where n is the number of points in the sample. The library gsl_matrix provided by GSL was employed for this purpose and the function gsl_ran_bivariate_gaussian() was used for generating the points.
An example of the two samples is shown in @fig:points.

Example of points sampled according to the two Gaussian distributions
with the given parameters.{#fig:points}

Assuming to not know how the points were generated, a model of classification, which assign to each point the 'most probably' class it belongs to, was implemented. Depending on the interpretation of 'most probable', many different models can be developed. Here, the Fisher linear discriminant and the perceptron were implemented and described in the following two sections. The results are compared in @sec:class-results.

Fisher linear discriminant

The projection direction

The Fisher linear discriminant (FLD) is a linear classification model based on dimensionality reduction. It does so by projecting the data onto hyperplanes that best divide the classes of points, consequently decreasing the dimension to n-1. In the 2D case the projection is onto a line, therefore the problem is reduced to simply selecting a threshold.

Consider the case of two classes (in this case signal and noise): the simplest representation of a linear discriminant is obtained by taking a linear function \tilde{x} of a sampled 2D point x so that:


  \tilde{x} = w^T x

where w is the so-called 'weight vector' and w^T stands for its transpose.

An input point x is commonly assigned to the first class if $\tilde{x} > w_{th}$ and to the second one otherwise, where w_{th} is a threshold value somehow defined. In general, the projection onto one dimension leads to a considerable loss of information and classes that are well separated in the original 2D space may become strongly overlapping in one dimension. However, by adjusting the components of the weight vector, a projection that maximizes the classes separation can be found [@bishop06].

To begin with, consider N_1 points of class C_1 and N_2 points of class C_2, so that the means \mu_1 and \mu_2 of the two classes are given by:


  \mu_1 = \frac{1}{N_1} \sum_{n \in C_1} x_n
  \et
  \mu_2 = \frac{1}{N_2} \sum_{n \in C_2} x_n

The simplest measure of the separation of the classes is the separation of the projected class means. This suggests to choose w so as to maximize:


  \tilde{\mu}_2  \tilde{\mu}_1 = w^T (\mu_2  \mu_1)

This expression can be made arbitrarily large simply by increasing the magnitude of w but, fortunately, the problem is easily solved by requiring w to be normalised: | w^2 | = 1. Using a Lagrange multiplier to perform the constrained maximization, it can be found that w \propto (\mu_2 \mu_1), meaning that the line onto the points must be projected is the one joining the class means.
There is still a problem with this approach, however, as illustrated in @fig:overlap: the two classes are well separated in the original 2D space but have considerable overlap when projected onto the line joining their means which maximize their projections distance.

The plot on the left shows samples from two classes along with the
histograms resulting from the  projection onto the line joining the
class means: note the considerable overlap in the projected
space. The right plot shows the corresponding projection based on the
Fisher linear discriminant, showing the greatly improved classes
separation.{#fig:overlap}

The overlap of the projections can be reduced by maximising a function that gives, besides a large separation, small variance within each class. The within-class variance of the transformed data of each class k is given by:


  \tilde{s}_k^2 = \sum_{n \in c_k} (\tilde{x}_n - \tilde{\mu}_k)^2

The total within-class variance for the whole data set is simply defined as \tilde{s}^2 = \tilde{s}_1^2 + \tilde{s}_2^2. The Fisher criterion is therefore defined to be the ratio of the between-class distance to the within-class variance and is given by:


  F(w) = \frac{(\tilde{\mu}_2 - \tilde{\mu}_1)^2}{\tilde{s}^2}

The dependence on w can be made explicit: \begin{align*} (\tilde{\mu}_2 - \tilde{\mu}1)^2 &= (w^T \mu_2 - w^T \mu_1)^2 \ &= [w^T (\mu_2 - \mu_1)]^2 \ &= [w^T (\mu_2 - \mu_1)][w^T (\mu_2 - \mu_1)] \ &= [w^T (\mu_2 - \mu_1)][(\mu_2 - \mu_1)^T w] = w^T M w \end{align*} where M is the between-distance matrix. Similarly, as regards the denominator: \begin{align*} \tilde{s}^2 &= \tilde{s}1^2 + \tilde{s}2^2 = \ &= \sum{n \in c_1} (\tilde{x}n - \tilde{\mu}1)^2 + \sum{n \in c_2} (\tilde{x}n - \tilde{\mu}2)^2 = w^T \Sigma_w w \end{align*} where \Sigma_w is the total within-class covariance matrix: \begin{align*} \Sigma_w &= \sum{n \in c_1} (x_n \mu_1)(x_n \mu_1)^T + \sum{n \in c_2} (x_n \mu_2)(x_n \mu_2)^T \ &= \Sigma_1 + \Sigma_2 = \begin{pmatrix} \sigma_x^2 & \sigma{xy} \ \sigma{xy} & \sigma_y^2 \end{pmatrix}1 + \begin{pmatrix} \sigma_x^2 & \sigma{xy} \ \sigma{xy} & \sigma_y^2 \end{pmatrix}_2 \end{align*} Where \Sigma_1 and \Sigma_2 are the covariance matrix of the two samples. The Fisher criterion can therefore be rewritten in the form:


  F(w) = \frac{w^T M w}{w^T \Sigma_w w}

Differentiating with respect to w, it can be found that F(w) is maximized when:


  w = \Sigma_w^{-1} (\mu_2 - \mu_1)
$$ {#eq:fisher-weight}
This is not truly a discriminant but rather a specific choice of the direction
for projection of the data down to one dimension: the projected data can then be
used to construct a discriminant by choosing a threshold for the
classification.

When implemented, the parameters given in @sec:sampling were used to compute
the covariance matrices and their sum $\Sigma_w$. Then $\Sigma_w$, being a
symmetrical and positive-definite matrix, was inverted with the Cholesky method,
already discussed in @sec:MLM. Lastly, the matrix-vector product was computed
with the `gsl_blas_dgemv()` function provided by GSL.


### The threshold

The threshold $t_{\text{cut}}$ was fixed by the conditional
probability $P(c_k | t_{\text{cut}})$ being the same for both classes $c_k$:

t_{\text{cut}} = x \quad \text{such that}\quad \frac{P(c_1 | x)}{P(c_2 | x)} = \frac{P(x | c_1) , P(c_1)}{P(x | c_2) , P(c_2)} = 1


where $P(x | c_k)$ is the probability for point $x$ along the Fisher projection
line of being sampled from $k$. If $\tilde{x} > t_\text{cut}$ then more likely
$x \in c_1$, otherwise $x \in c_2$.

If each class is a bivariate Gaussian distribution, as in the present case,
then $P(x | c_k)$ is simply given by its projected normal distribution with
mean $\tilde{m} = w^T m$ and variance $\tilde{s} = w^T S w$, being $S$ the
covariance matrix of the class.  
After some algebra, the threshold is found to be:

t_{\text{cut}} = \frac{b}{a} + \sqrt{\left( \frac{b}{a} \right)^2 - \frac{c}{a}}


where:

  - $a = \tilde{s}_1^2 - \tilde{s}_2^2$
  - $b = \tilde{\mu}_2 \, \tilde{s}_1^2 - \tilde{\mu}_1 \, \tilde{s}_2^2$
  - $c = \tilde{\mu}_2^2 \, \tilde{s}_1^2 - \tilde{\mu}_1^2 \, \tilde{s}_2^2
       - 2 \, \tilde{s}_1^2 \, \tilde{s}_2^2 \, \ln(\alpha)$
  - $\alpha = P(c_1) / P(c_2)$

In a simulation, the ratio of the prior probabilities $\alpha$ can
simply be set to:

\alpha = \frac{N_s}{N_n}


The projection of the points was accomplished by the use of the function
`gsl_blas_ddot()`, which computes a fast dot product of two vectors.

Results obtained for the same samples in @fig:points are shown below in
@fig:fisher-proj. The weight vector and the threshold were found to be:

w = (0.707, 0.707) \et t_{\text{cut}} = 1.323



::: { id=fig:fisher-proj }
![Scatter plot of the samples.](images/7-fisher-plane.pdf)
![Histogram of the Fisher-projected samples.](images/7-fisher-proj.pdf)

Aerial and lateral views of the samples. Projection line in blue and cut in red.
:::


### A mathematical curiosity

This section is really a sidenote which grew too large to fit in a margin,
so it can be safely skipped.

It can be seen that the weight vector turned out to be parallel to the line
joining the means of the two classes (as a remainder: $(0, 0)$ and $(4, 4)$),
as if the within-class covariances were ignored. Weird!

Looking at @eq:fisher-weight, one can be mislead into thinking that the inverse
of the total covariance matrix, $\Sigma_w$ is (proportional to) the identity,
but that's not true. By a remarkable accident, the vector joining the means is
an eigenvector of the covariance matrix $\Sigma_w^{-1}$. In
fact: since $\sigma_x = \sigma_y$ for both signal and noise:

\Sigma_1 = \begin{pmatrix} \sigma_x^2 & \sigma_{xy} \ \sigma_{xy} & \sigma_x^2 \end{pmatrix}1 \et \Sigma_2 = \begin{pmatrix} \sigma_x^2 & \sigma{xy} \ \sigma_{xy} & \sigma_x^2 \end{pmatrix}_2


$\Sigma_w$ takes the symmetrical form

\Sigma_w = \begin{pmatrix} A & B \ B & A \end{pmatrix},


which can be easily inverted by Gaussian elimination:
\begin{align*}
  \begin{pmatrix}
  A & B & \vline & 1 & 0 \\
  B & A & \vline & 0 & 1 \\
  \end{pmatrix} &\longrightarrow
  \begin{pmatrix}
  A - B & 0     & \vline & 1 - B & - B   \\
  0     & A - B & \vline & - B   & 1 - B \\
  \end{pmatrix}  \\ &\longrightarrow
  \begin{pmatrix}
  1 & 0 & \vline & (1 - B)/(A - B) &     - B/(A - B) \\
  0 & 1 & \vline &    -  B/(A - B) & (1 - B)/(A - B) \\
  \end{pmatrix}
\end{align*}

Hence, the inverse has still the same form:

\Sigma_w^{-1} = \begin{pmatrix} \tilde{A} & \tilde{B} \ \tilde{B} & \tilde{A} \end{pmatrix}


For this reason, $\Sigma_w$ and $\Sigma_w^{-1}$ share the same eigenvectors
$v_1$ and $v_2$:

v_1 = \begin{pmatrix} 1 \ -1 \end{pmatrix} \et v_2 = \begin{pmatrix} 1 \ 1 \end{pmatrix}


The vector joining the means is clearly a multiple of $v_2$, and so is $w$.


## Perceptron

In machine learning, the perceptron is an algorithm for supervised learning of
linear binary classifiers.

Supervised learning is the machine learning task of inferring a function $f$
that maps an input $x$ to an output $f(x)$ based on a set of training
input-output pairs, where each pair consists of an input object and an output
value. The inferred function can be used for mapping new examples: the algorithm
is generalized to correctly determine the class labels for unseen instances.

The aim of the perceptron algorithm is to determine the weight vector $w$ and
bias $b$ such that the so-called 'threshold function' $f(x)$ returns a binary
value: it is expected to return 1 for signal points and 0 for noise points:

f(x) = \theta(w^T x + b)

 {#eq:perc}
where $\theta$ is the Heaviside theta function. Note that the bias $b$ is
$-t_\text{cut}$, as defined in the previous section.

The training was performed using the generated sample as training set. From an
initial guess for $w$ and $b$ (which were set to be all null in the code), the
perceptron starts to improve their estimations. The training set is passed
point by point into a iterative procedure $N$ times: for every point, the
output of $f(x)$ is computed. Afterwards, the variable $\Delta$, which is
defined as:

\Delta = r [e - f(x)]


where:

  - $r \in [0, 1]$ is the learning rate of the perceptron: the larger $r$, the
    more volatile the weight changes. In the code it was arbitrarily set $r =
    0.8$;
  - $e$ is the expected output value, namely 1 if $x$ is signal and 0 if it is
    noise;

is used to update $b$ and $w$:

b \to b + \Delta \et w \to w + \Delta x


To see how it works, consider the four possible situations:

  - $e = 1 \quad \wedge \quad f(x) = 1 \quad \dot \vee \quad e = 0 \quad \wedge
    \quad f(x) = 0  \quad \Longrightarrow \quad \Delta = 0$  
    the current estimations work properly: $b$ and $w$ do not need to be updated;
  - $e = 1 \quad \wedge \quad f(x) = 0 \quad \Longrightarrow \quad
    \Delta \propto 1$  
    the current $b$ and $w$ underestimate the correct output: they must be
    increased;
  - $e = 0 \quad \wedge \quad f(x) = 1 \quad \Longrightarrow \quad
    \Delta \propto -1$  
    the current $b$ and $w$ overestimate the correct output: they must be
    decreased.

Whilst the $b$ updating is obvious, as regards $w$ the following consideration
may help clarify. Consider the case with $e = 0 \quad \wedge \quad f(x) = 1
\quad \Longrightarrow \quad \Delta = -r$:

w^T \cdot x \to (w^T + \Delta x^T) \cdot x = w^T \cdot x + \Delta |x|^2 = w^T \cdot x - r|x|^2 \leq w^T \cdot x


Similarly for the case with $e = 1$ and $f(x) = 0$.

![Weight vector and threshold value obtained with the perceptron method as a
  function of the number of iterations. Both level off at the third
  iteration.](images/7-iterations.pdf){#fig:percep-iterations}

As far as convergence is concerned, the perceptron will never get to the state
with all the input points classified correctly if the training set is not
linearly separable, meaning that the signal cannot be separated from the noise
by a line in the plane. In this case, no approximate solutions will be
gradually approached. On the other hand, if the training set is linearly
separable, it can be shown (see [@novikoff63]) that this method converges to
the coveted function.  As in the previous section, once found, the weight
vector is to be normalized.

Different values of the learning rate were tested, all giving the same result,
converging for a number $N = 3$ of iterations. In @fig:percep-iterations,
results are shown for $r = 0.8$: as can be seen, for $N = 3$, the values of $w$
and $t^{\text{cut}}$ level off.  
The following results were obtained:

w = (-0.654, -0.756) \et t_{\text{cut}} = 1.213


In this case, the projection line is not exactly parallel with the line joining
the means of the two samples. Plots in @fig:percep-proj.

::: { id=fig:percep-proj }
![Scatter plot of the samples.](images/7-percep-plane.pdf)
![Histogram of the projected samples.](images/7-percep-proj.pdf)

Aerial and lateral views of the samples. Projection line in blue and cut in red.
:::


## Efficiency test {#sec:class-results}

Using the same parameters of the training set, a number $N_t$ of test samples
was generated and the points were classified applying both methods. To avoid
storing large datasets in memory, at each iteration, false positives and
negatives were recorded using a running statistic method implemented in the
`gsl_rstat` library. For each sample, the numbers $N_{fn}$ and $N_{fp}$ of
false negative and false positive were obtained in this way: for every signal
point $x_s$, the threshold function $f(x_s)$ was computed, then:

  - if $f(x_s) = 1 \thus$ $N_{fn} \to N_{fn}$
  - if $f(x_s) = 0 \thus$ $N_{fn} \to N_{fn} + 1$

and similarly, for the noise points:

  - if $f(x_n) = 1 \thus$ $N_{fp} \to N_{fp} + 1$
  - if $f(x_n) = 0 \thus$ $N_{fp} \to N_{fp}$

Finally, the mean and standard deviation were computed from $N_{fn}$ and
$N_{fp}$ for every sample and used to estimate the significance $\alpha$
and false-positive rate $\beta$ of the classification:

\alpha = \frac{\text{mean}(N_{fn})}{N_s} \et \beta = \frac{\text{mean}(N_{fp})}{N_n}


Results for $N_t = 500$ are shown in @tbl:res_comp. As can be seen, the Fisher
discriminant gives a nearly perfect classification with a symmetric distribution
of true negatives and false positives, whereas the perceptron shows a little more
false positives than false negatives, being also more variable from dataset to
dataset.  
A possible explanation of this fact is that, for linearly separable and normally
distributed points, the Fisher linear discriminant is an exact analytical
solution, the most powerful one, according to the Neyman-Pearson lemma, whereas
the perceptron is only expected to converge to the solution and is therefore
more subject to random fluctuations.

-------------------------------------------------------
            $1-α$      $σ_{1-α}$  $1-β$      $σ_{1-β}$
----------- ---------- ---------- ---------- ----------
Fisher       0.9999     0.33       0.9999     0.33

Perceptron   0.9999     0.28       0.9995     0.64
-------------------------------------------------------

Table: Results for Fisher and perceptron method. $\sigma_{\alpha}$ and
       $\sigma_{\beta}$ stand for the standard deviation of the false
       negatives and false positives respectively. {#tbl:res_comp}