448 lines
17 KiB
Markdown
448 lines
17 KiB
Markdown
# Exercise 7
|
||
|
||
## Generating points according to Gaussian distributions {#sec:sampling}
|
||
|
||
Two sets of 2D points $(x, y)$ - signal and noise - is to be generated according
|
||
to two bivariate Gaussian distributions with parameters:
|
||
$$
|
||
\text{signal} \quad
|
||
\begin{cases}
|
||
\mu = (0, 0) \\
|
||
\sigma_x = \sigma_y = 0.3 \\
|
||
\rho = 0.5
|
||
\end{cases}
|
||
\et
|
||
\text{noise} \quad
|
||
\begin{cases}
|
||
\mu = (4, 4) \\
|
||
\sigma_x = \sigma_y = 1 \\
|
||
\rho = 0.4
|
||
\end{cases}
|
||
$$
|
||
|
||
where $\mu$ stands for the mean, $\sigma_x$ and $\sigma_y$ for the standard
|
||
deviations in $x$ and $y$ directions respectively and $\rho$ is the bivariate
|
||
correlation, namely:
|
||
$$
|
||
\sigma_{xy} = \rho \sigma_x \sigma_y
|
||
$$
|
||
|
||
where $\sigma_{xy}$ is the covariance of $x$ and $y$.
|
||
In the code, default settings are $N_s = 800$ points for the signal and $N_n =
|
||
1000$ points for the noise but can be customized from the input command-line.
|
||
Both samples were handled as matrices of dimension $n$ x 2, where $n$ is the
|
||
number of points in the sample. The library `gsl_matrix` provided by GSL was
|
||
employed for this purpose and the function `gsl_ran_bivariate_gaussian()` was
|
||
used for generating the points.
|
||
An example of the two samples is shown in @fig:points.
|
||
|
||
![Example of points sampled according to the two Gaussian distributions
|
||
with the given parameters.](images/7-points.pdf){#fig:points}
|
||
|
||
Assuming not to know how the points were generated, a model of classification
|
||
is then to be implemented in order to assign each point to the right class
|
||
(signal or noise) to which it 'most probably' belongs to. The point is how
|
||
'most probably' can be interpreted and implemented.
|
||
Here, the Fisher linear discriminant and the Perceptron were implemented and
|
||
described in the following two sections. The results are compared in
|
||
@sec:7_results.
|
||
|
||
|
||
## Fisher linear discriminant
|
||
|
||
|
||
### The projection direction
|
||
|
||
The Fisher linear discriminant (FLD) is a linear classification model based on
|
||
dimensionality reduction. It allows to reduce this 2D classification problem
|
||
into a one-dimensional decision surface.
|
||
|
||
Consider the case of two classes (in this case signal and noise): the simplest
|
||
representation of a linear discriminant is obtained by taking a linear function
|
||
$\hat{x}$ of a sampled 2D point $x$ so that:
|
||
$$
|
||
\hat{x} = w^T x
|
||
$$
|
||
|
||
where $w$ is the so-called 'weight vector' and $w^T$ stands for its transpose.
|
||
An input point $x$ is commonly assigned to the first class if $\hat{x} \geqslant
|
||
w_{th}$ and to the second one otherwise, where $w_{th}$ is a threshold value
|
||
somehow defined. In general, the projection onto one dimension leads to a
|
||
considerable loss of information and classes that are well separated in the
|
||
original 2D space may become strongly overlapping in one dimension. However, by
|
||
adjusting the components of the weight vector, a projection that maximizes the
|
||
classes separation can be selected [@bishop06].
|
||
To begin with, consider $N_1$ points of class $C_1$ and $N_2$ points of class
|
||
$C_2$, so that the means $\mu_1$ and $\mu_2$ of the two classes are given by:
|
||
$$
|
||
\mu_1 = \frac{1}{N_1} \sum_{n \in C_1} x_n
|
||
\et
|
||
\mu_2 = \frac{1}{N_2} \sum_{n \in C_2} x_n
|
||
$$
|
||
|
||
The simplest measure of the separation of the classes is the separation of the
|
||
projected class means. This suggests to choose $w$ so as to maximize:
|
||
$$
|
||
\hat{\mu}_2 − \hat{\mu}_1 = w^T (\mu_2 − \mu_1)
|
||
$$
|
||
|
||
This expression can be made arbitrarily large simply by increasing the magnitude
|
||
of $w$. To solve this problem, $w$ can be constrained to have unit length, so
|
||
that $| w^2 | = 1$. Using a Lagrange multiplier to perform the constrained
|
||
maximization, it can be found that $w \propto (\mu_2 − \mu_1)$, meaning that the
|
||
line onto the points must be projected is the one joining the class means.
|
||
There is still a problem with this approach, however, as illustrated in
|
||
@fig:overlap: the two classes are well separated in the original 2D space but
|
||
have considerable overlap when projected onto the line joining their means
|
||
which maximize their projections distance.
|
||
|
||
![The plot on the left shows samples from two classes along with the
|
||
histograms resulting fromthe projection onto the line joining the
|
||
class means: note that there is considerable overlap in the projected
|
||
space. The right plot shows the corresponding projection based on the
|
||
Fisher linear discriminant, showing the greatly improved classes
|
||
separation. Fifure from [@bishop06]](images/7-fisher.png){#fig:overlap}
|
||
|
||
The idea to solve it is to maximize a function that will give a large separation
|
||
between the projected classes means while also giving a small variance within
|
||
each class, thereby minimizing the class overlap.
|
||
The within-class variance of the transformed data of each class $k$ is given
|
||
by:
|
||
$$
|
||
\hat{s}_k^2 = \sum_{n \in c_k} (\hat{x}_n - \hat{\mu}_k)^2
|
||
$$
|
||
|
||
The total within-class variance for the whole data set is simply defined as
|
||
$\hat{s}^2 = \hat{s}_1^2 + \hat{s}_2^2$. The Fisher criterion is defined to
|
||
be the ratio of the between-class distance to the within-class variance and is
|
||
given by:
|
||
$$
|
||
F(w) = \frac{(\hat{\mu}_2 - \hat{\mu}_1)^2}{\hat{s}^2}
|
||
$$
|
||
|
||
The dependence on $w$ can be made explicit:
|
||
\begin{align*}
|
||
(\hat{\mu}_2 - \hat{\mu}_1)^2 &= (w^T \mu_2 - w^T \mu_1)^2 \\
|
||
&= [w^T (\mu_2 - \mu_1)]^2 \\
|
||
&= [w^T (\mu_2 - \mu_1)][w^T (\mu_2 - \mu_1)] \\
|
||
&= [w^T (\mu_2 - \mu_1)][(\mu_2 - \mu_1)^T w]
|
||
= w^T M w
|
||
\end{align*}
|
||
|
||
where $M$ is the between-distance matrix. Similarly, as regards the denominator:
|
||
\begin{align*}
|
||
\hat{s}^2 &= \hat{s}_1^2 + \hat{s}_2^2 = \\
|
||
&= \sum_{n \in c_1} (\hat{x}_n - \hat{\mu}_1)^2
|
||
+ \sum_{n \in c_2} (\hat{x}_n - \hat{\mu}_2)^2
|
||
= w^T \Sigma_w w
|
||
\end{align*}
|
||
|
||
where $\Sigma_w$ is the total within-class covariance matrix:
|
||
\begin{align*}
|
||
\Sigma_w &= \sum_{n \in c_1} (x_n − \mu_1)(x_n − \mu_1)^T
|
||
+ \sum_{n \in c_2} (x_n − \mu_2)(x_n − \mu_2)^T \\
|
||
&= \Sigma_1 + \Sigma_2
|
||
= \begin{pmatrix}
|
||
\sigma_x^2 & \sigma_{xy} \\
|
||
\sigma_{xy} & \sigma_y^2
|
||
\end{pmatrix}_1 +
|
||
\begin{pmatrix}
|
||
\sigma_x^2 & \sigma_{xy} \\
|
||
\sigma_{xy} & \sigma_y^2
|
||
\end{pmatrix}_2
|
||
\end{align*}
|
||
|
||
Where $\Sigma_1$ and $\Sigma_2$ are the covariance matrix of the two samples.
|
||
The Fisher criterion can therefore be rewritten in the form:
|
||
$$
|
||
F(w) = \frac{w^T M w}{w^T \Sigma_w w}
|
||
$$
|
||
|
||
Differentiating with respect to $w$, it can be found that $F(w)$ is maximized
|
||
when:
|
||
$$
|
||
w = \Sigma_w^{-1} (\mu_2 - \mu_1)
|
||
$$
|
||
|
||
This is not truly a discriminant but rather a specific choice of the direction
|
||
for projection of the data down to one dimension: the projected data can then be
|
||
used to construct a discriminant by choosing a threshold for the
|
||
classification.
|
||
|
||
When implemented, the parameters given in @sec:sampling were used to compute
|
||
the covariance matrices and their sum $\Sigma_w$. Then $\Sigma_w$, being a
|
||
symmetrical and positive-definite matrix, was inverted with the Cholesky method,
|
||
already discussed in @sec:MLM. Lastly, the matrix-vector product was computed
|
||
with the `gsl_blas_dgemv()` function provided by GSL.
|
||
|
||
|
||
### The threshold
|
||
|
||
The threshold $t_{\text{cut}}$ was fixed by the condition of conditional
|
||
probability $P(c_k | t_{\text{cut}})$ being the same for both classes $c_k$:
|
||
$$
|
||
t_{\text{cut}} = x \, | \hspace{20pt}
|
||
\frac{P(c_1 | x)}{P(c_2 | x)} =
|
||
\frac{P(x | c_1) \, P(c_1)}{P(x | c_2) \, P(c_2)} = 1
|
||
$$
|
||
|
||
where $P(x | c_k)$ is the probability for point $x$ along the Fisher projection
|
||
line of being sampled according to the class $k$. If each class is a bivariate
|
||
Gaussian, as in the present case, then $P(x | c_k)$ is simply given by its
|
||
projected normal distribution with mean $\hat{m} = w^T m$ and variance $\hat{s}
|
||
= w^T S w$, being $S$ the covariance matrix of the class.
|
||
With a bit of math, the following solution can be found:
|
||
$$
|
||
t_{\text{cut}} = \frac{b}{a}
|
||
+ \sqrt{\left( \frac{b}{a} \right)^2 - \frac{c}{a}}
|
||
$$
|
||
|
||
where:
|
||
|
||
- $a = \hat{s}_1^2 - \hat{s}_2^2$
|
||
- $b = \hat{\mu}_2 \, \hat{s}_1^2 - \hat{\mu}_1 \, \hat{s}_2^2$
|
||
- $c = \hat{\mu}_2^2 \, \hat{s}_1^2 - \hat{\mu}_1^2 \, \hat{s}_2^2
|
||
- 2 \, \hat{s}_1^2 \, \hat{s}_2^2 \, \ln(\alpha)$
|
||
- $\alpha = P(c_1) / P(c_2)$
|
||
|
||
The ratio of the prior probabilities $\alpha$ is simply given by:
|
||
$$
|
||
\alpha = \frac{N_s}{N_n}
|
||
$$
|
||
|
||
The projection of the points was accomplished by the use of the function
|
||
`gsl_blas_ddot()`, which computes the element wise product between two vectors.
|
||
|
||
Results obtained for the same samples in @fig:points are shown in
|
||
@fig:fisher_proj. The weight vector and the treshold were found to be:
|
||
$$
|
||
w = (0.707, 0.707) \et
|
||
t_{\text{cut}} = 1.323
|
||
$$
|
||
|
||
<div id="fig:fisher_proj">
|
||
![View of the samples in the plane.](images/7-fisher-plane.pdf)
|
||
![View of the samples projections onto the projection
|
||
line.](images/7-fisher-proj.pdf)
|
||
|
||
Aerial and lateral views of the samples. Projection line in blu and cut in red.
|
||
</div>
|
||
|
||
Since the vector $w$ turned out to be parallel with the line joining the means
|
||
of the two classes (reminded to be $(0, 0)$ and $(4, 4)$), one can be mislead
|
||
and assume that the inverse of the total covariance matrix $\Sigma_w$ is
|
||
isotropic, namely proportional to the unit matrix.
|
||
That's not true. In this special sample, the vector joining the means turns out
|
||
to be an eigenvector of the covariance matrix $\Sigma_w^{-1}$. In fact: since
|
||
$\sigma_x = \sigma_y$ for both signal and noise:
|
||
$$
|
||
\Sigma_1 = \begin{pmatrix}
|
||
\sigma_x^2 & \sigma_{xy} \\
|
||
\sigma_{xy} & \sigma_x^2
|
||
\end{pmatrix}_1
|
||
\et
|
||
\Sigma_2 = \begin{pmatrix}
|
||
\sigma_x^2 & \sigma_{xy} \\
|
||
\sigma_{xy} & \sigma_x^2
|
||
\end{pmatrix}_2
|
||
$$
|
||
|
||
$\Sigma_w$ takes the form:
|
||
$$
|
||
\Sigma_w = \begin{pmatrix}
|
||
A & B \\
|
||
B & A
|
||
\end{pmatrix}
|
||
$$
|
||
|
||
Which can be easily inverted by Gaussian elimination:
|
||
\begin{align*}
|
||
\begin{pmatrix}
|
||
A & B & \vline & 1 & 0 \\
|
||
B & A & \vline & 0 & 1 \\
|
||
\end{pmatrix} &\longrightarrow
|
||
\begin{pmatrix}
|
||
A - B & 0 & \vline & 1 - B & - B \\
|
||
0 & A - B & \vline & - B & 1 - B \\
|
||
\end{pmatrix} \\ &\longrightarrow
|
||
\begin{pmatrix}
|
||
1 & 0 & \vline & (1 - B)/(A - B) & - B/(A - B) \\
|
||
0 & 1 & \vline & - B/(A - B) & (1 - B)/(A - B) \\
|
||
\end{pmatrix}
|
||
\end{align*}
|
||
|
||
Hence:
|
||
$$
|
||
\Sigma_w^{-1} = \begin{pmatrix}
|
||
\tilde{A} & \tilde{B} \\
|
||
\tilde{B} & \tilde{A}
|
||
\end{pmatrix}
|
||
$$
|
||
|
||
Thus, $\Sigma_w$ and $\Sigma_w^{-1}$ share the same eigenvectors $v_1$ and
|
||
$v_2$:
|
||
$$
|
||
v_1 = \begin{pmatrix}
|
||
1 \\
|
||
-1
|
||
\end{pmatrix} \et
|
||
v_2 = \begin{pmatrix}
|
||
1 \\
|
||
1
|
||
\end{pmatrix}
|
||
$$
|
||
|
||
and the vector joining the means is clearly a multiple of $v_2$, causing $w$ to
|
||
be a multiple of it.
|
||
|
||
|
||
## Perceptron
|
||
|
||
In machine learning, the perceptron is an algorithm for supervised learning of
|
||
linear binary classifiers.
|
||
|
||
Supervised learning is the machine learning task of inferring a function $f$
|
||
that maps an input $x$ to an output $f(x)$ based on a set of training
|
||
input-output pairs, where each pair consists of an input object and an output
|
||
value. The inferred function can be used for mapping new examples: the algorithm
|
||
is generalized to correctly determine the class labels for unseen instances.
|
||
|
||
The aim of the perceptron algorithm is to determine the weight vector $w$ and
|
||
bias $b$ such that the so-called 'threshold function' $f(x)$ returns a binary
|
||
value: it is expected to return 1 for signal points and 0 for noise points:
|
||
$$
|
||
f(x) = \theta(w^T \cdot x + b)
|
||
$$ {#eq:perc}
|
||
|
||
where $\theta$ is the Heaviside theta function.
|
||
The training was performed using the generated sample as training set. From an
|
||
initial guess for $w$ and $b$ (which were set to be all null in the code), the
|
||
perceptron starts to improve their estimations. The training set is passed point
|
||
by point into a iterative procedure a customizable number $N$ of times: for
|
||
every point, the output of $f(x)$ is computed. Afterwards, the variable
|
||
$\Delta$, which is defined as:
|
||
$$
|
||
\Delta = r [e - f(x)]
|
||
$$
|
||
|
||
where:
|
||
|
||
- $r \in [0, 1]$ is the learning rate of the perceptron: the larger $r$, the
|
||
more volatile the weight changes. In the code it was arbitrarily set $r =
|
||
0.8$;
|
||
- $e$ is the expected output value, namely 1 if $x$ is signal and 0 if it is
|
||
noise;
|
||
|
||
is used to update $b$ and $w$:
|
||
|
||
$$
|
||
b \to b + \Delta
|
||
\et
|
||
w \to w + \Delta x
|
||
$$
|
||
|
||
To see how it works, consider the four possible situations:
|
||
|
||
- $e = 1 \quad \wedge \quad f(x) = 1 \quad \dot \vee \quad e = 0 \quad \wedge
|
||
\quad f(x) = 0 \quad \Longrightarrow \quad \Delta = 0$
|
||
the current estimations work properly: $b$ and $w$ do not need to be updated;
|
||
- $e = 1 \quad \wedge \quad f(x) = 0 \quad \Longrightarrow \quad
|
||
\Delta = 1$
|
||
the current $b$ and $w$ underestimate the correct output: they must be
|
||
increased;
|
||
- $e = 0 \quad \wedge \quad f(x) = 1 \quad \Longrightarrow \quad
|
||
\Delta = -1$
|
||
the current $b$ and $w$ overestimate the correct output: they must be
|
||
decreased.
|
||
|
||
Whilst the $b$ updating is obvious, as regarsd $w$ the following consideration
|
||
may help clarify. Consider the case with $e = 0 \quad \wedge \quad f(x) = 1
|
||
\quad \Longrightarrow \quad \Delta = -1$:
|
||
$$
|
||
w^T \cdot x \to (w^T + \Delta x^T) \cdot x
|
||
= w^T \cdot x + \Delta |x|^2
|
||
= w^T \cdot x - |x|^2 \leq w^T \cdot x
|
||
$$
|
||
|
||
Similarly for the case with $e = 1$ and $f(x) = 0$.
|
||
|
||
As far as convergence is concerned, the perceptron will never get to the state
|
||
with all the input points classified correctly if the training set is not
|
||
linearly separable, meaning that the signal cannot be separated from the noise
|
||
by a line in the plane. In this case, no approximate solutions will be gradually
|
||
approached. On the other hand, if the training set is linearly separable, it can
|
||
be shown that this method converges to the coveted function [@novikoff63].
|
||
As in the previous section, once found, the weight vector is to be normalized.
|
||
|
||
With $N = 5$ iterations, the values of $w$ and $t_{\text{cut}}$ level off up to the third
|
||
digit. The following results were obtained:
|
||
|
||
Different values of the learning rate were tested, all giving the same result,
|
||
converging for a number $N = 3$ of iterations. In @fig:iterations, results are
|
||
shown for $r = 0.8$: as can be seen, for $N = 3$, the values of $w$ and
|
||
$t^{\text{cut}}$ level off.
|
||
The following results were obtained:
|
||
$$
|
||
w = (0.654, 0.756) \et t_{\text{cut}} = 1.213
|
||
$$
|
||
|
||
In this case, the projection line is not parallel with the line joining the
|
||
means of the two samples. Plots in @fig:percep_proj.
|
||
|
||
<div id="fig:percep_proj">
|
||
![View from above of the samples.](images/7-percep-plane.pdf)
|
||
![Gaussian of the samples on the projection
|
||
line.](images/7-percep-proj.pdf)
|
||
|
||
Aerial and lateral views of the projection direction, in blue, and the cut, in
|
||
red.
|
||
</div>
|
||
|
||
|
||
## Efficiency test {#sec:7_results}
|
||
|
||
Using the same parameters of the training set, a number $N_t$ of test
|
||
samples was generated and the points were divided into noise and signal
|
||
applying both methods. To avoid storing large datasets in memory, at each
|
||
iteration, false positives and negatives were recorded using a running
|
||
statistics method implemented in the `gsl_rstat` library. For each sample, the
|
||
numbers $N_{fn}$ and $N_{fp}$ of false positive and false negative were obtained
|
||
this way: for every noise point $x_n$, the threshold function $f(x_n)$ was
|
||
computed, then:
|
||
|
||
- if $f(x) = 0 \thus$ $N_{fn} \to N_{fn}$
|
||
- if $f(x) \neq 0 \thus$ $N_{fn} \to N_{fn} + 1$
|
||
|
||
and similarly for the positive points.
|
||
Finally, the mean and standard deviation were computed from $N_{fn}$ and
|
||
$N_{fp}$ for every sample and used to estimate the purity $\alpha$ and
|
||
efficiency $\beta$ of the classification:
|
||
|
||
$$
|
||
\alpha = 1 - \frac{\text{mean}(N_{fn})}{N_s} \et
|
||
\beta = 1 - \frac{\text{mean}(N_{fp})}{N_n}
|
||
$$
|
||
|
||
Results for $N_t = 500$ are shown in @tbl:res_comp. As can be seen, the Fisher
|
||
discriminant gives a nearly perfect classification with a symmetric distribution
|
||
of false negative and false positive, whereas the perceptron shows a little more
|
||
false-positive than false-negative, being also more variable from dataset to
|
||
dataset.
|
||
A possible explanation of this fact is that, for linearly separable and normally
|
||
distributed points, the Fisher linear discriminant is an exact analytical
|
||
solution, whereas the perceptron is only expected to converge to the solution
|
||
and is therefore more subject to random fluctuations.
|
||
|
||
|
||
-------------------------------------------------------------------------------------------
|
||
$\alpha$ $\sigma_{\alpha}$ $\beta$ $\sigma_{\beta}$
|
||
----------- ------------------- ------------------- ------------------- -------------------
|
||
Fisher 0.9999 0.33 0.9999 0.33
|
||
|
||
Perceptron 0.9999 0.28 0.9995 0.64
|
||
-------------------------------------------------------------------------------------------
|
||
|
||
Table: Results for Fisher and perceptron method. $\sigma_{\alpha}$ and
|
||
$\sigma_{\beta}$ stand for the standard deviation of the false
|
||
negatives and false positives respectively. {#tbl:res_comp}
|