456 lines
18 KiB
Markdown
456 lines
18 KiB
Markdown
# Exercise 7
|
||
|
||
## Generating random points on the plane {#sec:sampling}
|
||
|
||
Two sets of 2D points $(x, y)$ --- signal and noise --- is to be generated
|
||
according to two bivariate Gaussian distributions with parameters:
|
||
\begin{align*}
|
||
\text{signal}\:
|
||
\begin{cases}
|
||
\mu = (0, 0) \\
|
||
\sigma_x = \sigma_y = 0.3 \\
|
||
\rho = 0.5
|
||
\end{cases}
|
||
&&
|
||
\text{noise}\:
|
||
\begin{cases}
|
||
\mu = (4, 4) \\
|
||
\sigma_x = \sigma_y = 1 \\
|
||
\rho = 0.4
|
||
\end{cases}
|
||
\end{align*}
|
||
where $\mu$ stands for the mean, $\sigma_x$ and $\sigma_y$ for the standard
|
||
deviations in $x$ and $y$ directions respectively and $\rho$ is the bivariate
|
||
correlation, namely:
|
||
$$
|
||
\sigma_{xy} = \rho\, \sigma_x \sigma_y
|
||
$$
|
||
where $\sigma_{xy}$ is the covariance of $x$ and $y$.
|
||
|
||
In the programs, $N_s = 800$ points for the signal and $N_n = 1000$ points for
|
||
the noise were chosen as default but can be customized from the command-line.
|
||
Both samples were stored as $n \times 2$ matrices, where $n$ is the number of
|
||
points in the sample. The library `gsl_matrix` provided by GSL was employed for
|
||
this purpose and the function `gsl_ran_bivariate_gaussian()` was used for
|
||
generating the points.
|
||
An example of the two samples is shown in @fig:points.
|
||
|
||
![Example of points sampled according to the two Gaussian distributions
|
||
with the given parameters.](images/7-points.pdf){#fig:points}
|
||
|
||
Assuming to not know how the points were generated, a model of classification,
|
||
which assign to each point the 'most probably' class it belongs to,
|
||
is implemented. Depending on the interpretation of 'most probable' many
|
||
different models can be developed.
|
||
Here, the Fisher linear discriminant and the perceptron were implemented and
|
||
described in the following two sections. The results are compared in
|
||
@sec:class-results.
|
||
|
||
|
||
## Fisher linear discriminant
|
||
|
||
|
||
### The projection direction
|
||
|
||
The Fisher linear discriminant (FLD) is a linear classification model based on
|
||
dimensionality reduction. It does so by projecting the data onto hyperplanes
|
||
that best divide the classes of points, consequently decreasing the dimension
|
||
to $n-1$. In the 2D case the projection is onto a line, therefore the problem
|
||
is reduced to simply selecting a threshold.
|
||
|
||
Consider the case of two classes (in this case signal and noise): the simplest
|
||
representation of a linear discriminant is obtained by taking a linear function
|
||
$\tilde{x}$ of a sampled 2D point $x$ so that:
|
||
$$
|
||
\tilde{x} = w^T x
|
||
$$
|
||
where $w$ is the so-called 'weight vector' and $w^T$ stands for its transpose.
|
||
|
||
An input point $x$ is commonly assigned to the first class if $\tilde{x} >
|
||
w_{th}$ and to the second one otherwise, where $w_{th}$ is a threshold value
|
||
somehow defined. In general, the projection onto one dimension leads to a
|
||
considerable loss of information and classes that are well separated in the
|
||
original 2D space may become strongly overlapping in one dimension. However, by
|
||
adjusting the components of the weight vector, a projection that maximizes the
|
||
classes separation can be found [@bishop06].
|
||
|
||
To begin with, consider $N_1$ points of class $C_1$ and $N_2$ points of class
|
||
$C_2$, so that the means $\mu_1$ and $\mu_2$ of the two classes are given by:
|
||
$$
|
||
\mu_1 = \frac{1}{N_1} \sum_{n \in C_1} x_n
|
||
\et
|
||
\mu_2 = \frac{1}{N_2} \sum_{n \in C_2} x_n
|
||
$$
|
||
|
||
The simplest measure of the separation of the classes is the separation of the
|
||
projected class means. This suggests to choose $w$ so as to maximize:
|
||
$$
|
||
\tilde{\mu}_2 − \tilde{\mu}_1 = w^T (\mu_2 − \mu_1)
|
||
$$
|
||
|
||
This expression can be made arbitrarily large simply by increasing the
|
||
magnitude of $w$, fortunately the problem is easily solved by requiring $w$
|
||
to be normalised: $| w^2 | = 1$. Using a Lagrange multiplier to perform the
|
||
constrained maximization, it can be found that $w \propto (\mu_2 − \mu_1)$,
|
||
meaning that the line onto the points must be projected is the one joining the
|
||
class means.
|
||
There is still a problem with this approach, however, as illustrated in
|
||
@fig:overlap: the two classes are well separated in the original 2D space but
|
||
have considerable overlap when projected onto the line joining their means
|
||
which maximize their projections distance.
|
||
|
||
![The plot on the left shows samples from two classes along with the
|
||
histograms resulting from the projection onto the line joining the
|
||
class means: note the considerable overlap in the projected
|
||
space. The right plot shows the corresponding projection based on the
|
||
Fisher linear discriminant, showing the greatly improved classes
|
||
separation. Figure taken from [@bishop06]](images/7-fisher.png){#fig:overlap}
|
||
|
||
The overlap of the projections can be reduced by maximising a function that
|
||
gives, besides a large separation, small variance within each class. The
|
||
within-class variance of the transformed data of each class $k$ is given by:
|
||
$$
|
||
\tilde{s}_k^2 = \sum_{n \in c_k} (\tilde{x}_n - \tilde{\mu}_k)^2
|
||
$$
|
||
|
||
The total within-class variance for the whole data set is simply defined as
|
||
$\tilde{s}^2 = \tilde{s}_1^2 + \tilde{s}_2^2$. The Fisher criterion is therfore
|
||
defined to be the ratio of the between-class distance to the within-class
|
||
variance and is given by:
|
||
$$
|
||
F(w) = \frac{(\tilde{\mu}_2 - \tilde{\mu}_1)^2}{\tilde{s}^2}
|
||
$$
|
||
|
||
The dependence on $w$ can be made explicit:
|
||
\begin{align*}
|
||
(\tilde{\mu}_2 - \tilde{\mu}_1)^2 &= (w^T \mu_2 - w^T \mu_1)^2 \\
|
||
&= [w^T (\mu_2 - \mu_1)]^2 \\
|
||
&= [w^T (\mu_2 - \mu_1)][w^T (\mu_2 - \mu_1)] \\
|
||
&= [w^T (\mu_2 - \mu_1)][(\mu_2 - \mu_1)^T w]
|
||
= w^T M w
|
||
\end{align*}
|
||
|
||
where $M$ is the between-distance matrix. Similarly, as regards the denominator:
|
||
\begin{align*}
|
||
\tilde{s}^2 &= \tilde{s}_1^2 + \tilde{s}_2^2 = \\
|
||
&= \sum_{n \in c_1} (\tilde{x}_n - \tilde{\mu}_1)^2
|
||
+ \sum_{n \in c_2} (\tilde{x}_n - \tilde{\mu}_2)^2
|
||
= w^T \Sigma_w w
|
||
\end{align*}
|
||
|
||
where $\Sigma_w$ is the total within-class covariance matrix:
|
||
\begin{align*}
|
||
\Sigma_w &= \sum_{n \in c_1} (x_n − \mu_1)(x_n − \mu_1)^T
|
||
+ \sum_{n \in c_2} (x_n − \mu_2)(x_n − \mu_2)^T \\
|
||
&= \Sigma_1 + \Sigma_2
|
||
= \begin{pmatrix}
|
||
\sigma_x^2 & \sigma_{xy} \\
|
||
\sigma_{xy} & \sigma_y^2
|
||
\end{pmatrix}_1 +
|
||
\begin{pmatrix}
|
||
\sigma_x^2 & \sigma_{xy} \\
|
||
\sigma_{xy} & \sigma_y^2
|
||
\end{pmatrix}_2
|
||
\end{align*}
|
||
|
||
Where $\Sigma_1$ and $\Sigma_2$ are the covariance matrix of the two samples.
|
||
The Fisher criterion can therefore be rewritten in the form:
|
||
$$
|
||
F(w) = \frac{w^T M w}{w^T \Sigma_w w}
|
||
$$
|
||
|
||
Differentiating with respect to $w$, it can be found that $F(w)$ is maximized
|
||
when:
|
||
$$
|
||
w = \Sigma_w^{-1} (\mu_2 - \mu_1)
|
||
$$ {#eq:fisher-weight}
|
||
|
||
This is not truly a discriminant but rather a specific choice of the direction
|
||
for projection of the data down to one dimension: the projected data can then be
|
||
used to construct a discriminant by choosing a threshold for the
|
||
classification.
|
||
|
||
When implemented, the parameters given in @sec:sampling were used to compute
|
||
the covariance matrices and their sum $\Sigma_w$. Then $\Sigma_w$, being a
|
||
symmetrical and positive-definite matrix, was inverted with the Cholesky method,
|
||
already discussed in @sec:MLM. Lastly, the matrix-vector product was computed
|
||
with the `gsl_blas_dgemv()` function provided by GSL.
|
||
|
||
|
||
### The threshold
|
||
|
||
The threshold $t_{\text{cut}}$ was fixed by the conditional
|
||
probability $P(c_k | t_{\text{cut}})$ being the same for both classes $c_k$:
|
||
$$
|
||
t_{\text{cut}} = x \text{ such that}\quad
|
||
\frac{P(c_1 | x)}{P(c_2 | x)} =
|
||
\frac{P(x | c_1) \, P(c_1)}{P(x | c_2) \, P(c_2)} = 1
|
||
$$
|
||
|
||
where $P(x | c_k)$ is the probability for point $x$ along the Fisher projection
|
||
line of being sampled from $k$. If $\tilde{x} > t_\text{cut}$ then more likely
|
||
$x \in c_1$, otherwise $x \in c_2$.
|
||
|
||
If each class is a bivariate Gaussian distribution, as in the present case,
|
||
then $P(x | c_k)$ is simply given by its projected normal distribution with
|
||
mean $\tilde{m} = w^T m$ and variance $\tilde{s} = w^T S w$, being $S$ the
|
||
covariance matrix of the class.
|
||
After some algebra, the threshold is found to be:
|
||
$$
|
||
t_{\text{cut}} = \frac{b}{a}
|
||
+ \sqrt{\left( \frac{b}{a} \right)^2 - \frac{c}{a}}
|
||
$$
|
||
where:
|
||
|
||
- $a = \tilde{s}_1^2 - \tilde{s}_2^2$
|
||
- $b = \tilde{\mu}_2 \, \tilde{s}_1^2 - \tilde{\mu}_1 \, \tilde{s}_2^2$
|
||
- $c = \tilde{\mu}_2^2 \, \tilde{s}_1^2 - \tilde{\mu}_1^2 \, \tilde{s}_2^2
|
||
- 2 \, \tilde{s}_1^2 \, \tilde{s}_2^2 \, \ln(\alpha)$
|
||
- $\alpha = P(c_1) / P(c_2)$
|
||
|
||
In a simulation, the ratio of the prior probabilities $\alpha$ can
|
||
simply be set to:
|
||
$$
|
||
\alpha = \frac{N_s}{N_n}
|
||
$$
|
||
|
||
The projection of the points was accomplished by the use of the function
|
||
`gsl_blas_ddot()`, which computes a fast dot product of two vectors.
|
||
|
||
Results obtained for the same samples in @fig:points are shown below in
|
||
@fig:fisher-proj. The weight vector and the threshold were found to be:
|
||
$$
|
||
w = (0.707, 0.707) \et
|
||
t_{\text{cut}} = 1.323
|
||
$$
|
||
|
||
::: { id=fig:fisher-proj }
|
||
![Scatter plot of the samples.](images/7-fisher-plane.pdf)
|
||
![Histogram of the Fisher-projected samples.](images/7-fisher-proj.pdf)
|
||
|
||
Aerial and lateral views of the samples. Projection line in blu and cut in red.
|
||
:::
|
||
|
||
|
||
### A mathematical curiosity
|
||
|
||
This section is really a sidenote which grew too large to fit in a margin,
|
||
so it can be safely skipped.
|
||
|
||
It can be seen that the weight vector turned out to parallel to the line
|
||
joining the means of the two classes (as a remainder: $(0, 0)$ and $(4, 4)$),
|
||
as if the within-class covariances were ignored. Strange!
|
||
|
||
Looking at @eq:fisher-weight, one can be mislead into thinking that the inverse
|
||
of the total covariance matrix, $\Sigma_w$ is (proportional to) the identity,
|
||
but that's not true. By a remarkable accident, the vector joining the means is
|
||
an eigenvector of the covariance matrix $\Sigma_w^{-1}$. In
|
||
fact: since $\sigma_x = \sigma_y$ for both signal and noise:
|
||
$$
|
||
\Sigma_1 = \begin{pmatrix}
|
||
\sigma_x^2 & \sigma_{xy} \\
|
||
\sigma_{xy} & \sigma_x^2
|
||
\end{pmatrix}_1
|
||
\et
|
||
\Sigma_2 = \begin{pmatrix}
|
||
\sigma_x^2 & \sigma_{xy} \\
|
||
\sigma_{xy} & \sigma_x^2
|
||
\end{pmatrix}_2
|
||
$$
|
||
|
||
$\Sigma_w$ takes the symmetrical form
|
||
$$
|
||
\Sigma_w = \begin{pmatrix}
|
||
A & B \\
|
||
B & A
|
||
\end{pmatrix},
|
||
$$
|
||
which can be easily inverted by Gaussian elimination:
|
||
\begin{align*}
|
||
\begin{pmatrix}
|
||
A & B & \vline & 1 & 0 \\
|
||
B & A & \vline & 0 & 1 \\
|
||
\end{pmatrix} &\longrightarrow
|
||
\begin{pmatrix}
|
||
A - B & 0 & \vline & 1 - B & - B \\
|
||
0 & A - B & \vline & - B & 1 - B \\
|
||
\end{pmatrix} \\ &\longrightarrow
|
||
\begin{pmatrix}
|
||
1 & 0 & \vline & (1 - B)/(A - B) & - B/(A - B) \\
|
||
0 & 1 & \vline & - B/(A - B) & (1 - B)/(A - B) \\
|
||
\end{pmatrix}
|
||
\end{align*}
|
||
|
||
Hence, the inverse has still the same form:
|
||
$$
|
||
\Sigma_w^{-1} = \begin{pmatrix}
|
||
\tilde{A} & \tilde{B} \\
|
||
\tilde{B} & \tilde{A}
|
||
\end{pmatrix}
|
||
$$
|
||
|
||
For this reason, $\Sigma_w$ and $\Sigma_w^{-1}$ share the same eigenvectors
|
||
$v_1$ and $v_2$:
|
||
$$
|
||
v_1 = \begin{pmatrix} 1 \\ -1 \end{pmatrix}
|
||
\et
|
||
v_2 = \begin{pmatrix} 1 \\ 1 \end{pmatrix}
|
||
$$
|
||
|
||
The vector joining the means is clearly a multiple of $v_2$, and so is $w$.
|
||
|
||
|
||
## Perceptron
|
||
|
||
In machine learning, the perceptron is an algorithm for supervised learning of
|
||
linear binary classifiers.
|
||
|
||
Supervised learning is the machine learning task of inferring a function $f$
|
||
that maps an input $x$ to an output $f(x)$ based on a set of training
|
||
input-output pairs, where each pair consists of an input object and an output
|
||
value. The inferred function can be used for mapping new examples: the algorithm
|
||
is generalized to correctly determine the class labels for unseen instances.
|
||
|
||
The aim of the perceptron algorithm is to determine the weight vector $w$ and
|
||
bias $b$ such that the so-called 'threshold function' $f(x)$ returns a binary
|
||
value: it is expected to return 1 for signal points and 0 for noise points:
|
||
$$
|
||
f(x) = \theta(w^T x + b)
|
||
$$ {#eq:perc}
|
||
|
||
where $\theta$ is the Heaviside theta function. Note that the bias $b$ is
|
||
$-t_\text{cut}$, as defined in the previous section.
|
||
|
||
The training was performed using the generated sample as training set. From an
|
||
initial guess for $w$ and $b$ (which were set to be all null in the code), the
|
||
perceptron starts to improve their estimations. The training set is passed
|
||
point by point into a iterative procedure $N$ times: for every point, the
|
||
output of $f(x)$ is computed. Afterwards, the variable $\Delta$, which is
|
||
defined as:
|
||
$$
|
||
\Delta = r [e - f(x)]
|
||
$$
|
||
|
||
where:
|
||
|
||
- $r \in [0, 1]$ is the learning rate of the perceptron: the larger $r$, the
|
||
more volatile the weight changes. In the code it was arbitrarily set $r =
|
||
0.8$;
|
||
- $e$ is the expected output value, namely 1 if $x$ is signal and 0 if it is
|
||
noise;
|
||
|
||
is used to update $b$ and $w$:
|
||
|
||
$$
|
||
b \to b + \Delta
|
||
\et
|
||
w \to w + \Delta x
|
||
$$
|
||
|
||
To see how it works, consider the four possible situations:
|
||
|
||
- $e = 1 \quad \wedge \quad f(x) = 1 \quad \dot \vee \quad e = 0 \quad \wedge
|
||
\quad f(x) = 0 \quad \Longrightarrow \quad \Delta = 0$
|
||
the current estimations work properly: $b$ and $w$ do not need to be updated;
|
||
- $e = 1 \quad \wedge \quad f(x) = 0 \quad \Longrightarrow \quad
|
||
\Delta = 1$
|
||
the current $b$ and $w$ underestimate the correct output: they must be
|
||
increased;
|
||
- $e = 0 \quad \wedge \quad f(x) = 1 \quad \Longrightarrow \quad
|
||
\Delta = -1$
|
||
the current $b$ and $w$ overestimate the correct output: they must be
|
||
decreased.
|
||
|
||
Whilst the $b$ updating is obvious, as regards $w$ the following consideration
|
||
may help clarify. Consider the case with $e = 0 \quad \wedge \quad f(x) = 1
|
||
\quad \Longrightarrow \quad \Delta = -1$:
|
||
$$
|
||
w^T \cdot x \to (w^T + \Delta x^T) \cdot x
|
||
= w^T \cdot x + \Delta |x|^2
|
||
= w^T \cdot x - |x|^2 \leq w^T \cdot x
|
||
$$
|
||
|
||
Similarly for the case with $e = 1$ and $f(x) = 0$.
|
||
|
||
![Weight vector and threshold value obtained with the perceptron method as a
|
||
function of the number of iterations. Both level off at the third
|
||
iteration.](images/7-iterations.pdf){#fig:percep-iterations}
|
||
|
||
As far as convergence is concerned, the perceptron will never get to the state
|
||
with all the input points classified correctly if the training set is not
|
||
linearly separable, meaning that the signal cannot be separated from the noise
|
||
by a line in the plane. In this case, no approximate solutions will be
|
||
gradually approached. On the other hand, if the training set is linearly
|
||
separable, it can be shown (see [@novikoff63]) that this method converges to
|
||
the coveted function. As in the previous section, once found, the weight
|
||
vector is to be normalized.
|
||
|
||
With $N = 5$ iterations, the values of $w$ and $t_{\text{cut}}$ level off up to the third
|
||
digit. The following results were obtained:
|
||
|
||
Different values of the learning rate were tested, all giving the same result,
|
||
converging for a number $N = 3$ of iterations. In @fig:percep-iterations, results are
|
||
shown for $r = 0.8$: as can be seen, for $N = 3$, the values of $w$ and
|
||
$t^{\text{cut}}$ level off.
|
||
The following results were obtained:
|
||
$$
|
||
w = (-0.654, -0.756) \et t_{\text{cut}} = 1.213
|
||
$$
|
||
|
||
In this case, the projection line is not exactly parallel with the line joining
|
||
the means of the two samples. Plots in @fig:percep-proj.
|
||
|
||
::: { id=fig:percep-proj }
|
||
![Scatter plot of the samples.](images/7-percep-plane.pdf)
|
||
![Histogram of the projected samples.](images/7-percep-proj.pdf)
|
||
|
||
Aerial and lateral views of the samples. Projection line in blu and cut in red.
|
||
:::
|
||
|
||
|
||
## Efficiency test {#sec:class-results}
|
||
|
||
Using the same parameters of the training set, a number $N_t$ of test samples
|
||
was generated and the points were classified applying both methods. To avoid
|
||
storing large datasets in memory, at each iteration, false positives and
|
||
negatives were recorded using a running statistics method implemented in the
|
||
`gsl_rstat` library. For each sample, the numbers $N_{fn}$ and $N_{fp}$ of
|
||
false negative and false positive were obtained this way: for every noise point
|
||
$x_n$, the threshold function $f(x_n)$ was computed, then:
|
||
|
||
- if $f(x) = 0 \thus$ $N_{fn} \to N_{fn}$
|
||
- if $f(x) \neq 0 \thus$ $N_{fn} \to N_{fn} + 1$
|
||
|
||
and similarly for the positive points.
|
||
Finally, the mean and standard deviation were computed from $N_{fn}$ and
|
||
$N_{fp}$ for every sample and used to estimate the purity $\alpha$ and
|
||
efficiency $\beta$ of the classification:
|
||
|
||
$$
|
||
\alpha = 1 - \frac{\text{mean}(N_{fn})}{N_s} \et
|
||
\beta = 1 - \frac{\text{mean}(N_{fp})}{N_n}
|
||
$$
|
||
|
||
Results for $N_t = 500$ are shown in @tbl:res_comp. As can be seen, the Fisher
|
||
discriminant gives a nearly perfect classification with a symmetric distribution
|
||
of false negative and false positive, whereas the perceptron shows a little more
|
||
false-positive than false-negative, being also more variable from dataset to
|
||
dataset.
|
||
A possible explanation of this fact is that, for linearly separable and normally
|
||
distributed points, the Fisher linear discriminant is an exact analytical
|
||
solution, the most powerful one, according to the Neyman-Pearson lemma, whereas
|
||
the perceptron is only expected to converge to the solution and is therefore
|
||
more subject to random fluctuations.
|
||
|
||
------------------------------------------------------
|
||
$α$ $σ_α$ $β$ $σ_β$
|
||
----------- ---------- ---------- ---------- ---------
|
||
Fisher 0.9999 0.33 0.9999 0.33
|
||
|
||
Perceptron 0.9999 0.28 0.9995 0.64
|
||
------------------------------------------------------
|
||
|
||
Table: Results for Fisher and perceptron method. $\sigma_{\alpha}$ and
|
||
$\sigma_{\beta}$ stand for the standard deviation of the false
|
||
negatives and false positives respectively. {#tbl:res_comp}
|