339 lines
12 KiB
Markdown
339 lines
12 KiB
Markdown
# Exercise 7
|
||
|
||
## Generating points according to Gaussian distributions {#sec:sampling}
|
||
|
||
The first task of exercise 7 is to generate two sets of 2D points $(x, y)$
|
||
according to two bivariate Gaussian distributions with parameters:
|
||
|
||
$$
|
||
\text{signal} \quad
|
||
\begin{cases}
|
||
\mu = (0, 0) \\
|
||
\sigma_x = \sigma_y = 0.3 \\
|
||
\rho = 0.5
|
||
\end{cases}
|
||
\et
|
||
\text{noise} \quad
|
||
\begin{cases}
|
||
\mu = (4, 4) \\
|
||
\sigma_x = \sigma_y = 1 \\
|
||
\rho = 0.4
|
||
\end{cases}
|
||
$$
|
||
|
||
where $\mu$ stands for the mean, $\sigma_x$ and $\sigma_y$ are the standard
|
||
deviations in $x$ and $y$ directions respectively and $\rho$ is the bivariate
|
||
correlation, hence:
|
||
|
||
$$
|
||
\sigma_{xy} = \rho \sigma_x \sigma_y
|
||
$$
|
||
|
||
where $\sigma_{xy}$ is the covariance of $x$ and $y$.
|
||
In the code, default settings are $N_s = 800$ points for the signal and $N_n =
|
||
1000$ points for the noise but can be changed from the command-line. Both
|
||
samples were handled as matrices of dimension $n$ x 2, where $n$ is the number
|
||
of points in the sample. The library `gsl_matrix` provided by GSL was employed
|
||
for this purpose and the function `gsl_ran_bivariate_gaussian()` was used for
|
||
generating the points.
|
||
An example of the two samples is shown in @fig:points.
|
||
|
||
![Example of points sorted according to two Gaussian with
|
||
the given parameters. Noise points in pink and signal points
|
||
in yellow.](images/7-points.pdf){#fig:points}
|
||
|
||
Assuming not to know how the points were generated, a model of classification
|
||
must then be implemented in order to assign each point to the right class
|
||
(signal or noise) to which it 'most probably' belongs to. The point is how
|
||
'most probably' can be interpreted and implemented.
|
||
|
||
## Fisher linear discriminant
|
||
|
||
### The projection direction
|
||
|
||
The Fisher linear discriminant (FLD) is a linear classification model based on
|
||
dimensionality reduction. It allows to reduce this 2D classification problem
|
||
into a one-dimensional decision surface.
|
||
|
||
Consider the case of two classes (in this case the signal and the noise): the
|
||
simplest representation of a linear discriminant is obtained by taking a linear
|
||
function of a sampled 2D point $x$ so that:
|
||
|
||
$$
|
||
\hat{x} = w^T x
|
||
$$
|
||
|
||
where $w$ is the so-called 'weight vector'. An input point $x$ is commonly
|
||
assigned to the first class if $\hat{x} \geqslant w_{th}$ and to the second one
|
||
otherwise, where $w_{th}$ is a threshold value somehow defined.
|
||
In general, the projection onto one dimension leads to a considerable loss of
|
||
information and classes that are well separated in the original 2D space may
|
||
become strongly overlapping in one dimension. However, by adjusting the
|
||
components of the weight vector, a projection that maximizes the classes
|
||
separation can be selected.
|
||
To begin with, consider $N_1$ points of class $C_1$ and $N_2$ points of class
|
||
$C_2$, so that the means $m_1$ and $m_2$ of the two classes are given by:
|
||
|
||
$$
|
||
m_1 = \frac{1}{N_1} \sum_{n \in C_1} x_n
|
||
\et
|
||
m_2 = \frac{1}{N_2} \sum_{n \in C_2} x_n
|
||
$$
|
||
|
||
The simplest measure of the separation of the classes is the separation of the
|
||
projected class means. This suggests to choose $w$ so as to maximize:
|
||
|
||
$$
|
||
\hat{m}_2 − \hat{m}_1 = w^T (m_2 − m_1)
|
||
$$
|
||
|
||
This expression can be made arbitrarily large simply by increasing the magnitude
|
||
of $w$. To solve this problem, $w$ can be constrained to have unit length, so
|
||
that $| w^2 | = 1$. Using a Lagrange multiplier to perform the constrained
|
||
maximization, it can be found that $w \propto (m_2 − m_1)$.
|
||
|
||
![The plot on the left shows samples from two classes along with the histograms
|
||
resulting from projection onto the line joining the class means: note that
|
||
there is considerable overlap in the projected space. The right plot shows the
|
||
corresponding projection based on the Fisher linear discriminant, showing the
|
||
greatly improved classes separation.](images/7-fisher.png){#fig:overlap}
|
||
|
||
There is still a problem with this approach, however, as illustrated in
|
||
@fig:overlap: the two classes are well separated in the original 2D space but
|
||
have considerable overlap when projected onto the line joining their means.
|
||
The idea to solve it is to maximize a function that will give a large separation
|
||
between the projected classes means while also giving a small variance within
|
||
each class, thereby minimizing the class overlap.
|
||
The within-classes variance of the transformed data of each class $k$ is given
|
||
by:
|
||
|
||
$$
|
||
s_k^2 = \sum_{n \in C_k} (\hat{x}_n - \hat{m}_k)^2
|
||
$$
|
||
|
||
The total within-classes variance for the whole data set can be simply defined
|
||
as $s^2 = s_1^2 + s_2^2$. The Fisher criterion is therefore defined to be the
|
||
ratio of the between-classes distance to the within-classes variance and is
|
||
given by:
|
||
|
||
$$
|
||
J(w) = \frac{(\hat{m}_2 - \hat{m}_1)^2}{s^2}
|
||
$$
|
||
|
||
Differentiating $J(w)$ with respect to $w$, it can be found that it is
|
||
maximized when:
|
||
|
||
$$
|
||
w = S_b^{-1} (m_2 - m_1)
|
||
$$
|
||
|
||
where $S_b$ is the covariance matrix, given by:
|
||
|
||
$$
|
||
S_b = S_1 + S_2
|
||
$$
|
||
|
||
where $S_1$ and $S_2$ are the covariance matrix of the two classes, namely:
|
||
|
||
$$
|
||
\begin{pmatrix}
|
||
\sigma_x^2 & \sigma_{xy} \\
|
||
\sigma_{xy} & \sigma_y^2
|
||
\end{pmatrix}
|
||
$$
|
||
|
||
This is not truly a discriminant but rather a specific choice of direction for
|
||
projection of the data down to one dimension: the projected data can then be
|
||
used to construct a discriminant by choosing a threshold for the
|
||
classification.
|
||
|
||
When implemented, the parameters given in @sec:sampling were used to compute
|
||
the covariance matrices $S_1$ and $S_2$ of the two classes and their sum $S$.
|
||
Then $S$, being a symmetrical and positive-definite matrix, was inverted with
|
||
the Cholesky method, already discussed in @sec:MLM.
|
||
Lastly, the matrix-vector product was computed with the `gsl_blas_dgemv()`
|
||
function provided by GSL.
|
||
|
||
|
||
### The threshold
|
||
|
||
The cut was fixed by the condition of conditional probability being the same
|
||
for each class:
|
||
|
||
$$
|
||
t_{\text{cut}} = x \, | \hspace{20pt}
|
||
\frac{P(c_1 | x)}{P(c_2 | x)} =
|
||
\frac{p(x | c_1) \, p(c_1)}{p(x | c_1) \, p(c_2)} = 1
|
||
$$
|
||
|
||
where $p(x | c_k)$ is the probability for point $x$ along the Fisher projection
|
||
line of belonging to the class $k$. If the classes are bivariate Gaussian, as
|
||
in the present case, then $p(x | c_k)$ is simply given by its projected normal
|
||
distribution $\mathscr{G} (\hat{μ}, \hat{S})$. With a bit of math, the solution
|
||
is then:
|
||
|
||
$$
|
||
t = \frac{b}{a} + \sqrt{\left( \frac{b}{a} \right)^2 - \frac{c}{a}}
|
||
$$
|
||
|
||
where:
|
||
|
||
- $a = \hat{S}_1^2 - \hat{S}_2^2$
|
||
- $b = \hat{m}_2 \, \hat{S}_1^2 - \hat{M}_1 \, \hat{S}_2^2$
|
||
- $c = \hat{M}_2^2 \, \hat{S}_1^2 - \hat{M}_1^2 \, \hat{S}_2^2
|
||
- 2 \, \hat{S}_1^2 \, \hat{S}_2^2 \, \ln(\alpha)$
|
||
- $\alpha = p(c_1) / p(c_2)$
|
||
|
||
The ratio of the prior probability $\alpha$ was computed as:
|
||
|
||
$$
|
||
\alpha = \frac{N_s}{N_n}
|
||
$$
|
||
|
||
The projection of the points was accomplished by the use of the function
|
||
`gsl_blas_ddot()`, which computed a dot product between two vectors, which in
|
||
this case were the weight vector and the position of the point to be projected.
|
||
|
||
<div id="fig:fisher_proj">
|
||
![View from above of the samples.](images/7-fisher-plane.pdf){height=5.7cm}
|
||
![Gaussian of the samples on the projection
|
||
line.](images/7-fisher-proj.pdf){height=5.7cm}
|
||
|
||
Aerial and lateral views of the projection direction, in blue, and the cut, in
|
||
red.
|
||
</div>
|
||
|
||
Results obtained for the same sample in @fig:points are shown in
|
||
@fig:fisher_proj. The weight vector $w$ was found to be:
|
||
|
||
$$
|
||
w = (0.707, 0.707)
|
||
$$
|
||
|
||
and $t_{\text{cut}}$ is 1.323 far from the origin of the axes. Hence, as can be
|
||
seen, the vector $w$ turned out to be parallel to the line joining the means of
|
||
the two classes (reminded to be $(0, 0)$ and $(4, 4)$) which means that the
|
||
total covariance matrix $S$ is isotropic, proportional to the unit matrix.
|
||
|
||
|
||
## Perceptron
|
||
|
||
In machine learning, the perceptron is an algorithm for supervised learning of
|
||
linear binary classifiers.
|
||
Supervised learning is the machine learning task of inferring a function $f$
|
||
that maps an input $x$ to an output $f(x)$ based on a set of training
|
||
input-output pairs. Each example is a pair consisting of an input object and an
|
||
output value. The inferred function can be used for mapping new examples. The
|
||
algorithm will be generalized to correctly determine the class labels for unseen
|
||
instances.
|
||
|
||
The aim is to determine the bias $b$ such that the threshold function $f(x)$:
|
||
|
||
$$
|
||
f(x) = x \cdot w + b \hspace{20pt}
|
||
\begin{cases}
|
||
\geqslant 0 \incase x \in \text{signal} \\
|
||
< 0 \incase x \in \text{noise}
|
||
\end{cases}
|
||
$$ {#eq:perc}
|
||
|
||
The training was performed as follow. Initial values were set as $w = (0,0)$ and
|
||
$b = 0$. From these, the perceptron starts to improve their estimations. The
|
||
sample was passed point by point into a iterative procedure a grand total of
|
||
$N_c$ calls: each time, the projection $w \cdot x$ of the point was computed
|
||
and then the variable $\Delta$ was defined as:
|
||
|
||
$$
|
||
\Delta = r * (e - \theta (f(x))
|
||
$$
|
||
|
||
where:
|
||
|
||
- $r$ is the learning rate of the perceptron: it is between 0 and 1. The
|
||
larger $r$, the more volatile the weight changes. In the code, it was set
|
||
$r = 0.8$;
|
||
- $e$ is the expected value, namely 0 if $x$ is noise and 1 if it is signal;
|
||
- $\theta$ is the Heaviside theta function;
|
||
- $o$ is the observed value of $f(x)$ defined in @eq:perc.
|
||
|
||
Then $b$ and $w$ must be updated as:
|
||
|
||
$$
|
||
b \to b + \Delta
|
||
\et
|
||
w \to w + x \Delta
|
||
$$
|
||
|
||
<div id="fig:percep_proj">
|
||
![View from above of the samples.](images/7-percep-plane.pdf){height=5.7cm}
|
||
![Gaussian of the samples on the projection
|
||
line.](images/7-percep-proj.pdf){height=5.7cm}
|
||
|
||
Aerial and lateral views of the projection direction, in blue, and the cut, in
|
||
red.
|
||
</div>
|
||
|
||
It can be shown that this method converges to the coveted function.
|
||
As stated in the previous section, the weight vector must finally be normalized.
|
||
|
||
With $N_c = 5$, the values of $w$ and $t_{\text{cut}}$ level off up to the third
|
||
digit. The following results were obtained:
|
||
|
||
$$
|
||
w = (0.654, 0.756) \et t_{\text{cut}} = 1.213
|
||
$$
|
||
|
||
where, once again, $t_{\text{cut}}$ is computed from the origin of the axes. In
|
||
this case, the projection line does not lies along the mains of the two
|
||
samples. Plots in @fig:percep_proj.
|
||
|
||
## Efficiency test
|
||
|
||
A program was implemented to check the validity of the two
|
||
classification methods.
|
||
A number $N_t$ of test samples, with the same parameters of the training set,
|
||
is generated using an RNG and their points are divided into noise/signal by
|
||
both methods. At each iteration, false positives and negatives are recorded
|
||
using a running statistics method implemented in the `gsl_rstat` library, to
|
||
avoid storing large datasets in memory.
|
||
In each sample, the numbers $N_{fn}$ and $N_{fp}$ of false positive and false
|
||
negative are obtained in this way: for every noise point $x_n$ compute the
|
||
activation function $f(x_n)$ with the weight vector $w$ and the
|
||
$t_{\text{cut}}$, then:
|
||
|
||
- if $f(x) < 0 \thus$ $N_{fn} \to N_{fn}$
|
||
- if $f(x) > 0 \thus$ $N_{fn} \to N_{fn} + 1$
|
||
|
||
and similarly for the positive points.
|
||
Finally, the mean and standard deviation are computed from $N_{fn}$ and
|
||
$N_{fp}$ of every sample and used to estimate purity $\alpha$
|
||
and efficiency $\beta$ of the classification:
|
||
|
||
$$
|
||
\alpha = 1 - \frac{\text{mean}(N_{fn})}{N_s} \et
|
||
\beta = 1 - \frac{\text{mean}(N_{fp})}{N_n}
|
||
$$
|
||
|
||
Results for $N_t = 500$ are shown in @tbl:res_comp. As can be seen, the
|
||
Fisher discriminant gives a nearly perfect classification
|
||
with a symmetric distribution of false negative and false positive,
|
||
whereas the perceptron show a little more false-positive than
|
||
false-negative, being also more variable from dataset to dataset.
|
||
A possible explanation of this fact is that, for linearly separable and
|
||
normally distributed points, the Fisher linear discriminant is an exact
|
||
analytical solution, whereas the perceptron is only expected to converge to the
|
||
solution and thus more subjected to random fluctuations.
|
||
|
||
|
||
-------------------------------------------------------------------------------------------
|
||
$\alpha$ $\sigma_{\alpha}$ $\beta$ $\sigma_{\beta}$
|
||
----------- ------------------- ------------------- ------------------- -------------------
|
||
Fisher 0.9999 0.33 0.9999 0.33
|
||
|
||
Perceptron 0.9999 0.28 0.9995 0.64
|
||
-------------------------------------------------------------------------------------------
|
||
|
||
Table: Results for Fisher and perceptron method. $\sigma_{\alpha}$ and
|
||
$\sigma_{\beta}$ stand for the standard deviation of the false
|
||
negative and false positive respectively. {#tbl:res_comp}
|