In addition, the folder ex-7/iters was created in order to plot the results of the Perceptron method as a function of the iterations parameter.
17 KiB
Exercise 7
Generating points according to Gaussian distributions
Two sets of 2D points (x, y)
- signal and noise - is to be generated according
to two bivariate Gaussian distributions with parameters:
\text{signal} \quad
\begin{cases}
\mu = (0, 0) \\
\sigma_x = \sigma_y = 0.3 \\
\rho = 0.5
\end{cases}
\et
\text{noise} \quad
\begin{cases}
\mu = (4, 4) \\
\sigma_x = \sigma_y = 1 \\
\rho = 0.4
\end{cases}
where \mu
stands for the mean, \sigma_x
and \sigma_y
for the standard
deviations in x
and y
directions respectively and \rho
is the bivariate
correlation, namely:
\sigma_{xy} = \rho \sigma_x \sigma_y
where \sigma_{xy}
is the covariance of x
and y
.
In the code, default settings are N_s = 800
points for the signal and $N_n =
1000$ points for the noise but can be customized from the input command-line.
Both samples were handled as matrices of dimension n
x 2, where n
is the
number of points in the sample. The library gsl_matrix
provided by GSL was
employed for this purpose and the function gsl_ran_bivariate_gaussian()
was
used for generating the points.
An example of the two samples is shown in @fig:points.
Assuming not to know how the points were generated, a model of classification
is then to be implemented in order to assign each point to the right class
(signal or noise) to which it 'most probably' belongs to. The point is how
'most probably' can be interpreted and implemented.
Here, the Fisher linear discriminant and the Perceptron were implemented and
described in the following two sections. The results are compared in
@sec:7_results.
Fisher linear discriminant
The projection direction
The Fisher linear discriminant (FLD) is a linear classification model based on dimensionality reduction. It allows to reduce this 2D classification problem into a one-dimensional decision surface.
Consider the case of two classes (in this case signal and noise): the simplest
representation of a linear discriminant is obtained by taking a linear function
\hat{x}
of a sampled 2D point x
so that:
\hat{x} = w^T x
where w
is the so-called 'weight vector' and w^T
stands for its transpose.
An input point x
is commonly assigned to the first class if $\hat{x} \geqslant
w_{th}$ and to the second one otherwise, where w_{th}
is a threshold value
somehow defined. In general, the projection onto one dimension leads to a
considerable loss of information and classes that are well separated in the
original 2D space may become strongly overlapping in one dimension. However, by
adjusting the components of the weight vector, a projection that maximizes the
classes separation can be selected [@bishop06].
To begin with, consider N_1
points of class C_1
and N_2
points of class
C_2
, so that the means \mu_1
and \mu_2
of the two classes are given by:
\mu_1 = \frac{1}{N_1} \sum_{n \in C_1} x_n
\et
\mu_2 = \frac{1}{N_2} \sum_{n \in C_2} x_n
The simplest measure of the separation of the classes is the separation of the
projected class means. This suggests to choose w
so as to maximize:
\hat{\mu}_2 − \hat{\mu}_1 = w^T (\mu_2 − \mu_1)
This expression can be made arbitrarily large simply by increasing the magnitude
of w
. To solve this problem, w
can be constrained to have unit length, so
that | w^2 | = 1
. Using a Lagrange multiplier to perform the constrained
maximization, it can be found that w \propto (\mu_2 − \mu_1)
, meaning that the
line onto the points must be projected is the one joining the class means.
There is still a problem with this approach, however, as illustrated in
@fig:overlap: the two classes are well separated in the original 2D space but
have considerable overlap when projected onto the line joining their means
which maximize their projections distance.
The idea to solve it is to maximize a function that will give a large separation
between the projected classes means while also giving a small variance within
each class, thereby minimizing the class overlap.
The within-class variance of the transformed data of each class k
is given
by:
\hat{s}_k^2 = \sum_{n \in c_k} (\hat{x}_n - \hat{\mu}_k)^2
The total within-class variance for the whole data set is simply defined as
\hat{s}^2 = \hat{s}_1^2 + \hat{s}_2^2
. The Fisher criterion is defined to
be the ratio of the between-class distance to the within-class variance and is
given by:
F(w) = \frac{(\hat{\mu}_2 - \hat{\mu}_1)^2}{\hat{s}^2}
The dependence on w
can be made explicit:
\begin{align*}
(\hat{\mu}_2 - \hat{\mu}_1)^2 &= (w^T \mu_2 - w^T \mu_1)^2 \
&= [w^T (\mu_2 - \mu_1)]^2 \
&= [w^T (\mu_2 - \mu_1)][w^T (\mu_2 - \mu_1)] \
&= [w^T (\mu_2 - \mu_1)][(\mu_2 - \mu_1)^T w]
= w^T M w
\end{align*}
where M
is the between-distance matrix. Similarly, as regards the denominator:
\begin{align*}
\hat{s}^2 &= \hat{s}_1^2 + \hat{s}2^2 = \
&= \sum{n \in c_1} (\hat{x}_n - \hat{\mu}1)^2
+ \sum{n \in c_2} (\hat{x}_n - \hat{\mu}_2)^2
= w^T \Sigma_w w
\end{align*}
where \Sigma_w
is the total within-class covariance matrix:
\begin{align*}
\Sigma_w &= \sum_{n \in c_1} (x_n − \mu_1)(x_n − \mu_1)^T
+ \sum_{n \in c_2} (x_n − \mu_2)(x_n − \mu_2)^T \
&= \Sigma_1 + \Sigma_2
= \begin{pmatrix}
\sigma_x^2 & \sigma_{xy} \
\sigma_{xy} & \sigma_y^2
\end{pmatrix}1 +
\begin{pmatrix}
\sigma_x^2 & \sigma{xy} \
\sigma_{xy} & \sigma_y^2
\end{pmatrix}_2
\end{align*}
Where \Sigma_1
and \Sigma_2
are the covariance matrix of the two samples.
The Fisher criterion can therefore be rewritten in the form:
F(w) = \frac{w^T M w}{w^T \Sigma_w w}
Differentiating with respect to w
, it can be found that F(w)
is maximized
when:
w = \Sigma_w^{-1} (\mu_2 - \mu_1)
This is not truly a discriminant but rather a specific choice of the direction for projection of the data down to one dimension: the projected data can then be used to construct a discriminant by choosing a threshold for the classification.
When implemented, the parameters given in @sec:sampling were used to compute
the covariance matrices and their sum \Sigma_w
. Then \Sigma_w
, being a
symmetrical and positive-definite matrix, was inverted with the Cholesky method,
already discussed in @sec:MLM. Lastly, the matrix-vector product was computed
with the gsl_blas_dgemv()
function provided by GSL.
The threshold
The threshold t_{\text{cut}}
was fixed by the condition of conditional
probability P(c_k | t_{\text{cut}})
being the same for both classes c_k
:
t_{\text{cut}} = x \, | \hspace{20pt}
\frac{P(c_1 | x)}{P(c_2 | x)} =
\frac{P(x | c_1) \, P(c_1)}{P(x | c_1) \, P(c_2)} = 1
where P(x | c_k)
is the probability for point x
along the Fisher projection
line of being sampled according to the class k
. If each class is a bivariate
Gaussian, as in the present case, then P(x | c_k)
is simply given by its
projected normal distribution with mean \hat{m} = w^T m
and variance $\hat{s}
= w^T S w$, being S
the covariance matrix of the class.
With a bit of math, the following solution can be found:
t_{\text{cut}} = \frac{b}{a}
+ \sqrt{\left( \frac{b}{a} \right)^2 - \frac{c}{a}}
where:
a = \hat{s}_1^2 - \hat{s}_2^2
b = \hat{\mu}_2 \, \hat{s}_1^2 - \hat{\mu}_1 \, \hat{s}_2^2
- $c = \hat{\mu}_2^2 , \hat{s}_1^2 - \hat{\mu}_1^2 , \hat{s}_2^2
- 2 , \hat{s}_1^2 , \hat{s}_2^2 , \ln(\alpha)$
\alpha = P(c_1) / P(c_2)
The ratio of the prior probabilities \alpha
is simply given by:
\alpha = \frac{N_s}{N_n}
The projection of the points was accomplished by the use of the function
gsl_blas_ddot()
, which computes the element wise product between two vectors.
Results obtained for the same samples in @fig:points are shown in @fig:fisher_proj. The weight vector and the treshold were found to be:
w = (0.707, 0.707) \et
t_{\text{cut}} = 1.323
Aerial and lateral views of the samples. Projection line in blu and cut in red.
Since the vector w
turned out to be parallel with the line joining the means
of the two classes (reminded to be (0, 0)
and (4, 4)
), one can be mislead
and assume that the inverse of the total covariance matrix \Sigma_w
is
isotropic, namely proportional to the unit matrix.
That's not true. In this special sample, the vector joining the means turns out
to be an eigenvector of the covariance matrix \Sigma_w^{-1}
. In fact: since
\sigma_x = \sigma_y
for both signal and noise:
\Sigma_1 = \begin{pmatrix}
\sigma_x^2 & \sigma_{xy} \\
\sigma_{xy} & \sigma_x^2
\end{pmatrix}_1
\et
\Sigma_2 = \begin{pmatrix}
\sigma_x^2 & \sigma_{xy} \\
\sigma_{xy} & \sigma_x^2
\end{pmatrix}_2
\Sigma_w
takes the form:
\Sigma_w = \begin{pmatrix}
A & B \\
B & A
\end{pmatrix}
Which can be easily inverted by Gaussian elimination: \begin{align*} \begin{pmatrix} A & B & \vline & 1 & 0 \ B & A & \vline & 0 & 1 \ \end{pmatrix} &\longrightarrow \begin{pmatrix} A - B & 0 & \vline & 1 - B & - B \ 0 & A - B & \vline & - B & 1 - B \ \end{pmatrix} \ &\longrightarrow \begin{pmatrix} 1 & 0 & \vline & (1 - B)/(A - B) & - B/(A - B) \ 0 & 1 & \vline & - B/(A - B) & (1 - B)/(A - B) \ \end{pmatrix} \end{align*}
Hence:
\Sigma_w^{-1} = \begin{pmatrix}
\tilde{A} & \tilde{B} \\
\tilde{B} & \tilde{A}
\end{pmatrix}
Thus, \Sigma_w
and \Sigma_w^{-1}
share the same eigenvectors v_1
and
v_2
:
v_1 = \begin{pmatrix}
1 \\
-1
\end{pmatrix} \et
v_2 = \begin{pmatrix}
1 \\
1
\end{pmatrix}
and the vector joining the means is clearly a multiple of v_2
, causing w
to
be a multiple of it.
Perceptron
In machine learning, the perceptron is an algorithm for supervised learning of linear binary classifiers.
Supervised learning is the machine learning task of inferring a function $f$
that maps an input x
to an output f(x)
based on a set of training
input-output pairs, where each pair consists of an input object and an output
value. The inferred function can be used for mapping new examples: the algorithm
is generalized to correctly determine the class labels for unseen instances.
The aim of the perceptron algorithm is to determine the weight vector w
and
bias b
such that the so-called 'threshold function' f(x)
returns a binary
value: it is expected to return 1 for signal points and 0 for noise points:
f(x) = \theta(w^T \cdot x + b)
$$ {#eq:perc}
where $\theta$ is the Heaviside theta function.
The training was performed using the generated sample as training set. From an
initial guess for $w$ and $b$ (which were set to be all null in the code), the
perceptron starts to improve their estimations. The training set is passed point
by point into a iterative procedure a customizable number $N$ of times: for
every point, the output of $f(x)$ is computed. Afterwards, the variable
$\Delta$, which is defined as:
\Delta = r [e - f(x)]
where:
- $r \in [0, 1]$ is the learning rate of the perceptron: the larger $r$, the
more volatile the weight changes. In the code it was arbitrarily set $r =
0.8$;
- $e$ is the expected output value, namely 1 if $x$ is signal and 0 if it is
noise;
is used to update $b$ and $w$:
b \to b + \Delta \et w \to w + \Delta x
To see how it works, consider the four possible situations:
- $e = 1 \quad \wedge \quad f(x) = 1 \quad \dot \vee \quad e = 0 \quad \wedge
\quad f(x) = 0 \quad \Longrightarrow \quad \Delta = 0$
the current estimations work properly: $b$ and $w$ do not need to be updated;
- $e = 1 \quad \wedge \quad f(x) = 0 \quad \Longrightarrow \quad
\Delta = 1$
the current $b$ and $w$ underestimate the correct output: they must be
increased;
- $e = 0 \quad \wedge \quad f(x) = 1 \quad \Longrightarrow \quad
\Delta = -1$
the current $b$ and $w$ overestimate the correct output: they must be
decreased.
Whilst the $b$ updating is obvious, as regarsd $w$ the following consideration
may help clarify. Consider the case with $e = 0 \quad \wedge \quad f(x) = 1
\quad \Longrightarrow \quad \Delta = -1$:
w^T \cdot x \to (w^T + \Delta x^T) \cdot x = w^T \cdot x + \Delta |x|^2 = w^T \cdot x - |x|^2 \leq w^T \cdot x
Similarly for the case with $e = 1$ and $f(x) = 0$.
As far as convergence is concerned, the perceptron will never get to the state
with all the input points classified correctly if the training set is not
linearly separable, meaning that the signal cannot be separated from the noise
by a line in the plane. In this case, no approximate solutions will be gradually
approached. On the other hand, if the training set is linearly separable, it can
be shown that this method converges to the coveted function [@novikoff63].
As in the previous section, once found, the weight vector is to be normalized.
With $N = 5$ iterations, the values of $w$ and $t_{\text{cut}}$ level off up to the third
digit. The following results were obtained:
Different values of the learning rate were tested, all giving the same result,
converging for a number $N = 3$ of iterations. In @fig:iterations, results are
shown for $r = 0.8$: as can be seen, for $N = 3$, the values of $w$ and
$t^{\text{cut}}$ level off.
The following results were obtained:
w = (0.654, 0.756) \et t_{\text{cut}} = 1.213
In this case, the projection line is not parallel with the line joining the
means of the two samples. Plots in @fig:percep_proj.
<div id="fig:percep_proj">
![View of the samples in the plane.](images/7-percep-plane.pdf)
![View of the samples projections onto the projection
line.](images/7-percep-proj.pdf)
Aerial and lateral views of the samples. Projection line in blue and cut in
red.
</div>
## Efficiency test {#sec:7_results}
<div id="fig:percep_proj">
![View from above of the samples.](images/7-percep-plane.pdf){height=5.7cm}
![Gaussian of the samples on the projection
line.](images/7-percep-proj.pdf){height=5.7cm}
Aerial and lateral views of the projection direction, in blue, and the cut, in
red.
</div>
## Efficiency test {#sec:7_results}
Using the same parameters of the training set, a number $N_t$ of test
samples was generated and the points were divided into noise and signal
applying both methods. To avoid storing large datasets in memory, at each
iteration, false positives and negatives were recorded using a running
statistics method implemented in the `gsl_rstat` library. For each sample, the
numbers $N_{fn}$ and $N_{fp}$ of false positive and false negative were obtained
this way: for every noise point $x_n$, the threshold function $f(x_n)$ was
computed, then:
- if $f(x) = 0 \thus$ $N_{fn} \to N_{fn}$
- if $f(x) \neq 0 \thus$ $N_{fn} \to N_{fn} + 1$
and similarly for the positive points.
Finally, the mean and standard deviation were computed from $N_{fn}$ and
$N_{fp}$ for every sample and used to estimate the purity $\alpha$ and
efficiency $\beta$ of the classification:
\alpha = 1 - \frac{\text{mean}(N_{fn})}{N_s} \et \beta = 1 - \frac{\text{mean}(N_{fp})}{N_n}
Results for $N_t = 500$ are shown in @tbl:res_comp. As can be seen, the Fisher
discriminant gives a nearly perfect classification with a symmetric distribution
of false negative and false positive, whereas the perceptron shows a little more
false-positive than false-negative, being also more variable from dataset to
dataset.
A possible explanation of this fact is that, for linearly separable and normally
distributed points, the Fisher linear discriminant is an exact analytical
solution, whereas the perceptron is only expected to converge to the solution
and is therefore more subject to random fluctuations.
-------------------------------------------------------------------------------------------
$\alpha$ $\sigma_{\alpha}$ $\beta$ $\sigma_{\beta}$
----------- ------------------- ------------------- ------------------- -------------------
Fisher 0.9999 0.33 0.9999 0.33
Perceptron 0.9999 0.28 0.9995 0.64
-------------------------------------------------------------------------------------------
Table: Results for Fisher and perceptron method. $\sigma_{\alpha}$ and
$\sigma_{\beta}$ stand for the standard deviation of the false
negatives and false positives respectively. {#tbl:res_comp}