12 KiB
Exercise 7
Generating points according to Gaussian distributions
The first task of exercise 7 is to generate two sets of 2D points $(x, y)$ according to two bivariate Gaussian distributions with parameters:
\text{signal} \quad
\begin{cases}
\mu = (0, 0) \\
\sigma_x = \sigma_y = 0.3 \\
\rho = 0.5
\end{cases}
\et
\text{noise} \quad
\begin{cases}
\mu = (4, 4) \\
\sigma_x = \sigma_y = 1 \\
\rho = 0.4
\end{cases}
where \mu
stands for the mean, \sigma_x
and \sigma_y
are the standard
deviations in x
and y
directions respectively and \rho
is the bivariate
correlation, hence:
\sigma_{xy} = \rho \sigma_x \sigma_y
where \sigma_{xy}
is the covariance of x
and y
.
In the code, default settings are N_s = 800
points for the signal and $N_n =
1000$ points for the noise but can be changed from the command-line. Both
samples were handled as matrices of dimension n
x 2, where n
is the number
of points in the sample. The library gsl_matrix
provided by GSL was employed
for this purpose and the function gsl_ran_bivariate_gaussian()
was used for
generating the points.
An example of the two samples is shown in @fig:points.
Assuming not to know how the points were generated, a model of classification must then be implemented in order to assign each point to the right class (signal or noise) to which it 'most probably' belongs to. The point is how 'most probably' can be interpreted and implemented.
Fisher linear discriminant
The projection direction
The Fisher linear discriminant (FLD) is a linear classification model based on dimensionality reduction. It allows to reduce this 2D classification problem into a one-dimensional decision surface.
Consider the case of two classes (in this case the signal and the noise): the
simplest representation of a linear discriminant is obtained by taking a linear
function of a sampled 2D point x
so that:
\hat{x} = w^T x
where w
is the so-called 'weight vector'. An input point x
is commonly
assigned to the first class if \hat{x} \geqslant w_{th}
and to the second one
otherwise, where w_{th}
is a threshold value somehow defined.
In general, the projection onto one dimension leads to a considerable loss of
information and classes that are well separated in the original 2D space may
become strongly overlapping in one dimension. However, by adjusting the
components of the weight vector, a projection that maximizes the classes
separation can be selected.
To begin with, consider N_1
points of class C_1
and N_2
points of class
C_2
, so that the means m_1
and m_2
of the two classes are given by:
m_1 = \frac{1}{N_1} \sum_{n \in C_1} x_n
\et
m_2 = \frac{1}{N_2} \sum_{n \in C_2} x_n
The simplest measure of the separation of the classes is the separation of the
projected class means. This suggests to choose w
so as to maximize:
\hat{m}_2 − \hat{m}_1 = w^T (m_2 − m_1)
This expression can be made arbitrarily large simply by increasing the magnitude
of w
. To solve this problem, w
can be constrained to have unit length, so
that | w^2 | = 1
. Using a Lagrange multiplier to perform the constrained
maximization, it can be found that w \propto (m_2 − m_1)
.
There is still a problem with this approach, however, as illustrated in
@fig:overlap: the two classes are well separated in the original 2D space but
have considerable overlap when projected onto the line joining their means.
The idea to solve it is to maximize a function that will give a large separation
between the projected classes means while also giving a small variance within
each class, thereby minimizing the class overlap.
The within-classes variance of the transformed data of each class k
is given
by:
s_k^2 = \sum_{n \in C_k} (\hat{x}_n - \hat{m}_k)^2
The total within-classes variance for the whole data set can be simply defined
as s^2 = s_1^2 + s_2^2
. The Fisher criterion is therefore defined to be the
ratio of the between-classes distance to the within-classes variance and is
given by:
J(w) = \frac{(\hat{m}_2 - \hat{m}_1)^2}{s^2}
Differentiating J(w)
with respect to w
, it can be found that it is
maximized when:
w = S_b^{-1} (m_2 - m_1)
where S_b
is the covariance matrix, given by:
S_b = S_1 + S_2
where S_1
and S_2
are the covariance matrix of the two classes, namely:
\begin{pmatrix}
\sigma_x^2 & \sigma_{xy} \\
\sigma_{xy} & \sigma_y^2
\end{pmatrix}
This is not truly a discriminant but rather a specific choice of direction for projection of the data down to one dimension: the projected data can then be used to construct a discriminant by choosing a threshold for the classification.
When implemented, the parameters given in @sec:sampling were used to compute
the covariance matrices S_1
and S_2
of the two classes and their sum S
.
Then S
, being a symmetrical and positive-definite matrix, was inverted with
the Cholesky method, already discussed in @sec:MLM.
Lastly, the matrix-vector product was computed with the gsl_blas_dgemv()
function provided by GSL.
The threshold
The cut was fixed by the condition of conditional probability being the same for each class:
t_{\text{cut}} = x \, | \hspace{20pt}
\frac{P(c_1 | x)}{P(c_2 | x)} =
\frac{p(x | c_1) \, p(c_1)}{p(x | c_1) \, p(c_2)} = 1
where p(x | c_k)
is the probability for point x
along the Fisher projection
line of belonging to the class k
. If the classes are bivariate Gaussian, as
in the present case, then p(x | c_k)
is simply given by its projected normal
distribution \mathscr{G} (\hat{μ}, \hat{S})
. With a bit of math, the solution
is then:
t = \frac{b}{a} + \sqrt{\left( \frac{b}{a} \right)^2 - \frac{c}{a}}
where:
a = \hat{S}_1^2 - \hat{S}_2^2
b = \hat{m}_2 \, \hat{S}_1^2 - \hat{M}_1 \, \hat{S}_2^2
- $c = \hat{M}_2^2 , \hat{S}_1^2 - \hat{M}_1^2 , \hat{S}_2^2
- 2 , \hat{S}_1^2 , \hat{S}_2^2 , \ln(\alpha)$
\alpha = p(c_1) / p(c_2)
The ratio of the prior probability \alpha
was computed as:
\alpha = \frac{N_s}{N_n}
The projection of the points was accomplished by the use of the function
gsl_blas_ddot()
, which computed a dot product between two vectors, which in
this case were the weight vector and the position of the point to be projected.
Aerial and lateral views of the projection direction, in blue, and the cut, in red.
Results obtained for the same sample in @fig:points are shown in
@fig:fisher_proj. The weight vector w
was found to be:
w = (0.707, 0.707)
and t_{\text{cut}}
is 1.323 far from the origin of the axes. Hence, as can be
seen, the vector w
turned out to be parallel to the line joining the means of
the two classes (reminded to be (0, 0)
and (4, 4)
) which means that the
total covariance matrix S
is isotropic, proportional to the unit matrix.
Perceptron
In machine learning, the perceptron is an algorithm for supervised learning of
linear binary classifiers.
Supervised learning is the machine learning task of inferring a function $f$
that maps an input x
to an output f(x)
based on a set of training
input-output pairs. Each example is a pair consisting of an input object and an
output value. The inferred function can be used for mapping new examples. The
algorithm will be generalized to correctly determine the class labels for unseen
instances.
The aim is to determine the bias b
such that the threshold function f(x)
:
f(x) = x \cdot w + b \hspace{20pt}
\begin{cases}
\geqslant 0 \incase x \in \text{signal} \\
< 0 \incase x \in \text{noise}
\end{cases}
$$ {#eq:perc}
The training was performed as follow. Initial values were set as $w = (0,0)$ and
$b = 0$. From these, the perceptron starts to improve their estimations. The
sample was passed point by point into a reiterative procedure a grand total of
$N_c$ calls: each time, the projection $w \cdot x$ of the point was computed
and then the variable $\Delta$ was defined as:
\Delta = r * (e - \theta (f(x))
where:
- $r$ is the learning rate of the perceptron: it is between 0 and 1. The
larger $r$, the more volatile the weight changes. In the code, it was set
$r = 0.8$;
- $e$ is the expected value, namely 0 if $x$ is noise and 1 if it is signal;
- $\theta$ is the Heaviside theta function;
- $o$ is the observed value of $f(x)$ defined in @eq:perc.
Then $b$ and $w$ must be updated as:
b \to b + \Delta \et w \to w + x \Delta
<div id="fig:percep_proj">
![View from above of the samples.](images/7-percep-plane.pdf){height=5.7cm}
![Gaussian of the samples on the projection
line.](images/7-percep-proj.pdf){height=5.7cm}
Aerial and lateral views of the projection direction, in blue, and the cut, in
red.
</div>
It can be shown that this method converges to the coveted function.
As stated in the previous section, the weight vector must finally be normalized.
With $N_c = 5$, the values of $w$ and $t_{\text{cut}}$ level off up to the third
digit. The following results were obtained:
w = (0.654, 0.756) \et t_{\text{cut}} = 1.213
where, once again, $t_{\text{cut}}$ is computed from the origin of the axes. In
this case, the projection line does not lies along the mains of the two
samples. Plots in @fig:percep_proj.
## Efficiency test
A program was implemented to check the validity of the two
classification methods.
A number $N_t$ of test samples, with the same parameters of the training set,
is generated using an RNG and their points are divided into noise/signal by
both methods. At each iteration, false positives and negatives are recorded
using a running statistics method implemented in the `gsl_rstat` library, to
avoid storing large datasets in memory.
In each sample, the numbers $N_{fn}$ and $N_{fp}$ of false positive and false
negative are obtained in this way: for every noise point $x_n$ compute the
activation function $f(x_n)$ with the weight vector $w$ and the
$t_{\text{cut}}$, then:
- if $f(x) < 0 \thus$ $N_{fn} \to N_{fn}$
- if $f(x) > 0 \thus$ $N_{fn} \to N_{fn} + 1$
and similarly for the positive points.
Finally, the mean and standard deviation are computed from $N_{fn}$ and
$N_{fp}$ of every sample and used to estimate purity $\alpha$
and efficiency $\beta$ of the classification:
\alpha = 1 - \frac{\text{mean}(N_{fn})}{N_s} \et \beta = 1 - \frac{\text{mean}(N_{fp})}{N_n}
Results for $N_t = 500$ are shown in @tbl:res_comp. As can be seen, the
Fisher discriminant gives a nearly perfect classification
with a symmetric distribution of false negative and false positive,
whereas the perceptron show a little more false-positive than
false-negative, being also more variable from dataset to dataset.
A possible explanation of this fact is that, for linearly separable and
normally distributed points, the Fisher linear discriminant is an exact
analytical solution, whereas the perceptron is only expected to converge to the
solution and thus more subjected to random fluctuations.
-------------------------------------------------------------------------------------------
$\alpha$ $\sigma_{\alpha}$ $\beta$ $\sigma_{\beta}$
----------- ------------------- ------------------- ------------------- -------------------
Fisher 0.9999 0.33 0.9999 0.33
Perceptron 0.9999 0.28 0.9995 0.64
-------------------------------------------------------------------------------------------
Table: Results for Fisher and perceptron method. $\sigma_{\alpha}$ and
$\sigma_{\beta}$ stand for the standard deviation of the false
negative and false positive respectively. {#tbl:res_comp}