2020-03-31 23:37:49 +02:00
|
|
|
|
# Exercise 7
|
|
|
|
|
|
2020-04-02 23:35:36 +02:00
|
|
|
|
## Generating points according to Gaussian distributions {#sec:sampling}
|
2020-03-31 23:37:49 +02:00
|
|
|
|
|
2020-04-02 23:35:36 +02:00
|
|
|
|
The first task of exercise 7 is to generate two sets of 2D points $(x, y)$
|
|
|
|
|
according to two bivariate Gaussian distributions with parameters:
|
2020-03-31 23:37:49 +02:00
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\text{signal} \quad
|
|
|
|
|
\begin{cases}
|
|
|
|
|
\mu = (0, 0) \\
|
|
|
|
|
\sigma_x = \sigma_y = 0.3 \\
|
|
|
|
|
\rho = 0.5
|
|
|
|
|
\end{cases}
|
|
|
|
|
\et
|
|
|
|
|
\text{noise} \quad
|
|
|
|
|
\begin{cases}
|
|
|
|
|
\mu = (4, 4) \\
|
|
|
|
|
\sigma_x = \sigma_y = 1 \\
|
|
|
|
|
\rho = 0.4
|
|
|
|
|
\end{cases}
|
|
|
|
|
$$
|
|
|
|
|
|
2020-04-01 23:39:19 +02:00
|
|
|
|
where $\mu$ stands for the mean, $\sigma_x$ and $\sigma_y$ are the standard
|
|
|
|
|
deviations in $x$ and $y$ directions respectively and $\rho$ is the bivariate
|
|
|
|
|
correlation, hence:
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\sigma_{xy} = \rho \sigma_x \sigma_y
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
where $\sigma_{xy}$ is the covariance of $x$ and $y$.
|
|
|
|
|
In the code, default settings are $N_s = 800$ points for the signal and $N_n =
|
2020-03-31 23:37:49 +02:00
|
|
|
|
1000$ points for the noise but can be changed from the command-line. Both
|
|
|
|
|
samples were handled as matrices of dimension $n$ x 2, where $n$ is the number
|
|
|
|
|
of points in the sample. The library `gsl_matrix` provided by GSL was employed
|
|
|
|
|
for this purpose and the function `gsl_ran_bivariate_gaussian()` was used for
|
2020-04-03 23:28:29 +02:00
|
|
|
|
generating the points.
|
2020-04-06 23:16:56 +02:00
|
|
|
|
An example of the two samples is shown in @fig:points.
|
2020-04-03 23:28:29 +02:00
|
|
|
|
|
|
|
|
|
![Example of points sorted according to two Gaussian with
|
|
|
|
|
the given parameters. Noise points in pink and signal points
|
2020-04-06 23:16:56 +02:00
|
|
|
|
in yellow.](images/points.pdf){#fig:points}
|
2020-03-31 23:37:49 +02:00
|
|
|
|
|
2020-04-01 23:39:19 +02:00
|
|
|
|
Assuming not to know how the points were generated, a model of classification
|
|
|
|
|
must then be implemented in order to assign each point to the right class
|
|
|
|
|
(signal or noise) to which it 'most probably' belongs to. The point is how
|
|
|
|
|
'most probably' can be interpreted and implemented.
|
2020-03-31 23:37:49 +02:00
|
|
|
|
|
|
|
|
|
## Fisher linear discriminant
|
|
|
|
|
|
2020-04-02 23:35:36 +02:00
|
|
|
|
### The projection direction
|
2020-04-01 23:39:19 +02:00
|
|
|
|
|
2020-03-31 23:37:49 +02:00
|
|
|
|
The Fisher linear discriminant (FLD) is a linear classification model based on
|
|
|
|
|
dimensionality reduction. It allows to reduce this 2D classification problem
|
|
|
|
|
into a one-dimensional decision surface.
|
|
|
|
|
|
2020-04-02 23:35:36 +02:00
|
|
|
|
Consider the case of two classes (in this case the signal and the noise): the
|
2020-03-31 23:37:49 +02:00
|
|
|
|
simplest representation of a linear discriminant is obtained by taking a linear
|
2020-04-02 23:35:36 +02:00
|
|
|
|
function of a sampled 2D point $x$ so that:
|
2020-03-31 23:37:49 +02:00
|
|
|
|
|
|
|
|
|
$$
|
2020-04-01 23:39:19 +02:00
|
|
|
|
\hat{x} = w^T x
|
2020-03-31 23:37:49 +02:00
|
|
|
|
$$
|
|
|
|
|
|
2020-04-01 23:39:19 +02:00
|
|
|
|
where $w$ is the so-called 'weight vector'. An input point $x$ is commonly
|
|
|
|
|
assigned to the first class if $\hat{x} \geqslant w_{th}$ and to the second one
|
2020-04-02 23:35:36 +02:00
|
|
|
|
otherwise, where $w_{th}$ is a threshold value somehow defined.
|
2020-03-31 23:37:49 +02:00
|
|
|
|
In general, the projection onto one dimension leads to a considerable loss of
|
|
|
|
|
information and classes that are well separated in the original 2D space may
|
|
|
|
|
become strongly overlapping in one dimension. However, by adjusting the
|
|
|
|
|
components of the weight vector, a projection that maximizes the classes
|
|
|
|
|
separation can be selected.
|
2020-04-02 23:35:36 +02:00
|
|
|
|
To begin with, consider $N_1$ points of class $C_1$ and $N_2$ points of class
|
|
|
|
|
$C_2$, so that the means $m_1$ and $m_2$ of the two classes are given by:
|
2020-03-31 23:37:49 +02:00
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
m_1 = \frac{1}{N_1} \sum_{n \in C_1} x_n
|
|
|
|
|
\et
|
|
|
|
|
m_2 = \frac{1}{N_2} \sum_{n \in C_2} x_n
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
The simplest measure of the separation of the classes is the separation of the
|
2020-04-02 23:35:36 +02:00
|
|
|
|
projected class means. This suggests to choose $w$ so as to maximize:
|
2020-03-31 23:37:49 +02:00
|
|
|
|
|
|
|
|
|
$$
|
2020-04-01 23:39:19 +02:00
|
|
|
|
\hat{m}_2 − \hat{m}_1 = w^T (m_2 − m_1)
|
2020-03-31 23:37:49 +02:00
|
|
|
|
$$
|
|
|
|
|
|
2020-04-02 23:35:36 +02:00
|
|
|
|
This expression can be made arbitrarily large simply by increasing the magnitude
|
|
|
|
|
of $w$. To solve this problem, $w$ can be constrained to have unit length, so
|
|
|
|
|
that $| w^2 | = 1$. Using a Lagrange multiplier to perform the constrained
|
|
|
|
|
maximization, it can be found that $w \propto (m_2 − m_1)$.
|
|
|
|
|
|
2020-03-31 23:37:49 +02:00
|
|
|
|
![The plot on the left shows samples from two classes along with the histograms
|
|
|
|
|
resulting from projection onto the line joining the class means: note that
|
|
|
|
|
there is considerable overlap in the projected space. The right plot shows the
|
|
|
|
|
corresponding projection based on the Fisher linear discriminant, showing the
|
|
|
|
|
greatly improved classes separation.](images/fisher.png){#fig:overlap}
|
|
|
|
|
|
|
|
|
|
There is still a problem with this approach, however, as illustrated in
|
|
|
|
|
@fig:overlap: the two classes are well separated in the original 2D space but
|
|
|
|
|
have considerable overlap when projected onto the line joining their means.
|
|
|
|
|
The idea to solve it is to maximize a function that will give a large separation
|
|
|
|
|
between the projected classes means while also giving a small variance within
|
|
|
|
|
each class, thereby minimizing the class overlap.
|
2020-04-02 23:35:36 +02:00
|
|
|
|
The within-classes variance of the transformed data of each class $k$ is given
|
2020-03-31 23:37:49 +02:00
|
|
|
|
by:
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
s_k^2 = \sum_{n \in C_k} (\hat{x}_n - \hat{m}_k)^2
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
The total within-classes variance for the whole data set can be simply defined
|
2020-04-02 23:35:36 +02:00
|
|
|
|
as $s^2 = s_1^2 + s_2^2$. The Fisher criterion is therefore defined to be the
|
|
|
|
|
ratio of the between-classes distance to the within-classes variance and is
|
|
|
|
|
given by:
|
2020-03-31 23:37:49 +02:00
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
J(w) = \frac{(\hat{m}_2 - \hat{m}_1)^2}{s^2}
|
|
|
|
|
$$
|
2020-04-01 23:39:19 +02:00
|
|
|
|
|
|
|
|
|
Differentiating $J(w)$ with respect to $w$, it can be found that it is
|
|
|
|
|
maximized when:
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
w = S_b^{-1} (m_2 - m_1)
|
|
|
|
|
$$
|
|
|
|
|
|
2020-04-02 23:35:36 +02:00
|
|
|
|
where $S_b$ is the covariance matrix, given by:
|
2020-04-01 23:39:19 +02:00
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
S_b = S_1 + S_2
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
where $S_1$ and $S_2$ are the covariance matrix of the two classes, namely:
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\begin{pmatrix}
|
|
|
|
|
\sigma_x^2 & \sigma_{xy} \\
|
|
|
|
|
\sigma_{xy} & \sigma_y^2
|
|
|
|
|
\end{pmatrix}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
This is not truly a discriminant but rather a specific choice of direction for
|
|
|
|
|
projection of the data down to one dimension: the projected data can then be
|
|
|
|
|
used to construct a discriminant by choosing a threshold for the
|
|
|
|
|
classification.
|
|
|
|
|
|
2020-04-02 23:35:36 +02:00
|
|
|
|
When implemented, the parameters given in @sec:sampling were used to compute
|
|
|
|
|
the covariance matrices $S_1$ and $S_2$ of the two classes and their sum $S$.
|
|
|
|
|
Then $S$, being a symmetrical and positive-definite matrix, was inverted with
|
|
|
|
|
the Cholesky method, already discussed in @sec:MLM.
|
|
|
|
|
Lastly, the matrix-vector product was computed with the `gsl_blas_dgemv()`
|
|
|
|
|
function provided by GSL.
|
|
|
|
|
|
2020-04-06 23:16:56 +02:00
|
|
|
|
|
2020-04-02 23:35:36 +02:00
|
|
|
|
### The threshold
|
2020-04-01 23:39:19 +02:00
|
|
|
|
|
2020-04-02 23:35:36 +02:00
|
|
|
|
The cut was fixed by the condition of conditional probability being the same
|
|
|
|
|
for each class:
|
2020-04-01 23:39:19 +02:00
|
|
|
|
|
|
|
|
|
$$
|
2020-04-02 23:35:36 +02:00
|
|
|
|
t_{\text{cut}} = x \, | \hspace{20pt}
|
|
|
|
|
\frac{P(c_1 | x)}{P(c_2 | x)} =
|
|
|
|
|
\frac{p(x | c_1) \, p(c_1)}{p(x | c_1) \, p(c_2)} = 1
|
2020-04-01 23:39:19 +02:00
|
|
|
|
$$
|
|
|
|
|
|
2020-04-02 23:35:36 +02:00
|
|
|
|
where $p(x | c_k)$ is the probability for point $x$ along the Fisher projection
|
|
|
|
|
line of belonging to the class $k$. If the classes are bivariate Gaussian, as
|
|
|
|
|
in the present case, then $p(x | c_k)$ is simply given by its projected normal
|
|
|
|
|
distribution $\mathscr{G} (\hat{μ}, \hat{S})$. With a bit of math, the solution
|
|
|
|
|
is then:
|
2020-04-01 23:39:19 +02:00
|
|
|
|
|
|
|
|
|
$$
|
2020-04-02 23:35:36 +02:00
|
|
|
|
t = \frac{b}{a} + \sqrt{\left( \frac{b}{a} \right)^2 - \frac{c}{a}}
|
2020-04-01 23:39:19 +02:00
|
|
|
|
$$
|
|
|
|
|
|
2020-04-02 23:35:36 +02:00
|
|
|
|
where:
|
|
|
|
|
|
|
|
|
|
- $a = \hat{S}_1^2 - \hat{S}_2^2$
|
|
|
|
|
- $b = \hat{m}_2 \, \hat{S}_1^2 - \hat{M}_1 \, \hat{S}_2^2$
|
|
|
|
|
- $c = \hat{M}_2^2 \, \hat{S}_1^2 - \hat{M}_1^2 \, \hat{S}_2^2
|
|
|
|
|
- 2 \, \hat{S}_1^2 \, \hat{S}_2^2 \, \ln(\alpha)$
|
|
|
|
|
- $\alpha = p(c_1) / p(c_2)$
|
2020-04-01 23:39:19 +02:00
|
|
|
|
|
2020-04-02 23:35:36 +02:00
|
|
|
|
The ratio of the prior probability $\alpha$ was computed as:
|
2020-04-01 23:39:19 +02:00
|
|
|
|
|
|
|
|
|
$$
|
2020-04-02 23:35:36 +02:00
|
|
|
|
\alpha = \frac{N_s}{N_n}
|
2020-04-01 23:39:19 +02:00
|
|
|
|
$$
|
|
|
|
|
|
2020-04-02 23:35:36 +02:00
|
|
|
|
The projection of the points was accomplished by the use of the function
|
2020-04-03 23:28:29 +02:00
|
|
|
|
`gsl_blas_ddot()`, which computed a dot product between two vectors, which in
|
2020-04-02 23:35:36 +02:00
|
|
|
|
this case were the weight vector and the position of the point to be projected.
|
2020-04-01 23:39:19 +02:00
|
|
|
|
|
2020-04-03 23:28:29 +02:00
|
|
|
|
<div id="fig:fisher_proj">
|
|
|
|
|
![View from above of the samples.](images/fisher-plane.pdf){height=5.7cm}
|
|
|
|
|
![Gaussian of the samples on the projection
|
|
|
|
|
line.](images/fisher-proj.pdf){height=5.7cm}
|
|
|
|
|
|
2020-04-07 23:36:59 +02:00
|
|
|
|
Aerial and lateral views of the projection direction, in blue, and the cut, in
|
2020-04-06 23:16:56 +02:00
|
|
|
|
red.
|
2020-04-03 23:28:29 +02:00
|
|
|
|
</div>
|
|
|
|
|
|
2020-04-07 23:36:59 +02:00
|
|
|
|
Results obtained for the same sample in @fig:points are shown in
|
2020-04-03 23:28:29 +02:00
|
|
|
|
@fig:fisher_proj. The weight vector $w$ was found to be:
|
2020-04-01 23:39:19 +02:00
|
|
|
|
|
2020-04-03 23:28:29 +02:00
|
|
|
|
$$
|
|
|
|
|
w = (0.707, 0.707)
|
|
|
|
|
$$
|
2020-04-01 23:39:19 +02:00
|
|
|
|
|
2020-04-03 23:28:29 +02:00
|
|
|
|
and $t_{\text{cut}}$ is 1.323 far from the origin of the axes. Hence, as can be
|
|
|
|
|
seen, the vector $w$ turned out to be parallel to the line joining the means of
|
|
|
|
|
the two classes (reminded to be $(0, 0)$ and $(4, 4)$) which means that the
|
|
|
|
|
total covariance matrix $S$ is isotropic, proportional to the unit matrix.
|
2020-04-06 23:16:56 +02:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Perceptron
|
|
|
|
|
|
|
|
|
|
In machine learning, the perceptron is an algorithm for supervised learning of
|
|
|
|
|
linear binary classifiers.
|
|
|
|
|
Supervised learning is the machine learning task of inferring a function $f$
|
|
|
|
|
that maps an input $x$ to an output $f(x)$ based on a set of training
|
|
|
|
|
input-output pairs. Each example is a pair consisting of an input object and an
|
|
|
|
|
output value. The inferred function can be used for mapping new examples. The
|
|
|
|
|
algorithm will be generalized to correctly determine the class labels for unseen
|
|
|
|
|
instances.
|
|
|
|
|
|
2020-04-07 23:36:59 +02:00
|
|
|
|
The aim is to determine the bias $b$ such that the threshold function $f(x)$:
|
2020-04-06 23:16:56 +02:00
|
|
|
|
|
|
|
|
|
$$
|
2020-04-07 23:36:59 +02:00
|
|
|
|
f(x) = x \cdot w + b \hspace{20pt}
|
|
|
|
|
\begin{cases}
|
|
|
|
|
\geqslant 0 \incase x \in \text{signal} \\
|
|
|
|
|
< 0 \incase x \in \text{noise}
|
|
|
|
|
\end{cases}
|
2020-04-06 23:16:56 +02:00
|
|
|
|
$$ {#eq:perc}
|
|
|
|
|
|
2020-04-07 23:36:59 +02:00
|
|
|
|
The training was performed as follow. Initial values were set as $w = (0,0)$ and
|
|
|
|
|
$b = 0$. From these, the perceptron starts to improve their estimations. The
|
|
|
|
|
sample was passed point by point into a reiterative procedure a grand total of
|
|
|
|
|
$N_c$ calls: each time, the projection $w \cdot x$ of the point was computed
|
|
|
|
|
and then the variable $\Delta$ was defined as:
|
2020-04-06 23:16:56 +02:00
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\Delta = r * (e - \theta (f(x))
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
where:
|
|
|
|
|
|
|
|
|
|
- $r$ is the learning rate of the perceptron: it is between 0 and 1. The
|
|
|
|
|
larger $r$, the more volatile the weight changes. In the code, it was set
|
|
|
|
|
$r = 0.8$;
|
|
|
|
|
- $e$ is the expected value, namely 0 if $x$ is noise and 1 if it is signal;
|
2020-04-07 23:36:59 +02:00
|
|
|
|
- $\theta$ is the Heaviside theta function;
|
2020-04-06 23:16:56 +02:00
|
|
|
|
- $o$ is the observed value of $f(x)$ defined in @eq:perc.
|
|
|
|
|
|
|
|
|
|
Then $b$ and $w$ must be updated as:
|
|
|
|
|
|
|
|
|
|
$$
|
2020-04-07 23:36:59 +02:00
|
|
|
|
b \to b + \Delta
|
2020-04-06 23:16:56 +02:00
|
|
|
|
\et
|
2020-04-07 23:36:59 +02:00
|
|
|
|
w \to w + x \Delta
|
2020-04-06 23:16:56 +02:00
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
<div id="fig:percep_proj">
|
|
|
|
|
![View from above of the samples.](images/percep-plane.pdf){height=5.7cm}
|
|
|
|
|
![Gaussian of the samples on the projection
|
|
|
|
|
line.](images/percep-proj.pdf){height=5.7cm}
|
|
|
|
|
|
2020-04-07 23:36:59 +02:00
|
|
|
|
Aerial and lateral views of the projection direction, in blue, and the cut, in
|
2020-04-06 23:16:56 +02:00
|
|
|
|
red.
|
|
|
|
|
</div>
|
|
|
|
|
|
|
|
|
|
It can be shown that this method converges to the coveted function.
|
2020-04-07 23:36:59 +02:00
|
|
|
|
As stated in the previous section, the weight vector must finally be normalized.
|
2020-04-06 23:16:56 +02:00
|
|
|
|
|
|
|
|
|
With $N_c = 5$, the values of $w$ and $t_{\text{cut}}$ level off up to the third
|
|
|
|
|
digit. The following results were obtained:
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
w = (0.654, 0.756) \et t_{\text{cut}} = 1.213
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
where, once again, $t_{\text{cut}}$ is computed from the origin of the axes. In
|
|
|
|
|
this case, the projection line does not lies along the mains of the two
|
|
|
|
|
samples. Plots in @fig:percep_proj.
|
|
|
|
|
|
|
|
|
|
## Efficiency test
|
2020-04-07 23:36:59 +02:00
|
|
|
|
|
|
|
|
|
A program was implemented in order to check the validity of the two
|
|
|
|
|
aforementioned methods.
|
|
|
|
|
A number $N_t$ of test samples was generated and the
|
|
|
|
|
points were divided into the two classes according to the selected method.
|
|
|
|
|
At each iteration, false positives and negatives are recorded using a running
|
|
|
|
|
statistics method implemented in the `gsl_rstat` library, being suitable for
|
|
|
|
|
handling large datasets for which it is inconvenient to store in memory all at
|
2020-04-10 21:55:07 +02:00
|
|
|
|
once.
|
2020-04-07 23:36:59 +02:00
|
|
|
|
For each sample, the numbers $N_{fn}$ and $N_{fp}$ of false positive and false
|
2020-04-10 21:55:07 +02:00
|
|
|
|
negative are computed with the following trick: every noise point $x_n$ was
|
|
|
|
|
checked this way: the function $f(x_n)$ was computed with the weight vector $w$
|
|
|
|
|
and the $t_{\text{cut}}$ given by the employed method, then:
|
2020-04-07 23:36:59 +02:00
|
|
|
|
|
|
|
|
|
- if $f(x) < 0 \thus$ $N_{fn} \to N_{fn}$
|
|
|
|
|
- if $f(x) > 0 \thus$ $N_{fn} \to N_{fn} + 1$
|
|
|
|
|
|
|
|
|
|
Similarly for the positive points.
|
2020-04-10 21:55:07 +02:00
|
|
|
|
Finally, the mean and the standard deviation were computed from $N_{fn}$ and
|
|
|
|
|
$N_{fp}$ obtained for every sample in order to get the mean purity $\alpha$
|
2020-04-07 23:36:59 +02:00
|
|
|
|
and efficiency $\beta$ for the employed statistics:
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\alpha = 1 - \frac{\text{mean}(N_{fn})}{N_s} \et
|
|
|
|
|
\beta = 1 - \frac{\text{mean}(N_{fp})}{N_n}
|
|
|
|
|
$$
|
|
|
|
|
|
2020-04-10 21:55:07 +02:00
|
|
|
|
Results for $N_t = 500$ are shown in @tbl:res_comp. As can be observed, the
|
|
|
|
|
Fisher method gives a nearly perfect assignment of the points to their belonging
|
|
|
|
|
class, with a symmetric distribution of false negative and false positive,
|
|
|
|
|
whereas the points perceptron-divided show a little more false-positive than
|
|
|
|
|
false-negative, being also more changable from dataset to dataset.
|
|
|
|
|
The reason why this happened lies in the fact that the Fisher linear
|
|
|
|
|
discriminant is an exact analitical result, whereas the perceptron is based on
|
|
|
|
|
a convergent behaviour which cannot be exactely reached by definition.
|
|
|
|
|
|
|
|
|
|
|
2020-04-07 23:36:59 +02:00
|
|
|
|
|
|
|
|
|
-------------------------------------------------------------------------------------------
|
|
|
|
|
$\alpha$ $\sigma_{\alpha}$ $\beta$ $\sigma_{\beta}$
|
|
|
|
|
----------- ------------------- ------------------- ------------------- -------------------
|
|
|
|
|
Fisher 0.9999 0.33 0.9999 0.33
|
|
|
|
|
|
|
|
|
|
Perceptron 0.9999 0.28 0.9995 0.64
|
|
|
|
|
-------------------------------------------------------------------------------------------
|
|
|
|
|
|
|
|
|
|
Table: Results for Fisher and perceptron method. $\sigma_{\alpha}$ and
|
|
|
|
|
$\sigma_{\beta}$ stand for the standard deviation of the false
|
2020-04-10 21:55:07 +02:00
|
|
|
|
negative and false positive respectively. {#tbl:res_comp}
|