2020-03-31 23:37:49 +02:00
|
|
|
|
# Exercise 7
|
|
|
|
|
|
2020-04-02 23:35:36 +02:00
|
|
|
|
## Generating points according to Gaussian distributions {#sec:sampling}
|
2020-03-31 23:37:49 +02:00
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
Two sets of 2D points $(x, y)$ - signal and noise - is to be generated according
|
|
|
|
|
to two bivariate Gaussian distributions with parameters:
|
2020-03-31 23:37:49 +02:00
|
|
|
|
$$
|
|
|
|
|
\text{signal} \quad
|
|
|
|
|
\begin{cases}
|
|
|
|
|
\mu = (0, 0) \\
|
|
|
|
|
\sigma_x = \sigma_y = 0.3 \\
|
|
|
|
|
\rho = 0.5
|
|
|
|
|
\end{cases}
|
|
|
|
|
\et
|
|
|
|
|
\text{noise} \quad
|
|
|
|
|
\begin{cases}
|
|
|
|
|
\mu = (4, 4) \\
|
|
|
|
|
\sigma_x = \sigma_y = 1 \\
|
|
|
|
|
\rho = 0.4
|
|
|
|
|
\end{cases}
|
|
|
|
|
$$
|
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
where $\mu$ stands for the mean, $\sigma_x$ and $\sigma_y$ for the standard
|
2020-04-01 23:39:19 +02:00
|
|
|
|
deviations in $x$ and $y$ directions respectively and $\rho$ is the bivariate
|
2020-05-24 12:01:36 +02:00
|
|
|
|
correlation, namely:
|
2020-04-01 23:39:19 +02:00
|
|
|
|
$$
|
|
|
|
|
\sigma_{xy} = \rho \sigma_x \sigma_y
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
where $\sigma_{xy}$ is the covariance of $x$ and $y$.
|
|
|
|
|
In the code, default settings are $N_s = 800$ points for the signal and $N_n =
|
2020-05-24 12:01:36 +02:00
|
|
|
|
1000$ points for the noise but can be customized from the input command-line.
|
|
|
|
|
Both samples were handled as matrices of dimension $n$ x 2, where $n$ is the
|
|
|
|
|
number of points in the sample. The library `gsl_matrix` provided by GSL was
|
|
|
|
|
employed for this purpose and the function `gsl_ran_bivariate_gaussian()` was
|
|
|
|
|
used for generating the points.
|
2020-04-06 23:16:56 +02:00
|
|
|
|
An example of the two samples is shown in @fig:points.
|
2020-04-03 23:28:29 +02:00
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
![Example of points sampled according to the two Gaussian distributions
|
|
|
|
|
with the given parameters.](images/7-points.pdf){#fig:points}
|
2020-03-31 23:37:49 +02:00
|
|
|
|
|
2020-04-01 23:39:19 +02:00
|
|
|
|
Assuming not to know how the points were generated, a model of classification
|
2020-05-24 12:01:36 +02:00
|
|
|
|
is then to be implemented in order to assign each point to the right class
|
2020-04-01 23:39:19 +02:00
|
|
|
|
(signal or noise) to which it 'most probably' belongs to. The point is how
|
2020-05-24 12:01:36 +02:00
|
|
|
|
'most probably' can be interpreted and implemented.
|
|
|
|
|
Here, the Fisher linear discriminant and the Perceptron were implemented and
|
|
|
|
|
described in the following two sections. The results are compared in
|
|
|
|
|
@sec:7_results.
|
|
|
|
|
|
2020-03-31 23:37:49 +02:00
|
|
|
|
|
|
|
|
|
## Fisher linear discriminant
|
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
|
2020-04-02 23:35:36 +02:00
|
|
|
|
### The projection direction
|
2020-04-01 23:39:19 +02:00
|
|
|
|
|
2020-03-31 23:37:49 +02:00
|
|
|
|
The Fisher linear discriminant (FLD) is a linear classification model based on
|
|
|
|
|
dimensionality reduction. It allows to reduce this 2D classification problem
|
|
|
|
|
into a one-dimensional decision surface.
|
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
Consider the case of two classes (in this case signal and noise): the simplest
|
|
|
|
|
representation of a linear discriminant is obtained by taking a linear function
|
|
|
|
|
$\hat{x}$ of a sampled 2D point $x$ so that:
|
2020-03-31 23:37:49 +02:00
|
|
|
|
$$
|
2020-04-01 23:39:19 +02:00
|
|
|
|
\hat{x} = w^T x
|
2020-03-31 23:37:49 +02:00
|
|
|
|
$$
|
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
where $w$ is the so-called 'weight vector' and $w^T$ stands for its transpose.
|
|
|
|
|
An input point $x$ is commonly assigned to the first class if $\hat{x} \geqslant
|
|
|
|
|
w_{th}$ and to the second one otherwise, where $w_{th}$ is a threshold value
|
|
|
|
|
somehow defined. In general, the projection onto one dimension leads to a
|
|
|
|
|
considerable loss of information and classes that are well separated in the
|
|
|
|
|
original 2D space may become strongly overlapping in one dimension. However, by
|
|
|
|
|
adjusting the components of the weight vector, a projection that maximizes the
|
|
|
|
|
classes separation can be selected [@bishop06].
|
2020-04-02 23:35:36 +02:00
|
|
|
|
To begin with, consider $N_1$ points of class $C_1$ and $N_2$ points of class
|
2020-05-24 12:01:36 +02:00
|
|
|
|
$C_2$, so that the means $\mu_1$ and $\mu_2$ of the two classes are given by:
|
2020-03-31 23:37:49 +02:00
|
|
|
|
$$
|
2020-05-24 12:01:36 +02:00
|
|
|
|
\mu_1 = \frac{1}{N_1} \sum_{n \in C_1} x_n
|
2020-03-31 23:37:49 +02:00
|
|
|
|
\et
|
2020-05-24 12:01:36 +02:00
|
|
|
|
\mu_2 = \frac{1}{N_2} \sum_{n \in C_2} x_n
|
2020-03-31 23:37:49 +02:00
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
The simplest measure of the separation of the classes is the separation of the
|
2020-04-02 23:35:36 +02:00
|
|
|
|
projected class means. This suggests to choose $w$ so as to maximize:
|
2020-03-31 23:37:49 +02:00
|
|
|
|
$$
|
2020-05-24 12:01:36 +02:00
|
|
|
|
\hat{\mu}_2 − \hat{\mu}_1 = w^T (\mu_2 − \mu_1)
|
2020-03-31 23:37:49 +02:00
|
|
|
|
$$
|
|
|
|
|
|
2020-04-02 23:35:36 +02:00
|
|
|
|
This expression can be made arbitrarily large simply by increasing the magnitude
|
|
|
|
|
of $w$. To solve this problem, $w$ can be constrained to have unit length, so
|
|
|
|
|
that $| w^2 | = 1$. Using a Lagrange multiplier to perform the constrained
|
2020-05-24 12:01:36 +02:00
|
|
|
|
maximization, it can be found that $w \propto (\mu_2 − \mu_1)$, meaning that the
|
|
|
|
|
line onto the points must be projected is the one joining the class means.
|
2020-03-31 23:37:49 +02:00
|
|
|
|
There is still a problem with this approach, however, as illustrated in
|
|
|
|
|
@fig:overlap: the two classes are well separated in the original 2D space but
|
2020-05-24 12:01:36 +02:00
|
|
|
|
have considerable overlap when projected onto the line joining their means
|
|
|
|
|
which maximize their projections distance.
|
|
|
|
|
|
|
|
|
|
![The plot on the left shows samples from two classes along with the
|
|
|
|
|
histograms resulting fromthe projection onto the line joining the
|
|
|
|
|
class means: note that there is considerable overlap in the projected
|
|
|
|
|
space. The right plot shows the corresponding projection based on the
|
|
|
|
|
Fisher linear discriminant, showing the greatly improved classes
|
|
|
|
|
separation. Fifure from [@bishop06]](images/7-fisher.png){#fig:overlap}
|
|
|
|
|
|
2020-03-31 23:37:49 +02:00
|
|
|
|
The idea to solve it is to maximize a function that will give a large separation
|
|
|
|
|
between the projected classes means while also giving a small variance within
|
|
|
|
|
each class, thereby minimizing the class overlap.
|
2020-05-24 12:01:36 +02:00
|
|
|
|
The within-class variance of the transformed data of each class $k$ is given
|
2020-03-31 23:37:49 +02:00
|
|
|
|
by:
|
|
|
|
|
$$
|
2020-05-24 12:01:36 +02:00
|
|
|
|
\hat{s}_k^2 = \sum_{n \in c_k} (\hat{x}_n - \hat{\mu}_k)^2
|
2020-03-31 23:37:49 +02:00
|
|
|
|
$$
|
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
The total within-class variance for the whole data set is simply defined as
|
|
|
|
|
$\hat{s}^2 = \hat{s}_1^2 + \hat{s}_2^2$. The Fisher criterion is defined to
|
|
|
|
|
be the ratio of the between-class distance to the within-class variance and is
|
2020-04-02 23:35:36 +02:00
|
|
|
|
given by:
|
2020-03-31 23:37:49 +02:00
|
|
|
|
$$
|
2020-05-24 12:01:36 +02:00
|
|
|
|
F(w) = \frac{(\hat{\mu}_2 - \hat{\mu}_1)^2}{\hat{s}^2}
|
2020-03-31 23:37:49 +02:00
|
|
|
|
$$
|
2020-04-01 23:39:19 +02:00
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
The dependence on $w$ can be made explicit:
|
|
|
|
|
\begin{align*}
|
|
|
|
|
(\hat{\mu}_2 - \hat{\mu}_1)^2 &= (w^T \mu_2 - w^T \mu_1)^2 \\
|
|
|
|
|
&= [w^T (\mu_2 - \mu_1)]^2 \\
|
|
|
|
|
&= [w^T (\mu_2 - \mu_1)][w^T (\mu_2 - \mu_1)] \\
|
|
|
|
|
&= [w^T (\mu_2 - \mu_1)][(\mu_2 - \mu_1)^T w]
|
|
|
|
|
= w^T M w
|
|
|
|
|
\end{align*}
|
2020-04-01 23:39:19 +02:00
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
where $M$ is the between-distance matrix. Similarly, as regards the denominator:
|
|
|
|
|
\begin{align*}
|
|
|
|
|
\hat{s}^2 &= \hat{s}_1^2 + \hat{s}_2^2 = \\
|
|
|
|
|
&= \sum_{n \in c_1} (\hat{x}_n - \hat{\mu}_1)^2
|
|
|
|
|
+ \sum_{n \in c_2} (\hat{x}_n - \hat{\mu}_2)^2
|
|
|
|
|
= w^T \Sigma_w w
|
|
|
|
|
\end{align*}
|
2020-04-01 23:39:19 +02:00
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
where $\Sigma_w$ is the total within-class covariance matrix:
|
|
|
|
|
\begin{align*}
|
|
|
|
|
\Sigma_w &= \sum_{n \in c_1} (x_n − \mu_1)(x_n − \mu_1)^T
|
|
|
|
|
+ \sum_{n \in c_2} (x_n − \mu_2)(x_n − \mu_2)^T \\
|
|
|
|
|
&= \Sigma_1 + \Sigma_2
|
|
|
|
|
= \begin{pmatrix}
|
|
|
|
|
\sigma_x^2 & \sigma_{xy} \\
|
|
|
|
|
\sigma_{xy} & \sigma_y^2
|
|
|
|
|
\end{pmatrix}_1 +
|
|
|
|
|
\begin{pmatrix}
|
|
|
|
|
\sigma_x^2 & \sigma_{xy} \\
|
|
|
|
|
\sigma_{xy} & \sigma_y^2
|
|
|
|
|
\end{pmatrix}_2
|
|
|
|
|
\end{align*}
|
2020-04-01 23:39:19 +02:00
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
Where $\Sigma_1$ and $\Sigma_2$ are the covariance matrix of the two samples.
|
|
|
|
|
The Fisher criterion can therefore be rewritten in the form:
|
2020-04-01 23:39:19 +02:00
|
|
|
|
$$
|
2020-05-24 12:01:36 +02:00
|
|
|
|
F(w) = \frac{w^T M w}{w^T \Sigma_w w}
|
2020-04-01 23:39:19 +02:00
|
|
|
|
$$
|
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
Differentiating with respect to $w$, it can be found that $F(w)$ is maximized
|
|
|
|
|
when:
|
2020-04-01 23:39:19 +02:00
|
|
|
|
$$
|
2020-05-24 12:01:36 +02:00
|
|
|
|
w = \Sigma_w^{-1} (\mu_2 - \mu_1)
|
2020-04-01 23:39:19 +02:00
|
|
|
|
$$
|
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
This is not truly a discriminant but rather a specific choice of the direction
|
|
|
|
|
for projection of the data down to one dimension: the projected data can then be
|
2020-04-01 23:39:19 +02:00
|
|
|
|
used to construct a discriminant by choosing a threshold for the
|
|
|
|
|
classification.
|
|
|
|
|
|
2020-04-02 23:35:36 +02:00
|
|
|
|
When implemented, the parameters given in @sec:sampling were used to compute
|
2020-05-24 12:01:36 +02:00
|
|
|
|
the covariance matrices and their sum $\Sigma_w$. Then $\Sigma_w$, being a
|
|
|
|
|
symmetrical and positive-definite matrix, was inverted with the Cholesky method,
|
|
|
|
|
already discussed in @sec:MLM. Lastly, the matrix-vector product was computed
|
|
|
|
|
with the `gsl_blas_dgemv()` function provided by GSL.
|
2020-04-02 23:35:36 +02:00
|
|
|
|
|
2020-04-06 23:16:56 +02:00
|
|
|
|
|
2020-04-02 23:35:36 +02:00
|
|
|
|
### The threshold
|
2020-04-01 23:39:19 +02:00
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
The threshold $t_{\text{cut}}$ was fixed by the condition of conditional
|
|
|
|
|
probability $P(c_k | t_{\text{cut}})$ being the same for both classes $c_k$:
|
2020-04-01 23:39:19 +02:00
|
|
|
|
$$
|
2020-04-02 23:35:36 +02:00
|
|
|
|
t_{\text{cut}} = x \, | \hspace{20pt}
|
|
|
|
|
\frac{P(c_1 | x)}{P(c_2 | x)} =
|
2020-05-27 16:32:31 +02:00
|
|
|
|
\frac{P(x | c_1) \, P(c_1)}{P(x | c_2) \, P(c_2)} = 1
|
2020-04-01 23:39:19 +02:00
|
|
|
|
$$
|
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
where $P(x | c_k)$ is the probability for point $x$ along the Fisher projection
|
|
|
|
|
line of being sampled according to the class $k$. If each class is a bivariate
|
|
|
|
|
Gaussian, as in the present case, then $P(x | c_k)$ is simply given by its
|
|
|
|
|
projected normal distribution with mean $\hat{m} = w^T m$ and variance $\hat{s}
|
|
|
|
|
= w^T S w$, being $S$ the covariance matrix of the class.
|
|
|
|
|
With a bit of math, the following solution can be found:
|
2020-04-01 23:39:19 +02:00
|
|
|
|
$$
|
2020-05-24 12:01:36 +02:00
|
|
|
|
t_{\text{cut}} = \frac{b}{a}
|
|
|
|
|
+ \sqrt{\left( \frac{b}{a} \right)^2 - \frac{c}{a}}
|
2020-04-01 23:39:19 +02:00
|
|
|
|
$$
|
|
|
|
|
|
2020-04-02 23:35:36 +02:00
|
|
|
|
where:
|
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
- $a = \hat{s}_1^2 - \hat{s}_2^2$
|
|
|
|
|
- $b = \hat{\mu}_2 \, \hat{s}_1^2 - \hat{\mu}_1 \, \hat{s}_2^2$
|
|
|
|
|
- $c = \hat{\mu}_2^2 \, \hat{s}_1^2 - \hat{\mu}_1^2 \, \hat{s}_2^2
|
|
|
|
|
- 2 \, \hat{s}_1^2 \, \hat{s}_2^2 \, \ln(\alpha)$
|
|
|
|
|
- $\alpha = P(c_1) / P(c_2)$
|
2020-04-01 23:39:19 +02:00
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
The ratio of the prior probabilities $\alpha$ is simply given by:
|
2020-04-01 23:39:19 +02:00
|
|
|
|
$$
|
2020-04-02 23:35:36 +02:00
|
|
|
|
\alpha = \frac{N_s}{N_n}
|
2020-04-01 23:39:19 +02:00
|
|
|
|
$$
|
|
|
|
|
|
2020-04-02 23:35:36 +02:00
|
|
|
|
The projection of the points was accomplished by the use of the function
|
2020-05-24 12:01:36 +02:00
|
|
|
|
`gsl_blas_ddot()`, which computes the element wise product between two vectors.
|
|
|
|
|
|
|
|
|
|
Results obtained for the same samples in @fig:points are shown in
|
|
|
|
|
@fig:fisher_proj. The weight vector and the treshold were found to be:
|
|
|
|
|
$$
|
|
|
|
|
w = (0.707, 0.707) \et
|
|
|
|
|
t_{\text{cut}} = 1.323
|
|
|
|
|
$$
|
2020-04-01 23:39:19 +02:00
|
|
|
|
|
2020-04-03 23:28:29 +02:00
|
|
|
|
<div id="fig:fisher_proj">
|
2020-05-24 12:01:36 +02:00
|
|
|
|
![View of the samples in the plane.](images/7-fisher-plane.pdf)
|
|
|
|
|
![View of the samples projections onto the projection
|
|
|
|
|
line.](images/7-fisher-proj.pdf)
|
2020-04-03 23:28:29 +02:00
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
Aerial and lateral views of the samples. Projection line in blu and cut in red.
|
2020-04-03 23:28:29 +02:00
|
|
|
|
</div>
|
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
Since the vector $w$ turned out to be parallel with the line joining the means
|
|
|
|
|
of the two classes (reminded to be $(0, 0)$ and $(4, 4)$), one can be mislead
|
|
|
|
|
and assume that the inverse of the total covariance matrix $\Sigma_w$ is
|
|
|
|
|
isotropic, namely proportional to the unit matrix.
|
|
|
|
|
That's not true. In this special sample, the vector joining the means turns out
|
|
|
|
|
to be an eigenvector of the covariance matrix $\Sigma_w^{-1}$. In fact: since
|
|
|
|
|
$\sigma_x = \sigma_y$ for both signal and noise:
|
|
|
|
|
$$
|
|
|
|
|
\Sigma_1 = \begin{pmatrix}
|
|
|
|
|
\sigma_x^2 & \sigma_{xy} \\
|
|
|
|
|
\sigma_{xy} & \sigma_x^2
|
|
|
|
|
\end{pmatrix}_1
|
|
|
|
|
\et
|
|
|
|
|
\Sigma_2 = \begin{pmatrix}
|
|
|
|
|
\sigma_x^2 & \sigma_{xy} \\
|
|
|
|
|
\sigma_{xy} & \sigma_x^2
|
|
|
|
|
\end{pmatrix}_2
|
|
|
|
|
$$
|
2020-04-01 23:39:19 +02:00
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
$\Sigma_w$ takes the form:
|
2020-04-03 23:28:29 +02:00
|
|
|
|
$$
|
2020-05-24 12:01:36 +02:00
|
|
|
|
\Sigma_w = \begin{pmatrix}
|
|
|
|
|
A & B \\
|
|
|
|
|
B & A
|
|
|
|
|
\end{pmatrix}
|
2020-04-03 23:28:29 +02:00
|
|
|
|
$$
|
2020-04-01 23:39:19 +02:00
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
Which can be easily inverted by Gaussian elimination:
|
|
|
|
|
\begin{align*}
|
|
|
|
|
\begin{pmatrix}
|
|
|
|
|
A & B & \vline & 1 & 0 \\
|
|
|
|
|
B & A & \vline & 0 & 1 \\
|
|
|
|
|
\end{pmatrix} &\longrightarrow
|
|
|
|
|
\begin{pmatrix}
|
|
|
|
|
A - B & 0 & \vline & 1 - B & - B \\
|
|
|
|
|
0 & A - B & \vline & - B & 1 - B \\
|
|
|
|
|
\end{pmatrix} \\ &\longrightarrow
|
|
|
|
|
\begin{pmatrix}
|
|
|
|
|
1 & 0 & \vline & (1 - B)/(A - B) & - B/(A - B) \\
|
|
|
|
|
0 & 1 & \vline & - B/(A - B) & (1 - B)/(A - B) \\
|
|
|
|
|
\end{pmatrix}
|
|
|
|
|
\end{align*}
|
|
|
|
|
|
|
|
|
|
Hence:
|
|
|
|
|
$$
|
|
|
|
|
\Sigma_w^{-1} = \begin{pmatrix}
|
|
|
|
|
\tilde{A} & \tilde{B} \\
|
|
|
|
|
\tilde{B} & \tilde{A}
|
|
|
|
|
\end{pmatrix}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
Thus, $\Sigma_w$ and $\Sigma_w^{-1}$ share the same eigenvectors $v_1$ and
|
|
|
|
|
$v_2$:
|
|
|
|
|
$$
|
|
|
|
|
v_1 = \begin{pmatrix}
|
|
|
|
|
1 \\
|
|
|
|
|
-1
|
|
|
|
|
\end{pmatrix} \et
|
|
|
|
|
v_2 = \begin{pmatrix}
|
|
|
|
|
1 \\
|
|
|
|
|
1
|
|
|
|
|
\end{pmatrix}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
and the vector joining the means is clearly a multiple of $v_2$, causing $w$ to
|
|
|
|
|
be a multiple of it.
|
2020-04-06 23:16:56 +02:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Perceptron
|
|
|
|
|
|
|
|
|
|
In machine learning, the perceptron is an algorithm for supervised learning of
|
2020-05-24 12:01:36 +02:00
|
|
|
|
linear binary classifiers.
|
|
|
|
|
|
2020-04-06 23:16:56 +02:00
|
|
|
|
Supervised learning is the machine learning task of inferring a function $f$
|
|
|
|
|
that maps an input $x$ to an output $f(x)$ based on a set of training
|
2020-05-24 12:01:36 +02:00
|
|
|
|
input-output pairs, where each pair consists of an input object and an output
|
|
|
|
|
value. The inferred function can be used for mapping new examples: the algorithm
|
|
|
|
|
is generalized to correctly determine the class labels for unseen instances.
|
2020-04-06 23:16:56 +02:00
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
The aim of the perceptron algorithm is to determine the weight vector $w$ and
|
|
|
|
|
bias $b$ such that the so-called 'threshold function' $f(x)$ returns a binary
|
|
|
|
|
value: it is expected to return 1 for signal points and 0 for noise points:
|
2020-04-06 23:16:56 +02:00
|
|
|
|
$$
|
2020-05-24 12:01:36 +02:00
|
|
|
|
f(x) = \theta(w^T \cdot x + b)
|
2020-04-06 23:16:56 +02:00
|
|
|
|
$$ {#eq:perc}
|
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
where $\theta$ is the Heaviside theta function.
|
|
|
|
|
The training was performed using the generated sample as training set. From an
|
|
|
|
|
initial guess for $w$ and $b$ (which were set to be all null in the code), the
|
|
|
|
|
perceptron starts to improve their estimations. The training set is passed point
|
|
|
|
|
by point into a iterative procedure a customizable number $N$ of times: for
|
|
|
|
|
every point, the output of $f(x)$ is computed. Afterwards, the variable
|
|
|
|
|
$\Delta$, which is defined as:
|
2020-04-06 23:16:56 +02:00
|
|
|
|
$$
|
2020-05-24 12:01:36 +02:00
|
|
|
|
\Delta = r [e - f(x)]
|
2020-04-06 23:16:56 +02:00
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
where:
|
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
- $r \in [0, 1]$ is the learning rate of the perceptron: the larger $r$, the
|
|
|
|
|
more volatile the weight changes. In the code it was arbitrarily set $r =
|
|
|
|
|
0.8$;
|
|
|
|
|
- $e$ is the expected output value, namely 1 if $x$ is signal and 0 if it is
|
|
|
|
|
noise;
|
2020-04-06 23:16:56 +02:00
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
is used to update $b$ and $w$:
|
2020-04-06 23:16:56 +02:00
|
|
|
|
|
|
|
|
|
$$
|
2020-04-07 23:36:59 +02:00
|
|
|
|
b \to b + \Delta
|
2020-04-06 23:16:56 +02:00
|
|
|
|
\et
|
2020-05-24 12:01:36 +02:00
|
|
|
|
w \to w + \Delta x
|
2020-04-06 23:16:56 +02:00
|
|
|
|
$$
|
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
To see how it works, consider the four possible situations:
|
2020-04-06 23:16:56 +02:00
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
- $e = 1 \quad \wedge \quad f(x) = 1 \quad \dot \vee \quad e = 0 \quad \wedge
|
|
|
|
|
\quad f(x) = 0 \quad \Longrightarrow \quad \Delta = 0$
|
|
|
|
|
the current estimations work properly: $b$ and $w$ do not need to be updated;
|
|
|
|
|
- $e = 1 \quad \wedge \quad f(x) = 0 \quad \Longrightarrow \quad
|
|
|
|
|
\Delta = 1$
|
|
|
|
|
the current $b$ and $w$ underestimate the correct output: they must be
|
|
|
|
|
increased;
|
|
|
|
|
- $e = 0 \quad \wedge \quad f(x) = 1 \quad \Longrightarrow \quad
|
|
|
|
|
\Delta = -1$
|
|
|
|
|
the current $b$ and $w$ overestimate the correct output: they must be
|
|
|
|
|
decreased.
|
|
|
|
|
|
|
|
|
|
Whilst the $b$ updating is obvious, as regarsd $w$ the following consideration
|
|
|
|
|
may help clarify. Consider the case with $e = 0 \quad \wedge \quad f(x) = 1
|
|
|
|
|
\quad \Longrightarrow \quad \Delta = -1$:
|
|
|
|
|
$$
|
|
|
|
|
w^T \cdot x \to (w^T + \Delta x^T) \cdot x
|
|
|
|
|
= w^T \cdot x + \Delta |x|^2
|
|
|
|
|
= w^T \cdot x - |x|^2 \leq w^T \cdot x
|
|
|
|
|
$$
|
2020-04-06 23:16:56 +02:00
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
Similarly for the case with $e = 1$ and $f(x) = 0$.
|
2020-04-06 23:16:56 +02:00
|
|
|
|
|
2020-05-30 09:37:36 +02:00
|
|
|
|
![Weiht vector and threshold value obtained with the perceptron method as a
|
|
|
|
|
function of the number of iterations. Both level off at the third
|
|
|
|
|
iteration.](images/7-iterations.pdf){#fig:iterations}
|
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
As far as convergence is concerned, the perceptron will never get to the state
|
|
|
|
|
with all the input points classified correctly if the training set is not
|
|
|
|
|
linearly separable, meaning that the signal cannot be separated from the noise
|
|
|
|
|
by a line in the plane. In this case, no approximate solutions will be gradually
|
|
|
|
|
approached. On the other hand, if the training set is linearly separable, it can
|
|
|
|
|
be shown that this method converges to the coveted function [@novikoff63].
|
|
|
|
|
As in the previous section, once found, the weight vector is to be normalized.
|
|
|
|
|
|
|
|
|
|
With $N = 5$ iterations, the values of $w$ and $t_{\text{cut}}$ level off up to the third
|
2020-04-06 23:16:56 +02:00
|
|
|
|
digit. The following results were obtained:
|
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
Different values of the learning rate were tested, all giving the same result,
|
|
|
|
|
converging for a number $N = 3$ of iterations. In @fig:iterations, results are
|
|
|
|
|
shown for $r = 0.8$: as can be seen, for $N = 3$, the values of $w$ and
|
|
|
|
|
$t^{\text{cut}}$ level off.
|
|
|
|
|
The following results were obtained:
|
2020-04-06 23:16:56 +02:00
|
|
|
|
$$
|
|
|
|
|
w = (0.654, 0.756) \et t_{\text{cut}} = 1.213
|
|
|
|
|
$$
|
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
In this case, the projection line is not parallel with the line joining the
|
|
|
|
|
means of the two samples. Plots in @fig:percep_proj.
|
|
|
|
|
|
|
|
|
|
<div id="fig:percep_proj">
|
2020-05-25 23:58:55 +02:00
|
|
|
|
![View from above of the samples.](images/7-percep-plane.pdf)
|
2020-05-24 12:01:36 +02:00
|
|
|
|
![Gaussian of the samples on the projection
|
2020-05-25 23:58:55 +02:00
|
|
|
|
line.](images/7-percep-proj.pdf)
|
2020-05-24 12:01:36 +02:00
|
|
|
|
|
|
|
|
|
Aerial and lateral views of the projection direction, in blue, and the cut, in
|
|
|
|
|
red.
|
|
|
|
|
</div>
|
|
|
|
|
|
2020-05-25 23:58:55 +02:00
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
## Efficiency test {#sec:7_results}
|
2020-04-07 23:36:59 +02:00
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
Using the same parameters of the training set, a number $N_t$ of test
|
|
|
|
|
samples was generated and the points were divided into noise and signal
|
|
|
|
|
applying both methods. To avoid storing large datasets in memory, at each
|
|
|
|
|
iteration, false positives and negatives were recorded using a running
|
|
|
|
|
statistics method implemented in the `gsl_rstat` library. For each sample, the
|
2020-05-30 09:37:36 +02:00
|
|
|
|
numbers $N_{fn}$ and $N_{fp}$ of false negative and false positive were obtained
|
2020-05-24 12:01:36 +02:00
|
|
|
|
this way: for every noise point $x_n$, the threshold function $f(x_n)$ was
|
|
|
|
|
computed, then:
|
|
|
|
|
|
|
|
|
|
- if $f(x) = 0 \thus$ $N_{fn} \to N_{fn}$
|
|
|
|
|
- if $f(x) \neq 0 \thus$ $N_{fn} \to N_{fn} + 1$
|
2020-04-07 23:36:59 +02:00
|
|
|
|
|
2020-05-19 18:00:26 +02:00
|
|
|
|
and similarly for the positive points.
|
2020-05-24 12:01:36 +02:00
|
|
|
|
Finally, the mean and standard deviation were computed from $N_{fn}$ and
|
|
|
|
|
$N_{fp}$ for every sample and used to estimate the purity $\alpha$ and
|
|
|
|
|
efficiency $\beta$ of the classification:
|
2020-04-07 23:36:59 +02:00
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\alpha = 1 - \frac{\text{mean}(N_{fn})}{N_s} \et
|
|
|
|
|
\beta = 1 - \frac{\text{mean}(N_{fp})}{N_n}
|
|
|
|
|
$$
|
|
|
|
|
|
2020-05-24 12:01:36 +02:00
|
|
|
|
Results for $N_t = 500$ are shown in @tbl:res_comp. As can be seen, the Fisher
|
|
|
|
|
discriminant gives a nearly perfect classification with a symmetric distribution
|
|
|
|
|
of false negative and false positive, whereas the perceptron shows a little more
|
|
|
|
|
false-positive than false-negative, being also more variable from dataset to
|
|
|
|
|
dataset.
|
|
|
|
|
A possible explanation of this fact is that, for linearly separable and normally
|
|
|
|
|
distributed points, the Fisher linear discriminant is an exact analytical
|
2020-05-30 09:37:36 +02:00
|
|
|
|
solution, the most powerful one, according to the Neyman-Pearson lemma, whereas
|
|
|
|
|
the perceptron is only expected to converge to the solution and is therefore
|
|
|
|
|
more subject to random fluctuations.
|
2020-04-07 23:36:59 +02:00
|
|
|
|
|
|
|
|
|
-------------------------------------------------------------------------------------------
|
|
|
|
|
$\alpha$ $\sigma_{\alpha}$ $\beta$ $\sigma_{\beta}$
|
|
|
|
|
----------- ------------------- ------------------- ------------------- -------------------
|
|
|
|
|
Fisher 0.9999 0.33 0.9999 0.33
|
|
|
|
|
|
|
|
|
|
Perceptron 0.9999 0.28 0.9995 0.64
|
|
|
|
|
-------------------------------------------------------------------------------------------
|
|
|
|
|
|
|
|
|
|
Table: Results for Fisher and perceptron method. $\sigma_{\alpha}$ and
|
|
|
|
|
$\sigma_{\beta}$ stand for the standard deviation of the false
|
2020-05-24 12:01:36 +02:00
|
|
|
|
negatives and false positives respectively. {#tbl:res_comp}
|