ex-7: review
This commit is contained in:
parent
ed73ea5c8b
commit
e7a5185976
@ -1,51 +1,50 @@
|
|||||||
# Exercise 7
|
# Exercise 7
|
||||||
|
|
||||||
## Generating points according to Gaussian distributions {#sec:sampling}
|
## Generating random points on the plane {#sec:sampling}
|
||||||
|
|
||||||
Two sets of 2D points $(x, y)$ - signal and noise - is to be generated according
|
|
||||||
to two bivariate Gaussian distributions with parameters:
|
|
||||||
$$
|
|
||||||
\text{signal} \quad
|
|
||||||
\begin{cases}
|
|
||||||
\mu = (0, 0) \\
|
|
||||||
\sigma_x = \sigma_y = 0.3 \\
|
|
||||||
\rho = 0.5
|
|
||||||
\end{cases}
|
|
||||||
\et
|
|
||||||
\text{noise} \quad
|
|
||||||
\begin{cases}
|
|
||||||
\mu = (4, 4) \\
|
|
||||||
\sigma_x = \sigma_y = 1 \\
|
|
||||||
\rho = 0.4
|
|
||||||
\end{cases}
|
|
||||||
$$
|
|
||||||
|
|
||||||
|
Two sets of 2D points $(x, y)$ --- signal and noise --- is to be generated
|
||||||
|
according to two bivariate Gaussian distributions with parameters:
|
||||||
|
\begin{align*}
|
||||||
|
\text{signal}\:
|
||||||
|
\begin{cases}
|
||||||
|
\mu = (0, 0) \\
|
||||||
|
\sigma_x = \sigma_y = 0.3 \\
|
||||||
|
\rho = 0.5
|
||||||
|
\end{cases}
|
||||||
|
&&
|
||||||
|
\text{noise}\:
|
||||||
|
\begin{cases}
|
||||||
|
\mu = (4, 4) \\
|
||||||
|
\sigma_x = \sigma_y = 1 \\
|
||||||
|
\rho = 0.4
|
||||||
|
\end{cases}
|
||||||
|
\end{align*}
|
||||||
where $\mu$ stands for the mean, $\sigma_x$ and $\sigma_y$ for the standard
|
where $\mu$ stands for the mean, $\sigma_x$ and $\sigma_y$ for the standard
|
||||||
deviations in $x$ and $y$ directions respectively and $\rho$ is the bivariate
|
deviations in $x$ and $y$ directions respectively and $\rho$ is the bivariate
|
||||||
correlation, namely:
|
correlation, namely:
|
||||||
$$
|
$$
|
||||||
\sigma_{xy} = \rho \sigma_x \sigma_y
|
\sigma_{xy} = \rho\, \sigma_x \sigma_y
|
||||||
$$
|
$$
|
||||||
|
|
||||||
where $\sigma_{xy}$ is the covariance of $x$ and $y$.
|
where $\sigma_{xy}$ is the covariance of $x$ and $y$.
|
||||||
In the code, default settings are $N_s = 800$ points for the signal and $N_n =
|
|
||||||
1000$ points for the noise but can be customized from the input command-line.
|
In the programs, $N_s = 800$ points for the signal and $N_n = 1000$ points for
|
||||||
Both samples were handled as matrices of dimension $n$ x 2, where $n$ is the
|
the noise were chosen as default but can be customized from the command-line.
|
||||||
number of points in the sample. The library `gsl_matrix` provided by GSL was
|
Both samples were stored as $n \times 2$ matrices, where $n$ is the number of
|
||||||
employed for this purpose and the function `gsl_ran_bivariate_gaussian()` was
|
points in the sample. The library `gsl_matrix` provided by GSL was employed for
|
||||||
used for generating the points.
|
this purpose and the function `gsl_ran_bivariate_gaussian()` was used for
|
||||||
|
generating the points.
|
||||||
An example of the two samples is shown in @fig:points.
|
An example of the two samples is shown in @fig:points.
|
||||||
|
|
||||||
![Example of points sampled according to the two Gaussian distributions
|
![Example of points sampled according to the two Gaussian distributions
|
||||||
with the given parameters.](images/7-points.pdf){#fig:points}
|
with the given parameters.](images/7-points.pdf){#fig:points}
|
||||||
|
|
||||||
Assuming not to know how the points were generated, a model of classification
|
Assuming to not know how the points were generated, a model of classification,
|
||||||
is then to be implemented in order to assign each point to the right class
|
which assign to each point the 'most probably' class it belongs to,
|
||||||
(signal or noise) to which it 'most probably' belongs to. The point is how
|
is implemented. Depending on the interpretation of 'most probable' many
|
||||||
'most probably' can be interpreted and implemented.
|
different models can be developed.
|
||||||
Here, the Fisher linear discriminant and the Perceptron were implemented and
|
Here, the Fisher linear discriminant and the perceptron were implemented and
|
||||||
described in the following two sections. The results are compared in
|
described in the following two sections. The results are compared in
|
||||||
@sec:7_results.
|
@sec:class-results.
|
||||||
|
|
||||||
|
|
||||||
## Fisher linear discriminant
|
## Fisher linear discriminant
|
||||||
@ -54,24 +53,27 @@ described in the following two sections. The results are compared in
|
|||||||
### The projection direction
|
### The projection direction
|
||||||
|
|
||||||
The Fisher linear discriminant (FLD) is a linear classification model based on
|
The Fisher linear discriminant (FLD) is a linear classification model based on
|
||||||
dimensionality reduction. It allows to reduce this 2D classification problem
|
dimensionality reduction. It does so by projecting the data onto hyperplanes
|
||||||
into a one-dimensional decision surface.
|
that best divide the classes of points, consequently decreasing the dimension
|
||||||
|
to $n-1$. In the 2D case the projection is onto a line, therefore the problem
|
||||||
|
is reduced to simply selecting a threshold.
|
||||||
|
|
||||||
Consider the case of two classes (in this case signal and noise): the simplest
|
Consider the case of two classes (in this case signal and noise): the simplest
|
||||||
representation of a linear discriminant is obtained by taking a linear function
|
representation of a linear discriminant is obtained by taking a linear function
|
||||||
$\hat{x}$ of a sampled 2D point $x$ so that:
|
$\tilde{x}$ of a sampled 2D point $x$ so that:
|
||||||
$$
|
$$
|
||||||
\hat{x} = w^T x
|
\tilde{x} = w^T x
|
||||||
$$
|
$$
|
||||||
|
|
||||||
where $w$ is the so-called 'weight vector' and $w^T$ stands for its transpose.
|
where $w$ is the so-called 'weight vector' and $w^T$ stands for its transpose.
|
||||||
An input point $x$ is commonly assigned to the first class if $\hat{x} \geqslant
|
|
||||||
|
An input point $x$ is commonly assigned to the first class if $\tilde{x} >
|
||||||
w_{th}$ and to the second one otherwise, where $w_{th}$ is a threshold value
|
w_{th}$ and to the second one otherwise, where $w_{th}$ is a threshold value
|
||||||
somehow defined. In general, the projection onto one dimension leads to a
|
somehow defined. In general, the projection onto one dimension leads to a
|
||||||
considerable loss of information and classes that are well separated in the
|
considerable loss of information and classes that are well separated in the
|
||||||
original 2D space may become strongly overlapping in one dimension. However, by
|
original 2D space may become strongly overlapping in one dimension. However, by
|
||||||
adjusting the components of the weight vector, a projection that maximizes the
|
adjusting the components of the weight vector, a projection that maximizes the
|
||||||
classes separation can be selected [@bishop06].
|
classes separation can be found [@bishop06].
|
||||||
|
|
||||||
To begin with, consider $N_1$ points of class $C_1$ and $N_2$ points of class
|
To begin with, consider $N_1$ points of class $C_1$ and $N_2$ points of class
|
||||||
$C_2$, so that the means $\mu_1$ and $\mu_2$ of the two classes are given by:
|
$C_2$, so that the means $\mu_1$ and $\mu_2$ of the two classes are given by:
|
||||||
$$
|
$$
|
||||||
@ -83,46 +85,45 @@ $$
|
|||||||
The simplest measure of the separation of the classes is the separation of the
|
The simplest measure of the separation of the classes is the separation of the
|
||||||
projected class means. This suggests to choose $w$ so as to maximize:
|
projected class means. This suggests to choose $w$ so as to maximize:
|
||||||
$$
|
$$
|
||||||
\hat{\mu}_2 − \hat{\mu}_1 = w^T (\mu_2 − \mu_1)
|
\tilde{\mu}_2 − \tilde{\mu}_1 = w^T (\mu_2 − \mu_1)
|
||||||
$$
|
$$
|
||||||
|
|
||||||
This expression can be made arbitrarily large simply by increasing the magnitude
|
This expression can be made arbitrarily large simply by increasing the
|
||||||
of $w$. To solve this problem, $w$ can be constrained to have unit length, so
|
magnitude of $w$, fortunately the problem is easily solved by requiring $w$
|
||||||
that $| w^2 | = 1$. Using a Lagrange multiplier to perform the constrained
|
to be normalised: $| w^2 | = 1$. Using a Lagrange multiplier to perform the
|
||||||
maximization, it can be found that $w \propto (\mu_2 − \mu_1)$, meaning that the
|
constrained maximization, it can be found that $w \propto (\mu_2 − \mu_1)$,
|
||||||
line onto the points must be projected is the one joining the class means.
|
meaning that the line onto the points must be projected is the one joining the
|
||||||
|
class means.
|
||||||
There is still a problem with this approach, however, as illustrated in
|
There is still a problem with this approach, however, as illustrated in
|
||||||
@fig:overlap: the two classes are well separated in the original 2D space but
|
@fig:overlap: the two classes are well separated in the original 2D space but
|
||||||
have considerable overlap when projected onto the line joining their means
|
have considerable overlap when projected onto the line joining their means
|
||||||
which maximize their projections distance.
|
which maximize their projections distance.
|
||||||
|
|
||||||
![The plot on the left shows samples from two classes along with the
|
![The plot on the left shows samples from two classes along with the
|
||||||
histograms resulting fromthe projection onto the line joining the
|
histograms resulting from the projection onto the line joining the
|
||||||
class means: note that there is considerable overlap in the projected
|
class means: note the considerable overlap in the projected
|
||||||
space. The right plot shows the corresponding projection based on the
|
space. The right plot shows the corresponding projection based on the
|
||||||
Fisher linear discriminant, showing the greatly improved classes
|
Fisher linear discriminant, showing the greatly improved classes
|
||||||
separation. Fifure from [@bishop06]](images/7-fisher.png){#fig:overlap}
|
separation. Figure taken from [@bishop06]](images/7-fisher.png){#fig:overlap}
|
||||||
|
|
||||||
The idea to solve it is to maximize a function that will give a large separation
|
The overlap of the projections can be reduced by maximising a function that
|
||||||
between the projected classes means while also giving a small variance within
|
gives, besides a large separation, small variance within each class. The
|
||||||
each class, thereby minimizing the class overlap.
|
within-class variance of the transformed data of each class $k$ is given by:
|
||||||
The within-class variance of the transformed data of each class $k$ is given
|
|
||||||
by:
|
|
||||||
$$
|
$$
|
||||||
\hat{s}_k^2 = \sum_{n \in c_k} (\hat{x}_n - \hat{\mu}_k)^2
|
\tilde{s}_k^2 = \sum_{n \in c_k} (\tilde{x}_n - \tilde{\mu}_k)^2
|
||||||
$$
|
$$
|
||||||
|
|
||||||
The total within-class variance for the whole data set is simply defined as
|
The total within-class variance for the whole data set is simply defined as
|
||||||
$\hat{s}^2 = \hat{s}_1^2 + \hat{s}_2^2$. The Fisher criterion is defined to
|
$\tilde{s}^2 = \tilde{s}_1^2 + \tilde{s}_2^2$. The Fisher criterion is therfore
|
||||||
be the ratio of the between-class distance to the within-class variance and is
|
defined to be the ratio of the between-class distance to the within-class
|
||||||
given by:
|
variance and is given by:
|
||||||
$$
|
$$
|
||||||
F(w) = \frac{(\hat{\mu}_2 - \hat{\mu}_1)^2}{\hat{s}^2}
|
F(w) = \frac{(\tilde{\mu}_2 - \tilde{\mu}_1)^2}{\tilde{s}^2}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
The dependence on $w$ can be made explicit:
|
The dependence on $w$ can be made explicit:
|
||||||
\begin{align*}
|
\begin{align*}
|
||||||
(\hat{\mu}_2 - \hat{\mu}_1)^2 &= (w^T \mu_2 - w^T \mu_1)^2 \\
|
(\tilde{\mu}_2 - \tilde{\mu}_1)^2 &= (w^T \mu_2 - w^T \mu_1)^2 \\
|
||||||
&= [w^T (\mu_2 - \mu_1)]^2 \\
|
&= [w^T (\mu_2 - \mu_1)]^2 \\
|
||||||
&= [w^T (\mu_2 - \mu_1)][w^T (\mu_2 - \mu_1)] \\
|
&= [w^T (\mu_2 - \mu_1)][w^T (\mu_2 - \mu_1)] \\
|
||||||
&= [w^T (\mu_2 - \mu_1)][(\mu_2 - \mu_1)^T w]
|
&= [w^T (\mu_2 - \mu_1)][(\mu_2 - \mu_1)^T w]
|
||||||
@ -131,9 +132,9 @@ The dependence on $w$ can be made explicit:
|
|||||||
|
|
||||||
where $M$ is the between-distance matrix. Similarly, as regards the denominator:
|
where $M$ is the between-distance matrix. Similarly, as regards the denominator:
|
||||||
\begin{align*}
|
\begin{align*}
|
||||||
\hat{s}^2 &= \hat{s}_1^2 + \hat{s}_2^2 = \\
|
\tilde{s}^2 &= \tilde{s}_1^2 + \tilde{s}_2^2 = \\
|
||||||
&= \sum_{n \in c_1} (\hat{x}_n - \hat{\mu}_1)^2
|
&= \sum_{n \in c_1} (\tilde{x}_n - \tilde{\mu}_1)^2
|
||||||
+ \sum_{n \in c_2} (\hat{x}_n - \hat{\mu}_2)^2
|
+ \sum_{n \in c_2} (\tilde{x}_n - \tilde{\mu}_2)^2
|
||||||
= w^T \Sigma_w w
|
= w^T \Sigma_w w
|
||||||
\end{align*}
|
\end{align*}
|
||||||
|
|
||||||
@ -162,7 +163,7 @@ Differentiating with respect to $w$, it can be found that $F(w)$ is maximized
|
|||||||
when:
|
when:
|
||||||
$$
|
$$
|
||||||
w = \Sigma_w^{-1} (\mu_2 - \mu_1)
|
w = \Sigma_w^{-1} (\mu_2 - \mu_1)
|
||||||
$$
|
$$ {#eq:fisher-weight}
|
||||||
|
|
||||||
This is not truly a discriminant but rather a specific choice of the direction
|
This is not truly a discriminant but rather a specific choice of the direction
|
||||||
for projection of the data down to one dimension: the projected data can then be
|
for projection of the data down to one dimension: the projected data can then be
|
||||||
@ -178,63 +179,73 @@ with the `gsl_blas_dgemv()` function provided by GSL.
|
|||||||
|
|
||||||
### The threshold
|
### The threshold
|
||||||
|
|
||||||
The threshold $t_{\text{cut}}$ was fixed by the condition of conditional
|
The threshold $t_{\text{cut}}$ was fixed by the conditional
|
||||||
probability $P(c_k | t_{\text{cut}})$ being the same for both classes $c_k$:
|
probability $P(c_k | t_{\text{cut}})$ being the same for both classes $c_k$:
|
||||||
$$
|
$$
|
||||||
t_{\text{cut}} = x \, | \hspace{20pt}
|
t_{\text{cut}} = x \text{ such that}\quad
|
||||||
\frac{P(c_1 | x)}{P(c_2 | x)} =
|
\frac{P(c_1 | x)}{P(c_2 | x)} =
|
||||||
\frac{P(x | c_1) \, P(c_1)}{P(x | c_2) \, P(c_2)} = 1
|
\frac{P(x | c_1) \, P(c_1)}{P(x | c_2) \, P(c_2)} = 1
|
||||||
$$
|
$$
|
||||||
|
|
||||||
where $P(x | c_k)$ is the probability for point $x$ along the Fisher projection
|
where $P(x | c_k)$ is the probability for point $x$ along the Fisher projection
|
||||||
line of being sampled according to the class $k$. If each class is a bivariate
|
line of being sampled from $k$. If $\tilde{x} > t_\text{cut}$ then more likely
|
||||||
Gaussian, as in the present case, then $P(x | c_k)$ is simply given by its
|
$x \in c_1$, otherwise $x \in c_2$.
|
||||||
projected normal distribution with mean $\hat{m} = w^T m$ and variance $\hat{s}
|
|
||||||
= w^T S w$, being $S$ the covariance matrix of the class.
|
If each class is a bivariate Gaussian distribution, as in the present case,
|
||||||
With a bit of math, the following solution can be found:
|
then $P(x | c_k)$ is simply given by its projected normal distribution with
|
||||||
|
mean $\tilde{m} = w^T m$ and variance $\tilde{s} = w^T S w$, being $S$ the
|
||||||
|
covariance matrix of the class.
|
||||||
|
After some algebra, the threshold is found to be:
|
||||||
$$
|
$$
|
||||||
t_{\text{cut}} = \frac{b}{a}
|
t_{\text{cut}} = \frac{b}{a}
|
||||||
+ \sqrt{\left( \frac{b}{a} \right)^2 - \frac{c}{a}}
|
+ \sqrt{\left( \frac{b}{a} \right)^2 - \frac{c}{a}}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
where:
|
where:
|
||||||
|
|
||||||
- $a = \hat{s}_1^2 - \hat{s}_2^2$
|
- $a = \tilde{s}_1^2 - \tilde{s}_2^2$
|
||||||
- $b = \hat{\mu}_2 \, \hat{s}_1^2 - \hat{\mu}_1 \, \hat{s}_2^2$
|
- $b = \tilde{\mu}_2 \, \tilde{s}_1^2 - \tilde{\mu}_1 \, \tilde{s}_2^2$
|
||||||
- $c = \hat{\mu}_2^2 \, \hat{s}_1^2 - \hat{\mu}_1^2 \, \hat{s}_2^2
|
- $c = \tilde{\mu}_2^2 \, \tilde{s}_1^2 - \tilde{\mu}_1^2 \, \tilde{s}_2^2
|
||||||
- 2 \, \hat{s}_1^2 \, \hat{s}_2^2 \, \ln(\alpha)$
|
- 2 \, \tilde{s}_1^2 \, \tilde{s}_2^2 \, \ln(\alpha)$
|
||||||
- $\alpha = P(c_1) / P(c_2)$
|
- $\alpha = P(c_1) / P(c_2)$
|
||||||
|
|
||||||
The ratio of the prior probabilities $\alpha$ is simply given by:
|
In a simulation, the ratio of the prior probabilities $\alpha$ can
|
||||||
|
simply be set to:
|
||||||
$$
|
$$
|
||||||
\alpha = \frac{N_s}{N_n}
|
\alpha = \frac{N_s}{N_n}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
The projection of the points was accomplished by the use of the function
|
The projection of the points was accomplished by the use of the function
|
||||||
`gsl_blas_ddot()`, which computes the element wise product between two vectors.
|
`gsl_blas_ddot()`, which computes a fast dot product of two vectors.
|
||||||
|
|
||||||
Results obtained for the same samples in @fig:points are shown in
|
Results obtained for the same samples in @fig:points are shown below in
|
||||||
@fig:fisher_proj. The weight vector and the treshold were found to be:
|
@fig:fisher-proj. The weight vector and the threshold were found to be:
|
||||||
$$
|
$$
|
||||||
w = (0.707, 0.707) \et
|
w = (0.707, 0.707) \et
|
||||||
t_{\text{cut}} = 1.323
|
t_{\text{cut}} = 1.323
|
||||||
$$
|
$$
|
||||||
|
|
||||||
<div id="fig:fisher_proj">
|
::: { id=fig:fisher-proj }
|
||||||
![View of the samples in the plane.](images/7-fisher-plane.pdf)
|
![Scatter plot of the samples.](images/7-fisher-plane.pdf)
|
||||||
![View of the samples projections onto the projection
|
![Histogram of the Fisher-projected samples.](images/7-fisher-proj.pdf)
|
||||||
line.](images/7-fisher-proj.pdf)
|
|
||||||
|
|
||||||
Aerial and lateral views of the samples. Projection line in blu and cut in red.
|
Aerial and lateral views of the samples. Projection line in blu and cut in red.
|
||||||
</div>
|
:::
|
||||||
|
|
||||||
Since the vector $w$ turned out to be parallel with the line joining the means
|
|
||||||
of the two classes (reminded to be $(0, 0)$ and $(4, 4)$), one can be mislead
|
### A mathematical curiosity
|
||||||
and assume that the inverse of the total covariance matrix $\Sigma_w$ is
|
|
||||||
isotropic, namely proportional to the unit matrix.
|
This section is really a sidenote which grew too large to fit in a margin,
|
||||||
That's not true. In this special sample, the vector joining the means turns out
|
so it can be safely skipped.
|
||||||
to be an eigenvector of the covariance matrix $\Sigma_w^{-1}$. In fact: since
|
|
||||||
$\sigma_x = \sigma_y$ for both signal and noise:
|
It can be seen that the weight vector turned out to parallel to the line
|
||||||
|
joining the means of the two classes (as a remainder: $(0, 0)$ and $(4, 4)$),
|
||||||
|
as if the within-class covariances were ignored. Strange!
|
||||||
|
|
||||||
|
Looking at @eq:fisher-weight, one can be mislead into thinking that the inverse
|
||||||
|
of the total covariance matrix, $\Sigma_w$ is (proportional to) the identity,
|
||||||
|
but that's not true. By a remarkable accident, the vector joining the means is
|
||||||
|
an eigenvector of the covariance matrix $\Sigma_w^{-1}$. In
|
||||||
|
fact: since $\sigma_x = \sigma_y$ for both signal and noise:
|
||||||
$$
|
$$
|
||||||
\Sigma_1 = \begin{pmatrix}
|
\Sigma_1 = \begin{pmatrix}
|
||||||
\sigma_x^2 & \sigma_{xy} \\
|
\sigma_x^2 & \sigma_{xy} \\
|
||||||
@ -247,15 +258,14 @@ $$
|
|||||||
\end{pmatrix}_2
|
\end{pmatrix}_2
|
||||||
$$
|
$$
|
||||||
|
|
||||||
$\Sigma_w$ takes the form:
|
$\Sigma_w$ takes the symmetrical form
|
||||||
$$
|
$$
|
||||||
\Sigma_w = \begin{pmatrix}
|
\Sigma_w = \begin{pmatrix}
|
||||||
A & B \\
|
A & B \\
|
||||||
B & A
|
B & A
|
||||||
\end{pmatrix}
|
\end{pmatrix},
|
||||||
$$
|
$$
|
||||||
|
which can be easily inverted by Gaussian elimination:
|
||||||
Which can be easily inverted by Gaussian elimination:
|
|
||||||
\begin{align*}
|
\begin{align*}
|
||||||
\begin{pmatrix}
|
\begin{pmatrix}
|
||||||
A & B & \vline & 1 & 0 \\
|
A & B & \vline & 1 & 0 \\
|
||||||
@ -271,7 +281,7 @@ Which can be easily inverted by Gaussian elimination:
|
|||||||
\end{pmatrix}
|
\end{pmatrix}
|
||||||
\end{align*}
|
\end{align*}
|
||||||
|
|
||||||
Hence:
|
Hence, the inverse has still the same form:
|
||||||
$$
|
$$
|
||||||
\Sigma_w^{-1} = \begin{pmatrix}
|
\Sigma_w^{-1} = \begin{pmatrix}
|
||||||
\tilde{A} & \tilde{B} \\
|
\tilde{A} & \tilde{B} \\
|
||||||
@ -279,21 +289,15 @@ $$
|
|||||||
\end{pmatrix}
|
\end{pmatrix}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
Thus, $\Sigma_w$ and $\Sigma_w^{-1}$ share the same eigenvectors $v_1$ and
|
For this reason, $\Sigma_w$ and $\Sigma_w^{-1}$ share the same eigenvectors
|
||||||
$v_2$:
|
$v_1$ and $v_2$:
|
||||||
$$
|
$$
|
||||||
v_1 = \begin{pmatrix}
|
v_1 = \begin{pmatrix} 1 \\ -1 \end{pmatrix}
|
||||||
1 \\
|
\et
|
||||||
-1
|
v_2 = \begin{pmatrix} 1 \\ 1 \end{pmatrix}
|
||||||
\end{pmatrix} \et
|
|
||||||
v_2 = \begin{pmatrix}
|
|
||||||
1 \\
|
|
||||||
1
|
|
||||||
\end{pmatrix}
|
|
||||||
$$
|
$$
|
||||||
|
|
||||||
and the vector joining the means is clearly a multiple of $v_2$, causing $w$ to
|
The vector joining the means is clearly a multiple of $v_2$, and so is $w$.
|
||||||
be a multiple of it.
|
|
||||||
|
|
||||||
|
|
||||||
## Perceptron
|
## Perceptron
|
||||||
@ -311,16 +315,18 @@ The aim of the perceptron algorithm is to determine the weight vector $w$ and
|
|||||||
bias $b$ such that the so-called 'threshold function' $f(x)$ returns a binary
|
bias $b$ such that the so-called 'threshold function' $f(x)$ returns a binary
|
||||||
value: it is expected to return 1 for signal points and 0 for noise points:
|
value: it is expected to return 1 for signal points and 0 for noise points:
|
||||||
$$
|
$$
|
||||||
f(x) = \theta(w^T \cdot x + b)
|
f(x) = \theta(w^T x + b)
|
||||||
$$ {#eq:perc}
|
$$ {#eq:perc}
|
||||||
|
|
||||||
where $\theta$ is the Heaviside theta function.
|
where $\theta$ is the Heaviside theta function. Note that the bias $b$ is
|
||||||
|
$-t_\text{cut}$, as defined in the previous section.
|
||||||
|
|
||||||
The training was performed using the generated sample as training set. From an
|
The training was performed using the generated sample as training set. From an
|
||||||
initial guess for $w$ and $b$ (which were set to be all null in the code), the
|
initial guess for $w$ and $b$ (which were set to be all null in the code), the
|
||||||
perceptron starts to improve their estimations. The training set is passed point
|
perceptron starts to improve their estimations. The training set is passed
|
||||||
by point into a iterative procedure a customizable number $N$ of times: for
|
point by point into a iterative procedure $N$ times: for every point, the
|
||||||
every point, the output of $f(x)$ is computed. Afterwards, the variable
|
output of $f(x)$ is computed. Afterwards, the variable $\Delta$, which is
|
||||||
$\Delta$, which is defined as:
|
defined as:
|
||||||
$$
|
$$
|
||||||
\Delta = r [e - f(x)]
|
\Delta = r [e - f(x)]
|
||||||
$$
|
$$
|
||||||
@ -355,7 +361,7 @@ To see how it works, consider the four possible situations:
|
|||||||
the current $b$ and $w$ overestimate the correct output: they must be
|
the current $b$ and $w$ overestimate the correct output: they must be
|
||||||
decreased.
|
decreased.
|
||||||
|
|
||||||
Whilst the $b$ updating is obvious, as regarsd $w$ the following consideration
|
Whilst the $b$ updating is obvious, as regards $w$ the following consideration
|
||||||
may help clarify. Consider the case with $e = 0 \quad \wedge \quad f(x) = 1
|
may help clarify. Consider the case with $e = 0 \quad \wedge \quad f(x) = 1
|
||||||
\quad \Longrightarrow \quad \Delta = -1$:
|
\quad \Longrightarrow \quad \Delta = -1$:
|
||||||
$$
|
$$
|
||||||
@ -366,53 +372,51 @@ $$
|
|||||||
|
|
||||||
Similarly for the case with $e = 1$ and $f(x) = 0$.
|
Similarly for the case with $e = 1$ and $f(x) = 0$.
|
||||||
|
|
||||||
![Weiht vector and threshold value obtained with the perceptron method as a
|
![Weight vector and threshold value obtained with the perceptron method as a
|
||||||
function of the number of iterations. Both level off at the third
|
function of the number of iterations. Both level off at the third
|
||||||
iteration.](images/7-iterations.pdf){#fig:iterations}
|
iteration.](images/7-iterations.pdf){#fig:percep-iterations}
|
||||||
|
|
||||||
As far as convergence is concerned, the perceptron will never get to the state
|
As far as convergence is concerned, the perceptron will never get to the state
|
||||||
with all the input points classified correctly if the training set is not
|
with all the input points classified correctly if the training set is not
|
||||||
linearly separable, meaning that the signal cannot be separated from the noise
|
linearly separable, meaning that the signal cannot be separated from the noise
|
||||||
by a line in the plane. In this case, no approximate solutions will be gradually
|
by a line in the plane. In this case, no approximate solutions will be
|
||||||
approached. On the other hand, if the training set is linearly separable, it can
|
gradually approached. On the other hand, if the training set is linearly
|
||||||
be shown that this method converges to the coveted function [@novikoff63].
|
separable, it can be shown (see [@novikoff63]) that this method converges to
|
||||||
As in the previous section, once found, the weight vector is to be normalized.
|
the coveted function. As in the previous section, once found, the weight
|
||||||
|
vector is to be normalized.
|
||||||
|
|
||||||
With $N = 5$ iterations, the values of $w$ and $t_{\text{cut}}$ level off up to the third
|
With $N = 5$ iterations, the values of $w$ and $t_{\text{cut}}$ level off up to the third
|
||||||
digit. The following results were obtained:
|
digit. The following results were obtained:
|
||||||
|
|
||||||
Different values of the learning rate were tested, all giving the same result,
|
Different values of the learning rate were tested, all giving the same result,
|
||||||
converging for a number $N = 3$ of iterations. In @fig:iterations, results are
|
converging for a number $N = 3$ of iterations. In @fig:percep-iterations, results are
|
||||||
shown for $r = 0.8$: as can be seen, for $N = 3$, the values of $w$ and
|
shown for $r = 0.8$: as can be seen, for $N = 3$, the values of $w$ and
|
||||||
$t^{\text{cut}}$ level off.
|
$t^{\text{cut}}$ level off.
|
||||||
The following results were obtained:
|
The following results were obtained:
|
||||||
$$
|
$$
|
||||||
w = (0.654, 0.756) \et t_{\text{cut}} = 1.213
|
w = (-0.654, -0.756) \et t_{\text{cut}} = 1.213
|
||||||
$$
|
$$
|
||||||
|
|
||||||
In this case, the projection line is not parallel with the line joining the
|
In this case, the projection line is not exactly parallel with the line joining
|
||||||
means of the two samples. Plots in @fig:percep_proj.
|
the means of the two samples. Plots in @fig:percep-proj.
|
||||||
|
|
||||||
<div id="fig:percep_proj">
|
::: { id=fig:percep-proj }
|
||||||
![View from above of the samples.](images/7-percep-plane.pdf)
|
![Scatter plot of the samples.](images/7-percep-plane.pdf)
|
||||||
![Gaussian of the samples on the projection
|
![Histogram of the projected samples.](images/7-percep-proj.pdf)
|
||||||
line.](images/7-percep-proj.pdf)
|
|
||||||
|
|
||||||
Aerial and lateral views of the projection direction, in blue, and the cut, in
|
Aerial and lateral views of the samples. Projection line in blu and cut in red.
|
||||||
red.
|
:::
|
||||||
</div>
|
|
||||||
|
|
||||||
|
|
||||||
## Efficiency test {#sec:7_results}
|
## Efficiency test {#sec:class-results}
|
||||||
|
|
||||||
Using the same parameters of the training set, a number $N_t$ of test
|
Using the same parameters of the training set, a number $N_t$ of test samples
|
||||||
samples was generated and the points were divided into noise and signal
|
was generated and the points were classified applying both methods. To avoid
|
||||||
applying both methods. To avoid storing large datasets in memory, at each
|
storing large datasets in memory, at each iteration, false positives and
|
||||||
iteration, false positives and negatives were recorded using a running
|
negatives were recorded using a running statistics method implemented in the
|
||||||
statistics method implemented in the `gsl_rstat` library. For each sample, the
|
`gsl_rstat` library. For each sample, the numbers $N_{fn}$ and $N_{fp}$ of
|
||||||
numbers $N_{fn}$ and $N_{fp}$ of false negative and false positive were obtained
|
false negative and false positive were obtained this way: for every noise point
|
||||||
this way: for every noise point $x_n$, the threshold function $f(x_n)$ was
|
$x_n$, the threshold function $f(x_n)$ was computed, then:
|
||||||
computed, then:
|
|
||||||
|
|
||||||
- if $f(x) = 0 \thus$ $N_{fn} \to N_{fn}$
|
- if $f(x) = 0 \thus$ $N_{fn} \to N_{fn}$
|
||||||
- if $f(x) \neq 0 \thus$ $N_{fn} \to N_{fn} + 1$
|
- if $f(x) \neq 0 \thus$ $N_{fn} \to N_{fn} + 1$
|
||||||
@ -438,13 +442,13 @@ solution, the most powerful one, according to the Neyman-Pearson lemma, whereas
|
|||||||
the perceptron is only expected to converge to the solution and is therefore
|
the perceptron is only expected to converge to the solution and is therefore
|
||||||
more subject to random fluctuations.
|
more subject to random fluctuations.
|
||||||
|
|
||||||
-------------------------------------------------------------------------------------------
|
------------------------------------------------------
|
||||||
$\alpha$ $\sigma_{\alpha}$ $\beta$ $\sigma_{\beta}$
|
$α$ $σ_α$ $β$ $σ_β$
|
||||||
----------- ------------------- ------------------- ------------------- -------------------
|
----------- ---------- ---------- ---------- ---------
|
||||||
Fisher 0.9999 0.33 0.9999 0.33
|
Fisher 0.9999 0.33 0.9999 0.33
|
||||||
|
|
||||||
Perceptron 0.9999 0.28 0.9995 0.64
|
Perceptron 0.9999 0.28 0.9995 0.64
|
||||||
-------------------------------------------------------------------------------------------
|
------------------------------------------------------
|
||||||
|
|
||||||
Table: Results for Fisher and perceptron method. $\sigma_{\alpha}$ and
|
Table: Results for Fisher and perceptron method. $\sigma_{\alpha}$ and
|
||||||
$\sigma_{\beta}$ stand for the standard deviation of the false
|
$\sigma_{\beta}$ stand for the standard deviation of the false
|
||||||
|
Loading…
Reference in New Issue
Block a user