ex-7: review
This commit is contained in:
parent
ed73ea5c8b
commit
e7a5185976
@ -1,51 +1,50 @@
|
||||
# Exercise 7
|
||||
|
||||
## Generating points according to Gaussian distributions {#sec:sampling}
|
||||
|
||||
Two sets of 2D points $(x, y)$ - signal and noise - is to be generated according
|
||||
to two bivariate Gaussian distributions with parameters:
|
||||
$$
|
||||
\text{signal} \quad
|
||||
\begin{cases}
|
||||
\mu = (0, 0) \\
|
||||
\sigma_x = \sigma_y = 0.3 \\
|
||||
\rho = 0.5
|
||||
\end{cases}
|
||||
\et
|
||||
\text{noise} \quad
|
||||
\begin{cases}
|
||||
\mu = (4, 4) \\
|
||||
\sigma_x = \sigma_y = 1 \\
|
||||
\rho = 0.4
|
||||
\end{cases}
|
||||
$$
|
||||
## Generating random points on the plane {#sec:sampling}
|
||||
|
||||
Two sets of 2D points $(x, y)$ --- signal and noise --- is to be generated
|
||||
according to two bivariate Gaussian distributions with parameters:
|
||||
\begin{align*}
|
||||
\text{signal}\:
|
||||
\begin{cases}
|
||||
\mu = (0, 0) \\
|
||||
\sigma_x = \sigma_y = 0.3 \\
|
||||
\rho = 0.5
|
||||
\end{cases}
|
||||
&&
|
||||
\text{noise}\:
|
||||
\begin{cases}
|
||||
\mu = (4, 4) \\
|
||||
\sigma_x = \sigma_y = 1 \\
|
||||
\rho = 0.4
|
||||
\end{cases}
|
||||
\end{align*}
|
||||
where $\mu$ stands for the mean, $\sigma_x$ and $\sigma_y$ for the standard
|
||||
deviations in $x$ and $y$ directions respectively and $\rho$ is the bivariate
|
||||
correlation, namely:
|
||||
$$
|
||||
\sigma_{xy} = \rho \sigma_x \sigma_y
|
||||
\sigma_{xy} = \rho\, \sigma_x \sigma_y
|
||||
$$
|
||||
|
||||
where $\sigma_{xy}$ is the covariance of $x$ and $y$.
|
||||
In the code, default settings are $N_s = 800$ points for the signal and $N_n =
|
||||
1000$ points for the noise but can be customized from the input command-line.
|
||||
Both samples were handled as matrices of dimension $n$ x 2, where $n$ is the
|
||||
number of points in the sample. The library `gsl_matrix` provided by GSL was
|
||||
employed for this purpose and the function `gsl_ran_bivariate_gaussian()` was
|
||||
used for generating the points.
|
||||
|
||||
In the programs, $N_s = 800$ points for the signal and $N_n = 1000$ points for
|
||||
the noise were chosen as default but can be customized from the command-line.
|
||||
Both samples were stored as $n \times 2$ matrices, where $n$ is the number of
|
||||
points in the sample. The library `gsl_matrix` provided by GSL was employed for
|
||||
this purpose and the function `gsl_ran_bivariate_gaussian()` was used for
|
||||
generating the points.
|
||||
An example of the two samples is shown in @fig:points.
|
||||
|
||||
![Example of points sampled according to the two Gaussian distributions
|
||||
with the given parameters.](images/7-points.pdf){#fig:points}
|
||||
|
||||
Assuming not to know how the points were generated, a model of classification
|
||||
is then to be implemented in order to assign each point to the right class
|
||||
(signal or noise) to which it 'most probably' belongs to. The point is how
|
||||
'most probably' can be interpreted and implemented.
|
||||
Here, the Fisher linear discriminant and the Perceptron were implemented and
|
||||
Assuming to not know how the points were generated, a model of classification,
|
||||
which assign to each point the 'most probably' class it belongs to,
|
||||
is implemented. Depending on the interpretation of 'most probable' many
|
||||
different models can be developed.
|
||||
Here, the Fisher linear discriminant and the perceptron were implemented and
|
||||
described in the following two sections. The results are compared in
|
||||
@sec:7_results.
|
||||
@sec:class-results.
|
||||
|
||||
|
||||
## Fisher linear discriminant
|
||||
@ -54,24 +53,27 @@ described in the following two sections. The results are compared in
|
||||
### The projection direction
|
||||
|
||||
The Fisher linear discriminant (FLD) is a linear classification model based on
|
||||
dimensionality reduction. It allows to reduce this 2D classification problem
|
||||
into a one-dimensional decision surface.
|
||||
dimensionality reduction. It does so by projecting the data onto hyperplanes
|
||||
that best divide the classes of points, consequently decreasing the dimension
|
||||
to $n-1$. In the 2D case the projection is onto a line, therefore the problem
|
||||
is reduced to simply selecting a threshold.
|
||||
|
||||
Consider the case of two classes (in this case signal and noise): the simplest
|
||||
representation of a linear discriminant is obtained by taking a linear function
|
||||
$\hat{x}$ of a sampled 2D point $x$ so that:
|
||||
$\tilde{x}$ of a sampled 2D point $x$ so that:
|
||||
$$
|
||||
\hat{x} = w^T x
|
||||
\tilde{x} = w^T x
|
||||
$$
|
||||
|
||||
where $w$ is the so-called 'weight vector' and $w^T$ stands for its transpose.
|
||||
An input point $x$ is commonly assigned to the first class if $\hat{x} \geqslant
|
||||
|
||||
An input point $x$ is commonly assigned to the first class if $\tilde{x} >
|
||||
w_{th}$ and to the second one otherwise, where $w_{th}$ is a threshold value
|
||||
somehow defined. In general, the projection onto one dimension leads to a
|
||||
somehow defined. In general, the projection onto one dimension leads to a
|
||||
considerable loss of information and classes that are well separated in the
|
||||
original 2D space may become strongly overlapping in one dimension. However, by
|
||||
adjusting the components of the weight vector, a projection that maximizes the
|
||||
classes separation can be selected [@bishop06].
|
||||
classes separation can be found [@bishop06].
|
||||
|
||||
To begin with, consider $N_1$ points of class $C_1$ and $N_2$ points of class
|
||||
$C_2$, so that the means $\mu_1$ and $\mu_2$ of the two classes are given by:
|
||||
$$
|
||||
@ -83,46 +85,45 @@ $$
|
||||
The simplest measure of the separation of the classes is the separation of the
|
||||
projected class means. This suggests to choose $w$ so as to maximize:
|
||||
$$
|
||||
\hat{\mu}_2 − \hat{\mu}_1 = w^T (\mu_2 − \mu_1)
|
||||
\tilde{\mu}_2 − \tilde{\mu}_1 = w^T (\mu_2 − \mu_1)
|
||||
$$
|
||||
|
||||
This expression can be made arbitrarily large simply by increasing the magnitude
|
||||
of $w$. To solve this problem, $w$ can be constrained to have unit length, so
|
||||
that $| w^2 | = 1$. Using a Lagrange multiplier to perform the constrained
|
||||
maximization, it can be found that $w \propto (\mu_2 − \mu_1)$, meaning that the
|
||||
line onto the points must be projected is the one joining the class means.
|
||||
This expression can be made arbitrarily large simply by increasing the
|
||||
magnitude of $w$, fortunately the problem is easily solved by requiring $w$
|
||||
to be normalised: $| w^2 | = 1$. Using a Lagrange multiplier to perform the
|
||||
constrained maximization, it can be found that $w \propto (\mu_2 − \mu_1)$,
|
||||
meaning that the line onto the points must be projected is the one joining the
|
||||
class means.
|
||||
There is still a problem with this approach, however, as illustrated in
|
||||
@fig:overlap: the two classes are well separated in the original 2D space but
|
||||
have considerable overlap when projected onto the line joining their means
|
||||
which maximize their projections distance.
|
||||
|
||||
![The plot on the left shows samples from two classes along with the
|
||||
histograms resulting fromthe projection onto the line joining the
|
||||
class means: note that there is considerable overlap in the projected
|
||||
histograms resulting from the projection onto the line joining the
|
||||
class means: note the considerable overlap in the projected
|
||||
space. The right plot shows the corresponding projection based on the
|
||||
Fisher linear discriminant, showing the greatly improved classes
|
||||
separation. Fifure from [@bishop06]](images/7-fisher.png){#fig:overlap}
|
||||
separation. Figure taken from [@bishop06]](images/7-fisher.png){#fig:overlap}
|
||||
|
||||
The idea to solve it is to maximize a function that will give a large separation
|
||||
between the projected classes means while also giving a small variance within
|
||||
each class, thereby minimizing the class overlap.
|
||||
The within-class variance of the transformed data of each class $k$ is given
|
||||
by:
|
||||
The overlap of the projections can be reduced by maximising a function that
|
||||
gives, besides a large separation, small variance within each class. The
|
||||
within-class variance of the transformed data of each class $k$ is given by:
|
||||
$$
|
||||
\hat{s}_k^2 = \sum_{n \in c_k} (\hat{x}_n - \hat{\mu}_k)^2
|
||||
\tilde{s}_k^2 = \sum_{n \in c_k} (\tilde{x}_n - \tilde{\mu}_k)^2
|
||||
$$
|
||||
|
||||
The total within-class variance for the whole data set is simply defined as
|
||||
$\hat{s}^2 = \hat{s}_1^2 + \hat{s}_2^2$. The Fisher criterion is defined to
|
||||
be the ratio of the between-class distance to the within-class variance and is
|
||||
given by:
|
||||
$\tilde{s}^2 = \tilde{s}_1^2 + \tilde{s}_2^2$. The Fisher criterion is therfore
|
||||
defined to be the ratio of the between-class distance to the within-class
|
||||
variance and is given by:
|
||||
$$
|
||||
F(w) = \frac{(\hat{\mu}_2 - \hat{\mu}_1)^2}{\hat{s}^2}
|
||||
F(w) = \frac{(\tilde{\mu}_2 - \tilde{\mu}_1)^2}{\tilde{s}^2}
|
||||
$$
|
||||
|
||||
The dependence on $w$ can be made explicit:
|
||||
\begin{align*}
|
||||
(\hat{\mu}_2 - \hat{\mu}_1)^2 &= (w^T \mu_2 - w^T \mu_1)^2 \\
|
||||
(\tilde{\mu}_2 - \tilde{\mu}_1)^2 &= (w^T \mu_2 - w^T \mu_1)^2 \\
|
||||
&= [w^T (\mu_2 - \mu_1)]^2 \\
|
||||
&= [w^T (\mu_2 - \mu_1)][w^T (\mu_2 - \mu_1)] \\
|
||||
&= [w^T (\mu_2 - \mu_1)][(\mu_2 - \mu_1)^T w]
|
||||
@ -131,9 +132,9 @@ The dependence on $w$ can be made explicit:
|
||||
|
||||
where $M$ is the between-distance matrix. Similarly, as regards the denominator:
|
||||
\begin{align*}
|
||||
\hat{s}^2 &= \hat{s}_1^2 + \hat{s}_2^2 = \\
|
||||
&= \sum_{n \in c_1} (\hat{x}_n - \hat{\mu}_1)^2
|
||||
+ \sum_{n \in c_2} (\hat{x}_n - \hat{\mu}_2)^2
|
||||
\tilde{s}^2 &= \tilde{s}_1^2 + \tilde{s}_2^2 = \\
|
||||
&= \sum_{n \in c_1} (\tilde{x}_n - \tilde{\mu}_1)^2
|
||||
+ \sum_{n \in c_2} (\tilde{x}_n - \tilde{\mu}_2)^2
|
||||
= w^T \Sigma_w w
|
||||
\end{align*}
|
||||
|
||||
@ -162,7 +163,7 @@ Differentiating with respect to $w$, it can be found that $F(w)$ is maximized
|
||||
when:
|
||||
$$
|
||||
w = \Sigma_w^{-1} (\mu_2 - \mu_1)
|
||||
$$
|
||||
$$ {#eq:fisher-weight}
|
||||
|
||||
This is not truly a discriminant but rather a specific choice of the direction
|
||||
for projection of the data down to one dimension: the projected data can then be
|
||||
@ -178,63 +179,73 @@ with the `gsl_blas_dgemv()` function provided by GSL.
|
||||
|
||||
### The threshold
|
||||
|
||||
The threshold $t_{\text{cut}}$ was fixed by the condition of conditional
|
||||
The threshold $t_{\text{cut}}$ was fixed by the conditional
|
||||
probability $P(c_k | t_{\text{cut}})$ being the same for both classes $c_k$:
|
||||
$$
|
||||
t_{\text{cut}} = x \, | \hspace{20pt}
|
||||
t_{\text{cut}} = x \text{ such that}\quad
|
||||
\frac{P(c_1 | x)}{P(c_2 | x)} =
|
||||
\frac{P(x | c_1) \, P(c_1)}{P(x | c_2) \, P(c_2)} = 1
|
||||
$$
|
||||
|
||||
where $P(x | c_k)$ is the probability for point $x$ along the Fisher projection
|
||||
line of being sampled according to the class $k$. If each class is a bivariate
|
||||
Gaussian, as in the present case, then $P(x | c_k)$ is simply given by its
|
||||
projected normal distribution with mean $\hat{m} = w^T m$ and variance $\hat{s}
|
||||
= w^T S w$, being $S$ the covariance matrix of the class.
|
||||
With a bit of math, the following solution can be found:
|
||||
line of being sampled from $k$. If $\tilde{x} > t_\text{cut}$ then more likely
|
||||
$x \in c_1$, otherwise $x \in c_2$.
|
||||
|
||||
If each class is a bivariate Gaussian distribution, as in the present case,
|
||||
then $P(x | c_k)$ is simply given by its projected normal distribution with
|
||||
mean $\tilde{m} = w^T m$ and variance $\tilde{s} = w^T S w$, being $S$ the
|
||||
covariance matrix of the class.
|
||||
After some algebra, the threshold is found to be:
|
||||
$$
|
||||
t_{\text{cut}} = \frac{b}{a}
|
||||
+ \sqrt{\left( \frac{b}{a} \right)^2 - \frac{c}{a}}
|
||||
$$
|
||||
|
||||
where:
|
||||
|
||||
- $a = \hat{s}_1^2 - \hat{s}_2^2$
|
||||
- $b = \hat{\mu}_2 \, \hat{s}_1^2 - \hat{\mu}_1 \, \hat{s}_2^2$
|
||||
- $c = \hat{\mu}_2^2 \, \hat{s}_1^2 - \hat{\mu}_1^2 \, \hat{s}_2^2
|
||||
- 2 \, \hat{s}_1^2 \, \hat{s}_2^2 \, \ln(\alpha)$
|
||||
- $a = \tilde{s}_1^2 - \tilde{s}_2^2$
|
||||
- $b = \tilde{\mu}_2 \, \tilde{s}_1^2 - \tilde{\mu}_1 \, \tilde{s}_2^2$
|
||||
- $c = \tilde{\mu}_2^2 \, \tilde{s}_1^2 - \tilde{\mu}_1^2 \, \tilde{s}_2^2
|
||||
- 2 \, \tilde{s}_1^2 \, \tilde{s}_2^2 \, \ln(\alpha)$
|
||||
- $\alpha = P(c_1) / P(c_2)$
|
||||
|
||||
The ratio of the prior probabilities $\alpha$ is simply given by:
|
||||
In a simulation, the ratio of the prior probabilities $\alpha$ can
|
||||
simply be set to:
|
||||
$$
|
||||
\alpha = \frac{N_s}{N_n}
|
||||
$$
|
||||
|
||||
The projection of the points was accomplished by the use of the function
|
||||
`gsl_blas_ddot()`, which computes the element wise product between two vectors.
|
||||
`gsl_blas_ddot()`, which computes a fast dot product of two vectors.
|
||||
|
||||
Results obtained for the same samples in @fig:points are shown in
|
||||
@fig:fisher_proj. The weight vector and the treshold were found to be:
|
||||
Results obtained for the same samples in @fig:points are shown below in
|
||||
@fig:fisher-proj. The weight vector and the threshold were found to be:
|
||||
$$
|
||||
w = (0.707, 0.707) \et
|
||||
t_{\text{cut}} = 1.323
|
||||
$$
|
||||
|
||||
<div id="fig:fisher_proj">
|
||||
![View of the samples in the plane.](images/7-fisher-plane.pdf)
|
||||
![View of the samples projections onto the projection
|
||||
line.](images/7-fisher-proj.pdf)
|
||||
::: { id=fig:fisher-proj }
|
||||
![Scatter plot of the samples.](images/7-fisher-plane.pdf)
|
||||
![Histogram of the Fisher-projected samples.](images/7-fisher-proj.pdf)
|
||||
|
||||
Aerial and lateral views of the samples. Projection line in blu and cut in red.
|
||||
</div>
|
||||
:::
|
||||
|
||||
Since the vector $w$ turned out to be parallel with the line joining the means
|
||||
of the two classes (reminded to be $(0, 0)$ and $(4, 4)$), one can be mislead
|
||||
and assume that the inverse of the total covariance matrix $\Sigma_w$ is
|
||||
isotropic, namely proportional to the unit matrix.
|
||||
That's not true. In this special sample, the vector joining the means turns out
|
||||
to be an eigenvector of the covariance matrix $\Sigma_w^{-1}$. In fact: since
|
||||
$\sigma_x = \sigma_y$ for both signal and noise:
|
||||
|
||||
### A mathematical curiosity
|
||||
|
||||
This section is really a sidenote which grew too large to fit in a margin,
|
||||
so it can be safely skipped.
|
||||
|
||||
It can be seen that the weight vector turned out to parallel to the line
|
||||
joining the means of the two classes (as a remainder: $(0, 0)$ and $(4, 4)$),
|
||||
as if the within-class covariances were ignored. Strange!
|
||||
|
||||
Looking at @eq:fisher-weight, one can be mislead into thinking that the inverse
|
||||
of the total covariance matrix, $\Sigma_w$ is (proportional to) the identity,
|
||||
but that's not true. By a remarkable accident, the vector joining the means is
|
||||
an eigenvector of the covariance matrix $\Sigma_w^{-1}$. In
|
||||
fact: since $\sigma_x = \sigma_y$ for both signal and noise:
|
||||
$$
|
||||
\Sigma_1 = \begin{pmatrix}
|
||||
\sigma_x^2 & \sigma_{xy} \\
|
||||
@ -247,15 +258,14 @@ $$
|
||||
\end{pmatrix}_2
|
||||
$$
|
||||
|
||||
$\Sigma_w$ takes the form:
|
||||
$\Sigma_w$ takes the symmetrical form
|
||||
$$
|
||||
\Sigma_w = \begin{pmatrix}
|
||||
A & B \\
|
||||
B & A
|
||||
\end{pmatrix}
|
||||
\end{pmatrix},
|
||||
$$
|
||||
|
||||
Which can be easily inverted by Gaussian elimination:
|
||||
which can be easily inverted by Gaussian elimination:
|
||||
\begin{align*}
|
||||
\begin{pmatrix}
|
||||
A & B & \vline & 1 & 0 \\
|
||||
@ -271,7 +281,7 @@ Which can be easily inverted by Gaussian elimination:
|
||||
\end{pmatrix}
|
||||
\end{align*}
|
||||
|
||||
Hence:
|
||||
Hence, the inverse has still the same form:
|
||||
$$
|
||||
\Sigma_w^{-1} = \begin{pmatrix}
|
||||
\tilde{A} & \tilde{B} \\
|
||||
@ -279,21 +289,15 @@ $$
|
||||
\end{pmatrix}
|
||||
$$
|
||||
|
||||
Thus, $\Sigma_w$ and $\Sigma_w^{-1}$ share the same eigenvectors $v_1$ and
|
||||
$v_2$:
|
||||
For this reason, $\Sigma_w$ and $\Sigma_w^{-1}$ share the same eigenvectors
|
||||
$v_1$ and $v_2$:
|
||||
$$
|
||||
v_1 = \begin{pmatrix}
|
||||
1 \\
|
||||
-1
|
||||
\end{pmatrix} \et
|
||||
v_2 = \begin{pmatrix}
|
||||
1 \\
|
||||
1
|
||||
\end{pmatrix}
|
||||
v_1 = \begin{pmatrix} 1 \\ -1 \end{pmatrix}
|
||||
\et
|
||||
v_2 = \begin{pmatrix} 1 \\ 1 \end{pmatrix}
|
||||
$$
|
||||
|
||||
and the vector joining the means is clearly a multiple of $v_2$, causing $w$ to
|
||||
be a multiple of it.
|
||||
The vector joining the means is clearly a multiple of $v_2$, and so is $w$.
|
||||
|
||||
|
||||
## Perceptron
|
||||
@ -311,16 +315,18 @@ The aim of the perceptron algorithm is to determine the weight vector $w$ and
|
||||
bias $b$ such that the so-called 'threshold function' $f(x)$ returns a binary
|
||||
value: it is expected to return 1 for signal points and 0 for noise points:
|
||||
$$
|
||||
f(x) = \theta(w^T \cdot x + b)
|
||||
f(x) = \theta(w^T x + b)
|
||||
$$ {#eq:perc}
|
||||
|
||||
where $\theta$ is the Heaviside theta function.
|
||||
where $\theta$ is the Heaviside theta function. Note that the bias $b$ is
|
||||
$-t_\text{cut}$, as defined in the previous section.
|
||||
|
||||
The training was performed using the generated sample as training set. From an
|
||||
initial guess for $w$ and $b$ (which were set to be all null in the code), the
|
||||
perceptron starts to improve their estimations. The training set is passed point
|
||||
by point into a iterative procedure a customizable number $N$ of times: for
|
||||
every point, the output of $f(x)$ is computed. Afterwards, the variable
|
||||
$\Delta$, which is defined as:
|
||||
perceptron starts to improve their estimations. The training set is passed
|
||||
point by point into a iterative procedure $N$ times: for every point, the
|
||||
output of $f(x)$ is computed. Afterwards, the variable $\Delta$, which is
|
||||
defined as:
|
||||
$$
|
||||
\Delta = r [e - f(x)]
|
||||
$$
|
||||
@ -355,7 +361,7 @@ To see how it works, consider the four possible situations:
|
||||
the current $b$ and $w$ overestimate the correct output: they must be
|
||||
decreased.
|
||||
|
||||
Whilst the $b$ updating is obvious, as regarsd $w$ the following consideration
|
||||
Whilst the $b$ updating is obvious, as regards $w$ the following consideration
|
||||
may help clarify. Consider the case with $e = 0 \quad \wedge \quad f(x) = 1
|
||||
\quad \Longrightarrow \quad \Delta = -1$:
|
||||
$$
|
||||
@ -366,53 +372,51 @@ $$
|
||||
|
||||
Similarly for the case with $e = 1$ and $f(x) = 0$.
|
||||
|
||||
![Weiht vector and threshold value obtained with the perceptron method as a
|
||||
![Weight vector and threshold value obtained with the perceptron method as a
|
||||
function of the number of iterations. Both level off at the third
|
||||
iteration.](images/7-iterations.pdf){#fig:iterations}
|
||||
iteration.](images/7-iterations.pdf){#fig:percep-iterations}
|
||||
|
||||
As far as convergence is concerned, the perceptron will never get to the state
|
||||
with all the input points classified correctly if the training set is not
|
||||
linearly separable, meaning that the signal cannot be separated from the noise
|
||||
by a line in the plane. In this case, no approximate solutions will be gradually
|
||||
approached. On the other hand, if the training set is linearly separable, it can
|
||||
be shown that this method converges to the coveted function [@novikoff63].
|
||||
As in the previous section, once found, the weight vector is to be normalized.
|
||||
by a line in the plane. In this case, no approximate solutions will be
|
||||
gradually approached. On the other hand, if the training set is linearly
|
||||
separable, it can be shown (see [@novikoff63]) that this method converges to
|
||||
the coveted function. As in the previous section, once found, the weight
|
||||
vector is to be normalized.
|
||||
|
||||
With $N = 5$ iterations, the values of $w$ and $t_{\text{cut}}$ level off up to the third
|
||||
digit. The following results were obtained:
|
||||
|
||||
Different values of the learning rate were tested, all giving the same result,
|
||||
converging for a number $N = 3$ of iterations. In @fig:iterations, results are
|
||||
converging for a number $N = 3$ of iterations. In @fig:percep-iterations, results are
|
||||
shown for $r = 0.8$: as can be seen, for $N = 3$, the values of $w$ and
|
||||
$t^{\text{cut}}$ level off.
|
||||
The following results were obtained:
|
||||
$$
|
||||
w = (0.654, 0.756) \et t_{\text{cut}} = 1.213
|
||||
w = (-0.654, -0.756) \et t_{\text{cut}} = 1.213
|
||||
$$
|
||||
|
||||
In this case, the projection line is not parallel with the line joining the
|
||||
means of the two samples. Plots in @fig:percep_proj.
|
||||
In this case, the projection line is not exactly parallel with the line joining
|
||||
the means of the two samples. Plots in @fig:percep-proj.
|
||||
|
||||
<div id="fig:percep_proj">
|
||||
![View from above of the samples.](images/7-percep-plane.pdf)
|
||||
![Gaussian of the samples on the projection
|
||||
line.](images/7-percep-proj.pdf)
|
||||
::: { id=fig:percep-proj }
|
||||
![Scatter plot of the samples.](images/7-percep-plane.pdf)
|
||||
![Histogram of the projected samples.](images/7-percep-proj.pdf)
|
||||
|
||||
Aerial and lateral views of the projection direction, in blue, and the cut, in
|
||||
red.
|
||||
</div>
|
||||
Aerial and lateral views of the samples. Projection line in blu and cut in red.
|
||||
:::
|
||||
|
||||
|
||||
## Efficiency test {#sec:7_results}
|
||||
## Efficiency test {#sec:class-results}
|
||||
|
||||
Using the same parameters of the training set, a number $N_t$ of test
|
||||
samples was generated and the points were divided into noise and signal
|
||||
applying both methods. To avoid storing large datasets in memory, at each
|
||||
iteration, false positives and negatives were recorded using a running
|
||||
statistics method implemented in the `gsl_rstat` library. For each sample, the
|
||||
numbers $N_{fn}$ and $N_{fp}$ of false negative and false positive were obtained
|
||||
this way: for every noise point $x_n$, the threshold function $f(x_n)$ was
|
||||
computed, then:
|
||||
Using the same parameters of the training set, a number $N_t$ of test samples
|
||||
was generated and the points were classified applying both methods. To avoid
|
||||
storing large datasets in memory, at each iteration, false positives and
|
||||
negatives were recorded using a running statistics method implemented in the
|
||||
`gsl_rstat` library. For each sample, the numbers $N_{fn}$ and $N_{fp}$ of
|
||||
false negative and false positive were obtained this way: for every noise point
|
||||
$x_n$, the threshold function $f(x_n)$ was computed, then:
|
||||
|
||||
- if $f(x) = 0 \thus$ $N_{fn} \to N_{fn}$
|
||||
- if $f(x) \neq 0 \thus$ $N_{fn} \to N_{fn} + 1$
|
||||
@ -438,13 +442,13 @@ solution, the most powerful one, according to the Neyman-Pearson lemma, whereas
|
||||
the perceptron is only expected to converge to the solution and is therefore
|
||||
more subject to random fluctuations.
|
||||
|
||||
-------------------------------------------------------------------------------------------
|
||||
$\alpha$ $\sigma_{\alpha}$ $\beta$ $\sigma_{\beta}$
|
||||
----------- ------------------- ------------------- ------------------- -------------------
|
||||
Fisher 0.9999 0.33 0.9999 0.33
|
||||
------------------------------------------------------
|
||||
$α$ $σ_α$ $β$ $σ_β$
|
||||
----------- ---------- ---------- ---------- ---------
|
||||
Fisher 0.9999 0.33 0.9999 0.33
|
||||
|
||||
Perceptron 0.9999 0.28 0.9995 0.64
|
||||
-------------------------------------------------------------------------------------------
|
||||
Perceptron 0.9999 0.28 0.9995 0.64
|
||||
------------------------------------------------------
|
||||
|
||||
Table: Results for Fisher and perceptron method. $\sigma_{\alpha}$ and
|
||||
$\sigma_{\beta}$ stand for the standard deviation of the false
|
||||
|
Loading…
Reference in New Issue
Block a user