ex-7: review

This commit is contained in:
Giù Marcer 2020-06-03 22:46:17 +02:00 committed by rnhmjoj
parent 3b774fb747
commit 7c830a7ba0

View File

@ -25,7 +25,7 @@ correlation, namely:
$$
\sigma_{xy} = \rho\, \sigma_x \sigma_y
$$
where $\sigma_{xy}$ is the covariance of $x$ and $y$.
where $\sigma_{xy}$ is the covariance of $x$ and $y$.
In the programs, $N_s = 800$ points for the signal and $N_n = 1000$ points for
the noise were chosen as default but can be customized from the command-line.
@ -40,11 +40,10 @@ with the given parameters.](images/7-points.pdf){#fig:points}
Assuming to not know how the points were generated, a model of classification,
which assign to each point the 'most probably' class it belongs to,
is implemented. Depending on the interpretation of 'most probable' many
different models can be developed.
Here, the Fisher linear discriminant and the perceptron were implemented and
described in the following two sections. The results are compared in
@sec:class-results.
was implemented. Depending on the interpretation of 'most probable', many
different models can be developed. Here, the Fisher linear discriminant and the
perceptron were implemented and described in the following two sections. The
results are compared in @sec:class-results.
## Fisher linear discriminant
@ -81,13 +80,11 @@ $$
\et
\mu_2 = \frac{1}{N_2} \sum_{n \in C_2} x_n
$$
The simplest measure of the separation of the classes is the separation of the
projected class means. This suggests to choose $w$ so as to maximize:
$$
\tilde{\mu}_2 \tilde{\mu}_1 = w^T (\mu_2 \mu_1)
$$
This expression can be made arbitrarily large simply by increasing the
magnitude of $w$, fortunately the problem is easily solved by requiring $w$
to be normalised: $| w^2 | = 1$. Using a Lagrange multiplier to perform the
@ -112,15 +109,13 @@ within-class variance of the transformed data of each class $k$ is given by:
$$
\tilde{s}_k^2 = \sum_{n \in c_k} (\tilde{x}_n - \tilde{\mu}_k)^2
$$
The total within-class variance for the whole data set is simply defined as
$\tilde{s}^2 = \tilde{s}_1^2 + \tilde{s}_2^2$. The Fisher criterion is therfore
$\tilde{s}^2 = \tilde{s}_1^2 + \tilde{s}_2^2$. The Fisher criterion is therefore
defined to be the ratio of the between-class distance to the within-class
variance and is given by:
$$
F(w) = \frac{(\tilde{\mu}_2 - \tilde{\mu}_1)^2}{\tilde{s}^2}
$$
The dependence on $w$ can be made explicit:
\begin{align*}
(\tilde{\mu}_2 - \tilde{\mu}_1)^2 &= (w^T \mu_2 - w^T \mu_1)^2 \\
@ -129,7 +124,6 @@ The dependence on $w$ can be made explicit:
&= [w^T (\mu_2 - \mu_1)][(\mu_2 - \mu_1)^T w]
= w^T M w
\end{align*}
where $M$ is the between-distance matrix. Similarly, as regards the denominator:
\begin{align*}
\tilde{s}^2 &= \tilde{s}_1^2 + \tilde{s}_2^2 = \\
@ -137,7 +131,6 @@ where $M$ is the between-distance matrix. Similarly, as regards the denominator:
+ \sum_{n \in c_2} (\tilde{x}_n - \tilde{\mu}_2)^2
= w^T \Sigma_w w
\end{align*}
where $\Sigma_w$ is the total within-class covariance matrix:
\begin{align*}
\Sigma_w &= \sum_{n \in c_1} (x_n \mu_1)(x_n \mu_1)^T
@ -152,19 +145,16 @@ where $\Sigma_w$ is the total within-class covariance matrix:
\sigma_{xy} & \sigma_y^2
\end{pmatrix}_2
\end{align*}
Where $\Sigma_1$ and $\Sigma_2$ are the covariance matrix of the two samples.
The Fisher criterion can therefore be rewritten in the form:
$$
F(w) = \frac{w^T M w}{w^T \Sigma_w w}
$$
Differentiating with respect to $w$, it can be found that $F(w)$ is maximized
when:
$$
w = \Sigma_w^{-1} (\mu_2 - \mu_1)
$$ {#eq:fisher-weight}
This is not truly a discriminant but rather a specific choice of the direction
for projection of the data down to one dimension: the projected data can then be
used to construct a discriminant by choosing a threshold for the
@ -182,11 +172,10 @@ with the `gsl_blas_dgemv()` function provided by GSL.
The threshold $t_{\text{cut}}$ was fixed by the conditional
probability $P(c_k | t_{\text{cut}})$ being the same for both classes $c_k$:
$$
t_{\text{cut}} = x \text{ such that}\quad
\frac{P(c_1 | x)}{P(c_2 | x)} =
t_{\text{cut}} = x \quad \text{such that}\quad
\frac{P(c_1 | x)}{P(c_2 | x)} =
\frac{P(x | c_1) \, P(c_1)}{P(x | c_2) \, P(c_2)} = 1
$$
where $P(x | c_k)$ is the probability for point $x$ along the Fisher projection
line of being sampled from $k$. If $\tilde{x} > t_\text{cut}$ then more likely
$x \in c_1$, otherwise $x \in c_2$.
@ -213,7 +202,6 @@ simply be set to:
$$
\alpha = \frac{N_s}{N_n}
$$
The projection of the points was accomplished by the use of the function
`gsl_blas_ddot()`, which computes a fast dot product of two vectors.
@ -228,7 +216,7 @@ $$
![Scatter plot of the samples.](images/7-fisher-plane.pdf)
![Histogram of the Fisher-projected samples.](images/7-fisher-proj.pdf)
Aerial and lateral views of the samples. Projection line in blu and cut in red.
Aerial and lateral views of the samples. Projection line in blue and cut in red.
:::
@ -237,9 +225,9 @@ Aerial and lateral views of the samples. Projection line in blu and cut in red.
This section is really a sidenote which grew too large to fit in a margin,
so it can be safely skipped.
It can be seen that the weight vector turned out to parallel to the line
It can be seen that the weight vector turned out to be parallel to the line
joining the means of the two classes (as a remainder: $(0, 0)$ and $(4, 4)$),
as if the within-class covariances were ignored. Strange!
as if the within-class covariances were ignored. Weird!
Looking at @eq:fisher-weight, one can be mislead into thinking that the inverse
of the total covariance matrix, $\Sigma_w$ is (proportional to) the identity,
@ -257,7 +245,6 @@ $$
\sigma_{xy} & \sigma_x^2
\end{pmatrix}_2
$$
$\Sigma_w$ takes the symmetrical form
$$
\Sigma_w = \begin{pmatrix}
@ -288,7 +275,6 @@ $$
\tilde{B} & \tilde{A}
\end{pmatrix}
$$
For this reason, $\Sigma_w$ and $\Sigma_w^{-1}$ share the same eigenvectors
$v_1$ and $v_2$:
$$
@ -296,7 +282,6 @@ $$
\et
v_2 = \begin{pmatrix} 1 \\ 1 \end{pmatrix}
$$
The vector joining the means is clearly a multiple of $v_2$, and so is $w$.
@ -317,7 +302,6 @@ value: it is expected to return 1 for signal points and 0 for noise points:
$$
f(x) = \theta(w^T x + b)
$$ {#eq:perc}
where $\theta$ is the Heaviside theta function. Note that the bias $b$ is
$-t_\text{cut}$, as defined in the previous section.
@ -330,7 +314,6 @@ defined as:
$$
\Delta = r [e - f(x)]
$$
where:
- $r \in [0, 1]$ is the learning rate of the perceptron: the larger $r$, the
@ -340,13 +323,11 @@ where:
noise;
is used to update $b$ and $w$:
$$
b \to b + \Delta
\et
w \to w + \Delta x
$$
To see how it works, consider the four possible situations:
- $e = 1 \quad \wedge \quad f(x) = 1 \quad \dot \vee \quad e = 0 \quad \wedge
@ -369,7 +350,6 @@ $$
= w^T \cdot x + \Delta |x|^2
= w^T \cdot x - |x|^2 \leq w^T \cdot x
$$
Similarly for the case with $e = 1$ and $f(x) = 0$.
![Weight vector and threshold value obtained with the perceptron method as a
@ -385,18 +365,14 @@ separable, it can be shown (see [@novikoff63]) that this method converges to
the coveted function. As in the previous section, once found, the weight
vector is to be normalized.
With $N = 5$ iterations, the values of $w$ and $t_{\text{cut}}$ level off up to the third
digit. The following results were obtained:
Different values of the learning rate were tested, all giving the same result,
converging for a number $N = 3$ of iterations. In @fig:percep-iterations, results are
shown for $r = 0.8$: as can be seen, for $N = 3$, the values of $w$ and
$t^{\text{cut}}$ level off.
converging for a number $N = 3$ of iterations. In @fig:percep-iterations,
results are shown for $r = 0.8$: as can be seen, for $N = 3$, the values of $w$
and $t^{\text{cut}}$ level off.
The following results were obtained:
$$
w = (-0.654, -0.756) \et t_{\text{cut}} = 1.213
$$
In this case, the projection line is not exactly parallel with the line joining
the means of the two samples. Plots in @fig:percep-proj.
@ -404,7 +380,7 @@ the means of the two samples. Plots in @fig:percep-proj.
![Scatter plot of the samples.](images/7-percep-plane.pdf)
![Histogram of the projected samples.](images/7-percep-proj.pdf)
Aerial and lateral views of the samples. Projection line in blu and cut in red.
Aerial and lateral views of the samples. Projection line in blue and cut in red.
:::
@ -419,18 +395,16 @@ false negative and false positive were obtained this way: for every noise point
$x_n$, the threshold function $f(x_n)$ was computed, then:
- if $f(x) = 0 \thus$ $N_{fn} \to N_{fn}$
- if $f(x) \neq 0 \thus$ $N_{fn} \to N_{fn} + 1$
- if $f(x) \neq 0 \thus$ $N_{fn} \to N_{fn} + 1$
and similarly for the positive points.
Finally, the mean and standard deviation were computed from $N_{fn}$ and
$N_{fp}$ for every sample and used to estimate the purity $\alpha$ and
efficiency $\beta$ of the classification:
$$
\alpha = 1 - \frac{\text{mean}(N_{fn})}{N_s} \et
\beta = 1 - \frac{\text{mean}(N_{fp})}{N_n}
$$
Results for $N_t = 500$ are shown in @tbl:res_comp. As can be seen, the Fisher
discriminant gives a nearly perfect classification with a symmetric distribution
of false negative and false positive, whereas the perceptron shows a little more