ex-7: review
This commit is contained in:
parent
3b774fb747
commit
7c830a7ba0
@ -25,7 +25,7 @@ correlation, namely:
|
||||
$$
|
||||
\sigma_{xy} = \rho\, \sigma_x \sigma_y
|
||||
$$
|
||||
where $\sigma_{xy}$ is the covariance of $x$ and $y$.
|
||||
where $\sigma_{xy}$ is the covariance of $x$ and $y$.
|
||||
|
||||
In the programs, $N_s = 800$ points for the signal and $N_n = 1000$ points for
|
||||
the noise were chosen as default but can be customized from the command-line.
|
||||
@ -40,11 +40,10 @@ with the given parameters.](images/7-points.pdf){#fig:points}
|
||||
|
||||
Assuming to not know how the points were generated, a model of classification,
|
||||
which assign to each point the 'most probably' class it belongs to,
|
||||
is implemented. Depending on the interpretation of 'most probable' many
|
||||
different models can be developed.
|
||||
Here, the Fisher linear discriminant and the perceptron were implemented and
|
||||
described in the following two sections. The results are compared in
|
||||
@sec:class-results.
|
||||
was implemented. Depending on the interpretation of 'most probable', many
|
||||
different models can be developed. Here, the Fisher linear discriminant and the
|
||||
perceptron were implemented and described in the following two sections. The
|
||||
results are compared in @sec:class-results.
|
||||
|
||||
|
||||
## Fisher linear discriminant
|
||||
@ -81,13 +80,11 @@ $$
|
||||
\et
|
||||
\mu_2 = \frac{1}{N_2} \sum_{n \in C_2} x_n
|
||||
$$
|
||||
|
||||
The simplest measure of the separation of the classes is the separation of the
|
||||
projected class means. This suggests to choose $w$ so as to maximize:
|
||||
$$
|
||||
\tilde{\mu}_2 − \tilde{\mu}_1 = w^T (\mu_2 − \mu_1)
|
||||
$$
|
||||
|
||||
This expression can be made arbitrarily large simply by increasing the
|
||||
magnitude of $w$, fortunately the problem is easily solved by requiring $w$
|
||||
to be normalised: $| w^2 | = 1$. Using a Lagrange multiplier to perform the
|
||||
@ -112,15 +109,13 @@ within-class variance of the transformed data of each class $k$ is given by:
|
||||
$$
|
||||
\tilde{s}_k^2 = \sum_{n \in c_k} (\tilde{x}_n - \tilde{\mu}_k)^2
|
||||
$$
|
||||
|
||||
The total within-class variance for the whole data set is simply defined as
|
||||
$\tilde{s}^2 = \tilde{s}_1^2 + \tilde{s}_2^2$. The Fisher criterion is therfore
|
||||
$\tilde{s}^2 = \tilde{s}_1^2 + \tilde{s}_2^2$. The Fisher criterion is therefore
|
||||
defined to be the ratio of the between-class distance to the within-class
|
||||
variance and is given by:
|
||||
$$
|
||||
F(w) = \frac{(\tilde{\mu}_2 - \tilde{\mu}_1)^2}{\tilde{s}^2}
|
||||
$$
|
||||
|
||||
The dependence on $w$ can be made explicit:
|
||||
\begin{align*}
|
||||
(\tilde{\mu}_2 - \tilde{\mu}_1)^2 &= (w^T \mu_2 - w^T \mu_1)^2 \\
|
||||
@ -129,7 +124,6 @@ The dependence on $w$ can be made explicit:
|
||||
&= [w^T (\mu_2 - \mu_1)][(\mu_2 - \mu_1)^T w]
|
||||
= w^T M w
|
||||
\end{align*}
|
||||
|
||||
where $M$ is the between-distance matrix. Similarly, as regards the denominator:
|
||||
\begin{align*}
|
||||
\tilde{s}^2 &= \tilde{s}_1^2 + \tilde{s}_2^2 = \\
|
||||
@ -137,7 +131,6 @@ where $M$ is the between-distance matrix. Similarly, as regards the denominator:
|
||||
+ \sum_{n \in c_2} (\tilde{x}_n - \tilde{\mu}_2)^2
|
||||
= w^T \Sigma_w w
|
||||
\end{align*}
|
||||
|
||||
where $\Sigma_w$ is the total within-class covariance matrix:
|
||||
\begin{align*}
|
||||
\Sigma_w &= \sum_{n \in c_1} (x_n − \mu_1)(x_n − \mu_1)^T
|
||||
@ -152,19 +145,16 @@ where $\Sigma_w$ is the total within-class covariance matrix:
|
||||
\sigma_{xy} & \sigma_y^2
|
||||
\end{pmatrix}_2
|
||||
\end{align*}
|
||||
|
||||
Where $\Sigma_1$ and $\Sigma_2$ are the covariance matrix of the two samples.
|
||||
The Fisher criterion can therefore be rewritten in the form:
|
||||
$$
|
||||
F(w) = \frac{w^T M w}{w^T \Sigma_w w}
|
||||
$$
|
||||
|
||||
Differentiating with respect to $w$, it can be found that $F(w)$ is maximized
|
||||
when:
|
||||
$$
|
||||
w = \Sigma_w^{-1} (\mu_2 - \mu_1)
|
||||
$$ {#eq:fisher-weight}
|
||||
|
||||
This is not truly a discriminant but rather a specific choice of the direction
|
||||
for projection of the data down to one dimension: the projected data can then be
|
||||
used to construct a discriminant by choosing a threshold for the
|
||||
@ -182,11 +172,10 @@ with the `gsl_blas_dgemv()` function provided by GSL.
|
||||
The threshold $t_{\text{cut}}$ was fixed by the conditional
|
||||
probability $P(c_k | t_{\text{cut}})$ being the same for both classes $c_k$:
|
||||
$$
|
||||
t_{\text{cut}} = x \text{ such that}\quad
|
||||
\frac{P(c_1 | x)}{P(c_2 | x)} =
|
||||
t_{\text{cut}} = x \quad \text{such that}\quad
|
||||
\frac{P(c_1 | x)}{P(c_2 | x)} =
|
||||
\frac{P(x | c_1) \, P(c_1)}{P(x | c_2) \, P(c_2)} = 1
|
||||
$$
|
||||
|
||||
where $P(x | c_k)$ is the probability for point $x$ along the Fisher projection
|
||||
line of being sampled from $k$. If $\tilde{x} > t_\text{cut}$ then more likely
|
||||
$x \in c_1$, otherwise $x \in c_2$.
|
||||
@ -213,7 +202,6 @@ simply be set to:
|
||||
$$
|
||||
\alpha = \frac{N_s}{N_n}
|
||||
$$
|
||||
|
||||
The projection of the points was accomplished by the use of the function
|
||||
`gsl_blas_ddot()`, which computes a fast dot product of two vectors.
|
||||
|
||||
@ -228,7 +216,7 @@ $$
|
||||
![Scatter plot of the samples.](images/7-fisher-plane.pdf)
|
||||
![Histogram of the Fisher-projected samples.](images/7-fisher-proj.pdf)
|
||||
|
||||
Aerial and lateral views of the samples. Projection line in blu and cut in red.
|
||||
Aerial and lateral views of the samples. Projection line in blue and cut in red.
|
||||
:::
|
||||
|
||||
|
||||
@ -237,9 +225,9 @@ Aerial and lateral views of the samples. Projection line in blu and cut in red.
|
||||
This section is really a sidenote which grew too large to fit in a margin,
|
||||
so it can be safely skipped.
|
||||
|
||||
It can be seen that the weight vector turned out to parallel to the line
|
||||
It can be seen that the weight vector turned out to be parallel to the line
|
||||
joining the means of the two classes (as a remainder: $(0, 0)$ and $(4, 4)$),
|
||||
as if the within-class covariances were ignored. Strange!
|
||||
as if the within-class covariances were ignored. Weird!
|
||||
|
||||
Looking at @eq:fisher-weight, one can be mislead into thinking that the inverse
|
||||
of the total covariance matrix, $\Sigma_w$ is (proportional to) the identity,
|
||||
@ -257,7 +245,6 @@ $$
|
||||
\sigma_{xy} & \sigma_x^2
|
||||
\end{pmatrix}_2
|
||||
$$
|
||||
|
||||
$\Sigma_w$ takes the symmetrical form
|
||||
$$
|
||||
\Sigma_w = \begin{pmatrix}
|
||||
@ -288,7 +275,6 @@ $$
|
||||
\tilde{B} & \tilde{A}
|
||||
\end{pmatrix}
|
||||
$$
|
||||
|
||||
For this reason, $\Sigma_w$ and $\Sigma_w^{-1}$ share the same eigenvectors
|
||||
$v_1$ and $v_2$:
|
||||
$$
|
||||
@ -296,7 +282,6 @@ $$
|
||||
\et
|
||||
v_2 = \begin{pmatrix} 1 \\ 1 \end{pmatrix}
|
||||
$$
|
||||
|
||||
The vector joining the means is clearly a multiple of $v_2$, and so is $w$.
|
||||
|
||||
|
||||
@ -317,7 +302,6 @@ value: it is expected to return 1 for signal points and 0 for noise points:
|
||||
$$
|
||||
f(x) = \theta(w^T x + b)
|
||||
$$ {#eq:perc}
|
||||
|
||||
where $\theta$ is the Heaviside theta function. Note that the bias $b$ is
|
||||
$-t_\text{cut}$, as defined in the previous section.
|
||||
|
||||
@ -330,7 +314,6 @@ defined as:
|
||||
$$
|
||||
\Delta = r [e - f(x)]
|
||||
$$
|
||||
|
||||
where:
|
||||
|
||||
- $r \in [0, 1]$ is the learning rate of the perceptron: the larger $r$, the
|
||||
@ -340,13 +323,11 @@ where:
|
||||
noise;
|
||||
|
||||
is used to update $b$ and $w$:
|
||||
|
||||
$$
|
||||
b \to b + \Delta
|
||||
\et
|
||||
w \to w + \Delta x
|
||||
$$
|
||||
|
||||
To see how it works, consider the four possible situations:
|
||||
|
||||
- $e = 1 \quad \wedge \quad f(x) = 1 \quad \dot \vee \quad e = 0 \quad \wedge
|
||||
@ -369,7 +350,6 @@ $$
|
||||
= w^T \cdot x + \Delta |x|^2
|
||||
= w^T \cdot x - |x|^2 \leq w^T \cdot x
|
||||
$$
|
||||
|
||||
Similarly for the case with $e = 1$ and $f(x) = 0$.
|
||||
|
||||
![Weight vector and threshold value obtained with the perceptron method as a
|
||||
@ -385,18 +365,14 @@ separable, it can be shown (see [@novikoff63]) that this method converges to
|
||||
the coveted function. As in the previous section, once found, the weight
|
||||
vector is to be normalized.
|
||||
|
||||
With $N = 5$ iterations, the values of $w$ and $t_{\text{cut}}$ level off up to the third
|
||||
digit. The following results were obtained:
|
||||
|
||||
Different values of the learning rate were tested, all giving the same result,
|
||||
converging for a number $N = 3$ of iterations. In @fig:percep-iterations, results are
|
||||
shown for $r = 0.8$: as can be seen, for $N = 3$, the values of $w$ and
|
||||
$t^{\text{cut}}$ level off.
|
||||
converging for a number $N = 3$ of iterations. In @fig:percep-iterations,
|
||||
results are shown for $r = 0.8$: as can be seen, for $N = 3$, the values of $w$
|
||||
and $t^{\text{cut}}$ level off.
|
||||
The following results were obtained:
|
||||
$$
|
||||
w = (-0.654, -0.756) \et t_{\text{cut}} = 1.213
|
||||
$$
|
||||
|
||||
In this case, the projection line is not exactly parallel with the line joining
|
||||
the means of the two samples. Plots in @fig:percep-proj.
|
||||
|
||||
@ -404,7 +380,7 @@ the means of the two samples. Plots in @fig:percep-proj.
|
||||
![Scatter plot of the samples.](images/7-percep-plane.pdf)
|
||||
![Histogram of the projected samples.](images/7-percep-proj.pdf)
|
||||
|
||||
Aerial and lateral views of the samples. Projection line in blu and cut in red.
|
||||
Aerial and lateral views of the samples. Projection line in blue and cut in red.
|
||||
:::
|
||||
|
||||
|
||||
@ -419,18 +395,16 @@ false negative and false positive were obtained this way: for every noise point
|
||||
$x_n$, the threshold function $f(x_n)$ was computed, then:
|
||||
|
||||
- if $f(x) = 0 \thus$ $N_{fn} \to N_{fn}$
|
||||
- if $f(x) \neq 0 \thus$ $N_{fn} \to N_{fn} + 1$
|
||||
- if $f(x) \neq 0 \thus$ $N_{fn} \to N_{fn} + 1$
|
||||
|
||||
and similarly for the positive points.
|
||||
Finally, the mean and standard deviation were computed from $N_{fn}$ and
|
||||
$N_{fp}$ for every sample and used to estimate the purity $\alpha$ and
|
||||
efficiency $\beta$ of the classification:
|
||||
|
||||
$$
|
||||
\alpha = 1 - \frac{\text{mean}(N_{fn})}{N_s} \et
|
||||
\beta = 1 - \frac{\text{mean}(N_{fp})}{N_n}
|
||||
$$
|
||||
|
||||
Results for $N_t = 500$ are shown in @tbl:res_comp. As can be seen, the Fisher
|
||||
discriminant gives a nearly perfect classification with a symmetric distribution
|
||||
of false negative and false positive, whereas the perceptron shows a little more
|
||||
|
Loading…
Reference in New Issue
Block a user