analistica/notes/sections/7.md

# Exercise 7

## Generating points according to Gaussian distributions {#sec:sampling}

Two sets of 2D points $(x, y)$ - signal and noise - is to be generated according
to two bivariate Gaussian distributions with parameters:
$$
\text{signal} \quad
\begin{cases}
\mu = (0, 0)              \\
\sigma_x = \sigma_y = 0.3 \\
\rho = 0.5
\end{cases}
\et
\text{noise} \quad
\begin{cases}
\mu = (4, 4)              \\
\sigma_x = \sigma_y = 1   \\
\rho = 0.4
\end{cases}
$$

where $\mu$ stands for the mean, $\sigma_x$ and $\sigma_y$ for the standard
deviations in $x$ and $y$ directions respectively and $\rho$ is the bivariate
correlation, namely:
$$
  \sigma_{xy} = \rho \sigma_x \sigma_y
$$

where $\sigma_{xy}$ is the covariance of $x$ and $y$.  
In the code, default settings are $N_s = 800$ points for the signal and $N_n =
1000$ points for the noise but can be customized from the input command-line.
Both samples were handled as matrices of dimension $n$ x 2, where $n$ is the
number of points in the sample. The library `gsl_matrix` provided by GSL was
employed for this purpose and the function `gsl_ran_bivariate_gaussian()` was
used for generating the points.  
An example of the two samples is shown in @fig:points.

![Example of points sampled according to the two Gaussian distributions
with the given parameters.](images/7-points.pdf){#fig:points}

Assuming not to know how the points were generated, a model of classification
is then to be implemented in order to assign each point to the right class
(signal or noise) to which it 'most probably' belongs to. The point is how
'most probably' can be interpreted and implemented.  
Here, the Fisher linear discriminant and the Perceptron were implemented and
described in the following two sections. The results are compared in
@sec:7_results.


## Fisher linear discriminant


### The projection direction

The Fisher linear discriminant (FLD) is a linear classification model based on
dimensionality reduction. It allows to reduce this 2D classification problem
into a one-dimensional decision surface.

Consider the case of two classes (in this case signal and noise): the simplest
representation of a linear discriminant is obtained by taking a linear function
$\hat{x}$ of a sampled 2D point $x$ so that:
$$
  \hat{x} = w^T x
$$

where $w$ is the so-called 'weight vector' and $w^T$ stands for its transpose.
An input point $x$ is commonly assigned to the first class if $\hat{x} \geqslant
w_{th}$ and to the second one otherwise, where $w_{th}$ is a threshold value
somehow defined.   In general, the projection onto one dimension leads to a
considerable loss of information and classes that are well separated in the
original 2D space may become strongly overlapping in one dimension. However, by
adjusting the components of the weight vector, a projection that maximizes the
classes separation can be selected [@bishop06].  
To begin with, consider $N_1$ points of class $C_1$ and $N_2$ points of class
$C_2$, so that the means $\mu_1$ and $\mu_2$ of the two classes are given by:
$$
  \mu_1 = \frac{1}{N_1} \sum_{n \in C_1} x_n
  \et
  \mu_2 = \frac{1}{N_2} \sum_{n \in C_2} x_n
$$

The simplest measure of the separation of the classes is the separation of the
projected class means. This suggests to choose $w$ so as to maximize:
$$
  \hat{\mu}_2 − \hat{\mu}_1 = w^T (\mu_2 − \mu_1)
$$

This expression can be made arbitrarily large simply by increasing the magnitude
of $w$. To solve this problem, $w$ can be constrained to have unit length, so
that $| w^2 | = 1$. Using a Lagrange multiplier to perform the constrained
maximization, it can be found that $w \propto (\mu_2 − \mu_1)$, meaning that the
line onto the points must be projected is the one joining the class means.  
There is still a problem with this approach, however, as illustrated in
@fig:overlap: the two classes are well separated in the original 2D space but
have considerable overlap when projected onto the line joining their means
which maximize their projections distance.

![The plot on the left shows samples from two classes along with the
histograms resulting fromthe  projection onto the line joining the
class means: note that there is considerable overlap in the projected
space. The right plot shows the corresponding projection based on the
Fisher linear discriminant, showing the greatly improved classes
separation. Fifure from [@bishop06]](images/7-fisher.png){#fig:overlap}

The idea to solve it is to maximize a function that will give a large separation
between the projected classes means while also giving a small variance within
each class, thereby minimizing the class overlap.  
The within-class variance of the transformed data of each class $k$ is given
by:
$$
  \hat{s}_k^2 = \sum_{n \in c_k} (\hat{x}_n - \hat{\mu}_k)^2
$$

The total within-class variance for the whole data set is simply defined as
$\hat{s}^2 = \hat{s}_1^2 + \hat{s}_2^2$. The Fisher criterion is defined to
be the ratio of the between-class distance to the within-class variance and is
given by:
$$
  F(w) = \frac{(\hat{\mu}_2 - \hat{\mu}_1)^2}{\hat{s}^2}
$$

The dependence on $w$ can be made explicit:
\begin{align*}
  (\hat{\mu}_2 - \hat{\mu}_1)^2 &= (w^T \mu_2 -  w^T \mu_1)^2             \\
                            &= [w^T (\mu_2 - \mu_1)]^2                    \\
                            &= [w^T (\mu_2 - \mu_1)][w^T (\mu_2 - \mu_1)] \\
                            &= [w^T (\mu_2 - \mu_1)][(\mu_2 - \mu_1)^T w]
                             = w^T M w
\end{align*}

where $M$ is the between-distance matrix. Similarly, as regards the denominator:
\begin{align*}
  \hat{s}^2 &= \hat{s}_1^2 + \hat{s}_2^2 = \\
            &= \sum_{n \in c_1} (\hat{x}_n - \hat{\mu}_1)^2
             + \sum_{n \in c_2} (\hat{x}_n - \hat{\mu}_2)^2
             = w^T \Sigma_w w
\end{align*}

where $\Sigma_w$ is the total within-class covariance matrix:
\begin{align*}
  \Sigma_w &= \sum_{n \in c_1} (x_n − \mu_1)(x_n − \mu_1)^T
            + \sum_{n \in c_2} (x_n − \mu_2)(x_n − \mu_2)^T \\
           &= \Sigma_1 + \Sigma_2
            = \begin{pmatrix}
              \sigma_x^2  & \sigma_{xy} \\
              \sigma_{xy} & \sigma_y^2
              \end{pmatrix}_1 +
              \begin{pmatrix}
              \sigma_x^2  & \sigma_{xy} \\
              \sigma_{xy} & \sigma_y^2
              \end{pmatrix}_2
\end{align*}

Where $\Sigma_1$ and $\Sigma_2$ are the covariance matrix of the two samples.
The Fisher criterion can therefore be rewritten in the form:
$$
  F(w) = \frac{w^T M w}{w^T \Sigma_w w}
$$

Differentiating with respect to $w$, it can be found that $F(w)$ is maximized
when:
$$
  w = \Sigma_w^{-1} (\mu_2 - \mu_1)
$$

This is not truly a discriminant but rather a specific choice of the direction
for projection of the data down to one dimension: the projected data can then be
used to construct a discriminant by choosing a threshold for the
classification.

When implemented, the parameters given in @sec:sampling were used to compute
the covariance matrices and their sum $\Sigma_w$. Then $\Sigma_w$, being a
symmetrical and positive-definite matrix, was inverted with the Cholesky method,
already discussed in @sec:MLM. Lastly, the matrix-vector product was computed
with the `gsl_blas_dgemv()` function provided by GSL.


### The threshold

The threshold $t_{\text{cut}}$ was fixed by the condition of conditional
probability $P(c_k | t_{\text{cut}})$ being the same for both classes $c_k$:
$$
  t_{\text{cut}} = x \, | \hspace{20pt}
 \frac{P(c_1 | x)}{P(c_2 | x)} = 
 \frac{P(x | c_1) \, P(c_1)}{P(x | c_1) \, P(c_2)} = 1
$$

where $P(x | c_k)$ is the probability for point $x$ along the Fisher projection
line of being sampled according to the class $k$. If each class is a bivariate
Gaussian, as in the present case, then $P(x | c_k)$ is simply given by its
projected normal distribution with mean $\hat{m} = w^T m$ and variance $\hat{s}
= w^T S w$, being $S$ the covariance matrix of the class.  
With a bit of math, the following solution can be found:
$$
  t_{\text{cut}} = \frac{b}{a}
                 + \sqrt{\left( \frac{b}{a} \right)^2 - \frac{c}{a}}
$$

where:

  - $a = \hat{s}_1^2 - \hat{s}_2^2$
  - $b = \hat{\mu}_2 \, \hat{s}_1^2 - \hat{\mu}_1 \, \hat{s}_2^2$
  - $c = \hat{\mu}_2^2 \, \hat{s}_1^2 - \hat{\mu}_1^2 \, \hat{s}_2^2
       - 2 \, \hat{s}_1^2 \, \hat{s}_2^2 \, \ln(\alpha)$
  - $\alpha = P(c_1) / P(c_2)$

The ratio of the prior probabilities $\alpha$ is simply given by:
$$
  \alpha = \frac{N_s}{N_n}
$$

The projection of the points was accomplished by the use of the function
`gsl_blas_ddot()`, which computes the element wise product between two vectors.

Results obtained for the same samples in @fig:points are shown in
@fig:fisher_proj. The weight vector and the treshold were found to be:
$$
  w = (0.707, 0.707) \et
  t_{\text{cut}} = 1.323
$$

<div id="fig:fisher_proj">
![View of the samples in the plane.](images/7-fisher-plane.pdf)
![View of the samples projections onto the projection
  line.](images/7-fisher-proj.pdf)

Aerial and lateral views of the samples. Projection line in blu and cut in red.
</div>

Since the vector $w$ turned out to be parallel with the line joining the means
of the two classes (reminded to be $(0, 0)$ and $(4, 4)$), one can be mislead
and assume that the inverse of the total covariance matrix $\Sigma_w$ is
isotropic, namely proportional to the unit matrix.  
That's not true. In this special sample, the vector joining the means turns out
to be an eigenvector of the covariance matrix $\Sigma_w^{-1}$. In fact: since
$\sigma_x = \sigma_y$ for both signal and noise:
$$
  \Sigma_1 = \begin{pmatrix}
             \sigma_x^2  & \sigma_{xy} \\
             \sigma_{xy} & \sigma_x^2
             \end{pmatrix}_1
  \et
  \Sigma_2 = \begin{pmatrix}
             \sigma_x^2  & \sigma_{xy} \\
             \sigma_{xy} & \sigma_x^2
             \end{pmatrix}_2
$$

$\Sigma_w$ takes the form:
$$
  \Sigma_w = \begin{pmatrix}
             A & B \\
             B & A
             \end{pmatrix}
$$

Which can be easily inverted by Gaussian elimination:
\begin{align*}
  \begin{pmatrix}
  A & B & \vline & 1 & 0 \\
  B & A & \vline & 0 & 1 \\
  \end{pmatrix} &\longrightarrow
  \begin{pmatrix}
  A - B & 0     & \vline & 1 - B & - B   \\
  0     & A - B & \vline & - B   & 1 - B \\
  \end{pmatrix}  \\ &\longrightarrow
  \begin{pmatrix}
  1 & 0 & \vline & (1 - B)/(A - B) &     - B/(A - B) \\
  0 & 1 & \vline &    -  B/(A - B) & (1 - B)/(A - B) \\
  \end{pmatrix}
\end{align*}

Hence:
$$
  \Sigma_w^{-1} = \begin{pmatrix}
                   \tilde{A} & \tilde{B} \\
                   \tilde{B} & \tilde{A}
                   \end{pmatrix}
$$

Thus, $\Sigma_w$ and $\Sigma_w^{-1}$ share the same eigenvectors $v_1$ and
$v_2$:
$$
  v_1 = \begin{pmatrix}
        1 \\
        -1
        \end{pmatrix} \et
  v_2 = \begin{pmatrix}
        1 \\
        1
        \end{pmatrix}
$$

and the vector joining the means is clearly a multiple of $v_2$, causing $w$ to
be a multiple of it.


## Perceptron

In machine learning, the perceptron is an algorithm for supervised learning of
linear binary classifiers.

Supervised learning is the machine learning task of inferring a function $f$
that maps an input $x$ to an output $f(x)$ based on a set of training
input-output pairs, where each pair consists of an input object and an output
value. The inferred function can be used for mapping new examples: the algorithm
is generalized to correctly determine the class labels for unseen instances.

The aim of the perceptron algorithm is to determine the weight vector $w$ and
bias $b$ such that the so-called 'threshold function' $f(x)$ returns a binary
value: it is expected to return 1 for signal points and 0 for noise points:
$$
  f(x) = \theta(w^T \cdot x + b)
$$ {#eq:perc}

where $\theta$ is the Heaviside theta function.  
The training was performed using the generated sample as training set. From an
initial guess for $w$ and $b$ (which were set to be all null in the code), the
perceptron starts to improve their estimations. The training set is passed point
by point into a iterative procedure a customizable number $N$ of times: for
every point, the output of $f(x)$ is computed. Afterwards, the variable
$\Delta$, which is defined as:
$$
  \Delta = r [e - f(x)]
$$

where:

  - $r \in [0, 1]$ is the learning rate of the perceptron: the larger $r$, the
    more volatile the weight changes. In the code it was arbitrarily set $r =
    0.8$;
  - $e$ is the expected output value, namely 1 if $x$ is signal and 0 if it is
    noise;

is used to update $b$ and $w$:

$$
  b \to b + \Delta
  \et
  w \to w + \Delta x
$$

To see how it works, consider the four possible situations:

  - $e = 1 \quad \wedge \quad f(x) = 1 \quad \dot \vee \quad e = 0 \quad \wedge
    \quad f(x) = 0  \quad \Longrightarrow \quad \Delta = 0$  
    the current estimations work properly: $b$ and $w$ do not need to be updated;
  - $e = 1 \quad \wedge \quad f(x) = 0 \quad \Longrightarrow \quad
    \Delta = 1$  
    the current $b$ and $w$ underestimate the correct output: they must be
    increased;
  - $e = 0 \quad \wedge \quad f(x) = 1 \quad \Longrightarrow \quad
    \Delta = -1$  
    the current $b$ and $w$ overestimate the correct output: they must be
    decreased.

Whilst the $b$ updating is obvious, as regarsd $w$ the following consideration
may help clarify. Consider the case with $e = 0 \quad \wedge \quad f(x) = 1
\quad \Longrightarrow \quad \Delta = -1$:
$$
  w^T \cdot x \to (w^T + \Delta x^T) \cdot x
              = w^T \cdot x + \Delta |x|^2
              = w^T \cdot x - |x|^2 \leq w^T \cdot x
$$

Similarly for the case with $e = 1$ and $f(x) = 0$.

As far as convergence is concerned, the perceptron will never get to the state
with all the input points classified correctly if the training set is not
linearly separable, meaning that the signal cannot be separated from the noise
by a line in the plane. In this case, no approximate solutions will be gradually
approached. On the other hand, if the training set is linearly separable, it can
be shown that this method converges to the coveted function [@novikoff63].  
As in the previous section, once found, the weight vector is to be normalized.

With $N = 5$ iterations, the values of $w$ and $t_{\text{cut}}$ level off up to the third
digit. The following results were obtained:

$$
  w = (0.654, 0.756) \et t_{\text{cut}} = 1.213
$$

where, once again, $t_{\text{cut}}$ is computed from the origin of the axes. In
this case, the projection line does not lies along the mains of the two
samples. Plots in @fig:percep_proj.

<div id="fig:percep_proj">
![View from above of the samples.](images/7-percep-plane.pdf){height=5.7cm}
![Gaussian of the samples on the projection
  line.](images/7-percep-proj.pdf){height=5.7cm}

Aerial and lateral views of the projection direction, in blue, and the cut, in
red.
</div>

## Efficiency test {#sec:7_results}

A program was implemented to check the validity of the two
classification methods.  
A number $N_t$ of test samples, with the same parameters of the training set,
is generated using an RNG and their points are divided into noise/signal by
both methods. At each iteration, false positives and negatives are recorded
using a running statistics method implemented in the `gsl_rstat` library, to
avoid storing large datasets in memory.  
In each sample, the numbers $N_{fn}$ and $N_{fp}$ of false positive and false
negative are obtained in this way: for every noise point $x_n$ compute the
activation function $f(x_n)$ with the weight vector $w$ and the
$t_{\text{cut}}$, then:

  - if $f(x) < 0 \thus$ $N_{fn} \to N_{fn}$
  - if $f(x) > 0 \thus$ $N_{fn} \to N_{fn} + 1$ 

and similarly for the positive points.  
Finally, the mean and standard deviation are computed from $N_{fn}$ and
$N_{fp}$ of every sample and used to estimate purity $\alpha$
and efficiency $\beta$ of the classification:

$$
  \alpha = 1 - \frac{\text{mean}(N_{fn})}{N_s} \et
  \beta = 1 - \frac{\text{mean}(N_{fp})}{N_n}
$$

Results for $N_t = 500$ are shown in @tbl:res_comp. As can be seen, the
Fisher discriminant gives a nearly perfect classification
with a symmetric distribution of false negative and false positive,
whereas the perceptron show a little more false-positive than
false-negative, being also more variable from dataset to dataset.  
A possible explanation of this fact is that, for linearly separable and
normally distributed points, the Fisher linear discriminant is an exact
analytical solution, whereas the perceptron is only expected to converge to the
solution and thus more subjected to random fluctuations.


-------------------------------------------------------------------------------------------
                  $\alpha$       $\sigma_{\alpha}$        $\beta$        $\sigma_{\beta}$
----------- ------------------- ------------------- ------------------- -------------------
Fisher       0.9999              0.33                0.9999              0.33

Perceptron   0.9999              0.28                0.9995              0.64
-------------------------------------------------------------------------------------------

Table: Results for Fisher and perceptron method. $\sigma_{\alpha}$ and
       $\sigma_{\beta}$ stand for the standard deviation of the false
       negative and false positive respectively. {#tbl:res_comp}
-												ex-7: started writing about the Fisher discriminant

											
										
										
											2020-03-31 23:37:49 +02:00
+								# Exercise 7
-												ex-7: went on writing the FLD

											
										
										
											2020-04-02 23:35:36 +02:00
+								## Generating points according to Gaussian distributions {#sec:sampling}
-												ex-7: started writing about the Fisher discriminant

											
										
										
											2020-03-31 23:37:49 +02:00
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								Two sets of 2D points $(x, y)$ - signal and noise - is to be generated according
 								to two bivariate Gaussian distributions with parameters:
-												ex-7: started writing about the Fisher discriminant

											
										
										
											2020-03-31 23:37:49 +02:00
+								$$
 								\text{signal} \quad
 								\begin{cases}
 								\mu = (0, 0)              \\
 								\sigma_x = \sigma_y = 0.3 \\
 								\rho = 0.5
 								\end{cases}
 								\et
 								\text{noise} \quad
 								\begin{cases}
 								\mu = (4, 4)              \\
 								\sigma_x = \sigma_y = 1   \\
 								\rho = 0.4
 								\end{cases}
 								$$
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								where $\mu$ stands for the mean, $\sigma_x$ and $\sigma_y$ for the standard
-												ex-7: went on writing the FLD

											
										
										
											2020-04-01 23:39:19 +02:00
+								deviations in $x$ and $y$ directions respectively and $\rho$ is the bivariate
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								correlation, namely:
-												ex-7: went on writing the FLD

											
										
										
											2020-04-01 23:39:19 +02:00
+								$$
 								  \sigma_{xy} = \rho \sigma_x \sigma_y
 								$$
 								where $\sigma_{xy}$ is the covariance of $x$ and $y$.
 								In the code, default settings are $N_s = 800$ points for the signal and $N_n =
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+$ points for the noise but can be customized from the input command-line.
 								Both samples were handled as matrices of dimension $n$ x 2, where $n$ is the
 								number of points in the sample. The library `gsl_matrix` provided by GSL was
 								employed for this purpose and the function `gsl_ran_bivariate_gaussian()` was
 								used for generating the points.
-												ex-7: Finished writing about perceptron

											
										
										
											2020-04-06 23:16:56 +02:00
+								An example of the two samples is shown in @fig:points.
-												ex-7: FLD terminated

											
										
										
											2020-04-03 23:28:29 +02:00
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								![Example of points sampled according to the two Gaussian distributions
 								with the given parameters.](images/7-points.pdf){#fig:points}
-												ex-7: started writing about the Fisher discriminant

											
										
										
											2020-03-31 23:37:49 +02:00
-												ex-7: went on writing the FLD

											
										
										
											2020-04-01 23:39:19 +02:00
+								Assuming not to know how the points were generated, a model of classification
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								is then to be implemented in order to assign each point to the right class
-												ex-7: went on writing the FLD

											
										
										
											2020-04-01 23:39:19 +02:00
+								(signal or noise) to which it 'most probably' belongs to. The point is how
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								'most probably' can be interpreted and implemented.
 								Here, the Fisher linear discriminant and the Perceptron were implemented and
 								described in the following two sections. The results are compared in
 								@sec:7_results.
-												ex-7: started writing about the Fisher discriminant

											
										
										
											2020-03-31 23:37:49 +02:00
 								## Fisher linear discriminant
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
-												ex-7: went on writing the FLD

											
										
										
											2020-04-02 23:35:36 +02:00
+								### The projection direction
-												ex-7: went on writing the FLD

											
										
										
											2020-04-01 23:39:19 +02:00
-												ex-7: started writing about the Fisher discriminant

											
										
										
											2020-03-31 23:37:49 +02:00
+								The Fisher linear discriminant (FLD) is a linear classification model based on
 								dimensionality reduction. It allows to reduce this 2D classification problem
 								into a one-dimensional decision surface.
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								Consider the case of two classes (in this case signal and noise): the simplest
 								representation of a linear discriminant is obtained by taking a linear function
 								$\hat{x}$ of a sampled 2D point $x$ so that:
-												ex-7: started writing about the Fisher discriminant

											
										
										
											2020-03-31 23:37:49 +02:00
+								$$
-												ex-7: went on writing the FLD

											
										
										
											2020-04-01 23:39:19 +02:00
+								  \hat{x} = w^T x
-												ex-7: started writing about the Fisher discriminant

											
										
										
											2020-03-31 23:37:49 +02:00
+								$$
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								where $w$ is the so-called 'weight vector' and $w^T$ stands for its transpose.
 								An input point $x$ is commonly assigned to the first class if $\hat{x} \geqslant
 								w_{th}$ and to the second one otherwise, where $w_{th}$ is a threshold value
 								somehow defined.   In general, the projection onto one dimension leads to a
 								considerable loss of information and classes that are well separated in the
 								original 2D space may become strongly overlapping in one dimension. However, by
 								adjusting the components of the weight vector, a projection that maximizes the
 								classes separation can be selected [@bishop06].
-												ex-7: went on writing the FLD

											
										
										
											2020-04-02 23:35:36 +02:00
+								To begin with, consider $N_1$ points of class $C_1$ and $N_2$ points of class
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								$C_2$, so that the means $\mu_1$ and $\mu_2$ of the two classes are given by:
-												ex-7: started writing about the Fisher discriminant

											
										
										
											2020-03-31 23:37:49 +02:00
+								$$
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								  \mu_1 = \frac{1}{N_1} \sum_{n \in C_1} x_n
-												ex-7: started writing about the Fisher discriminant

											
										
										
											2020-03-31 23:37:49 +02:00
+								  \et
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								  \mu_2 = \frac{1}{N_2} \sum_{n \in C_2} x_n
-												ex-7: started writing about the Fisher discriminant

											
										
										
											2020-03-31 23:37:49 +02:00
+								$$
 								The simplest measure of the separation of the classes is the separation of the
-												ex-7: went on writing the FLD

											
										
										
											2020-04-02 23:35:36 +02:00
+								projected class means. This suggests to choose $w$ so as to maximize:
-												ex-7: started writing about the Fisher discriminant

											
										
										
											2020-03-31 23:37:49 +02:00
+								$$
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								  \hat{\mu}_2 − \hat{\mu}_1 = w^T (\mu_2 − \mu_1)
-												ex-7: started writing about the Fisher discriminant

											
										
										
											2020-03-31 23:37:49 +02:00
+								$$
-												ex-7: went on writing the FLD

											
										
										
											2020-04-02 23:35:36 +02:00
+								This expression can be made arbitrarily large simply by increasing the magnitude
 								of $w$. To solve this problem, $w$ can be constrained to have unit length, so
 								that $| w^2 | = 1$. Using a Lagrange multiplier to perform the constrained
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								maximization, it can be found that $w \propto (\mu_2 − \mu_1)$, meaning that the
 								line onto the points must be projected is the one joining the class means.
-												ex-7: started writing about the Fisher discriminant

											
										
										
											2020-03-31 23:37:49 +02:00
+								There is still a problem with this approach, however, as illustrated in
 								@fig:overlap: the two classes are well separated in the original 2D space but
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								have considerable overlap when projected onto the line joining their means
 								which maximize their projections distance.
 								![The plot on the left shows samples from two classes along with the
 								histograms resulting fromthe  projection onto the line joining the
 								class means: note that there is considerable overlap in the projected
 								space. The right plot shows the corresponding projection based on the
 								Fisher linear discriminant, showing the greatly improved classes
 								separation. Fifure from [@bishop06]](images/7-fisher.png){#fig:overlap}
-												ex-7: started writing about the Fisher discriminant

											
										
										
											2020-03-31 23:37:49 +02:00
+								The idea to solve it is to maximize a function that will give a large separation
 								between the projected classes means while also giving a small variance within
 								each class, thereby minimizing the class overlap.
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								The within-class variance of the transformed data of each class $k$ is given
-												ex-7: started writing about the Fisher discriminant

											
										
										
											2020-03-31 23:37:49 +02:00
+								by:
 								$$
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								  \hat{s}_k^2 = \sum_{n \in c_k} (\hat{x}_n - \hat{\mu}_k)^2
-												ex-7: started writing about the Fisher discriminant

											
										
										
											2020-03-31 23:37:49 +02:00
+								$$
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								The total within-class variance for the whole data set is simply defined as
 								$\hat{s}^2 = \hat{s}_1^2 + \hat{s}_2^2$. The Fisher criterion is defined to
 								be the ratio of the between-class distance to the within-class variance and is
-												ex-7: went on writing the FLD

											
										
										
											2020-04-02 23:35:36 +02:00
+								given by:
-												ex-7: started writing about the Fisher discriminant

											
										
										
											2020-03-31 23:37:49 +02:00
+								$$
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								  F(w) = \frac{(\hat{\mu}_2 - \hat{\mu}_1)^2}{\hat{s}^2}
-												ex-7: started writing about the Fisher discriminant

											
										
										
											2020-03-31 23:37:49 +02:00
+								$$
-												ex-7: went on writing the FLD

											
										
										
											2020-04-01 23:39:19 +02:00
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								The dependence on $w$ can be made explicit:
 								\begin{align*}
 								  (\hat{\mu}_2 - \hat{\mu}_1)^2 &= (w^T \mu_2 -  w^T \mu_1)^2             \\
 								                            &= [w^T (\mu_2 - \mu_1)]^2                    \\
 								                            &= [w^T (\mu_2 - \mu_1)][w^T (\mu_2 - \mu_1)] \\
 								                            &= [w^T (\mu_2 - \mu_1)][(\mu_2 - \mu_1)^T w]
 								                             = w^T M w
 								\end{align*}
-												ex-7: went on writing the FLD

											
										
										
											2020-04-01 23:39:19 +02:00
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								where $M$ is the between-distance matrix. Similarly, as regards the denominator:
 								\begin{align*}
 								  \hat{s}^2 &= \hat{s}_1^2 + \hat{s}_2^2 = \\
 								            &= \sum_{n \in c_1} (\hat{x}_n - \hat{\mu}_1)^2
 								             + \sum_{n \in c_2} (\hat{x}_n - \hat{\mu}_2)^2
 								             = w^T \Sigma_w w
 								\end{align*}
-												ex-7: went on writing the FLD

											
										
										
											2020-04-01 23:39:19 +02:00
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								where $\Sigma_w$ is the total within-class covariance matrix:
 								\begin{align*}
 								  \Sigma_w &= \sum_{n \in c_1} (x_n − \mu_1)(x_n − \mu_1)^T
 								            + \sum_{n \in c_2} (x_n − \mu_2)(x_n − \mu_2)^T \\
 								           &= \Sigma_1 + \Sigma_2
 								            = \begin{pmatrix}
 								              \sigma_x^2  & \sigma_{xy} \\
 								              \sigma_{xy} & \sigma_y^2
 								              \end{pmatrix}_1 +
 								              \begin{pmatrix}
 								              \sigma_x^2  & \sigma_{xy} \\
 								              \sigma_{xy} & \sigma_y^2
 								              \end{pmatrix}_2
 								\end{align*}
-												ex-7: went on writing the FLD

											
										
										
											2020-04-01 23:39:19 +02:00
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								Where $\Sigma_1$ and $\Sigma_2$ are the covariance matrix of the two samples.
 								The Fisher criterion can therefore be rewritten in the form:
-												ex-7: went on writing the FLD

											
										
										
											2020-04-01 23:39:19 +02:00
+								$$
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								  F(w) = \frac{w^T M w}{w^T \Sigma_w w}
-												ex-7: went on writing the FLD

											
										
										
											2020-04-01 23:39:19 +02:00
+								$$
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								Differentiating with respect to $w$, it can be found that $F(w)$ is maximized
 								when:
-												ex-7: went on writing the FLD

											
										
										
											2020-04-01 23:39:19 +02:00
+								$$
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								  w = \Sigma_w^{-1} (\mu_2 - \mu_1)
-												ex-7: went on writing the FLD

											
										
										
											2020-04-01 23:39:19 +02:00
+								$$
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								This is not truly a discriminant but rather a specific choice of the direction
 								for projection of the data down to one dimension: the projected data can then be
-												ex-7: went on writing the FLD

											
										
										
											2020-04-01 23:39:19 +02:00
+								used to construct a discriminant by choosing a threshold for the
 								classification.
-												ex-7: went on writing the FLD

											
										
										
											2020-04-02 23:35:36 +02:00
+								When implemented, the parameters given in @sec:sampling were used to compute
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								the covariance matrices and their sum $\Sigma_w$. Then $\Sigma_w$, being a
 								symmetrical and positive-definite matrix, was inverted with the Cholesky method,
 								already discussed in @sec:MLM. Lastly, the matrix-vector product was computed
 								with the `gsl_blas_dgemv()` function provided by GSL.
-												ex-7: went on writing the FLD

											
										
										
											2020-04-02 23:35:36 +02:00
-												ex-7: Finished writing about perceptron

											
										
										
											2020-04-06 23:16:56 +02:00
-												ex-7: went on writing the FLD

											
										
										
											2020-04-02 23:35:36 +02:00
+								### The threshold
-												ex-7: went on writing the FLD

											
										
										
											2020-04-01 23:39:19 +02:00
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								The threshold $t_{\text{cut}}$ was fixed by the condition of conditional
 								probability $P(c_k | t_{\text{cut}})$ being the same for both classes $c_k$:
-												ex-7: went on writing the FLD

											
										
										
											2020-04-01 23:39:19 +02:00
+								$$
-												ex-7: went on writing the FLD

											
										
										
											2020-04-02 23:35:36 +02:00
+								  t_{\text{cut}} = x \, | \hspace{20pt}
 								 \frac{P(c_1 | x)}{P(c_2 | x)} =
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								 \frac{P(x | c_1) \, P(c_1)}{P(x | c_1) \, P(c_2)} = 1
-												ex-7: went on writing the FLD

											
										
										
											2020-04-01 23:39:19 +02:00
+								$$
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								where $P(x | c_k)$ is the probability for point $x$ along the Fisher projection
 								line of being sampled according to the class $k$. If each class is a bivariate
 								Gaussian, as in the present case, then $P(x | c_k)$ is simply given by its
 								projected normal distribution with mean $\hat{m} = w^T m$ and variance $\hat{s}
 								= w^T S w$, being $S$ the covariance matrix of the class.
 								With a bit of math, the following solution can be found:
-												ex-7: went on writing the FLD

											
										
										
											2020-04-01 23:39:19 +02:00
+								$$
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								  t_{\text{cut}} = \frac{b}{a}
 								                 + \sqrt{\left( \frac{b}{a} \right)^2 - \frac{c}{a}}
-												ex-7: went on writing the FLD

											
										
										
											2020-04-01 23:39:19 +02:00
+								$$
-												ex-7: went on writing the FLD

											
										
										
											2020-04-02 23:35:36 +02:00
+								where:
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								  - $a = \hat{s}_1^2 - \hat{s}_2^2$
 								  - $b = \hat{\mu}_2 \, \hat{s}_1^2 - \hat{\mu}_1 \, \hat{s}_2^2$
 								  - $c = \hat{\mu}_2^2 \, \hat{s}_1^2 - \hat{\mu}_1^2 \, \hat{s}_2^2
 								       - 2 \, \hat{s}_1^2 \, \hat{s}_2^2 \, \ln(\alpha)$
 								  - $\alpha = P(c_1) / P(c_2)$
-												ex-7: went on writing the FLD

											
										
										
											2020-04-01 23:39:19 +02:00
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								The ratio of the prior probabilities $\alpha$ is simply given by:
-												ex-7: went on writing the FLD

											
										
										
											2020-04-01 23:39:19 +02:00
+								$$
-												ex-7: went on writing the FLD

											
										
										
											2020-04-02 23:35:36 +02:00
+								  \alpha = \frac{N_s}{N_n}
-												ex-7: went on writing the FLD

											
										
										
											2020-04-01 23:39:19 +02:00
+								$$
-												ex-7: went on writing the FLD

											
										
										
											2020-04-02 23:35:36 +02:00
+								The projection of the points was accomplished by the use of the function
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								`gsl_blas_ddot()`, which computes the element wise product between two vectors.
 								Results obtained for the same samples in @fig:points are shown in
 								@fig:fisher_proj. The weight vector and the treshold were found to be:
 								$$
 								  w = (0.707, 0.707) \et
 								  t_{\text{cut}} = 1.323
 								$$
-												ex-7: went on writing the FLD

											
										
										
											2020-04-01 23:39:19 +02:00
-												ex-7: FLD terminated

											
										
										
											2020-04-03 23:28:29 +02:00
+								<div id="fig:fisher_proj">
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								![View of the samples in the plane.](images/7-fisher-plane.pdf)
 								![View of the samples projections onto the projection
 								  line.](images/7-fisher-proj.pdf)
-												ex-7: FLD terminated

											
										
										
											2020-04-03 23:28:29 +02:00
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								Aerial and lateral views of the samples. Projection line in blu and cut in red.
-												ex-7: FLD terminated

											
										
										
											2020-04-03 23:28:29 +02:00
+								</div>
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								Since the vector $w$ turned out to be parallel with the line joining the means
 								of the two classes (reminded to be $(0, 0)$ and $(4, 4)$), one can be mislead
 								and assume that the inverse of the total covariance matrix $\Sigma_w$ is
 								isotropic, namely proportional to the unit matrix.
 								That's not true. In this special sample, the vector joining the means turns out
 								to be an eigenvector of the covariance matrix $\Sigma_w^{-1}$. In fact: since
 								$\sigma_x = \sigma_y$ for both signal and noise:
 								$$
 								  \Sigma_1 = \begin{pmatrix}
 								             \sigma_x^2  & \sigma_{xy} \\
 								             \sigma_{xy} & \sigma_x^2
 								             \end{pmatrix}_1
 								  \et
 								  \Sigma_2 = \begin{pmatrix}
 								             \sigma_x^2  & \sigma_{xy} \\
 								             \sigma_{xy} & \sigma_x^2
 								             \end{pmatrix}_2
 								$$
-												ex-7: went on writing the FLD

											
										
										
											2020-04-01 23:39:19 +02:00
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								$\Sigma_w$ takes the form:
-												ex-7: FLD terminated

											
										
										
											2020-04-03 23:28:29 +02:00
+								$$
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								  \Sigma_w = \begin{pmatrix}
 								             A & B \\
 								             B & A
 								             \end{pmatrix}
-												ex-7: FLD terminated

											
										
										
											2020-04-03 23:28:29 +02:00
+								$$
-												ex-7: went on writing the FLD

											
										
										
											2020-04-01 23:39:19 +02:00
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								Which can be easily inverted by Gaussian elimination:
 								\begin{align*}
 								  \begin{pmatrix}
 								  A & B & \vline & 1 & 0 \\
 								  B & A & \vline & 0 & 1 \\
 								  \end{pmatrix} &\longrightarrow
 								  \begin{pmatrix}
 								  A - B & 0     & \vline & 1 - B & - B   \\
 & A - B & \vline & - B   & 1 - B \\
 								  \end{pmatrix}  \\ &\longrightarrow
 								  \begin{pmatrix}
 & 0 & \vline & (1 - B)/(A - B) &     - B/(A - B) \\
 & 1 & \vline &    -  B/(A - B) & (1 - B)/(A - B) \\
 								  \end{pmatrix}
 								\end{align*}
 								Hence:
 								$$
 								  \Sigma_w^{-1} = \begin{pmatrix}
 								                   \tilde{A} & \tilde{B} \\
 								                   \tilde{B} & \tilde{A}
 								                   \end{pmatrix}
 								$$
 								Thus, $\Sigma_w$ and $\Sigma_w^{-1}$ share the same eigenvectors $v_1$ and
 								$v_2$:
 								$$
 								  v_1 = \begin{pmatrix}
 \\
 								        -1
 								        \end{pmatrix} \et
 								  v_2 = \begin{pmatrix}
 \\
 
 								        \end{pmatrix}
 								$$
 								and the vector joining the means is clearly a multiple of $v_2$, causing $w$ to
 								be a multiple of it.
-												ex-7: Finished writing about perceptron

											
										
										
											2020-04-06 23:16:56 +02:00
 								## Perceptron
 								In machine learning, the perceptron is an algorithm for supervised learning of
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								linear binary classifiers.
-												ex-7: Finished writing about perceptron

											
										
										
											2020-04-06 23:16:56 +02:00
+								Supervised learning is the machine learning task of inferring a function $f$
 								that maps an input $x$ to an output $f(x)$ based on a set of training
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								input-output pairs, where each pair consists of an input object and an output
 								value. The inferred function can be used for mapping new examples: the algorithm
 								is generalized to correctly determine the class labels for unseen instances.
-												ex-7: Finished writing about perceptron

											
										
										
											2020-04-06 23:16:56 +02:00
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								The aim of the perceptron algorithm is to determine the weight vector $w$ and
 								bias $b$ such that the so-called 'threshold function' $f(x)$ returns a binary
 								value: it is expected to return 1 for signal points and 0 for noise points:
-												ex-7: Finished writing about perceptron

											
										
										
											2020-04-06 23:16:56 +02:00
+								$$
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								  f(x) = \theta(w^T \cdot x + b)
-												ex-7: Finished writing about perceptron

											
										
										
											2020-04-06 23:16:56 +02:00
+								$$ {#eq:perc}
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								where $\theta$ is the Heaviside theta function.
 								The training was performed using the generated sample as training set. From an
 								initial guess for $w$ and $b$ (which were set to be all null in the code), the
 								perceptron starts to improve their estimations. The training set is passed point
 								by point into a iterative procedure a customizable number $N$ of times: for
 								every point, the output of $f(x)$ is computed. Afterwards, the variable
 								$\Delta$, which is defined as:
-												ex-7: Finished writing about perceptron

											
										
										
											2020-04-06 23:16:56 +02:00
+								$$
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								  \Delta = r [e - f(x)]
-												ex-7: Finished writing about perceptron

											
										
										
											2020-04-06 23:16:56 +02:00
+								$$
 								where:
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								  - $r \in [0, 1]$ is the learning rate of the perceptron: the larger $r$, the
 								    more volatile the weight changes. In the code it was arbitrarily set $r =
 .8$;
 								  - $e$ is the expected output value, namely 1 if $x$ is signal and 0 if it is
 								    noise;
-												ex-7: Finished writing about perceptron

											
										
										
											2020-04-06 23:16:56 +02:00
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								is used to update $b$ and $w$:
-												ex-7: Finished writing about perceptron

											
										
										
											2020-04-06 23:16:56 +02:00
 								$$
-												ex-7: started writing the test part

											
										
										
											2020-04-07 23:36:59 +02:00
+								  b \to b + \Delta
-												ex-7: Finished writing about perceptron

											
										
										
											2020-04-06 23:16:56 +02:00
+								  \et
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								  w \to w + \Delta x
-												ex-7: Finished writing about perceptron

											
										
										
											2020-04-06 23:16:56 +02:00
+								$$
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								To see how it works, consider the four possible situations:
-												ex-7: Finished writing about perceptron

											
										
										
											2020-04-06 23:16:56 +02:00
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								  - $e = 1 \quad \wedge \quad f(x) = 1 \quad \dot \vee \quad e = 0 \quad \wedge
 								    \quad f(x) = 0  \quad \Longrightarrow \quad \Delta = 0$
 								    the current estimations work properly: $b$ and $w$ do not need to be updated;
 								  - $e = 1 \quad \wedge \quad f(x) = 0 \quad \Longrightarrow \quad
 								    \Delta = 1$
 								    the current $b$ and $w$ underestimate the correct output: they must be
 								    increased;
 								  - $e = 0 \quad \wedge \quad f(x) = 1 \quad \Longrightarrow \quad
 								    \Delta = -1$
 								    the current $b$ and $w$ overestimate the correct output: they must be
 								    decreased.
 								Whilst the $b$ updating is obvious, as regarsd $w$ the following consideration
 								may help clarify. Consider the case with $e = 0 \quad \wedge \quad f(x) = 1
 								\quad \Longrightarrow \quad \Delta = -1$:
 								$$
 								  w^T \cdot x \to (w^T + \Delta x^T) \cdot x
 								              = w^T \cdot x + \Delta |x|^2
 								              = w^T \cdot x - |x|^2 \leq w^T \cdot x
 								$$
-												ex-7: Finished writing about perceptron

											
										
										
											2020-04-06 23:16:56 +02:00
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								Similarly for the case with $e = 1$ and $f(x) = 0$.
-												ex-7: Finished writing about perceptron

											
										
										
											2020-04-06 23:16:56 +02:00
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								As far as convergence is concerned, the perceptron will never get to the state
 								with all the input points classified correctly if the training set is not
 								linearly separable, meaning that the signal cannot be separated from the noise
 								by a line in the plane. In this case, no approximate solutions will be gradually
 								approached. On the other hand, if the training set is linearly separable, it can
 								be shown that this method converges to the coveted function [@novikoff63].
 								As in the previous section, once found, the weight vector is to be normalized.
 								With $N = 5$ iterations, the values of $w$ and $t_{\text{cut}}$ level off up to the third
-												ex-7: Finished writing about perceptron

											
										
										
											2020-04-06 23:16:56 +02:00
+								digit. The following results were obtained:
 								$$
 								  w = (0.654, 0.756) \et t_{\text{cut}} = 1.213
 								$$
 								where, once again, $t_{\text{cut}}$ is computed from the origin of the axes. In
 								this case, the projection line does not lies along the mains of the two
 								samples. Plots in @fig:percep_proj.
-												ex-7: revised and typo-fixed

In addition, the folder ex-7/iters was created in order to plot the results
of the Perceptron method as a function of the iterations parameter.

											
										
										
											2020-05-24 12:01:36 +02:00
+								<div id="fig:percep_proj">
 								![View from above of the samples.](images/7-percep-plane.pdf){height=5.7cm}
 								![Gaussian of the samples on the projection
 								  line.](images/7-percep-proj.pdf){height=5.7cm}
 								Aerial and lateral views of the projection direction, in blue, and the cut, in
 								red.
 								</div>
 								## Efficiency test {#sec:7_results}
-												ex-7: started writing the test part

											
										
										
											2020-04-07 23:36:59 +02:00
-												ex-7: reword efficiency paragraph

											
										
										
											2020-05-19 18:00:26 +02:00
+								A program was implemented to check the validity of the two
 								classification methods.
 								A number $N_t$ of test samples, with the same parameters of the training set,
 								is generated using an RNG and their points are divided into noise/signal by
 								both methods. At each iteration, false positives and negatives are recorded
 								using a running statistics method implemented in the `gsl_rstat` library, to
 								avoid storing large datasets in memory.
 								In each sample, the numbers $N_{fn}$ and $N_{fp}$ of false positive and false
 								negative are obtained in this way: for every noise point $x_n$ compute the
 								activation function $f(x_n)$ with the weight vector $w$ and the
 								$t_{\text{cut}}$, then:
-												ex-7: started writing the test part

											
										
										
											2020-04-07 23:36:59 +02:00
+								  - if $f(x) < 0 \thus$ $N_{fn} \to N_{fn}$
 								  - if $f(x) > 0 \thus$ $N_{fn} \to N_{fn} + 1$
-												ex-7: reword efficiency paragraph

											
										
										
											2020-05-19 18:00:26 +02:00
+								and similarly for the positive points.
 								Finally, the mean and standard deviation are computed from $N_{fn}$ and
 								$N_{fp}$ of every sample and used to estimate purity $\alpha$
 								and efficiency $\beta$ of the classification:
-												ex-7: started writing the test part

											
										
										
											2020-04-07 23:36:59 +02:00
 								$$
 								  \alpha = 1 - \frac{\text{mean}(N_{fn})}{N_s} \et
 								  \beta = 1 - \frac{\text{mean}(N_{fp})}{N_n}
 								$$
-												ex-7: reword efficiency paragraph

											
										
										
											2020-05-19 18:00:26 +02:00
+								Results for $N_t = 500$ are shown in @tbl:res_comp. As can be seen, the
 								Fisher discriminant gives a nearly perfect classification
 								with a symmetric distribution of false negative and false positive,
 								whereas the perceptron show a little more false-positive than
 								false-negative, being also more variable from dataset to dataset.
 								A possible explanation of this fact is that, for linearly separable and
 								normally distributed points, the Fisher linear discriminant is an exact
 								analytical solution, whereas the perceptron is only expected to converge to the
 								solution and thus more subjected to random fluctuations.
-												ex-7: completed

											
										
										
											2020-04-10 21:55:07 +02:00
-												ex-7: started writing the test part

											
										
										
											2020-04-07 23:36:59 +02:00
 								-------------------------------------------------------------------------------------------
 								                  $\alpha$       $\sigma_{\alpha}$        $\beta$        $\sigma_{\beta}$
 								----------- ------------------- ------------------- ------------------- -------------------
 								Fisher       0.9999              0.33                0.9999              0.33
 								Perceptron   0.9999              0.28                0.9995              0.64
 								-------------------------------------------------------------------------------------------
 								Table: Results for Fisher and perceptron method. $\sigma_{\alpha}$ and
 								       $\sigma_{\beta}$ stand for the standard deviation of the false
-												ex-7: completed

											
										
										
											2020-04-10 21:55:07 +02:00
+								       negative and false positive respectively. {#tbl:res_comp}