ex-7: review

2020-06-01 23:28:44 +02:00 · 2020-06-01 23:28:44 +02:00 · e7a5185976
commit e7a5185976
parent ed73ea5c8b
1 changed files with 153 additions and 149 deletions
--- a/notes/sections/7.md
+++ b/notes/sections/7.md
@ -1,51 +1,50 @@
 # Exercise 7
-## Generating points according to Gaussian distributions {#sec:sampling}
+## Generating random points on the plane {#sec:sampling}
 Two sets of 2D points $(x, y)$ - signal and noise - is to be generated according
 to two bivariate Gaussian distributions with parameters:
 $$
 \text{signal} \quad
 \begin{cases}
 \mu = (0, 0)              \\
 \sigma_x = \sigma_y = 0.3 \\
 \rho = 0.5
 \end{cases}
 \et
 \text{noise} \quad
 \begin{cases}
 \mu = (4, 4)              \\
 \sigma_x = \sigma_y = 1   \\
 \rho = 0.4
 \end{cases}
 $$
 Two sets of 2D points $(x, y)$ --- signal and noise --- is to be generated
 according to two bivariate Gaussian distributions with parameters:
 \begin{align*}
  \text{signal}\:
  \begin{cases}
    \mu = (0, 0)              \\
    \sigma_x = \sigma_y = 0.3 \\
    \rho = 0.5
  \end{cases}
  &&
  \text{noise}\:
  \begin{cases}
    \mu = (4, 4)            \\
    \sigma_x = \sigma_y = 1 \\
    \rho = 0.4
  \end{cases}
 \end{align*}
 where $\mu$ stands for the mean, $\sigma_x$ and $\sigma_y$ for the standard
 deviations in $x$ and $y$ directions respectively and $\rho$ is the bivariate
 correlation, namely:
 $$
-  \sigma_{xy} = \rho \sigma_x \sigma_y
+  \sigma_{xy} = \rho\, \sigma_x \sigma_y
 $$
 where $\sigma_{xy}$ is the covariance of $x$ and $y$.  
-In the code, default settings are $N_s = 800$ points for the signal and $N_n =
+
-1000$ points for the noise but can be customized from the input command-line.
+In the programs, $N_s = 800$ points for the signal and $N_n = 1000$ points for
-Both samples were handled as matrices of dimension $n$ x 2, where $n$ is the
+the noise were chosen as default but can be customized from the command-line.
-number of points in the sample. The library `gsl_matrix` provided by GSL was
+Both samples were stored as $n \times 2$ matrices, where $n$ is the number of
-employed for this purpose and the function `gsl_ran_bivariate_gaussian()` was
+points in the sample. The library `gsl_matrix` provided by GSL was employed for
-used for generating the points.  
+this purpose and the function `gsl_ran_bivariate_gaussian()` was used for
 generating the points.  
 An example of the two samples is shown in @fig:points.
 ![Example of points sampled according to the two Gaussian distributions
 with the given parameters.](images/7-points.pdf){#fig:points}
-Assuming not to know how the points were generated, a model of classification
+Assuming to not know how the points were generated, a model of classification,
-is then to be implemented in order to assign each point to the right class
+which assign to each point the 'most probably' class it belongs to,
-(signal or noise) to which it 'most probably' belongs to. The point is how
+is implemented. Depending on the interpretation of 'most probable' many
-'most probably' can be interpreted and implemented.  
+different models can be developed.
-Here, the Fisher linear discriminant and the Perceptron were implemented and
+Here, the Fisher linear discriminant and the perceptron were implemented and
 described in the following two sections. The results are compared in
-@sec:7_results.
+@sec:class-results.
 ## Fisher linear discriminant
@ -54,24 +53,27 @@ described in the following two sections. The results are compared in
 ### The projection direction
 The Fisher linear discriminant (FLD) is a linear classification model based on
-dimensionality reduction. It allows to reduce this 2D classification problem
+dimensionality reduction. It does so by projecting the data onto hyperplanes
-into a one-dimensional decision surface.
+that best divide the classes of points, consequently decreasing the dimension
 to $n-1$. In the 2D case the projection is onto a line, therefore the problem
 is reduced to simply selecting a threshold.
 Consider the case of two classes (in this case signal and noise): the simplest
 representation of a linear discriminant is obtained by taking a linear function
-$\hat{x}$ of a sampled 2D point $x$ so that:
+$\tilde{x}$ of a sampled 2D point $x$ so that:
 $$
-  \hat{x} = w^T x
+  \tilde{x} = w^T x
 $$
 where $w$ is the so-called 'weight vector' and $w^T$ stands for its transpose.
-An input point $x$ is commonly assigned to the first class if $\hat{x} \geqslant
+
 An input point $x$ is commonly assigned to the first class if $\tilde{x} >
 w_{th}$ and to the second one otherwise, where $w_{th}$ is a threshold value
-somehow defined.   In general, the projection onto one dimension leads to a
+somehow defined. In general, the projection onto one dimension leads to a
 considerable loss of information and classes that are well separated in the
 original 2D space may become strongly overlapping in one dimension. However, by
 adjusting the components of the weight vector, a projection that maximizes the
-classes separation can be selected [@bishop06].  
+classes separation can be found [@bishop06].
 To begin with, consider $N_1$ points of class $C_1$ and $N_2$ points of class
 $C_2$, so that the means $\mu_1$ and $\mu_2$ of the two classes are given by:
 $$
@ -83,46 +85,45 @@ $$
 The simplest measure of the separation of the classes is the separation of the
 projected class means. This suggests to choose $w$ so as to maximize:
 $$
-  \hat{\mu}_2 − \hat{\mu}_1 = w^T (\mu_2 − \mu_1)
+  \tilde{\mu}_2 − \tilde{\mu}_1 = w^T (\mu_2 − \mu_1)
 $$
-This expression can be made arbitrarily large simply by increasing the magnitude
+This expression can be made arbitrarily large simply by increasing the
-of $w$. To solve this problem, $w$ can be constrained to have unit length, so
+magnitude of $w$, fortunately the problem is easily solved by requiring $w$
-that $| w^2 | = 1$. Using a Lagrange multiplier to perform the constrained
+to be normalised: $| w^2 | = 1$. Using a Lagrange multiplier to perform the
-maximization, it can be found that $w \propto (\mu_2 − \mu_1)$, meaning that the
+constrained maximization, it can be found that $w \propto (\mu_2 − \mu_1)$,
-line onto the points must be projected is the one joining the class means.  
+meaning that the line onto the points must be projected is the one joining the
 class means.  
 There is still a problem with this approach, however, as illustrated in
@fig:overlap: the two classes are well separated in the original 2D space but
 have considerable overlap when projected onto the line joining their means
 which maximize their projections distance.
 ![The plot on the left shows samples from two classes along with the
-histograms resulting fromthe  projection onto the line joining the
+histograms resulting from the  projection onto the line joining the
-class means: note that there is considerable overlap in the projected
+class means: note the considerable overlap in the projected
 space. The right plot shows the corresponding projection based on the
 Fisher linear discriminant, showing the greatly improved classes
-separation. Fifure from [@bishop06]](images/7-fisher.png){#fig:overlap}
+separation. Figure taken from [@bishop06]](images/7-fisher.png){#fig:overlap}
-The idea to solve it is to maximize a function that will give a large separation
+The overlap of the projections can be reduced by maximising a function that
-between the projected classes means while also giving a small variance within
+gives, besides a large separation, small variance within each class. The
-each class, thereby minimizing the class overlap.  
+within-class variance of the transformed data of each class $k$ is given by:
 The within-class variance of the transformed data of each class $k$ is given
 by:
 $$
-  \hat{s}_k^2 = \sum_{n \in c_k} (\hat{x}_n - \hat{\mu}_k)^2
+  \tilde{s}_k^2 = \sum_{n \in c_k} (\tilde{x}_n - \tilde{\mu}_k)^2
 $$
 The total within-class variance for the whole data set is simply defined as
-$\hat{s}^2 = \hat{s}_1^2 + \hat{s}_2^2$. The Fisher criterion is defined to
+$\tilde{s}^2 = \tilde{s}_1^2 + \tilde{s}_2^2$. The Fisher criterion is therfore
-be the ratio of the between-class distance to the within-class variance and is
+defined to be the ratio of the between-class distance to the within-class
-given by:
+variance and is given by:
 $$
-  F(w) = \frac{(\hat{\mu}_2 - \hat{\mu}_1)^2}{\hat{s}^2}
+  F(w) = \frac{(\tilde{\mu}_2 - \tilde{\mu}_1)^2}{\tilde{s}^2}
 $$
 The dependence on $w$ can be made explicit:
 \begin{align*}
-  (\hat{\mu}_2 - \hat{\mu}_1)^2 &= (w^T \mu_2 -  w^T \mu_1)^2             \\
+  (\tilde{\mu}_2 - \tilde{\mu}_1)^2 &= (w^T \mu_2 -  w^T \mu_1)^2         \\
                            &= [w^T (\mu_2 - \mu_1)]^2                    \\
                            &= [w^T (\mu_2 - \mu_1)][w^T (\mu_2 - \mu_1)] \\
                            &= [w^T (\mu_2 - \mu_1)][(\mu_2 - \mu_1)^T w]
@ -131,9 +132,9 @@ The dependence on $w$ can be made explicit:
 where $M$ is the between-distance matrix. Similarly, as regards the denominator:
 \begin{align*}
-  \hat{s}^2 &= \hat{s}_1^2 + \hat{s}_2^2 = \\
+  \tilde{s}^2 &= \tilde{s}_1^2 + \tilde{s}_2^2 = \\
-            &= \sum_{n \in c_1} (\hat{x}_n - \hat{\mu}_1)^2
+            &= \sum_{n \in c_1} (\tilde{x}_n - \tilde{\mu}_1)^2
-             + \sum_{n \in c_2} (\hat{x}_n - \hat{\mu}_2)^2
+             + \sum_{n \in c_2} (\tilde{x}_n - \tilde{\mu}_2)^2
             = w^T \Sigma_w w
 \end{align*}
@ -162,7 +163,7 @@ Differentiating with respect to $w$, it can be found that $F(w)$ is maximized
 when:
 $$
  w = \Sigma_w^{-1} (\mu_2 - \mu_1)
-$$
+$$ {#eq:fisher-weight}
 This is not truly a discriminant but rather a specific choice of the direction
 for projection of the data down to one dimension: the projected data can then be
@ -178,63 +179,73 @@ with the `gsl_blas_dgemv()` function provided by GSL.
 ### The threshold
-The threshold $t_{\text{cut}}$ was fixed by the condition of conditional
+The threshold $t_{\text{cut}}$ was fixed by the conditional
 probability $P(c_k | t_{\text{cut}})$ being the same for both classes $c_k$:
 $$
-  t_{\text{cut}} = x \, | \hspace{20pt}
+  t_{\text{cut}} = x \text{ such that}\quad
 \frac{P(c_1 | x)}{P(c_2 | x)} = 
 \frac{P(x | c_1) \, P(c_1)}{P(x | c_2) \, P(c_2)} = 1
 $$
 where $P(x | c_k)$ is the probability for point $x$ along the Fisher projection
-line of being sampled according to the class $k$. If each class is a bivariate
+line of being sampled from $k$. If $\tilde{x} > t_\text{cut}$ then more likely
-Gaussian, as in the present case, then $P(x | c_k)$ is simply given by its
+$x \in c_1$, otherwise $x \in c_2$.
-projected normal distribution with mean $\hat{m} = w^T m$ and variance $\hat{s}
+
-= w^T S w$, being $S$ the covariance matrix of the class.  
+If each class is a bivariate Gaussian distribution, as in the present case,
-With a bit of math, the following solution can be found:
+then $P(x | c_k)$ is simply given by its projected normal distribution with
 mean $\tilde{m} = w^T m$ and variance $\tilde{s} = w^T S w$, being $S$ the
 covariance matrix of the class.  
 After some algebra, the threshold is found to be:
 $$
  t_{\text{cut}} = \frac{b}{a}
                 + \sqrt{\left( \frac{b}{a} \right)^2 - \frac{c}{a}}
 $$
 where:
-  - $a = \hat{s}_1^2 - \hat{s}_2^2$
+  - $a = \tilde{s}_1^2 - \tilde{s}_2^2$
-  - $b = \hat{\mu}_2 \, \hat{s}_1^2 - \hat{\mu}_1 \, \hat{s}_2^2$
+  - $b = \tilde{\mu}_2 \, \tilde{s}_1^2 - \tilde{\mu}_1 \, \tilde{s}_2^2$
-  - $c = \hat{\mu}_2^2 \, \hat{s}_1^2 - \hat{\mu}_1^2 \, \hat{s}_2^2
+  - $c = \tilde{\mu}_2^2 \, \tilde{s}_1^2 - \tilde{\mu}_1^2 \, \tilde{s}_2^2
-       - 2 \, \hat{s}_1^2 \, \hat{s}_2^2 \, \ln(\alpha)$
+       - 2 \, \tilde{s}_1^2 \, \tilde{s}_2^2 \, \ln(\alpha)$
  - $\alpha = P(c_1) / P(c_2)$
-The ratio of the prior probabilities $\alpha$ is simply given by:
+In a simulation, the ratio of the prior probabilities $\alpha$ can
 simply be set to:
 $$
  \alpha = \frac{N_s}{N_n}
 $$
 The projection of the points was accomplished by the use of the function
-`gsl_blas_ddot()`, which computes the element wise product between two vectors.
+`gsl_blas_ddot()`, which computes a fast dot product of two vectors.
-Results obtained for the same samples in @fig:points are shown in
+Results obtained for the same samples in @fig:points are shown below in
-@fig:fisher_proj. The weight vector and the treshold were found to be:
+@fig:fisher-proj. The weight vector and the threshold were found to be:
 $$
  w = (0.707, 0.707) \et
  t_{\text{cut}} = 1.323
 $$
-<div id="fig:fisher_proj">
+::: { id=fig:fisher-proj }
-![View of the samples in the plane.](images/7-fisher-plane.pdf)
+![Scatter plot of the samples.](images/7-fisher-plane.pdf)
-![View of the samples projections onto the projection
+![Histogram of the Fisher-projected samples.](images/7-fisher-proj.pdf)
  line.](images/7-fisher-proj.pdf)
 Aerial and lateral views of the samples. Projection line in blu and cut in red.
-</div>
+:::
-Since the vector $w$ turned out to be parallel with the line joining the means
+
-of the two classes (reminded to be $(0, 0)$ and $(4, 4)$), one can be mislead
+### A mathematical curiosity
-and assume that the inverse of the total covariance matrix $\Sigma_w$ is
+
-isotropic, namely proportional to the unit matrix.  
+This section is really a sidenote which grew too large to fit in a margin,
-That's not true. In this special sample, the vector joining the means turns out
+so it can be safely skipped.
-to be an eigenvector of the covariance matrix $\Sigma_w^{-1}$. In fact: since
+
-$\sigma_x = \sigma_y$ for both signal and noise:
+It can be seen that the weight vector turned out to parallel to the line
 joining the means of the two classes (as a remainder: $(0, 0)$ and $(4, 4)$),
 as if the within-class covariances were ignored. Strange!
 Looking at @eq:fisher-weight, one can be mislead into thinking that the inverse
 of the total covariance matrix, $\Sigma_w$ is (proportional to) the identity,
 but that's not true. By a remarkable accident, the vector joining the means is
 an eigenvector of the covariance matrix $\Sigma_w^{-1}$. In
 fact: since $\sigma_x = \sigma_y$ for both signal and noise:
 $$
  \Sigma_1 = \begin{pmatrix}
             \sigma_x^2  & \sigma_{xy} \\
@ -247,15 +258,14 @@ $$
             \end{pmatrix}_2
 $$
-$\Sigma_w$ takes the form:
+$\Sigma_w$ takes the symmetrical form
 $$
  \Sigma_w = \begin{pmatrix}
             A & B \\
             B & A
-             \end{pmatrix}
+             \end{pmatrix},
 $$
-
+which can be easily inverted by Gaussian elimination:
 Which can be easily inverted by Gaussian elimination:
 \begin{align*}
  \begin{pmatrix}
  A & B & \vline & 1 & 0 \\
@ -271,7 +281,7 @@ Which can be easily inverted by Gaussian elimination:
  \end{pmatrix}
 \end{align*}
-Hence:
+Hence, the inverse has still the same form:
 $$
  \Sigma_w^{-1} = \begin{pmatrix}
                   \tilde{A} & \tilde{B} \\
@ -279,21 +289,15 @@ $$
                   \end{pmatrix}
 $$
-Thus, $\Sigma_w$ and $\Sigma_w^{-1}$ share the same eigenvectors $v_1$ and
+For this reason, $\Sigma_w$ and $\Sigma_w^{-1}$ share the same eigenvectors
-$v_2$:
+$v_1$ and $v_2$:
 $$
-  v_1 = \begin{pmatrix}
+  v_1 = \begin{pmatrix} 1 \\ -1 \end{pmatrix}
-        1 \\
+  \et
-        -1
+  v_2 = \begin{pmatrix} 1 \\  1 \end{pmatrix}
        \end{pmatrix} \et
  v_2 = \begin{pmatrix}
        1 \\
        1
        \end{pmatrix}
 $$
-and the vector joining the means is clearly a multiple of $v_2$, causing $w$ to
+The vector joining the means is clearly a multiple of $v_2$, and so is $w$.
 be a multiple of it.
 ## Perceptron
@ -311,16 +315,18 @@ The aim of the perceptron algorithm is to determine the weight vector $w$ and
 bias $b$ such that the so-called 'threshold function' $f(x)$ returns a binary
 value: it is expected to return 1 for signal points and 0 for noise points:
 $$
-  f(x) = \theta(w^T \cdot x + b)
+  f(x) = \theta(w^T x + b)
 $$ {#eq:perc}
-where $\theta$ is the Heaviside theta function.  
+where $\theta$ is the Heaviside theta function. Note that the bias $b$ is
 $-t_\text{cut}$, as defined in the previous section.
 The training was performed using the generated sample as training set. From an
 initial guess for $w$ and $b$ (which were set to be all null in the code), the
-perceptron starts to improve their estimations. The training set is passed point
+perceptron starts to improve their estimations. The training set is passed
-by point into a iterative procedure a customizable number $N$ of times: for
+point by point into a iterative procedure $N$ times: for every point, the
-every point, the output of $f(x)$ is computed. Afterwards, the variable
+output of $f(x)$ is computed. Afterwards, the variable $\Delta$, which is
-$\Delta$, which is defined as:
+defined as:
 $$
  \Delta = r [e - f(x)]
 $$
@ -355,7 +361,7 @@ To see how it works, consider the four possible situations:
    the current $b$ and $w$ overestimate the correct output: they must be
    decreased.
-Whilst the $b$ updating is obvious, as regarsd $w$ the following consideration
+Whilst the $b$ updating is obvious, as regards $w$ the following consideration
 may help clarify. Consider the case with $e = 0 \quad \wedge \quad f(x) = 1
 \quad \Longrightarrow \quad \Delta = -1$:
 $$
@ -366,53 +372,51 @@ $$
 Similarly for the case with $e = 1$ and $f(x) = 0$.
-![Weiht vector and threshold value obtained with the perceptron method as a
+![Weight vector and threshold value obtained with the perceptron method as a
  function of the number of iterations. Both level off at the third
-  iteration.](images/7-iterations.pdf){#fig:iterations}
+  iteration.](images/7-iterations.pdf){#fig:percep-iterations}
 As far as convergence is concerned, the perceptron will never get to the state
 with all the input points classified correctly if the training set is not
 linearly separable, meaning that the signal cannot be separated from the noise
-by a line in the plane. In this case, no approximate solutions will be gradually
+by a line in the plane. In this case, no approximate solutions will be
-approached. On the other hand, if the training set is linearly separable, it can
+gradually approached. On the other hand, if the training set is linearly
-be shown that this method converges to the coveted function [@novikoff63].  
+separable, it can be shown (see [@novikoff63]) that this method converges to
-As in the previous section, once found, the weight vector is to be normalized.
+the coveted function.  As in the previous section, once found, the weight
 vector is to be normalized.
 With $N = 5$ iterations, the values of $w$ and $t_{\text{cut}}$ level off up to the third
 digit. The following results were obtained:
 Different values of the learning rate were tested, all giving the same result,
-converging for a number $N = 3$ of iterations. In @fig:iterations, results are
+converging for a number $N = 3$ of iterations. In @fig:percep-iterations, results are
 shown for $r = 0.8$: as can be seen, for $N = 3$, the values of $w$ and
 $t^{\text{cut}}$ level off.  
 The following results were obtained:
 $$
-  w = (0.654, 0.756) \et t_{\text{cut}} = 1.213
+  w = (-0.654, -0.756) \et t_{\text{cut}} = 1.213
 $$
-In this case, the projection line is not parallel with the line joining the
+In this case, the projection line is not exactly parallel with the line joining
-means of the two samples. Plots in @fig:percep_proj.
+the means of the two samples. Plots in @fig:percep-proj.
-<div id="fig:percep_proj">
+::: { id=fig:percep-proj }
-![View from above of the samples.](images/7-percep-plane.pdf)
+![Scatter plot of the samples.](images/7-percep-plane.pdf)
-![Gaussian of the samples on the projection
+![Histogram of the projected samples.](images/7-percep-proj.pdf)
  line.](images/7-percep-proj.pdf)
-Aerial and lateral views of the projection direction, in blue, and the cut, in
+Aerial and lateral views of the samples. Projection line in blu and cut in red.
-red.
+:::
 </div>
-## Efficiency test {#sec:7_results}
+## Efficiency test {#sec:class-results}
-Using the same parameters of the training set, a number $N_t$ of test
+Using the same parameters of the training set, a number $N_t$ of test samples
-samples was generated and the points were divided into noise and signal
+was generated and the points were classified applying both methods. To avoid
-applying both methods. To avoid storing large datasets in memory, at each
+storing large datasets in memory, at each iteration, false positives and
-iteration, false positives and negatives were recorded using a running
+negatives were recorded using a running statistics method implemented in the
-statistics method implemented in the `gsl_rstat` library. For each sample, the
+`gsl_rstat` library. For each sample, the numbers $N_{fn}$ and $N_{fp}$ of
-numbers $N_{fn}$ and $N_{fp}$ of false negative and false positive were obtained
+false negative and false positive were obtained this way: for every noise point
-this way: for every noise point $x_n$, the threshold function $f(x_n)$ was
+$x_n$, the threshold function $f(x_n)$ was computed, then:
 computed, then:
  - if $f(x) = 0 \thus$ $N_{fn} \to N_{fn}$
  - if $f(x) \neq 0 \thus$ $N_{fn} \to N_{fn} + 1$ 
@ -438,13 +442,13 @@ solution, the most powerful one, according to the Neyman-Pearson lemma, whereas
 the perceptron is only expected to converge to the solution and is therefore
 more subject to random fluctuations.
-------------------------------------------------------------------------------------------
+------------------------------------------------------
-                  $\alpha$       $\sigma_{\alpha}$        $\beta$        $\sigma_{\beta}$
+            $α$        $σ_α$      $β$        $σ_β$
----------- ------------------- ------------------- ------------------- -------------------
+----------- ---------- ---------- ---------- ---------
-Fisher       0.9999              0.33                0.9999              0.33
+Fisher       0.9999     0.33       0.9999     0.33
-Perceptron   0.9999              0.28                0.9995              0.64
+Perceptron   0.9999     0.28       0.9995     0.64
-------------------------------------------------------------------------------------------
+------------------------------------------------------
 Table: Results for Fisher and perceptron method. $\sigma_{\alpha}$ and
       $\sigma_{\beta}$ stand for the standard deviation of the false