From 7c830a7ba02407c54955445cac8ca72238777b12 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Gi=C3=B9=20Marcer?= <giuliamarcer@yahoo.it>
Date: Wed, 3 Jun 2020 22:46:17 +0200
Subject: [PATCH] ex-7: review

---
 notes/sections/7.md | 58 +++++++++++++--------------------------------
 1 file changed, 16 insertions(+), 42 deletions(-)

diff --git a/notes/sections/7.md b/notes/sections/7.md
index 847b0f0..6ad4e9d 100644
--- a/notes/sections/7.md
+++ b/notes/sections/7.md
@@ -25,7 +25,7 @@ correlation, namely:
 $$
   \sigma_{xy} = \rho\, \sigma_x \sigma_y
 $$
-where $\sigma_{xy}$ is the covariance of $x$ and $y$.  
+where $\sigma_{xy}$ is the covariance of $x$ and $y$.
 
 In the programs, $N_s = 800$ points for the signal and $N_n = 1000$ points for
 the noise were chosen as default but can be customized from the command-line.
@@ -40,11 +40,10 @@ with the given parameters.](images/7-points.pdf){#fig:points}
 
 Assuming to not know how the points were generated, a model of classification,
 which assign to each point the 'most probably' class it belongs to,
-is implemented. Depending on the interpretation of 'most probable' many
-different models can be developed.
-Here, the Fisher linear discriminant and the perceptron were implemented and
-described in the following two sections. The results are compared in
-@sec:class-results.
+was implemented. Depending on the interpretation of 'most probable', many
+different models can be developed. Here, the Fisher linear discriminant and the
+perceptron were implemented and described in the following two sections. The
+results are compared in @sec:class-results.
 
 
 ## Fisher linear discriminant
@@ -81,13 +80,11 @@ $$
   \et
   \mu_2 = \frac{1}{N_2} \sum_{n \in C_2} x_n
 $$
-
 The simplest measure of the separation of the classes is the separation of the
 projected class means. This suggests to choose $w$ so as to maximize:
 $$
   \tilde{\mu}_2 − \tilde{\mu}_1 = w^T (\mu_2 − \mu_1)
 $$
-
 This expression can be made arbitrarily large simply by increasing the
 magnitude of $w$, fortunately the problem is easily solved by requiring $w$
 to be normalised: $| w^2 | = 1$. Using a Lagrange multiplier to perform the
@@ -112,15 +109,13 @@ within-class variance of the transformed data of each class $k$ is given by:
 $$
   \tilde{s}_k^2 = \sum_{n \in c_k} (\tilde{x}_n - \tilde{\mu}_k)^2
 $$
-
 The total within-class variance for the whole data set is simply defined as
-$\tilde{s}^2 = \tilde{s}_1^2 + \tilde{s}_2^2$. The Fisher criterion is therfore
+$\tilde{s}^2 = \tilde{s}_1^2 + \tilde{s}_2^2$. The Fisher criterion is therefore
 defined to be the ratio of the between-class distance to the within-class
 variance and is given by:
 $$
   F(w) = \frac{(\tilde{\mu}_2 - \tilde{\mu}_1)^2}{\tilde{s}^2}
 $$
-
 The dependence on $w$ can be made explicit:
 \begin{align*}
   (\tilde{\mu}_2 - \tilde{\mu}_1)^2 &= (w^T \mu_2 -  w^T \mu_1)^2         \\
@@ -129,7 +124,6 @@ The dependence on $w$ can be made explicit:
                             &= [w^T (\mu_2 - \mu_1)][(\mu_2 - \mu_1)^T w]
                              = w^T M w
 \end{align*}
-
 where $M$ is the between-distance matrix. Similarly, as regards the denominator:
 \begin{align*}
   \tilde{s}^2 &= \tilde{s}_1^2 + \tilde{s}_2^2 = \\
@@ -137,7 +131,6 @@ where $M$ is the between-distance matrix. Similarly, as regards the denominator:
              + \sum_{n \in c_2} (\tilde{x}_n - \tilde{\mu}_2)^2
              = w^T \Sigma_w w
 \end{align*}
-
 where $\Sigma_w$ is the total within-class covariance matrix:
 \begin{align*}
   \Sigma_w &= \sum_{n \in c_1} (x_n − \mu_1)(x_n − \mu_1)^T
@@ -152,19 +145,16 @@ where $\Sigma_w$ is the total within-class covariance matrix:
               \sigma_{xy} & \sigma_y^2
               \end{pmatrix}_2
 \end{align*}
-
 Where $\Sigma_1$ and $\Sigma_2$ are the covariance matrix of the two samples.
 The Fisher criterion can therefore be rewritten in the form:
 $$
   F(w) = \frac{w^T M w}{w^T \Sigma_w w}
 $$
-
 Differentiating with respect to $w$, it can be found that $F(w)$ is maximized
 when:
 $$
   w = \Sigma_w^{-1} (\mu_2 - \mu_1)
 $$ {#eq:fisher-weight}
-
 This is not truly a discriminant but rather a specific choice of the direction
 for projection of the data down to one dimension: the projected data can then be
 used to construct a discriminant by choosing a threshold for the
@@ -182,11 +172,10 @@ with the `gsl_blas_dgemv()` function provided by GSL.
 The threshold $t_{\text{cut}}$ was fixed by the conditional
 probability $P(c_k | t_{\text{cut}})$ being the same for both classes $c_k$:
 $$
-  t_{\text{cut}} = x \text{ such that}\quad
- \frac{P(c_1 | x)}{P(c_2 | x)} = 
+  t_{\text{cut}} = x \quad \text{such that}\quad
+ \frac{P(c_1 | x)}{P(c_2 | x)} =
  \frac{P(x | c_1) \, P(c_1)}{P(x | c_2) \, P(c_2)} = 1
 $$
-
 where $P(x | c_k)$ is the probability for point $x$ along the Fisher projection
 line of being sampled from $k$. If $\tilde{x} > t_\text{cut}$ then more likely
 $x \in c_1$, otherwise $x \in c_2$.
@@ -213,7 +202,6 @@ simply be set to:
 $$
   \alpha = \frac{N_s}{N_n}
 $$
-
 The projection of the points was accomplished by the use of the function
 `gsl_blas_ddot()`, which computes a fast dot product of two vectors.
 
@@ -228,7 +216,7 @@ $$
 ![Scatter plot of the samples.](images/7-fisher-plane.pdf)
 ![Histogram of the Fisher-projected samples.](images/7-fisher-proj.pdf)
 
-Aerial and lateral views of the samples. Projection line in blu and cut in red.
+Aerial and lateral views of the samples. Projection line in blue and cut in red.
 :::
 
 
@@ -237,9 +225,9 @@ Aerial and lateral views of the samples. Projection line in blu and cut in red.
 This section is really a sidenote which grew too large to fit in a margin,
 so it can be safely skipped.
 
-It can be seen that the weight vector turned out to parallel to the line
+It can be seen that the weight vector turned out to be parallel to the line
 joining the means of the two classes (as a remainder: $(0, 0)$ and $(4, 4)$),
-as if the within-class covariances were ignored. Strange!
+as if the within-class covariances were ignored. Weird!
 
 Looking at @eq:fisher-weight, one can be mislead into thinking that the inverse
 of the total covariance matrix, $\Sigma_w$ is (proportional to) the identity,
@@ -257,7 +245,6 @@ $$
              \sigma_{xy} & \sigma_x^2
              \end{pmatrix}_2
 $$
-
 $\Sigma_w$ takes the symmetrical form
 $$
   \Sigma_w = \begin{pmatrix}
@@ -288,7 +275,6 @@ $$
                    \tilde{B} & \tilde{A}
                    \end{pmatrix}
 $$
-
 For this reason, $\Sigma_w$ and $\Sigma_w^{-1}$ share the same eigenvectors
 $v_1$ and $v_2$:
 $$
@@ -296,7 +282,6 @@ $$
   \et
   v_2 = \begin{pmatrix} 1 \\  1 \end{pmatrix}
 $$
-
 The vector joining the means is clearly a multiple of $v_2$, and so is $w$.
 
 
@@ -317,7 +302,6 @@ value: it is expected to return 1 for signal points and 0 for noise points:
 $$
   f(x) = \theta(w^T x + b)
 $$ {#eq:perc}
-
 where $\theta$ is the Heaviside theta function. Note that the bias $b$ is
 $-t_\text{cut}$, as defined in the previous section.
 
@@ -330,7 +314,6 @@ defined as:
 $$
   \Delta = r [e - f(x)]
 $$
-
 where:
 
   - $r \in [0, 1]$ is the learning rate of the perceptron: the larger $r$, the
@@ -340,13 +323,11 @@ where:
     noise;
 
 is used to update $b$ and $w$:
-
 $$
   b \to b + \Delta
   \et
   w \to w + \Delta x
 $$
-
 To see how it works, consider the four possible situations:
 
   - $e = 1 \quad \wedge \quad f(x) = 1 \quad \dot \vee \quad e = 0 \quad \wedge
@@ -369,7 +350,6 @@ $$
               = w^T \cdot x + \Delta |x|^2
               = w^T \cdot x - |x|^2 \leq w^T \cdot x
 $$
-
 Similarly for the case with $e = 1$ and $f(x) = 0$.
 
 ![Weight vector and threshold value obtained with the perceptron method as a
@@ -385,18 +365,14 @@ separable, it can be shown (see [@novikoff63]) that this method converges to
 the coveted function.  As in the previous section, once found, the weight
 vector is to be normalized.
 
-With $N = 5$ iterations, the values of $w$ and $t_{\text{cut}}$ level off up to the third
-digit. The following results were obtained:
-
 Different values of the learning rate were tested, all giving the same result,
-converging for a number $N = 3$ of iterations. In @fig:percep-iterations, results are
-shown for $r = 0.8$: as can be seen, for $N = 3$, the values of $w$ and
-$t^{\text{cut}}$ level off.  
+converging for a number $N = 3$ of iterations. In @fig:percep-iterations,
+results are shown for $r = 0.8$: as can be seen, for $N = 3$, the values of $w$
+and $t^{\text{cut}}$ level off.  
 The following results were obtained:
 $$
   w = (-0.654, -0.756) \et t_{\text{cut}} = 1.213
 $$
-
 In this case, the projection line is not exactly parallel with the line joining
 the means of the two samples. Plots in @fig:percep-proj.
 
@@ -404,7 +380,7 @@ the means of the two samples. Plots in @fig:percep-proj.
 ![Scatter plot of the samples.](images/7-percep-plane.pdf)
 ![Histogram of the projected samples.](images/7-percep-proj.pdf)
 
-Aerial and lateral views of the samples. Projection line in blu and cut in red.
+Aerial and lateral views of the samples. Projection line in blue and cut in red.
 :::
 
 
@@ -419,18 +395,16 @@ false negative and false positive were obtained this way: for every noise point
 $x_n$, the threshold function $f(x_n)$ was computed, then:
 
   - if $f(x) = 0 \thus$ $N_{fn} \to N_{fn}$
-  - if $f(x) \neq 0 \thus$ $N_{fn} \to N_{fn} + 1$ 
+  - if $f(x) \neq 0 \thus$ $N_{fn} \to N_{fn} + 1$
 
 and similarly for the positive points.  
 Finally, the mean and standard deviation were computed from $N_{fn}$ and
 $N_{fp}$ for every sample and used to estimate the purity $\alpha$ and
 efficiency $\beta$ of the classification:
-
 $$
   \alpha = 1 - \frac{\text{mean}(N_{fn})}{N_s} \et
   \beta = 1 - \frac{\text{mean}(N_{fp})}{N_n}
 $$
-
 Results for $N_t = 500$ are shown in @tbl:res_comp. As can be seen, the Fisher
 discriminant gives a nearly perfect classification with a symmetric distribution
 of false negative and false positive, whereas the perceptron shows a little more