ex-7: went on writing the FLD

2020-04-02 23:35:36 +02:00 · 2020-04-02 23:35:36 +02:00 · a1b84f022d
commit a1b84f022d
parent ceedd61f00
1 changed files with 51 additions and 35 deletions
--- a/notes/sections/7.md
+++ b/notes/sections/7.md
@ -1,9 +1,9 @@
 # Exercise 7
-## Generating points according to gaussian distributions
+## Generating points according to Gaussian distributions {#sec:sampling}
-The firts task of esercise 7 is to generate two sets of 2D points $(x, y)$
+The first task of exercise 7 is to generate two sets of 2D points $(x, y)$
-according to two bivariate gaussian distributions with parameters:
+according to two bivariate Gaussian distributions with parameters:
 $$
 \text{signal} \quad
@ -44,15 +44,15 @@ must then be implemented in order to assign each point to the right class
 ## Fisher linear discriminant
-### The theory
+### The projection direction
 The Fisher linear discriminant (FLD) is a linear classification model based on
 dimensionality reduction. It allows to reduce this 2D classification problem
 into a one-dimensional decision surface.
-Consider the case of two classes, (in this case the signal and the noise): the
+Consider the case of two classes (in this case the signal and the noise): the
 simplest representation of a linear discriminant is obtained by taking a linear
-function of a sampled point 2D $x$ so that:
+function of a sampled 2D point $x$ so that:
 $$
  \hat{x} = w^T x
@ -60,15 +60,14 @@ $$
 where $w$ is the so-called 'weight vector'. An input point $x$ is commonly
 assigned to the first class if $\hat{x} \geqslant w_{th}$ and to the second one
-otherwise, where $w_{th}$ is a threshold somehow defined.  
+otherwise, where $w_{th}$ is a threshold value somehow defined.  
 In general, the projection onto one dimension leads to a considerable loss of
 information and classes that are well separated in the original 2D space may
 become strongly overlapping in one dimension. However, by adjusting the
 components of the weight vector, a projection that maximizes the classes
 separation can be selected.  
-To begin with, consider a two-classes problem in which there are $N_1$ points of
+To begin with, consider $N_1$ points of class $C_1$ and $N_2$ points of class
-class $C_1$ and $N_2$ points of class $C_2$, so that the means $n_1$ and $m_2$
+$C_2$, so that the means $m_1$ and $m_2$ of the two classes are given by:
 of the two classes are given by:
 $$
  m_1 = \frac{1}{N_1} \sum_{n \in C_1} x_n
@ -77,29 +76,30 @@ $$
 $$
 The simplest measure of the separation of the classes is the separation of the
-projected class means. This suggests that to choose $w$ so as to maximize:
+projected class means. This suggests to choose $w$ so as to maximize:
 $$
  \hat{m}_2 − \hat{m}_1 = w^T (m_2 − m_1)
 $$
 This expression can be made arbitrarily large simply by increasing the magnitude
 of $w$. To solve this problem, $w$ can be constrained to have unit length, so
 that $| w^2 | = 1$. Using a Lagrange multiplier to perform the constrained
 maximization, it can be found that $w \propto (m_2 − m_1)$.
 ![The plot on the left shows samples from two classes along with the histograms
 resulting from projection onto the line joining the class means: note that
 there is considerable overlap in the projected space. The right plot shows the
 corresponding projection based on the Fisher linear discriminant, showing the
 greatly improved classes separation.](images/fisher.png){#fig:overlap}
 This expression can be made arbitrarily large simply by increasing the magnitude
 of $w$. To solve this problem, $w$ can be costrained to have unit length, so
 that $| w^2 | = 1$. Using a Lagrange multiplier to perform the constrained
 maximization, it can be find that $w \propto (m_2 − m_1)$.  
 There is still a problem with this approach, however, as illustrated in
@fig:overlap: the two classes are well separated in the original 2D space but
 have considerable overlap when projected onto the line joining their means.  
 The idea to solve it is to maximize a function that will give a large separation
 between the projected classes means while also giving a small variance within
 each class, thereby minimizing the class overlap.  
-The within-classes variance of the transformed data of each $k$ class is given
+The within-classes variance of the transformed data of each class $k$ is given
 by:
 $$
@ -107,9 +107,9 @@ $$
 $$
 The total within-classes variance for the whole data set can be simply defined
-as $s^2 = s_1^2 + s_2^2$. The Fisher criterion is derefore defined to be the
+as $s^2 = s_1^2 + s_2^2$. The Fisher criterion is therefore defined to be the
-ratio of the between-classes distance to the within-class variance and is given
+ratio of the between-classes distance to the within-classes variance and is
-by:
+given by:
 $$
  J(w) = \frac{(\hat{m}_2 - \hat{m}_1)^2}{s^2}
@ -122,7 +122,7 @@ $$
  w = S_b^{-1} (m_2 - m_1)
 $$
-where $S_b$ is the within-classes covariance matrix, given by:
+where $S_b$ is the covariance matrix, given by:
 $$
  S_b = S_1 + S_2
@ -142,35 +142,51 @@ projection of the data down to one dimension: the projected data can then be
 used to construct a discriminant by choosing a threshold for the
 classification.
-### The code
+When implemented, the parameters given in @sec:sampling were used to compute
 the covariance matrices $S_1$ and $S_2$ of the two classes and their sum $S$.
 Then $S$, being a symmetrical and positive-definite matrix, was inverted with
 the Cholesky method, already discussed in @sec:MLM.
 Lastly, the matrix-vector product was computed with the `gsl_blas_dgemv()`
 function provided by GSL.
-As stated above, the projection vector is given by
+### The threshold
 The cut was fixed by the condition of conditional probability being the same
 for each class:
 $$
-  x = S_b^{-1} (\mu_1 - \mu_2)
+  t_{\text{cut}} = x \, | \hspace{20pt}
 \frac{P(c_1 | x)}{P(c_2 | x)} = 
 \frac{p(x | c_1) \, p(c_1)}{p(x | c_1) \, p(c_2)} = 1
 $$
-where $\mu_1$ and $\mu_2$ are the two classes means.
+where $p(x | c_k)$ is the probability for point $x$ along the Fisher projection
 line of belonging to the class $k$. If the classes are bivariate Gaussian, as
 in the present case, then $p(x | c_k)$ is simply given by its projected normal
 distribution $\mathscr{G} (\hat{μ}, \hat{S})$. With a bit of math, the solution
 is then:
 $$
-  r = \frac{N_s}{N_n}
+  t = \frac{b}{a} + \sqrt{\left( \frac{b}{a} \right)^2 - \frac{c}{a}}
 $$
-cmpute S_b
+where:
-$S_b = S_1 + S_2$
+  - $a = \hat{S}_1^2 - \hat{S}_2^2$
  - $b = \hat{m}_2 \, \hat{S}_1^2 - \hat{M}_1 \, \hat{S}_2^2$
  - $c = \hat{M}_2^2 \, \hat{S}_1^2 - \hat{M}_1^2 \, \hat{S}_2^2
       - 2 \, \hat{S}_1^2 \, \hat{S}_2^2 \, \ln(\alpha)$
  - $\alpha = p(c_1) / p(c_2)$
 The ratio of the prior probability $\alpha$ was computed as:
 $$
-  \mu_1 = (\mu_{1x}, \mu_{1y})
+  \alpha = \frac{N_s}{N_n}
 $$
-the matrix $S$ is inverted with the Cholesky method, since it is symmetrical
+The projection of the points was accomplished by the use of the function
-and positive-definite.
+`gsl_blas_ddot`, which computed a dot product between two vectors, which in
 this case were the weight vector and the position of the point to be projected.
 $$
  diff = \mu_1 - \mu_2
 $$
 product with the `gsl_blas_dgemv()` function provided by GSL.
 result normalised with gsl functions.`