ex-7: started writing about the Fisher discriminant

2020-03-31 23:37:49 +02:00 · 2020-03-31 23:37:49 +02:00 · e19ebc7fd1
commit e19ebc7fd1
parent 4301409842
1 changed files with 107 additions and 0 deletions
--- a/notes/sections/7.md
+++ b/notes/sections/7.md
@ -0,0 +1,107 @@
 # Exercise 7
 ## Generating points according to gaussian distributions
 The firts task of esercise 7 is to generate two sets of 2D points $(x, y)$
 according to two bivariate gaussian distributions with parameters:
 $$
 \text{signal} \quad
 \begin{cases}
 \mu = (0, 0)              \\
 \sigma_x = \sigma_y = 0.3 \\
 \rho = 0.5
 \end{cases}
 \et
 \text{noise} \quad
 \begin{cases}
 \mu = (4, 4)              \\
 \sigma_x = \sigma_y = 1   \\
 \rho = 0.4
 \end{cases}
 $$
 where $\mu$ stands for the mean, $\sigma_x$ and $\sigma_y$ stand for the
 standard deviations in $x$ and $y$ directions respectively and $\rho$ is the
 correlation.  
 In the code, default settings are $N_s = 800$ points for the signal and $n_n =
 1000$ points for the noise but can be changed from the command-line. Both
 samples were handled as matrices of dimension $n$ x 2, where $n$ is the number
 of points in the sample. The library `gsl_matrix` provided by GSL was employed
 for this purpose and the function `gsl_ran_bivariate_gaussian()` was used for
 generating the points.
 Then, a model of classification must be implemented in order to assign each
 point to the right class (signal or noise) to which it 'most probably' belongs
 to. The point is how 'most probably' can be interpreted and implemented.
 ## Fisher linear discriminant
 The Fisher linear discriminant (FLD) is a linear classification model based on
 dimensionality reduction. It allows to reduce this 2D classification problem
 into a one-dimensional decision surface.
 Consider the case of two classes, (in this case the signal and the noise): the
 simplest representation of a linear discriminant is obtained by taking a linear
 function of a sampled point 2D $x$ so that:
 $$
  \hat{x} = w x + w_0
 $$
 where $w$ is called 'weight vector' and $w_0$ is a bias. The negative of the
 bias is called 'threshold'. An input point $x$ is assigned to the first class
 if $\hat{x} \geqslant 0$ and to the second one otherwise.  
 In general, the projection onto one dimension leads to a considerable loss of
 information and classes that are well separated in the original 2D space may
 become strongly overlapping in one dimension. However, by adjusting the
 components of the weight vector, a projection that maximizes the classes
 separation can be selected.  
 To begin with, consider a two-classes problem in which there are $N_1$ points of
 class $C_1$ and $N_2$ points of class $C_2$, so that the means $n_1$ and $m_2$
 of the two classes are given by:
 $$
  m_1 = \frac{1}{N_1} \sum_{n \in C_1} x_n
  \et
  m_2 = \frac{1}{N_2} \sum_{n \in C_2} x_n
 $$
 The simplest measure of the separation of the classes is the separation of the
 projected class means. This suggests that to choose $w$ so as to maximize:
 $$
  \hat{m}_2 − \hat{m}_1 = w (m_2 − m_1)
 $$
 ![The plot on the left shows samples from two classes along with the histograms
 resulting from projection onto the line joining the class means: note that
 there is considerable overlap in the projected space. The right plot shows the
 corresponding projection based on the Fisher linear discriminant, showing the
 greatly improved classes separation.](images/fisher.png){#fig:overlap}
 This expression can be made arbitrarily large simply by increasing the magnitude
 of $w$. To solve this problem, $w$ can be costrained to have unit length, so
 that $| w^2 | = 1$. Using a Lagrange multiplier to perform the constrained
 maximization, it can be find that $w \propto (m_2 − m_1)$.  
 There is still a problem with this approach, however, as illustrated in
@fig:overlap: the two classes are well separated in the original 2D space but
 have considerable overlap when projected onto the line joining their means.  
 The idea to solve it is to maximize a function that will give a large separation
 between the projected classes means while also giving a small variance within
 each class, thereby minimizing the class overlap.  
 The within-classes variance of the transformed data of each $k$ class is given
 by:
 $$
  s_k^2 = \sum_{n \in C_k} (\hat{x}_n - \hat{m}_k)^2
 $$
 The total within-classes variance for the whole data set can be simply defined
 as $s^2 = s_1^2 + s_2^2$. The Fisher criterion is derefore defined to be the
 ratio of the between-classes distance to the within-class variance and is given
 by:
 $$
  J(w) = \frac{(\hat{m}_2 - \hat{m}_1)^2}{s^2}
 $$