ex-7: went on writing the FLD

2020-04-01 23:39:19 +02:00 · 2020-04-01 23:39:19 +02:00 · 37e5bf0cbb
commit 37e5bf0cbb
parent 3ee0aec13e
1 changed files with 81 additions and 12 deletions
--- a/notes/sections/7.md
+++ b/notes/sections/7.md
@ -21,22 +21,31 @@ $$
 \end{cases}
 $$

-where $\mu$ stands for the mean, $\sigma_x$ and $\sigma_y$ stand for the
-standard deviations in $x$ and $y$ directions respectively and $\rho$ is the
-correlation.  
-In the code, default settings are $N_s = 800$ points for the signal and $n_n =
+where $\mu$ stands for the mean, $\sigma_x$ and $\sigma_y$ are the standard
+deviations in $x$ and $y$ directions respectively and $\rho$ is the bivariate
+correlation, hence:
+
+$$
+  \sigma_{xy} = \rho \sigma_x \sigma_y
+$$
+
+where $\sigma_{xy}$ is the covariance of $x$ and $y$.  
+In the code, default settings are $N_s = 800$ points for the signal and $N_n =
 1000$ points for the noise but can be changed from the command-line. Both
 samples were handled as matrices of dimension $n$ x 2, where $n$ is the number
 of points in the sample. The library `gsl_matrix` provided by GSL was employed
 for this purpose and the function `gsl_ran_bivariate_gaussian()` was used for
 generating the points.

-Then, a model of classification must be implemented in order to assign each
-point to the right class (signal or noise) to which it 'most probably' belongs
-to. The point is how 'most probably' can be interpreted and implemented.
+Assuming not to know how the points were generated, a model of classification
+must then be implemented in order to assign each point to the right class
+(signal or noise) to which it 'most probably' belongs to. The point is how
+'most probably' can be interpreted and implemented.

 ## Fisher linear discriminant

+### The theory
+
 The Fisher linear discriminant (FLD) is a linear classification model based on
 dimensionality reduction. It allows to reduce this 2D classification problem
 into a one-dimensional decision surface.
@ -46,12 +55,12 @@ simplest representation of a linear discriminant is obtained by taking a linear
 function of a sampled point 2D $x$ so that:

 $$
-  \hat{x} = w x + w_0
+  \hat{x} = w^T x
 $$

-where $w$ is called 'weight vector' and $w_0$ is a bias. The negative of the
-bias is called 'threshold'. An input point $x$ is assigned to the first class
-if $\hat{x} \geqslant 0$ and to the second one otherwise.  
+where $w$ is the so-called 'weight vector'. An input point $x$ is commonly
+assigned to the first class if $\hat{x} \geqslant w_{th}$ and to the second one
+otherwise, where $w_{th}$ is a threshold somehow defined.  
 In general, the projection onto one dimension leads to a considerable loss of
 information and classes that are well separated in the original 2D space may
 become strongly overlapping in one dimension. However, by adjusting the
@ -71,7 +80,7 @@ The simplest measure of the separation of the classes is the separation of the
 projected class means. This suggests that to choose $w$ so as to maximize:

 $$
-  \hat{m}_2 − \hat{m}_1 = w (m_2 − m_1)
+  \hat{m}_2 − \hat{m}_1 = w^T (m_2 − m_1)
 $$

 ![The plot on the left shows samples from two classes along with the histograms
@ -105,3 +114,63 @@ by:
 $$
  J(w) = \frac{(\hat{m}_2 - \hat{m}_1)^2}{s^2}
 $$
+
+Differentiating $J(w)$ with respect to $w$, it can be found that it is
+maximized when:
+
+$$
+  w = S_b^{-1} (m_2 - m_1)
+$$
+
+where $S_b$ is the within-classes covariance matrix, given by:
+
+$$
+  S_b = S_1 + S_2
+$$
+
+where $S_1$ and $S_2$ are the covariance matrix of the two classes, namely:
+
+$$
+\begin{pmatrix}
+\sigma_x^2  & \sigma_{xy} \\
+\sigma_{xy} & \sigma_y^2
+\end{pmatrix}
+$$
+
+This is not truly a discriminant but rather a specific choice of direction for
+projection of the data down to one dimension: the projected data can then be
+used to construct a discriminant by choosing a threshold for the
+classification.
+
+### The code
+
+As stated above, the projection vector is given by
+
+$$
+  x = S_b^{-1} (\mu_1 - \mu_2)
+$$
+
+where $\mu_1$ and $\mu_2$ are the two classes means.
+
+$$
+  r = \frac{N_s}{N_n}
+$$
+
+cmpute S_b
+
+$S_b = S_1 + S_2$
+
+$$
+  \mu_1 = (\mu_{1x}, \mu_{1y})
+$$
+
+the matrix $S$ is inverted with the Cholesky method, since it is symmetrical
+and positive-definite.
+
+$$
+  diff = \mu_1 - \mu_2
+$$
+
+product with the `gsl_blas_dgemv()` function provided by GSL.
+result normalised with gsl functions.`
+