diff --git a/notes/sections/7.md b/notes/sections/7.md index f0a59fe..fa70a05 100644 --- a/notes/sections/7.md +++ b/notes/sections/7.md @@ -21,22 +21,31 @@ $$ \end{cases} $$ -where $\mu$ stands for the mean, $\sigma_x$ and $\sigma_y$ stand for the -standard deviations in $x$ and $y$ directions respectively and $\rho$ is the -correlation. -In the code, default settings are $N_s = 800$ points for the signal and $n_n = +where $\mu$ stands for the mean, $\sigma_x$ and $\sigma_y$ are the standard +deviations in $x$ and $y$ directions respectively and $\rho$ is the bivariate +correlation, hence: + +$$ + \sigma_{xy} = \rho \sigma_x \sigma_y +$$ + +where $\sigma_{xy}$ is the covariance of $x$ and $y$. +In the code, default settings are $N_s = 800$ points for the signal and $N_n = 1000$ points for the noise but can be changed from the command-line. Both samples were handled as matrices of dimension $n$ x 2, where $n$ is the number of points in the sample. The library `gsl_matrix` provided by GSL was employed for this purpose and the function `gsl_ran_bivariate_gaussian()` was used for generating the points. -Then, a model of classification must be implemented in order to assign each -point to the right class (signal or noise) to which it 'most probably' belongs -to. The point is how 'most probably' can be interpreted and implemented. +Assuming not to know how the points were generated, a model of classification +must then be implemented in order to assign each point to the right class +(signal or noise) to which it 'most probably' belongs to. The point is how +'most probably' can be interpreted and implemented. ## Fisher linear discriminant +### The theory + The Fisher linear discriminant (FLD) is a linear classification model based on dimensionality reduction. It allows to reduce this 2D classification problem into a one-dimensional decision surface. @@ -46,12 +55,12 @@ simplest representation of a linear discriminant is obtained by taking a linear function of a sampled point 2D $x$ so that: $$ - \hat{x} = w x + w_0 + \hat{x} = w^T x $$ -where $w$ is called 'weight vector' and $w_0$ is a bias. The negative of the -bias is called 'threshold'. An input point $x$ is assigned to the first class -if $\hat{x} \geqslant 0$ and to the second one otherwise. +where $w$ is the so-called 'weight vector'. An input point $x$ is commonly +assigned to the first class if $\hat{x} \geqslant w_{th}$ and to the second one +otherwise, where $w_{th}$ is a threshold somehow defined. In general, the projection onto one dimension leads to a considerable loss of information and classes that are well separated in the original 2D space may become strongly overlapping in one dimension. However, by adjusting the @@ -71,7 +80,7 @@ The simplest measure of the separation of the classes is the separation of the projected class means. This suggests that to choose $w$ so as to maximize: $$ - \hat{m}_2 − \hat{m}_1 = w (m_2 − m_1) + \hat{m}_2 − \hat{m}_1 = w^T (m_2 − m_1) $$ ![The plot on the left shows samples from two classes along with the histograms @@ -105,3 +114,63 @@ by: $$ J(w) = \frac{(\hat{m}_2 - \hat{m}_1)^2}{s^2} $$ + +Differentiating $J(w)$ with respect to $w$, it can be found that it is +maximized when: + +$$ + w = S_b^{-1} (m_2 - m_1) +$$ + +where $S_b$ is the within-classes covariance matrix, given by: + +$$ + S_b = S_1 + S_2 +$$ + +where $S_1$ and $S_2$ are the covariance matrix of the two classes, namely: + +$$ +\begin{pmatrix} +\sigma_x^2 & \sigma_{xy} \\ +\sigma_{xy} & \sigma_y^2 +\end{pmatrix} +$$ + +This is not truly a discriminant but rather a specific choice of direction for +projection of the data down to one dimension: the projected data can then be +used to construct a discriminant by choosing a threshold for the +classification. + +### The code + +As stated above, the projection vector is given by + +$$ + x = S_b^{-1} (\mu_1 - \mu_2) +$$ + +where $\mu_1$ and $\mu_2$ are the two classes means. + +$$ + r = \frac{N_s}{N_n} +$$ + +cmpute S_b + +$S_b = S_1 + S_2$ + +$$ + \mu_1 = (\mu_{1x}, \mu_{1y}) +$$ + +the matrix $S$ is inverted with the Cholesky method, since it is symmetrical +and positive-definite. + +$$ + diff = \mu_1 - \mu_2 +$$ + +product with the `gsl_blas_dgemv()` function provided by GSL. +result normalised with gsl functions.` +