diff --git a/notes/sections/7.md b/notes/sections/7.md new file mode 100644 index 0000000..f0a59fe --- /dev/null +++ b/notes/sections/7.md @@ -0,0 +1,107 @@ +# Exercise 7 + +## Generating points according to gaussian distributions + +The firts task of esercise 7 is to generate two sets of 2D points $(x, y)$ +according to two bivariate gaussian distributions with parameters: + +$$ +\text{signal} \quad +\begin{cases} +\mu = (0, 0) \\ +\sigma_x = \sigma_y = 0.3 \\ +\rho = 0.5 +\end{cases} +\et +\text{noise} \quad +\begin{cases} +\mu = (4, 4) \\ +\sigma_x = \sigma_y = 1 \\ +\rho = 0.4 +\end{cases} +$$ + +where $\mu$ stands for the mean, $\sigma_x$ and $\sigma_y$ stand for the +standard deviations in $x$ and $y$ directions respectively and $\rho$ is the +correlation. +In the code, default settings are $N_s = 800$ points for the signal and $n_n = +1000$ points for the noise but can be changed from the command-line. Both +samples were handled as matrices of dimension $n$ x 2, where $n$ is the number +of points in the sample. The library `gsl_matrix` provided by GSL was employed +for this purpose and the function `gsl_ran_bivariate_gaussian()` was used for +generating the points. + +Then, a model of classification must be implemented in order to assign each +point to the right class (signal or noise) to which it 'most probably' belongs +to. The point is how 'most probably' can be interpreted and implemented. + +## Fisher linear discriminant + +The Fisher linear discriminant (FLD) is a linear classification model based on +dimensionality reduction. It allows to reduce this 2D classification problem +into a one-dimensional decision surface. + +Consider the case of two classes, (in this case the signal and the noise): the +simplest representation of a linear discriminant is obtained by taking a linear +function of a sampled point 2D $x$ so that: + +$$ + \hat{x} = w x + w_0 +$$ + +where $w$ is called 'weight vector' and $w_0$ is a bias. The negative of the +bias is called 'threshold'. An input point $x$ is assigned to the first class +if $\hat{x} \geqslant 0$ and to the second one otherwise. +In general, the projection onto one dimension leads to a considerable loss of +information and classes that are well separated in the original 2D space may +become strongly overlapping in one dimension. However, by adjusting the +components of the weight vector, a projection that maximizes the classes +separation can be selected. +To begin with, consider a two-classes problem in which there are $N_1$ points of +class $C_1$ and $N_2$ points of class $C_2$, so that the means $n_1$ and $m_2$ +of the two classes are given by: + +$$ + m_1 = \frac{1}{N_1} \sum_{n \in C_1} x_n + \et + m_2 = \frac{1}{N_2} \sum_{n \in C_2} x_n +$$ + +The simplest measure of the separation of the classes is the separation of the +projected class means. This suggests that to choose $w$ so as to maximize: + +$$ + \hat{m}_2 − \hat{m}_1 = w (m_2 − m_1) +$$ + +![The plot on the left shows samples from two classes along with the histograms +resulting from projection onto the line joining the class means: note that +there is considerable overlap in the projected space. The right plot shows the +corresponding projection based on the Fisher linear discriminant, showing the +greatly improved classes separation.](images/fisher.png){#fig:overlap} + +This expression can be made arbitrarily large simply by increasing the magnitude +of $w$. To solve this problem, $w$ can be costrained to have unit length, so +that $| w^2 | = 1$. Using a Lagrange multiplier to perform the constrained +maximization, it can be find that $w \propto (m_2 − m_1)$. +There is still a problem with this approach, however, as illustrated in +@fig:overlap: the two classes are well separated in the original 2D space but +have considerable overlap when projected onto the line joining their means. +The idea to solve it is to maximize a function that will give a large separation +between the projected classes means while also giving a small variance within +each class, thereby minimizing the class overlap. +The within-classes variance of the transformed data of each $k$ class is given +by: + +$$ + s_k^2 = \sum_{n \in C_k} (\hat{x}_n - \hat{m}_k)^2 +$$ + +The total within-classes variance for the whole data set can be simply defined +as $s^2 = s_1^2 + s_2^2$. The Fisher criterion is derefore defined to be the +ratio of the between-classes distance to the within-class variance and is given +by: + +$$ + J(w) = \frac{(\hat{m}_2 - \hat{m}_1)^2}{s^2} +$$