analistica/notes/sections/7.md

6.6 KiB
Raw Blame History

Exercise 7

Generating points according to Gaussian distributions

The first task of exercise 7 is to generate two sets of 2D points $(x, y)$ according to two bivariate Gaussian distributions with parameters:


\text{signal} \quad
\begin{cases}
\mu = (0, 0)              \\
\sigma_x = \sigma_y = 0.3 \\
\rho = 0.5
\end{cases}
\et
\text{noise} \quad
\begin{cases}
\mu = (4, 4)              \\
\sigma_x = \sigma_y = 1   \\
\rho = 0.4
\end{cases}

where \mu stands for the mean, \sigma_x and \sigma_y are the standard deviations in x and y directions respectively and \rho is the bivariate correlation, hence:


  \sigma_{xy} = \rho \sigma_x \sigma_y

where \sigma_{xy} is the covariance of x and y.
In the code, default settings are N_s = 800 points for the signal and $N_n = 1000$ points for the noise but can be changed from the command-line. Both samples were handled as matrices of dimension n x 2, where n is the number of points in the sample. The library gsl_matrix provided by GSL was employed for this purpose and the function gsl_ran_bivariate_gaussian() was used for generating the points.

Assuming not to know how the points were generated, a model of classification must then be implemented in order to assign each point to the right class (signal or noise) to which it 'most probably' belongs to. The point is how 'most probably' can be interpreted and implemented.

Fisher linear discriminant

The projection direction

The Fisher linear discriminant (FLD) is a linear classification model based on dimensionality reduction. It allows to reduce this 2D classification problem into a one-dimensional decision surface.

Consider the case of two classes (in this case the signal and the noise): the simplest representation of a linear discriminant is obtained by taking a linear function of a sampled 2D point x so that:


  \hat{x} = w^T x

where w is the so-called 'weight vector'. An input point x is commonly assigned to the first class if \hat{x} \geqslant w_{th} and to the second one otherwise, where w_{th} is a threshold value somehow defined.
In general, the projection onto one dimension leads to a considerable loss of information and classes that are well separated in the original 2D space may become strongly overlapping in one dimension. However, by adjusting the components of the weight vector, a projection that maximizes the classes separation can be selected.
To begin with, consider N_1 points of class C_1 and N_2 points of class C_2, so that the means m_1 and m_2 of the two classes are given by:


  m_1 = \frac{1}{N_1} \sum_{n \in C_1} x_n
  \et
  m_2 = \frac{1}{N_2} \sum_{n \in C_2} x_n

The simplest measure of the separation of the classes is the separation of the projected class means. This suggests to choose w so as to maximize:


  \hat{m}_2  \hat{m}_1 = w^T (m_2  m_1)

This expression can be made arbitrarily large simply by increasing the magnitude of w. To solve this problem, w can be constrained to have unit length, so that | w^2 | = 1. Using a Lagrange multiplier to perform the constrained maximization, it can be found that w \propto (m_2 m_1).

The plot on the left shows samples from two classes along with the histograms
resulting from projection onto the line joining the class means: note that
there is considerable overlap in the projected space. The right plot shows the
corresponding projection based on the Fisher linear discriminant, showing the
greatly improved classes separation.{#fig:overlap}

There is still a problem with this approach, however, as illustrated in @fig:overlap: the two classes are well separated in the original 2D space but have considerable overlap when projected onto the line joining their means.
The idea to solve it is to maximize a function that will give a large separation between the projected classes means while also giving a small variance within each class, thereby minimizing the class overlap.
The within-classes variance of the transformed data of each class k is given by:


  s_k^2 = \sum_{n \in C_k} (\hat{x}_n - \hat{m}_k)^2

The total within-classes variance for the whole data set can be simply defined as s^2 = s_1^2 + s_2^2. The Fisher criterion is therefore defined to be the ratio of the between-classes distance to the within-classes variance and is given by:


  J(w) = \frac{(\hat{m}_2 - \hat{m}_1)^2}{s^2}

Differentiating J(w) with respect to w, it can be found that it is maximized when:


  w = S_b^{-1} (m_2 - m_1)

where S_b is the covariance matrix, given by:


  S_b = S_1 + S_2

where S_1 and S_2 are the covariance matrix of the two classes, namely:


\begin{pmatrix}
\sigma_x^2  & \sigma_{xy} \\
\sigma_{xy} & \sigma_y^2
\end{pmatrix}

This is not truly a discriminant but rather a specific choice of direction for projection of the data down to one dimension: the projected data can then be used to construct a discriminant by choosing a threshold for the classification.

When implemented, the parameters given in @sec:sampling were used to compute the covariance matrices S_1 and S_2 of the two classes and their sum S. Then S, being a symmetrical and positive-definite matrix, was inverted with the Cholesky method, already discussed in @sec:MLM. Lastly, the matrix-vector product was computed with the gsl_blas_dgemv() function provided by GSL.

The threshold

The cut was fixed by the condition of conditional probability being the same for each class:


  t_{\text{cut}} = x \, | \hspace{20pt}
 \frac{P(c_1 | x)}{P(c_2 | x)} = 
 \frac{p(x | c_1) \, p(c_1)}{p(x | c_1) \, p(c_2)} = 1

where p(x | c_k) is the probability for point x along the Fisher projection line of belonging to the class k. If the classes are bivariate Gaussian, as in the present case, then p(x | c_k) is simply given by its projected normal distribution \mathscr{G} (\hat{μ}, \hat{S}). With a bit of math, the solution is then:


  t = \frac{b}{a} + \sqrt{\left( \frac{b}{a} \right)^2 - \frac{c}{a}}

where:

  • a = \hat{S}_1^2 - \hat{S}_2^2
  • b = \hat{m}_2 \, \hat{S}_1^2 - \hat{M}_1 \, \hat{S}_2^2
  • $c = \hat{M}_2^2 , \hat{S}_1^2 - \hat{M}_1^2 , \hat{S}_2^2
    • 2 , \hat{S}_1^2 , \hat{S}_2^2 , \ln(\alpha)$
  • \alpha = p(c_1) / p(c_2)

The ratio of the prior probability \alpha was computed as:


  \alpha = \frac{N_s}{N_n}

The projection of the points was accomplished by the use of the function gsl_blas_ddot, which computed a dot product between two vectors, which in this case were the weight vector and the position of the point to be projected.