diff --git a/notes/sections/7.md b/notes/sections/7.md index 446392e..26aed2d 100644 --- a/notes/sections/7.md +++ b/notes/sections/7.md @@ -1,51 +1,50 @@ # Exercise 7 -## Generating points according to Gaussian distributions {#sec:sampling} - -Two sets of 2D points $(x, y)$ - signal and noise - is to be generated according -to two bivariate Gaussian distributions with parameters: -$$ -\text{signal} \quad -\begin{cases} -\mu = (0, 0) \\ -\sigma_x = \sigma_y = 0.3 \\ -\rho = 0.5 -\end{cases} -\et -\text{noise} \quad -\begin{cases} -\mu = (4, 4) \\ -\sigma_x = \sigma_y = 1 \\ -\rho = 0.4 -\end{cases} -$$ +## Generating random points on the plane {#sec:sampling} +Two sets of 2D points $(x, y)$ --- signal and noise --- is to be generated +according to two bivariate Gaussian distributions with parameters: +\begin{align*} + \text{signal}\: + \begin{cases} + \mu = (0, 0) \\ + \sigma_x = \sigma_y = 0.3 \\ + \rho = 0.5 + \end{cases} + && + \text{noise}\: + \begin{cases} + \mu = (4, 4) \\ + \sigma_x = \sigma_y = 1 \\ + \rho = 0.4 + \end{cases} +\end{align*} where $\mu$ stands for the mean, $\sigma_x$ and $\sigma_y$ for the standard deviations in $x$ and $y$ directions respectively and $\rho$ is the bivariate correlation, namely: $$ - \sigma_{xy} = \rho \sigma_x \sigma_y + \sigma_{xy} = \rho\, \sigma_x \sigma_y $$ - where $\sigma_{xy}$ is the covariance of $x$ and $y$. -In the code, default settings are $N_s = 800$ points for the signal and $N_n = -1000$ points for the noise but can be customized from the input command-line. -Both samples were handled as matrices of dimension $n$ x 2, where $n$ is the -number of points in the sample. The library `gsl_matrix` provided by GSL was -employed for this purpose and the function `gsl_ran_bivariate_gaussian()` was -used for generating the points. + +In the programs, $N_s = 800$ points for the signal and $N_n = 1000$ points for +the noise were chosen as default but can be customized from the command-line. +Both samples were stored as $n \times 2$ matrices, where $n$ is the number of +points in the sample. The library `gsl_matrix` provided by GSL was employed for +this purpose and the function `gsl_ran_bivariate_gaussian()` was used for +generating the points. An example of the two samples is shown in @fig:points. ![Example of points sampled according to the two Gaussian distributions with the given parameters.](images/7-points.pdf){#fig:points} -Assuming not to know how the points were generated, a model of classification -is then to be implemented in order to assign each point to the right class -(signal or noise) to which it 'most probably' belongs to. The point is how -'most probably' can be interpreted and implemented. -Here, the Fisher linear discriminant and the Perceptron were implemented and +Assuming to not know how the points were generated, a model of classification, +which assign to each point the 'most probably' class it belongs to, +is implemented. Depending on the interpretation of 'most probable' many +different models can be developed. +Here, the Fisher linear discriminant and the perceptron were implemented and described in the following two sections. The results are compared in -@sec:7_results. +@sec:class-results. ## Fisher linear discriminant @@ -54,24 +53,27 @@ described in the following two sections. The results are compared in ### The projection direction The Fisher linear discriminant (FLD) is a linear classification model based on -dimensionality reduction. It allows to reduce this 2D classification problem -into a one-dimensional decision surface. +dimensionality reduction. It does so by projecting the data onto hyperplanes +that best divide the classes of points, consequently decreasing the dimension +to $n-1$. In the 2D case the projection is onto a line, therefore the problem +is reduced to simply selecting a threshold. Consider the case of two classes (in this case signal and noise): the simplest representation of a linear discriminant is obtained by taking a linear function -$\hat{x}$ of a sampled 2D point $x$ so that: +$\tilde{x}$ of a sampled 2D point $x$ so that: $$ - \hat{x} = w^T x + \tilde{x} = w^T x $$ - where $w$ is the so-called 'weight vector' and $w^T$ stands for its transpose. -An input point $x$ is commonly assigned to the first class if $\hat{x} \geqslant + +An input point $x$ is commonly assigned to the first class if $\tilde{x} > w_{th}$ and to the second one otherwise, where $w_{th}$ is a threshold value -somehow defined. In general, the projection onto one dimension leads to a +somehow defined. In general, the projection onto one dimension leads to a considerable loss of information and classes that are well separated in the original 2D space may become strongly overlapping in one dimension. However, by adjusting the components of the weight vector, a projection that maximizes the -classes separation can be selected [@bishop06]. +classes separation can be found [@bishop06]. + To begin with, consider $N_1$ points of class $C_1$ and $N_2$ points of class $C_2$, so that the means $\mu_1$ and $\mu_2$ of the two classes are given by: $$ @@ -83,46 +85,45 @@ $$ The simplest measure of the separation of the classes is the separation of the projected class means. This suggests to choose $w$ so as to maximize: $$ - \hat{\mu}_2 − \hat{\mu}_1 = w^T (\mu_2 − \mu_1) + \tilde{\mu}_2 − \tilde{\mu}_1 = w^T (\mu_2 − \mu_1) $$ -This expression can be made arbitrarily large simply by increasing the magnitude -of $w$. To solve this problem, $w$ can be constrained to have unit length, so -that $| w^2 | = 1$. Using a Lagrange multiplier to perform the constrained -maximization, it can be found that $w \propto (\mu_2 − \mu_1)$, meaning that the -line onto the points must be projected is the one joining the class means. +This expression can be made arbitrarily large simply by increasing the +magnitude of $w$, fortunately the problem is easily solved by requiring $w$ +to be normalised: $| w^2 | = 1$. Using a Lagrange multiplier to perform the +constrained maximization, it can be found that $w \propto (\mu_2 − \mu_1)$, +meaning that the line onto the points must be projected is the one joining the +class means. There is still a problem with this approach, however, as illustrated in @fig:overlap: the two classes are well separated in the original 2D space but have considerable overlap when projected onto the line joining their means which maximize their projections distance. ![The plot on the left shows samples from two classes along with the -histograms resulting fromthe projection onto the line joining the -class means: note that there is considerable overlap in the projected +histograms resulting from the projection onto the line joining the +class means: note the considerable overlap in the projected space. The right plot shows the corresponding projection based on the Fisher linear discriminant, showing the greatly improved classes -separation. Fifure from [@bishop06]](images/7-fisher.png){#fig:overlap} +separation. Figure taken from [@bishop06]](images/7-fisher.png){#fig:overlap} -The idea to solve it is to maximize a function that will give a large separation -between the projected classes means while also giving a small variance within -each class, thereby minimizing the class overlap. -The within-class variance of the transformed data of each class $k$ is given -by: +The overlap of the projections can be reduced by maximising a function that +gives, besides a large separation, small variance within each class. The +within-class variance of the transformed data of each class $k$ is given by: $$ - \hat{s}_k^2 = \sum_{n \in c_k} (\hat{x}_n - \hat{\mu}_k)^2 + \tilde{s}_k^2 = \sum_{n \in c_k} (\tilde{x}_n - \tilde{\mu}_k)^2 $$ The total within-class variance for the whole data set is simply defined as -$\hat{s}^2 = \hat{s}_1^2 + \hat{s}_2^2$. The Fisher criterion is defined to -be the ratio of the between-class distance to the within-class variance and is -given by: +$\tilde{s}^2 = \tilde{s}_1^2 + \tilde{s}_2^2$. The Fisher criterion is therfore +defined to be the ratio of the between-class distance to the within-class +variance and is given by: $$ - F(w) = \frac{(\hat{\mu}_2 - \hat{\mu}_1)^2}{\hat{s}^2} + F(w) = \frac{(\tilde{\mu}_2 - \tilde{\mu}_1)^2}{\tilde{s}^2} $$ The dependence on $w$ can be made explicit: \begin{align*} - (\hat{\mu}_2 - \hat{\mu}_1)^2 &= (w^T \mu_2 - w^T \mu_1)^2 \\ + (\tilde{\mu}_2 - \tilde{\mu}_1)^2 &= (w^T \mu_2 - w^T \mu_1)^2 \\ &= [w^T (\mu_2 - \mu_1)]^2 \\ &= [w^T (\mu_2 - \mu_1)][w^T (\mu_2 - \mu_1)] \\ &= [w^T (\mu_2 - \mu_1)][(\mu_2 - \mu_1)^T w] @@ -131,9 +132,9 @@ The dependence on $w$ can be made explicit: where $M$ is the between-distance matrix. Similarly, as regards the denominator: \begin{align*} - \hat{s}^2 &= \hat{s}_1^2 + \hat{s}_2^2 = \\ - &= \sum_{n \in c_1} (\hat{x}_n - \hat{\mu}_1)^2 - + \sum_{n \in c_2} (\hat{x}_n - \hat{\mu}_2)^2 + \tilde{s}^2 &= \tilde{s}_1^2 + \tilde{s}_2^2 = \\ + &= \sum_{n \in c_1} (\tilde{x}_n - \tilde{\mu}_1)^2 + + \sum_{n \in c_2} (\tilde{x}_n - \tilde{\mu}_2)^2 = w^T \Sigma_w w \end{align*} @@ -162,7 +163,7 @@ Differentiating with respect to $w$, it can be found that $F(w)$ is maximized when: $$ w = \Sigma_w^{-1} (\mu_2 - \mu_1) -$$ +$$ {#eq:fisher-weight} This is not truly a discriminant but rather a specific choice of the direction for projection of the data down to one dimension: the projected data can then be @@ -178,63 +179,73 @@ with the `gsl_blas_dgemv()` function provided by GSL. ### The threshold -The threshold $t_{\text{cut}}$ was fixed by the condition of conditional +The threshold $t_{\text{cut}}$ was fixed by the conditional probability $P(c_k | t_{\text{cut}})$ being the same for both classes $c_k$: $$ - t_{\text{cut}} = x \, | \hspace{20pt} + t_{\text{cut}} = x \text{ such that}\quad \frac{P(c_1 | x)}{P(c_2 | x)} = \frac{P(x | c_1) \, P(c_1)}{P(x | c_2) \, P(c_2)} = 1 $$ where $P(x | c_k)$ is the probability for point $x$ along the Fisher projection -line of being sampled according to the class $k$. If each class is a bivariate -Gaussian, as in the present case, then $P(x | c_k)$ is simply given by its -projected normal distribution with mean $\hat{m} = w^T m$ and variance $\hat{s} -= w^T S w$, being $S$ the covariance matrix of the class. -With a bit of math, the following solution can be found: +line of being sampled from $k$. If $\tilde{x} > t_\text{cut}$ then more likely +$x \in c_1$, otherwise $x \in c_2$. + +If each class is a bivariate Gaussian distribution, as in the present case, +then $P(x | c_k)$ is simply given by its projected normal distribution with +mean $\tilde{m} = w^T m$ and variance $\tilde{s} = w^T S w$, being $S$ the +covariance matrix of the class. +After some algebra, the threshold is found to be: $$ t_{\text{cut}} = \frac{b}{a} + \sqrt{\left( \frac{b}{a} \right)^2 - \frac{c}{a}} $$ - where: - - $a = \hat{s}_1^2 - \hat{s}_2^2$ - - $b = \hat{\mu}_2 \, \hat{s}_1^2 - \hat{\mu}_1 \, \hat{s}_2^2$ - - $c = \hat{\mu}_2^2 \, \hat{s}_1^2 - \hat{\mu}_1^2 \, \hat{s}_2^2 - - 2 \, \hat{s}_1^2 \, \hat{s}_2^2 \, \ln(\alpha)$ + - $a = \tilde{s}_1^2 - \tilde{s}_2^2$ + - $b = \tilde{\mu}_2 \, \tilde{s}_1^2 - \tilde{\mu}_1 \, \tilde{s}_2^2$ + - $c = \tilde{\mu}_2^2 \, \tilde{s}_1^2 - \tilde{\mu}_1^2 \, \tilde{s}_2^2 + - 2 \, \tilde{s}_1^2 \, \tilde{s}_2^2 \, \ln(\alpha)$ - $\alpha = P(c_1) / P(c_2)$ -The ratio of the prior probabilities $\alpha$ is simply given by: +In a simulation, the ratio of the prior probabilities $\alpha$ can +simply be set to: $$ \alpha = \frac{N_s}{N_n} $$ The projection of the points was accomplished by the use of the function -`gsl_blas_ddot()`, which computes the element wise product between two vectors. +`gsl_blas_ddot()`, which computes a fast dot product of two vectors. -Results obtained for the same samples in @fig:points are shown in -@fig:fisher_proj. The weight vector and the treshold were found to be: +Results obtained for the same samples in @fig:points are shown below in +@fig:fisher-proj. The weight vector and the threshold were found to be: $$ w = (0.707, 0.707) \et t_{\text{cut}} = 1.323 $$ -