# Exercise 7 ## Generating points according to Gaussian distributions {#sec:sampling} The first task of exercise 7 is to generate two sets of 2D points $(x, y)$ according to two bivariate Gaussian distributions with parameters: $$ \text{signal} \quad \begin{cases} \mu = (0, 0) \\ \sigma_x = \sigma_y = 0.3 \\ \rho = 0.5 \end{cases} \et \text{noise} \quad \begin{cases} \mu = (4, 4) \\ \sigma_x = \sigma_y = 1 \\ \rho = 0.4 \end{cases} $$ where $\mu$ stands for the mean, $\sigma_x$ and $\sigma_y$ are the standard deviations in $x$ and $y$ directions respectively and $\rho$ is the bivariate correlation, hence: $$ \sigma_{xy} = \rho \sigma_x \sigma_y $$ where $\sigma_{xy}$ is the covariance of $x$ and $y$. In the code, default settings are $N_s = 800$ points for the signal and $N_n = 1000$ points for the noise but can be changed from the command-line. Both samples were handled as matrices of dimension $n$ x 2, where $n$ is the number of points in the sample. The library `gsl_matrix` provided by GSL was employed for this purpose and the function `gsl_ran_bivariate_gaussian()` was used for generating the points. An example of the two samples is shown in @fig:points. ![Example of points sorted according to two Gaussian with the given parameters. Noise points in pink and signal points in yellow.](images/points.pdf){#fig:points} Assuming not to know how the points were generated, a model of classification must then be implemented in order to assign each point to the right class (signal or noise) to which it 'most probably' belongs to. The point is how 'most probably' can be interpreted and implemented. ## Fisher linear discriminant ### The projection direction The Fisher linear discriminant (FLD) is a linear classification model based on dimensionality reduction. It allows to reduce this 2D classification problem into a one-dimensional decision surface. Consider the case of two classes (in this case the signal and the noise): the simplest representation of a linear discriminant is obtained by taking a linear function of a sampled 2D point $x$ so that: $$ \hat{x} = w^T x $$ where $w$ is the so-called 'weight vector'. An input point $x$ is commonly assigned to the first class if $\hat{x} \geqslant w_{th}$ and to the second one otherwise, where $w_{th}$ is a threshold value somehow defined. In general, the projection onto one dimension leads to a considerable loss of information and classes that are well separated in the original 2D space may become strongly overlapping in one dimension. However, by adjusting the components of the weight vector, a projection that maximizes the classes separation can be selected. To begin with, consider $N_1$ points of class $C_1$ and $N_2$ points of class $C_2$, so that the means $m_1$ and $m_2$ of the two classes are given by: $$ m_1 = \frac{1}{N_1} \sum_{n \in C_1} x_n \et m_2 = \frac{1}{N_2} \sum_{n \in C_2} x_n $$ The simplest measure of the separation of the classes is the separation of the projected class means. This suggests to choose $w$ so as to maximize: $$ \hat{m}_2 − \hat{m}_1 = w^T (m_2 − m_1) $$ This expression can be made arbitrarily large simply by increasing the magnitude of $w$. To solve this problem, $w$ can be constrained to have unit length, so that $| w^2 | = 1$. Using a Lagrange multiplier to perform the constrained maximization, it can be found that $w \propto (m_2 − m_1)$. ![The plot on the left shows samples from two classes along with the histograms resulting from projection onto the line joining the class means: note that there is considerable overlap in the projected space. The right plot shows the corresponding projection based on the Fisher linear discriminant, showing the greatly improved classes separation.](images/fisher.png){#fig:overlap} There is still a problem with this approach, however, as illustrated in @fig:overlap: the two classes are well separated in the original 2D space but have considerable overlap when projected onto the line joining their means. The idea to solve it is to maximize a function that will give a large separation between the projected classes means while also giving a small variance within each class, thereby minimizing the class overlap. The within-classes variance of the transformed data of each class $k$ is given by: $$ s_k^2 = \sum_{n \in C_k} (\hat{x}_n - \hat{m}_k)^2 $$ The total within-classes variance for the whole data set can be simply defined as $s^2 = s_1^2 + s_2^2$. The Fisher criterion is therefore defined to be the ratio of the between-classes distance to the within-classes variance and is given by: $$ J(w) = \frac{(\hat{m}_2 - \hat{m}_1)^2}{s^2} $$ Differentiating $J(w)$ with respect to $w$, it can be found that it is maximized when: $$ w = S_b^{-1} (m_2 - m_1) $$ where $S_b$ is the covariance matrix, given by: $$ S_b = S_1 + S_2 $$ where $S_1$ and $S_2$ are the covariance matrix of the two classes, namely: $$ \begin{pmatrix} \sigma_x^2 & \sigma_{xy} \\ \sigma_{xy} & \sigma_y^2 \end{pmatrix} $$ This is not truly a discriminant but rather a specific choice of direction for projection of the data down to one dimension: the projected data can then be used to construct a discriminant by choosing a threshold for the classification. When implemented, the parameters given in @sec:sampling were used to compute the covariance matrices $S_1$ and $S_2$ of the two classes and their sum $S$. Then $S$, being a symmetrical and positive-definite matrix, was inverted with the Cholesky method, already discussed in @sec:MLM. Lastly, the matrix-vector product was computed with the `gsl_blas_dgemv()` function provided by GSL. ### The threshold The cut was fixed by the condition of conditional probability being the same for each class: $$ t_{\text{cut}} = x \, | \hspace{20pt} \frac{P(c_1 | x)}{P(c_2 | x)} = \frac{p(x | c_1) \, p(c_1)}{p(x | c_1) \, p(c_2)} = 1 $$ where $p(x | c_k)$ is the probability for point $x$ along the Fisher projection line of belonging to the class $k$. If the classes are bivariate Gaussian, as in the present case, then $p(x | c_k)$ is simply given by its projected normal distribution $\mathscr{G} (\hat{μ}, \hat{S})$. With a bit of math, the solution is then: $$ t = \frac{b}{a} + \sqrt{\left( \frac{b}{a} \right)^2 - \frac{c}{a}} $$ where: - $a = \hat{S}_1^2 - \hat{S}_2^2$ - $b = \hat{m}_2 \, \hat{S}_1^2 - \hat{M}_1 \, \hat{S}_2^2$ - $c = \hat{M}_2^2 \, \hat{S}_1^2 - \hat{M}_1^2 \, \hat{S}_2^2 - 2 \, \hat{S}_1^2 \, \hat{S}_2^2 \, \ln(\alpha)$ - $\alpha = p(c_1) / p(c_2)$ The ratio of the prior probability $\alpha$ was computed as: $$ \alpha = \frac{N_s}{N_n} $$ The projection of the points was accomplished by the use of the function `gsl_blas_ddot()`, which computed a dot product between two vectors, which in this case were the weight vector and the position of the point to be projected.
![View from above of the samples.](images/fisher-plane.pdf){height=5.7cm} ![Gaussian of the samples on the projection line.](images/fisher-proj.pdf){height=5.7cm} Aerial and lateral views of the projection direction, in blue, and the cut, in red.
Results obtained for the same sample in @fig:points are shown in @fig:fisher_proj. The weight vector $w$ was found to be: $$ w = (0.707, 0.707) $$ and $t_{\text{cut}}$ is 1.323 far from the origin of the axes. Hence, as can be seen, the vector $w$ turned out to be parallel to the line joining the means of the two classes (reminded to be $(0, 0)$ and $(4, 4)$) which means that the total covariance matrix $S$ is isotropic, proportional to the unit matrix. ## Perceptron In machine learning, the perceptron is an algorithm for supervised learning of linear binary classifiers. Supervised learning is the machine learning task of inferring a function $f$ that maps an input $x$ to an output $f(x)$ based on a set of training input-output pairs. Each example is a pair consisting of an input object and an output value. The inferred function can be used for mapping new examples. The algorithm will be generalized to correctly determine the class labels for unseen instances. The aim is to determine the bias $b$ such that the threshold function $f(x)$: $$ f(x) = x \cdot w + b \hspace{20pt} \begin{cases} \geqslant 0 \incase x \in \text{signal} \\ < 0 \incase x \in \text{noise} \end{cases} $$ {#eq:perc} The training was performed as follow. Initial values were set as $w = (0,0)$ and $b = 0$. From these, the perceptron starts to improve their estimations. The sample was passed point by point into a reiterative procedure a grand total of $N_c$ calls: each time, the projection $w \cdot x$ of the point was computed and then the variable $\Delta$ was defined as: $$ \Delta = r * (e - \theta (f(x)) $$ where: - $r$ is the learning rate of the perceptron: it is between 0 and 1. The larger $r$, the more volatile the weight changes. In the code, it was set $r = 0.8$; - $e$ is the expected value, namely 0 if $x$ is noise and 1 if it is signal; - $\theta$ is the Heaviside theta function; - $o$ is the observed value of $f(x)$ defined in @eq:perc. Then $b$ and $w$ must be updated as: $$ b \to b + \Delta \et w \to w + x \Delta $$
![View from above of the samples.](images/percep-plane.pdf){height=5.7cm} ![Gaussian of the samples on the projection line.](images/percep-proj.pdf){height=5.7cm} Aerial and lateral views of the projection direction, in blue, and the cut, in red.
It can be shown that this method converges to the coveted function. As stated in the previous section, the weight vector must finally be normalized. With $N_c = 5$, the values of $w$ and $t_{\text{cut}}$ level off up to the third digit. The following results were obtained: $$ w = (0.654, 0.756) \et t_{\text{cut}} = 1.213 $$ where, once again, $t_{\text{cut}}$ is computed from the origin of the axes. In this case, the projection line does not lies along the mains of the two samples. Plots in @fig:percep_proj. ## Efficiency test A program was implemented in order to check the validity of the two aforementioned methods. A number $N_t$ of test samples was generated and the points were divided into the two classes according to the selected method. At each iteration, false positives and negatives are recorded using a running statistics method implemented in the `gsl_rstat` library, being suitable for handling large datasets for which it is inconvenient to store in memory all at once. For each sample, the numbers $N_{fn}$ and $N_{fp}$ of false positive and false negative are computed with the following trick: every noise point $x_n$ was checked this way: the function $f(x_n)$ was computed with the weight vector $w$ and the $t_{\text{cut}}$ given by the employed method, then: - if $f(x) < 0 \thus$ $N_{fn} \to N_{fn}$ - if $f(x) > 0 \thus$ $N_{fn} \to N_{fn} + 1$ Similarly for the positive points. Finally, the mean and the standard deviation were computed from $N_{fn}$ and $N_{fp}$ obtained for every sample in order to get the mean purity $\alpha$ and efficiency $\beta$ for the employed statistics: $$ \alpha = 1 - \frac{\text{mean}(N_{fn})}{N_s} \et \beta = 1 - \frac{\text{mean}(N_{fp})}{N_n} $$ Results for $N_t = 500$ are shown in @tbl:res_comp. As can be observed, the Fisher method gives a nearly perfect assignment of the points to their belonging class, with a symmetric distribution of false negative and false positive, whereas the points perceptron-divided show a little more false-positive than false-negative, being also more changable from dataset to dataset. The reason why this happened lies in the fact that the Fisher linear discriminant is an exact analitical result, whereas the perceptron is based on a convergent behaviour which cannot be exactely reached by definition. ------------------------------------------------------------------------------------------- $\alpha$ $\sigma_{\alpha}$ $\beta$ $\sigma_{\beta}$ ----------- ------------------- ------------------- ------------------- ------------------- Fisher 0.9999 0.33 0.9999 0.33 Perceptron 0.9999 0.28 0.9995 0.64 ------------------------------------------------------------------------------------------- Table: Results for Fisher and perceptron method. $\sigma_{\alpha}$ and $\sigma_{\beta}$ stand for the standard deviation of the false negative and false positive respectively. {#tbl:res_comp}