From 7c830a7ba02407c54955445cac8ca72238777b12 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Gi=C3=B9=20Marcer?= Date: Wed, 3 Jun 2020 22:46:17 +0200 Subject: [PATCH] ex-7: review --- notes/sections/7.md | 58 +++++++++++++-------------------------------- 1 file changed, 16 insertions(+), 42 deletions(-) diff --git a/notes/sections/7.md b/notes/sections/7.md index 847b0f0..6ad4e9d 100644 --- a/notes/sections/7.md +++ b/notes/sections/7.md @@ -25,7 +25,7 @@ correlation, namely: $$ \sigma_{xy} = \rho\, \sigma_x \sigma_y $$ -where $\sigma_{xy}$ is the covariance of $x$ and $y$. +where $\sigma_{xy}$ is the covariance of $x$ and $y$. In the programs, $N_s = 800$ points for the signal and $N_n = 1000$ points for the noise were chosen as default but can be customized from the command-line. @@ -40,11 +40,10 @@ with the given parameters.](images/7-points.pdf){#fig:points} Assuming to not know how the points were generated, a model of classification, which assign to each point the 'most probably' class it belongs to, -is implemented. Depending on the interpretation of 'most probable' many -different models can be developed. -Here, the Fisher linear discriminant and the perceptron were implemented and -described in the following two sections. The results are compared in -@sec:class-results. +was implemented. Depending on the interpretation of 'most probable', many +different models can be developed. Here, the Fisher linear discriminant and the +perceptron were implemented and described in the following two sections. The +results are compared in @sec:class-results. ## Fisher linear discriminant @@ -81,13 +80,11 @@ $$ \et \mu_2 = \frac{1}{N_2} \sum_{n \in C_2} x_n $$ - The simplest measure of the separation of the classes is the separation of the projected class means. This suggests to choose $w$ so as to maximize: $$ \tilde{\mu}_2 − \tilde{\mu}_1 = w^T (\mu_2 − \mu_1) $$ - This expression can be made arbitrarily large simply by increasing the magnitude of $w$, fortunately the problem is easily solved by requiring $w$ to be normalised: $| w^2 | = 1$. Using a Lagrange multiplier to perform the @@ -112,15 +109,13 @@ within-class variance of the transformed data of each class $k$ is given by: $$ \tilde{s}_k^2 = \sum_{n \in c_k} (\tilde{x}_n - \tilde{\mu}_k)^2 $$ - The total within-class variance for the whole data set is simply defined as -$\tilde{s}^2 = \tilde{s}_1^2 + \tilde{s}_2^2$. The Fisher criterion is therfore +$\tilde{s}^2 = \tilde{s}_1^2 + \tilde{s}_2^2$. The Fisher criterion is therefore defined to be the ratio of the between-class distance to the within-class variance and is given by: $$ F(w) = \frac{(\tilde{\mu}_2 - \tilde{\mu}_1)^2}{\tilde{s}^2} $$ - The dependence on $w$ can be made explicit: \begin{align*} (\tilde{\mu}_2 - \tilde{\mu}_1)^2 &= (w^T \mu_2 - w^T \mu_1)^2 \\ @@ -129,7 +124,6 @@ The dependence on $w$ can be made explicit: &= [w^T (\mu_2 - \mu_1)][(\mu_2 - \mu_1)^T w] = w^T M w \end{align*} - where $M$ is the between-distance matrix. Similarly, as regards the denominator: \begin{align*} \tilde{s}^2 &= \tilde{s}_1^2 + \tilde{s}_2^2 = \\ @@ -137,7 +131,6 @@ where $M$ is the between-distance matrix. Similarly, as regards the denominator: + \sum_{n \in c_2} (\tilde{x}_n - \tilde{\mu}_2)^2 = w^T \Sigma_w w \end{align*} - where $\Sigma_w$ is the total within-class covariance matrix: \begin{align*} \Sigma_w &= \sum_{n \in c_1} (x_n − \mu_1)(x_n − \mu_1)^T @@ -152,19 +145,16 @@ where $\Sigma_w$ is the total within-class covariance matrix: \sigma_{xy} & \sigma_y^2 \end{pmatrix}_2 \end{align*} - Where $\Sigma_1$ and $\Sigma_2$ are the covariance matrix of the two samples. The Fisher criterion can therefore be rewritten in the form: $$ F(w) = \frac{w^T M w}{w^T \Sigma_w w} $$ - Differentiating with respect to $w$, it can be found that $F(w)$ is maximized when: $$ w = \Sigma_w^{-1} (\mu_2 - \mu_1) $$ {#eq:fisher-weight} - This is not truly a discriminant but rather a specific choice of the direction for projection of the data down to one dimension: the projected data can then be used to construct a discriminant by choosing a threshold for the @@ -182,11 +172,10 @@ with the `gsl_blas_dgemv()` function provided by GSL. The threshold $t_{\text{cut}}$ was fixed by the conditional probability $P(c_k | t_{\text{cut}})$ being the same for both classes $c_k$: $$ - t_{\text{cut}} = x \text{ such that}\quad - \frac{P(c_1 | x)}{P(c_2 | x)} = + t_{\text{cut}} = x \quad \text{such that}\quad + \frac{P(c_1 | x)}{P(c_2 | x)} = \frac{P(x | c_1) \, P(c_1)}{P(x | c_2) \, P(c_2)} = 1 $$ - where $P(x | c_k)$ is the probability for point $x$ along the Fisher projection line of being sampled from $k$. If $\tilde{x} > t_\text{cut}$ then more likely $x \in c_1$, otherwise $x \in c_2$. @@ -213,7 +202,6 @@ simply be set to: $$ \alpha = \frac{N_s}{N_n} $$ - The projection of the points was accomplished by the use of the function `gsl_blas_ddot()`, which computes a fast dot product of two vectors. @@ -228,7 +216,7 @@ $$ ![Scatter plot of the samples.](images/7-fisher-plane.pdf) ![Histogram of the Fisher-projected samples.](images/7-fisher-proj.pdf) -Aerial and lateral views of the samples. Projection line in blu and cut in red. +Aerial and lateral views of the samples. Projection line in blue and cut in red. ::: @@ -237,9 +225,9 @@ Aerial and lateral views of the samples. Projection line in blu and cut in red. This section is really a sidenote which grew too large to fit in a margin, so it can be safely skipped. -It can be seen that the weight vector turned out to parallel to the line +It can be seen that the weight vector turned out to be parallel to the line joining the means of the two classes (as a remainder: $(0, 0)$ and $(4, 4)$), -as if the within-class covariances were ignored. Strange! +as if the within-class covariances were ignored. Weird! Looking at @eq:fisher-weight, one can be mislead into thinking that the inverse of the total covariance matrix, $\Sigma_w$ is (proportional to) the identity, @@ -257,7 +245,6 @@ $$ \sigma_{xy} & \sigma_x^2 \end{pmatrix}_2 $$ - $\Sigma_w$ takes the symmetrical form $$ \Sigma_w = \begin{pmatrix} @@ -288,7 +275,6 @@ $$ \tilde{B} & \tilde{A} \end{pmatrix} $$ - For this reason, $\Sigma_w$ and $\Sigma_w^{-1}$ share the same eigenvectors $v_1$ and $v_2$: $$ @@ -296,7 +282,6 @@ $$ \et v_2 = \begin{pmatrix} 1 \\ 1 \end{pmatrix} $$ - The vector joining the means is clearly a multiple of $v_2$, and so is $w$. @@ -317,7 +302,6 @@ value: it is expected to return 1 for signal points and 0 for noise points: $$ f(x) = \theta(w^T x + b) $$ {#eq:perc} - where $\theta$ is the Heaviside theta function. Note that the bias $b$ is $-t_\text{cut}$, as defined in the previous section. @@ -330,7 +314,6 @@ defined as: $$ \Delta = r [e - f(x)] $$ - where: - $r \in [0, 1]$ is the learning rate of the perceptron: the larger $r$, the @@ -340,13 +323,11 @@ where: noise; is used to update $b$ and $w$: - $$ b \to b + \Delta \et w \to w + \Delta x $$ - To see how it works, consider the four possible situations: - $e = 1 \quad \wedge \quad f(x) = 1 \quad \dot \vee \quad e = 0 \quad \wedge @@ -369,7 +350,6 @@ $$ = w^T \cdot x + \Delta |x|^2 = w^T \cdot x - |x|^2 \leq w^T \cdot x $$ - Similarly for the case with $e = 1$ and $f(x) = 0$. ![Weight vector and threshold value obtained with the perceptron method as a @@ -385,18 +365,14 @@ separable, it can be shown (see [@novikoff63]) that this method converges to the coveted function. As in the previous section, once found, the weight vector is to be normalized. -With $N = 5$ iterations, the values of $w$ and $t_{\text{cut}}$ level off up to the third -digit. The following results were obtained: - Different values of the learning rate were tested, all giving the same result, -converging for a number $N = 3$ of iterations. In @fig:percep-iterations, results are -shown for $r = 0.8$: as can be seen, for $N = 3$, the values of $w$ and -$t^{\text{cut}}$ level off. +converging for a number $N = 3$ of iterations. In @fig:percep-iterations, +results are shown for $r = 0.8$: as can be seen, for $N = 3$, the values of $w$ +and $t^{\text{cut}}$ level off. The following results were obtained: $$ w = (-0.654, -0.756) \et t_{\text{cut}} = 1.213 $$ - In this case, the projection line is not exactly parallel with the line joining the means of the two samples. Plots in @fig:percep-proj. @@ -404,7 +380,7 @@ the means of the two samples. Plots in @fig:percep-proj. ![Scatter plot of the samples.](images/7-percep-plane.pdf) ![Histogram of the projected samples.](images/7-percep-proj.pdf) -Aerial and lateral views of the samples. Projection line in blu and cut in red. +Aerial and lateral views of the samples. Projection line in blue and cut in red. ::: @@ -419,18 +395,16 @@ false negative and false positive were obtained this way: for every noise point $x_n$, the threshold function $f(x_n)$ was computed, then: - if $f(x) = 0 \thus$ $N_{fn} \to N_{fn}$ - - if $f(x) \neq 0 \thus$ $N_{fn} \to N_{fn} + 1$ + - if $f(x) \neq 0 \thus$ $N_{fn} \to N_{fn} + 1$ and similarly for the positive points. Finally, the mean and standard deviation were computed from $N_{fn}$ and $N_{fp}$ for every sample and used to estimate the purity $\alpha$ and efficiency $\beta$ of the classification: - $$ \alpha = 1 - \frac{\text{mean}(N_{fn})}{N_s} \et \beta = 1 - \frac{\text{mean}(N_{fp})}{N_n} $$ - Results for $N_t = 500$ are shown in @tbl:res_comp. As can be seen, the Fisher discriminant gives a nearly perfect classification with a symmetric distribution of false negative and false positive, whereas the perceptron shows a little more