ex-7: improve efficiency/purity section

This commit is contained in:
Michele Guerini Rocco 2020-07-05 21:23:20 +02:00
parent b7e1857862
commit 747f2f4335
Signed by: rnhmjoj
GPG Key ID: BFBAF4C975F76450

View File

@ -391,24 +391,28 @@ was generated and the points were classified applying both methods. To avoid
storing large datasets in memory, at each iteration, false positives and
negatives were recorded using a running statistics method implemented in the
`gsl_rstat` library. For each sample, the numbers $N_{fn}$ and $N_{fp}$ of
false negative and false positive were obtained this way: for every noise point
$x_n$, the threshold function $f(x_n)$ was computed, then:
false negative and false positive were obtained in this way: for every signal
point $x_s$, the threshold function $f(x_s)$ was computed, then:
- if $f(x) = 0 \thus$ $N_{fn} \to N_{fn}$
- if $f(x) \neq 0 \thus$ $N_{fn} \to N_{fn} + 1$
- if $f(x_s) = 1 \thus$ $N_{fn} \to N_{fn}$
- if $f(x_s) = 0 \thus$ $N_{fn} \to N_{fn} + 1$
and similarly, for the noise points:
- if $f(x_n) = 1 \thus$ $N_{fp} \to N_{fp} + 1$
- if $f(x_n) = 0 \thus$ $N_{fp} \to N_{fp}$
and similarly for the positive points.
Finally, the mean and standard deviation were computed from $N_{fn}$ and
$N_{fp}$ for every sample and used to estimate the significance $\alpha$
and not-purity $\beta$ of the classification:
and false-positive rate $\beta$ of the classification:
$$
\alpha = 1 - \frac{\text{mean}(N_{fn})}{N_s} \et
\beta = 1 - \frac{\text{mean}(N_{fp})}{N_n}
\alpha = \frac{\text{mean}(N_{fn})}{N_s} \et
\beta = \frac{\text{mean}(N_{fp})}{N_n}
$$
Results for $N_t = 500$ are shown in @tbl:res_comp. As can be seen, the Fisher
discriminant gives a nearly perfect classification with a symmetric distribution
of false negative and false positive, whereas the perceptron shows a little more
false-positive than false-negative, being also more variable from dataset to
of true negatives and false positives, whereas the perceptron shows a little more
false positives than false negatives, being also more variable from dataset to
dataset.
A possible explanation of this fact is that, for linearly separable and normally
distributed points, the Fisher linear discriminant is an exact analytical
@ -416,13 +420,13 @@ solution, the most powerful one, according to the Neyman-Pearson lemma, whereas
the perceptron is only expected to converge to the solution and is therefore
more subject to random fluctuations.
------------------------------------------------------
$α$ $σ_α$ $β$ $σ_β$
----------- ---------- ---------- ---------- ---------
-------------------------------------------------------
$1-α$ $σ_{1-α}$ $1-β$ $σ_{1-β}$
----------- ---------- ---------- ---------- ----------
Fisher 0.9999 0.33 0.9999 0.33
Perceptron 0.9999 0.28 0.9995 0.64
------------------------------------------------------
-------------------------------------------------------
Table: Results for Fisher and perceptron method. $\sigma_{\alpha}$ and
$\sigma_{\beta}$ stand for the standard deviation of the false