ex-1: review

This commit is contained in:
Giù Marcer 2020-06-01 22:53:14 +02:00 committed by rnhmjoj
parent aa676cf9a9
commit 59fba30c12

View File

@ -29,24 +29,23 @@ function and plotted in a 100-bins histogram ranging from -20 to
### Kolmogorov-Smirnov test ### Kolmogorov-Smirnov test
In order to compare the sample with the Landau distribution, the In order to compare the sample with the Landau distribution, the
Kolmogorov-Smirnov (KS) test was applied. This test statistically quantifies the Kolmogorov-Smirnov (KS) test was applied. This test can be used to
distance between the cumulative distribution function of the Landau distribution statistically quantifies the distance between the cumulative distribution
and the one of the sample. The null hypothesis is that the sample was function of the Landau distribution and the one of the sample. The null
drawn from the reference distribution. hypothesis is that the sample was drawn from the reference distribution.
The KS statistic for a given cumulative distribution function $F(x)$ is: The KS statistic for a given cumulative distribution function $F(x)$ is:
$$ $$
D_N = \text{sup}_x |F_N(x) - F(x)| D_N = \text{sup}_x |F_N(x) - F(x)|
$$ $$
where: where:
- $x$ runs over the sample, - $x$ runs over the sample,
- $F(x)$ is the Landau cumulative distribution function, - $F(x)$ is the Landau cumulative distribution function,
- $F_N(x)$ is the empirical cumulative distribution function of the sample. - $F_N(x)$ is the empirical cumulative distribution function of the sample.
If $N$ numbers have been generated, for every point $x$, If $N$ numbers were generated, for every point $x$, $F_N(x)$ is simply given by
$F_N(x)$ is simply given by the number of points preceding the point (itself the number of points preceding the point (itself included) normalized by $N$,
included) normalized by $N$, once the sample is sorted in ascending order. once the sample is sorted in ascending order.
$F(x)$ was computed numerically from the Landau distribution with a maximum $F(x)$ was computed numerically from the Landau distribution with a maximum
relative error of $10^{-6}$, using the function `gsl_integration_qagiu()`, relative error of $10^{-6}$, using the function `gsl_integration_qagiu()`,
found in GSL. found in GSL.
@ -56,14 +55,12 @@ asymptotically approach a Kolmogorov distribution:
$$ $$
\sqrt{N}D_N \xrightarrow{N \rightarrow + \infty} K \sqrt{N}D_N \xrightarrow{N \rightarrow + \infty} K
$$ $$
where $K$ is the Kolmogorov variable, with cumulative distribution function where $K$ is the Kolmogorov variable, with cumulative distribution function
given by [@marsaglia03]: given by [@marsaglia03]:
$$ $$
P(K \leqslant K_0) = 1 - p = \frac{\sqrt{2 \pi}}{K_0} P(K \leqslant K_0) = 1 - p = \frac{\sqrt{2 \pi}}{K_0}
\sum_{j = 1}^{+ \infty} e^{-(2j - 1)^2 \pi^2 / 8 K_0^2} \sum_{j = 1}^{+ \infty} e^{-(2j - 1)^2 \pi^2 / 8 K_0^2}
$$ $$
Plugging the observed value $\sqrt{N}D_N$ in $K_0$, the $p$-value can be Plugging the observed value $\sqrt{N}D_N$ in $K_0$, the $p$-value can be
computed. At 95% confidence level (which is the probability of confirming the computed. At 95% confidence level (which is the probability of confirming the
null hypothesis when correct) the compatibility with the Landau distribution null hypothesis when correct) the compatibility with the Landau distribution
@ -78,11 +75,11 @@ For $N = 50000$, the following results were obtained:
- $D = 0.004$ - $D = 0.004$
- $p = 0.38$ - $p = 0.38$
Hence, the data was reasonably sampled from a Landau distribution. Hence, the data were reasonably sampled from a Landau distribution.
**Note**: **Note**:
Contrary to what one would expect, the $\chi^2$ test on a histogram is not very Contrary to what one would expect, the $\chi^2$ test on a histogram is not very
useful in this case. For the test to be significant, the data has to be binned useful in this case. For the test to be significant, the data have to be binned
such that at least several points fall in each bin. However, it can be seen such that at least several points fall in each bin. However, it can be seen
in @fig:landau that many bins are empty both in the right and left side of the in @fig:landau that many bins are empty both in the right and left side of the
distribution, so it would be necessary to fit only the region where the points distribution, so it would be necessary to fit only the region where the points
@ -136,26 +133,23 @@ tolerance of $10^{-7}$, giving:
$$ $$
\text{expected mode: } m_e = \num{-0.22278298 \pm 0.00000006} \text{expected mode: } m_e = \num{-0.22278298 \pm 0.00000006}
$$ $$
This is a minimization algorithm that begins with a bounded region known to This is a minimization algorithm that begins with a bounded region known to
contain a minimum. The region is described by a lower bound $x_\text{min}$ and contain a minimum. The region is described by a lower bound $x_\text{min}$ and
an upper bound $x_\text{max}$, with an estimate of the location of the minimum an upper bound $x_\text{max}$, with an estimate of the location of the minimum
$x_e$. The value of the function at $x_e$ must be less than the value of the $x_e$. The value of the function at $x_e$ must be less than the value of the
function at the ends of the interval, in order to guarantee that a minimum is function at the ends of the interval, in order to guarantee that a minimum is
contained somewhere within the interval. contained somewhere within the interval:
$$ $$
f(x_\text{min}) > f(x_e) < f(x_\text{max}) f(x_\text{min}) > f(x_e) < f(x_\text{max})
$$ $$
On each iteration the function is interpolated by a parabola passing though the On each iteration the function is interpolated by a parabola passing though the
points $x_\text{min}$, $x_e$, $x_\text{max}$ and the minimum is computed as the points $x_\text{min}$, $x_e$, $x_\text{max}$ and the minimum is computed as the
vertex of the parabola. If this point is found to be inside the interval it's vertex of the parabola. If this point is found to be inside the interval, it is
taken as a guess for the true minimum; otherwise the method falls taken as a guess for the true minimum; otherwise the method falls back to a g
back to a golden section (using the ratio $(3 - \sqrt{5})/2 \approx 0.3819660$ olden section (using the ratio $(3 - \sqrt{5})/2 \approx 0.3819660$ proven to be
proven to be optimal) of the interval. The value of the function at this new optimal) of the interval. The value of the function at this new point $x'$ is
point $x'$ is calculated. In any case if the new point is a better estimate of calculated. In any case, if the new point is a better estimate of the minimum,
the minimum, namely if $f(x') < f(x_e)$, then the current estimate of the namely if $f(x') < f(x_e)$, then the current estimate of the minimum is updated.
minimum is updated.
The new point allows the size of the bounded interval to be reduced, by choosing The new point allows the size of the bounded interval to be reduced, by choosing
the most compact set of points which satisfies the constraint $f(a) > f(x') < the most compact set of points which satisfies the constraint $f(a) > f(x') <
f(b)$ between $f(x_\text{min})$, $f(x_\text{min})$ and $f(x_e)$. The interval is f(b)$ between $f(x_\text{min})$, $f(x_\text{min})$ and $f(x_e)$. The interval is
@ -171,7 +165,7 @@ Once the sample is reduced to less than three points the mode is computed as the
average. The special case $n=3$ is dealt with by averaging the two closer points average. The special case $n=3$ is dealt with by averaging the two closer points
[@robertson74]. [@robertson74].
To obtain a better estimate of the mode and its error the above procedure was To obtain a better estimate of the mode and its error, the above procedure was
bootstrapped. The original sample was treated as a population and used to build bootstrapped. The original sample was treated as a population and used to build
100 other samples of the same size, by *sampling with replacements*. For each one 100 other samples of the same size, by *sampling with replacements*. For each one
of the new samples, the above statistic was computed. By simply taking the of the new samples, the above statistic was computed. By simply taking the
@ -179,14 +173,12 @@ mean of these statistics the following estimate was obtained:
$$ $$
\text{observed mode: } m_o = \num{-0.29 \pm 0.19} \text{observed mode: } m_o = \num{-0.29 \pm 0.19}
$$ $$
In order to compare the values $m_e$ and $m_0$, the following compatibility In order to compare the values $m_e$ and $m_0$, the following compatibility
$t$-test was applied: $t$-test was applied:
$$ $$
p = 1 - \text{erf}\left(\frac{t}{\sqrt{2}}\right)\ \with p = 1 - \text{erf}\left(\frac{t}{\sqrt{2}}\right)\ \with
t = \frac{|m_e - m_o|}{\sqrt{\sigma_e^2 + \sigma_o^2}} t = \frac{|m_e - m_o|}{\sqrt{\sigma_e^2 + \sigma_o^2}}
$$ $$
where $\sigma_e$ and $\sigma_o$ are the absolute errors of $m_e$ and $m_o$ where $\sigma_e$ and $\sigma_o$ are the absolute errors of $m_e$ and $m_o$
respectively. At 95% confidence level, the values are compatible if $p > 0.05$. respectively. At 95% confidence level, the values are compatible if $p > 0.05$.
In this case: In this case:
@ -220,7 +212,6 @@ calculate an approximation of the integral:
$$ $$
I(x) = \int_x^{+\infty} f(t)dt I(x) = \int_x^{+\infty} f(t)dt
$$ $$
[^1]: This is neither necessary nor the easiest way: it was chosen simply [^1]: This is neither necessary nor the easiest way: it was chosen simply
because the quantile had been already implemented and was initially because the quantile had been already implemented and was initially
used for reverse sampling. used for reverse sampling.
@ -234,7 +225,6 @@ accuracy requirement:
$$ $$
|\text{result} - I| \le \max(\varepsilon_\text{abs}, \varepsilon_\text{rel}I) |\text{result} - I| \le \max(\varepsilon_\text{abs}, \varepsilon_\text{rel}I)
$$ $$
where the absolute and relative tolerances $\varepsilon_\text{abs}$ and where the absolute and relative tolerances $\varepsilon_\text{abs}$ and
$\varepsilon_\text{rel}$ were set to \num{1e-10} and \num{1e-6}, $\varepsilon_\text{rel}$ were set to \num{1e-10} and \num{1e-6},
respectively. respectively.
@ -243,7 +233,6 @@ done by solving the equation;
$$ $$
p(x) = p_0 p(x) = p_0
$$ $$
for x, given a probability value $p_0$, where $p(x)$ is the CDF. The (unique) for x, given a probability value $p_0$, where $p(x)$ is the CDF. The (unique)
root of this equation was found by a root-finding routine root of this equation was found by a root-finding routine
(`gsl_root_fsolver_brent` in GSL) based on the Brent-Dekker method. (`gsl_root_fsolver_brent` in GSL) based on the Brent-Dekker method.
@ -251,7 +240,6 @@ The following condition was checked for convergence:
$$ $$
|a - b| < \varepsilon_\text{abs} + \varepsilon_\text{rel} \min(|a|, |b|) |a - b| < \varepsilon_\text{abs} + \varepsilon_\text{rel} \min(|a|, |b|)
$$ $$
where $a,b$ are the current interval bounds. The condition immediately gives an where $a,b$ are the current interval bounds. The condition immediately gives an
upper bound on the error of the root as $\varepsilon = |a-b|$. The tolerances upper bound on the error of the root as $\varepsilon = |a-b|$. The tolerances
here were set to 0 and \num{1e-3}. here were set to 0 and \num{1e-3}.
@ -259,12 +247,10 @@ The result of the numerical computation is:
$$ $$
\text{expected median: } m_e = \num{1.3557804 \pm 0.0000091} \text{expected median: } m_e = \num{1.3557804 \pm 0.0000091}
$$ $$
while the sample median, obtained again by bootstrapping, was found to be: while the sample median, obtained again by bootstrapping, was found to be:
$$ $$
\text{observed median: } m_e = \num{1.3605 \pm 0.0062} \text{observed median: } m_e = \num{1.3605 \pm 0.0062}
$$ $$
As stated above, the median is less sensitive to extreme values with respect to As stated above, the median is less sensitive to extreme values with respect to
the mode: this lead the result to be much more precise. Applying again the the mode: this lead the result to be much more precise. Applying again the
aforementioned $t$-test to this statistic: aforementioned $t$-test to this statistic:
@ -286,17 +272,16 @@ $$
f_{\text{max}} = f(m_e) \et \text{FWHM} = x_+ - x_- \with f_{\text{max}} = f(m_e) \et \text{FWHM} = x_+ - x_- \with
f(x_\pm) = \frac{f_\text{max}}{2} f(x_\pm) = \frac{f_\text{max}}{2}
$$ $$
The function derivative $f'(x)$ was minimized using the same minimization method The function derivative $f'(x)$ was minimized using the same minimization method
used for finding $m_e$. Once $f_\text{max}$ was known, the equation: used for finding $m_e$. Once $f_\text{max}$ was known, the equation:
$$ $$
f(x) = \frac{f_\text{max}}{2} f(x) = \frac{f_\text{max}}{2}
$$ $$
was solved by performing the Brent-Dekker method in the ranges $[x_\text{min},
was solved by performing the Brent-Dekker method (described before) in the m_e]$ and $[m_e, x_\text{max}]$, where $x_\text{min}$ and $x_\text{max}$ are
ranges $[x_\text{min}, m_e]$ and $[m_e, x_\text{max}]$ yielding the two the first and last sampled point respectively, once all the points are sorted
solutions $x_\pm$. With a relative tolerance of \num{1e-7}, the following in ascending order. This lead to the two solutions $x_\pm$.
result was obtained: With a relative tolerance of \num{1e-7}, the following result was obtained:
$$ $$
\text{expected FWHM: } w_e = \num{4.0186457 \pm 0.0000001} \text{expected FWHM: } w_e = \num{4.0186457 \pm 0.0000001}
$$ $$
@ -308,7 +293,7 @@ $$
On the other hand, obtaining a good estimate of the FWHM from a sample is much On the other hand, obtaining a good estimate of the FWHM from a sample is much
more difficult. In principle, it could be measured by binning the data and more difficult. In principle, it could be measured by binning the data and
applying the definition to the discretised values, however this yields very applying the definition to the discretized values, however this yields very
poor results and depends on an completely arbitrary parameter: the bin width. poor results and depends on an completely arbitrary parameter: the bin width.
A more refined method to construct a nonparametric empirical PDF function from A more refined method to construct a nonparametric empirical PDF function from
the sample is a kernel density estimation (KDE). This method consist in the sample is a kernel density estimation (KDE). This method consist in
@ -319,22 +304,19 @@ $$
f_\varepsilon(x) = \frac{1}{N\varepsilon} \sum_{i = 1}^N f_\varepsilon(x) = \frac{1}{N\varepsilon} \sum_{i = 1}^N
\mathcal{N}\left(\frac{x-x_i}{\varepsilon}\right) \mathcal{N}\left(\frac{x-x_i}{\varepsilon}\right)
$$ $$
where $\mathcal{N}$ is the kernel and the parameter $\varepsilon$, called where $\mathcal{N}$ is the kernel and the parameter $\varepsilon$, called
*bandwidth*, controls the strength of the smoothing. This parameter can be *bandwidth*, controls the strength of the smoothing. This parameter can be
determined in several ways: bootstrapping, cross-validation, etc. For determined in several ways. For simplicity, it was chosen to use Silverman's
simplicity, it was chosen to use Silverman's rule of thumb [@silverman86], rule of thumb [@silverman86], which gives:
which gives:
$$ $$
\varepsilon = 0.63 S_N \varepsilon = 0.63 \, S_N
\left(\frac{d + 2}{4}N\right)^{-1/(d + 4)} \left(\frac{d + 2}{4}N\right)^{-1/(d + 4)}
$$ $$
where the $0.63$ factor was chosen to compensate for the distortion that where the $0.63$ factor was chosen to compensate for the distortion that
systematically reduces the peaks height, which affects the estimation of the systematically reduces the peaks height, which affects the estimation of the
mode, and: mode, and:
- $S_N$ is the sample standard deviation, - $S_N$ is the sample standard deviation;
- $d$ is the number of dimensions, in this case $d=1$. - $d$ is the number of dimensions, in this case $d=1$.
With the empirical density estimation at hand, the FWHM can be computed by the With the empirical density estimation at hand, the FWHM can be computed by the
@ -343,7 +325,6 @@ to estimate the standard error giving:
$$ $$
\text{observed FWHM: } w_o = \num{4.06 \pm 0.08} \text{observed FWHM: } w_o = \num{4.06 \pm 0.08}
$$ $$
Applying the $t$-test to these two values gives Applying the $t$-test to these two values gives
- $t=0.495$ - $t=0.495$