ex-1: review
This commit is contained in:
parent
aa676cf9a9
commit
59fba30c12
@ -29,24 +29,23 @@ function and plotted in a 100-bins histogram ranging from -20 to
|
|||||||
### Kolmogorov-Smirnov test
|
### Kolmogorov-Smirnov test
|
||||||
|
|
||||||
In order to compare the sample with the Landau distribution, the
|
In order to compare the sample with the Landau distribution, the
|
||||||
Kolmogorov-Smirnov (KS) test was applied. This test statistically quantifies the
|
Kolmogorov-Smirnov (KS) test was applied. This test can be used to
|
||||||
distance between the cumulative distribution function of the Landau distribution
|
statistically quantifies the distance between the cumulative distribution
|
||||||
and the one of the sample. The null hypothesis is that the sample was
|
function of the Landau distribution and the one of the sample. The null
|
||||||
drawn from the reference distribution.
|
hypothesis is that the sample was drawn from the reference distribution.
|
||||||
The KS statistic for a given cumulative distribution function $F(x)$ is:
|
The KS statistic for a given cumulative distribution function $F(x)$ is:
|
||||||
$$
|
$$
|
||||||
D_N = \text{sup}_x |F_N(x) - F(x)|
|
D_N = \text{sup}_x |F_N(x) - F(x)|
|
||||||
$$
|
$$
|
||||||
|
|
||||||
where:
|
where:
|
||||||
|
|
||||||
- $x$ runs over the sample,
|
- $x$ runs over the sample,
|
||||||
- $F(x)$ is the Landau cumulative distribution function,
|
- $F(x)$ is the Landau cumulative distribution function,
|
||||||
- $F_N(x)$ is the empirical cumulative distribution function of the sample.
|
- $F_N(x)$ is the empirical cumulative distribution function of the sample.
|
||||||
|
|
||||||
If $N$ numbers have been generated, for every point $x$,
|
If $N$ numbers were generated, for every point $x$, $F_N(x)$ is simply given by
|
||||||
$F_N(x)$ is simply given by the number of points preceding the point (itself
|
the number of points preceding the point (itself included) normalized by $N$,
|
||||||
included) normalized by $N$, once the sample is sorted in ascending order.
|
once the sample is sorted in ascending order.
|
||||||
$F(x)$ was computed numerically from the Landau distribution with a maximum
|
$F(x)$ was computed numerically from the Landau distribution with a maximum
|
||||||
relative error of $10^{-6}$, using the function `gsl_integration_qagiu()`,
|
relative error of $10^{-6}$, using the function `gsl_integration_qagiu()`,
|
||||||
found in GSL.
|
found in GSL.
|
||||||
@ -56,14 +55,12 @@ asymptotically approach a Kolmogorov distribution:
|
|||||||
$$
|
$$
|
||||||
\sqrt{N}D_N \xrightarrow{N \rightarrow + \infty} K
|
\sqrt{N}D_N \xrightarrow{N \rightarrow + \infty} K
|
||||||
$$
|
$$
|
||||||
|
|
||||||
where $K$ is the Kolmogorov variable, with cumulative distribution function
|
where $K$ is the Kolmogorov variable, with cumulative distribution function
|
||||||
given by [@marsaglia03]:
|
given by [@marsaglia03]:
|
||||||
$$
|
$$
|
||||||
P(K \leqslant K_0) = 1 - p = \frac{\sqrt{2 \pi}}{K_0}
|
P(K \leqslant K_0) = 1 - p = \frac{\sqrt{2 \pi}}{K_0}
|
||||||
\sum_{j = 1}^{+ \infty} e^{-(2j - 1)^2 \pi^2 / 8 K_0^2}
|
\sum_{j = 1}^{+ \infty} e^{-(2j - 1)^2 \pi^2 / 8 K_0^2}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
Plugging the observed value $\sqrt{N}D_N$ in $K_0$, the $p$-value can be
|
Plugging the observed value $\sqrt{N}D_N$ in $K_0$, the $p$-value can be
|
||||||
computed. At 95% confidence level (which is the probability of confirming the
|
computed. At 95% confidence level (which is the probability of confirming the
|
||||||
null hypothesis when correct) the compatibility with the Landau distribution
|
null hypothesis when correct) the compatibility with the Landau distribution
|
||||||
@ -78,11 +75,11 @@ For $N = 50000$, the following results were obtained:
|
|||||||
- $D = 0.004$
|
- $D = 0.004$
|
||||||
- $p = 0.38$
|
- $p = 0.38$
|
||||||
|
|
||||||
Hence, the data was reasonably sampled from a Landau distribution.
|
Hence, the data were reasonably sampled from a Landau distribution.
|
||||||
|
|
||||||
**Note**:
|
**Note**:
|
||||||
Contrary to what one would expect, the $\chi^2$ test on a histogram is not very
|
Contrary to what one would expect, the $\chi^2$ test on a histogram is not very
|
||||||
useful in this case. For the test to be significant, the data has to be binned
|
useful in this case. For the test to be significant, the data have to be binned
|
||||||
such that at least several points fall in each bin. However, it can be seen
|
such that at least several points fall in each bin. However, it can be seen
|
||||||
in @fig:landau that many bins are empty both in the right and left side of the
|
in @fig:landau that many bins are empty both in the right and left side of the
|
||||||
distribution, so it would be necessary to fit only the region where the points
|
distribution, so it would be necessary to fit only the region where the points
|
||||||
@ -136,26 +133,23 @@ tolerance of $10^{-7}$, giving:
|
|||||||
$$
|
$$
|
||||||
\text{expected mode: } m_e = \num{-0.22278298 \pm 0.00000006}
|
\text{expected mode: } m_e = \num{-0.22278298 \pm 0.00000006}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
This is a minimization algorithm that begins with a bounded region known to
|
This is a minimization algorithm that begins with a bounded region known to
|
||||||
contain a minimum. The region is described by a lower bound $x_\text{min}$ and
|
contain a minimum. The region is described by a lower bound $x_\text{min}$ and
|
||||||
an upper bound $x_\text{max}$, with an estimate of the location of the minimum
|
an upper bound $x_\text{max}$, with an estimate of the location of the minimum
|
||||||
$x_e$. The value of the function at $x_e$ must be less than the value of the
|
$x_e$. The value of the function at $x_e$ must be less than the value of the
|
||||||
function at the ends of the interval, in order to guarantee that a minimum is
|
function at the ends of the interval, in order to guarantee that a minimum is
|
||||||
contained somewhere within the interval.
|
contained somewhere within the interval:
|
||||||
$$
|
$$
|
||||||
f(x_\text{min}) > f(x_e) < f(x_\text{max})
|
f(x_\text{min}) > f(x_e) < f(x_\text{max})
|
||||||
$$
|
$$
|
||||||
|
|
||||||
On each iteration the function is interpolated by a parabola passing though the
|
On each iteration the function is interpolated by a parabola passing though the
|
||||||
points $x_\text{min}$, $x_e$, $x_\text{max}$ and the minimum is computed as the
|
points $x_\text{min}$, $x_e$, $x_\text{max}$ and the minimum is computed as the
|
||||||
vertex of the parabola. If this point is found to be inside the interval it's
|
vertex of the parabola. If this point is found to be inside the interval, it is
|
||||||
taken as a guess for the true minimum; otherwise the method falls
|
taken as a guess for the true minimum; otherwise the method falls back to a g
|
||||||
back to a golden section (using the ratio $(3 - \sqrt{5})/2 \approx 0.3819660$
|
olden section (using the ratio $(3 - \sqrt{5})/2 \approx 0.3819660$ proven to be
|
||||||
proven to be optimal) of the interval. The value of the function at this new
|
optimal) of the interval. The value of the function at this new point $x'$ is
|
||||||
point $x'$ is calculated. In any case if the new point is a better estimate of
|
calculated. In any case, if the new point is a better estimate of the minimum,
|
||||||
the minimum, namely if $f(x') < f(x_e)$, then the current estimate of the
|
namely if $f(x') < f(x_e)$, then the current estimate of the minimum is updated.
|
||||||
minimum is updated.
|
|
||||||
The new point allows the size of the bounded interval to be reduced, by choosing
|
The new point allows the size of the bounded interval to be reduced, by choosing
|
||||||
the most compact set of points which satisfies the constraint $f(a) > f(x') <
|
the most compact set of points which satisfies the constraint $f(a) > f(x') <
|
||||||
f(b)$ between $f(x_\text{min})$, $f(x_\text{min})$ and $f(x_e)$. The interval is
|
f(b)$ between $f(x_\text{min})$, $f(x_\text{min})$ and $f(x_e)$. The interval is
|
||||||
@ -171,7 +165,7 @@ Once the sample is reduced to less than three points the mode is computed as the
|
|||||||
average. The special case $n=3$ is dealt with by averaging the two closer points
|
average. The special case $n=3$ is dealt with by averaging the two closer points
|
||||||
[@robertson74].
|
[@robertson74].
|
||||||
|
|
||||||
To obtain a better estimate of the mode and its error the above procedure was
|
To obtain a better estimate of the mode and its error, the above procedure was
|
||||||
bootstrapped. The original sample was treated as a population and used to build
|
bootstrapped. The original sample was treated as a population and used to build
|
||||||
100 other samples of the same size, by *sampling with replacements*. For each one
|
100 other samples of the same size, by *sampling with replacements*. For each one
|
||||||
of the new samples, the above statistic was computed. By simply taking the
|
of the new samples, the above statistic was computed. By simply taking the
|
||||||
@ -179,14 +173,12 @@ mean of these statistics the following estimate was obtained:
|
|||||||
$$
|
$$
|
||||||
\text{observed mode: } m_o = \num{-0.29 \pm 0.19}
|
\text{observed mode: } m_o = \num{-0.29 \pm 0.19}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
In order to compare the values $m_e$ and $m_0$, the following compatibility
|
In order to compare the values $m_e$ and $m_0$, the following compatibility
|
||||||
$t$-test was applied:
|
$t$-test was applied:
|
||||||
$$
|
$$
|
||||||
p = 1 - \text{erf}\left(\frac{t}{\sqrt{2}}\right)\ \with
|
p = 1 - \text{erf}\left(\frac{t}{\sqrt{2}}\right)\ \with
|
||||||
t = \frac{|m_e - m_o|}{\sqrt{\sigma_e^2 + \sigma_o^2}}
|
t = \frac{|m_e - m_o|}{\sqrt{\sigma_e^2 + \sigma_o^2}}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
where $\sigma_e$ and $\sigma_o$ are the absolute errors of $m_e$ and $m_o$
|
where $\sigma_e$ and $\sigma_o$ are the absolute errors of $m_e$ and $m_o$
|
||||||
respectively. At 95% confidence level, the values are compatible if $p > 0.05$.
|
respectively. At 95% confidence level, the values are compatible if $p > 0.05$.
|
||||||
In this case:
|
In this case:
|
||||||
@ -220,7 +212,6 @@ calculate an approximation of the integral:
|
|||||||
$$
|
$$
|
||||||
I(x) = \int_x^{+\infty} f(t)dt
|
I(x) = \int_x^{+\infty} f(t)dt
|
||||||
$$
|
$$
|
||||||
|
|
||||||
[^1]: This is neither necessary nor the easiest way: it was chosen simply
|
[^1]: This is neither necessary nor the easiest way: it was chosen simply
|
||||||
because the quantile had been already implemented and was initially
|
because the quantile had been already implemented and was initially
|
||||||
used for reverse sampling.
|
used for reverse sampling.
|
||||||
@ -234,7 +225,6 @@ accuracy requirement:
|
|||||||
$$
|
$$
|
||||||
|\text{result} - I| \le \max(\varepsilon_\text{abs}, \varepsilon_\text{rel}I)
|
|\text{result} - I| \le \max(\varepsilon_\text{abs}, \varepsilon_\text{rel}I)
|
||||||
$$
|
$$
|
||||||
|
|
||||||
where the absolute and relative tolerances $\varepsilon_\text{abs}$ and
|
where the absolute and relative tolerances $\varepsilon_\text{abs}$ and
|
||||||
$\varepsilon_\text{rel}$ were set to \num{1e-10} and \num{1e-6},
|
$\varepsilon_\text{rel}$ were set to \num{1e-10} and \num{1e-6},
|
||||||
respectively.
|
respectively.
|
||||||
@ -243,7 +233,6 @@ done by solving the equation;
|
|||||||
$$
|
$$
|
||||||
p(x) = p_0
|
p(x) = p_0
|
||||||
$$
|
$$
|
||||||
|
|
||||||
for x, given a probability value $p_0$, where $p(x)$ is the CDF. The (unique)
|
for x, given a probability value $p_0$, where $p(x)$ is the CDF. The (unique)
|
||||||
root of this equation was found by a root-finding routine
|
root of this equation was found by a root-finding routine
|
||||||
(`gsl_root_fsolver_brent` in GSL) based on the Brent-Dekker method.
|
(`gsl_root_fsolver_brent` in GSL) based on the Brent-Dekker method.
|
||||||
@ -251,7 +240,6 @@ The following condition was checked for convergence:
|
|||||||
$$
|
$$
|
||||||
|a - b| < \varepsilon_\text{abs} + \varepsilon_\text{rel} \min(|a|, |b|)
|
|a - b| < \varepsilon_\text{abs} + \varepsilon_\text{rel} \min(|a|, |b|)
|
||||||
$$
|
$$
|
||||||
|
|
||||||
where $a,b$ are the current interval bounds. The condition immediately gives an
|
where $a,b$ are the current interval bounds. The condition immediately gives an
|
||||||
upper bound on the error of the root as $\varepsilon = |a-b|$. The tolerances
|
upper bound on the error of the root as $\varepsilon = |a-b|$. The tolerances
|
||||||
here were set to 0 and \num{1e-3}.
|
here were set to 0 and \num{1e-3}.
|
||||||
@ -259,12 +247,10 @@ The result of the numerical computation is:
|
|||||||
$$
|
$$
|
||||||
\text{expected median: } m_e = \num{1.3557804 \pm 0.0000091}
|
\text{expected median: } m_e = \num{1.3557804 \pm 0.0000091}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
while the sample median, obtained again by bootstrapping, was found to be:
|
while the sample median, obtained again by bootstrapping, was found to be:
|
||||||
$$
|
$$
|
||||||
\text{observed median: } m_e = \num{1.3605 \pm 0.0062}
|
\text{observed median: } m_e = \num{1.3605 \pm 0.0062}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
As stated above, the median is less sensitive to extreme values with respect to
|
As stated above, the median is less sensitive to extreme values with respect to
|
||||||
the mode: this lead the result to be much more precise. Applying again the
|
the mode: this lead the result to be much more precise. Applying again the
|
||||||
aforementioned $t$-test to this statistic:
|
aforementioned $t$-test to this statistic:
|
||||||
@ -286,17 +272,16 @@ $$
|
|||||||
f_{\text{max}} = f(m_e) \et \text{FWHM} = x_+ - x_- \with
|
f_{\text{max}} = f(m_e) \et \text{FWHM} = x_+ - x_- \with
|
||||||
f(x_\pm) = \frac{f_\text{max}}{2}
|
f(x_\pm) = \frac{f_\text{max}}{2}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
The function derivative $f'(x)$ was minimized using the same minimization method
|
The function derivative $f'(x)$ was minimized using the same minimization method
|
||||||
used for finding $m_e$. Once $f_\text{max}$ was known, the equation:
|
used for finding $m_e$. Once $f_\text{max}$ was known, the equation:
|
||||||
$$
|
$$
|
||||||
f(x) = \frac{f_\text{max}}{2}
|
f(x) = \frac{f_\text{max}}{2}
|
||||||
$$
|
$$
|
||||||
|
was solved by performing the Brent-Dekker method in the ranges $[x_\text{min},
|
||||||
was solved by performing the Brent-Dekker method (described before) in the
|
m_e]$ and $[m_e, x_\text{max}]$, where $x_\text{min}$ and $x_\text{max}$ are
|
||||||
ranges $[x_\text{min}, m_e]$ and $[m_e, x_\text{max}]$ yielding the two
|
the first and last sampled point respectively, once all the points are sorted
|
||||||
solutions $x_\pm$. With a relative tolerance of \num{1e-7}, the following
|
in ascending order. This lead to the two solutions $x_\pm$.
|
||||||
result was obtained:
|
With a relative tolerance of \num{1e-7}, the following result was obtained:
|
||||||
$$
|
$$
|
||||||
\text{expected FWHM: } w_e = \num{4.0186457 \pm 0.0000001}
|
\text{expected FWHM: } w_e = \num{4.0186457 \pm 0.0000001}
|
||||||
$$
|
$$
|
||||||
@ -308,7 +293,7 @@ $$
|
|||||||
|
|
||||||
On the other hand, obtaining a good estimate of the FWHM from a sample is much
|
On the other hand, obtaining a good estimate of the FWHM from a sample is much
|
||||||
more difficult. In principle, it could be measured by binning the data and
|
more difficult. In principle, it could be measured by binning the data and
|
||||||
applying the definition to the discretised values, however this yields very
|
applying the definition to the discretized values, however this yields very
|
||||||
poor results and depends on an completely arbitrary parameter: the bin width.
|
poor results and depends on an completely arbitrary parameter: the bin width.
|
||||||
A more refined method to construct a nonparametric empirical PDF function from
|
A more refined method to construct a nonparametric empirical PDF function from
|
||||||
the sample is a kernel density estimation (KDE). This method consist in
|
the sample is a kernel density estimation (KDE). This method consist in
|
||||||
@ -319,22 +304,19 @@ $$
|
|||||||
f_\varepsilon(x) = \frac{1}{N\varepsilon} \sum_{i = 1}^N
|
f_\varepsilon(x) = \frac{1}{N\varepsilon} \sum_{i = 1}^N
|
||||||
\mathcal{N}\left(\frac{x-x_i}{\varepsilon}\right)
|
\mathcal{N}\left(\frac{x-x_i}{\varepsilon}\right)
|
||||||
$$
|
$$
|
||||||
|
|
||||||
where $\mathcal{N}$ is the kernel and the parameter $\varepsilon$, called
|
where $\mathcal{N}$ is the kernel and the parameter $\varepsilon$, called
|
||||||
*bandwidth*, controls the strength of the smoothing. This parameter can be
|
*bandwidth*, controls the strength of the smoothing. This parameter can be
|
||||||
determined in several ways: bootstrapping, cross-validation, etc. For
|
determined in several ways. For simplicity, it was chosen to use Silverman's
|
||||||
simplicity, it was chosen to use Silverman's rule of thumb [@silverman86],
|
rule of thumb [@silverman86], which gives:
|
||||||
which gives:
|
|
||||||
$$
|
$$
|
||||||
\varepsilon = 0.63 S_N
|
\varepsilon = 0.63 \, S_N
|
||||||
\left(\frac{d + 2}{4}N\right)^{-1/(d + 4)}
|
\left(\frac{d + 2}{4}N\right)^{-1/(d + 4)}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
where the $0.63$ factor was chosen to compensate for the distortion that
|
where the $0.63$ factor was chosen to compensate for the distortion that
|
||||||
systematically reduces the peaks height, which affects the estimation of the
|
systematically reduces the peaks height, which affects the estimation of the
|
||||||
mode, and:
|
mode, and:
|
||||||
|
|
||||||
- $S_N$ is the sample standard deviation,
|
- $S_N$ is the sample standard deviation;
|
||||||
- $d$ is the number of dimensions, in this case $d=1$.
|
- $d$ is the number of dimensions, in this case $d=1$.
|
||||||
|
|
||||||
With the empirical density estimation at hand, the FWHM can be computed by the
|
With the empirical density estimation at hand, the FWHM can be computed by the
|
||||||
@ -343,7 +325,6 @@ to estimate the standard error giving:
|
|||||||
$$
|
$$
|
||||||
\text{observed FWHM: } w_o = \num{4.06 \pm 0.08}
|
\text{observed FWHM: } w_o = \num{4.06 \pm 0.08}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
Applying the $t$-test to these two values gives
|
Applying the $t$-test to these two values gives
|
||||||
|
|
||||||
- $t=0.495$
|
- $t=0.495$
|
||||||
|
Loading…
Reference in New Issue
Block a user