analistica/notes/sections/1.md

305 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Exercise 1
## Random numbers following the Landau distribution
The Landau distribution is a probability density function which can be defined
as follows:
$$
f(x) = \int \limits_{0}^{+ \infty} dt \, e^{-t log(t) -xt} \sin (\pi t)
$$
![Landau distribution.](images/landau-small.pdf){width=50%}
The GNU Scientific Library (GSL) provides a number of functions for generating
random variates following tens of probability distributions. Thus, the function
for generating numbers from the Landau distribution, namely `gsl_ran_landau()`,
was used.
For the purpose of visualizing the resulting sample, the data was put into
an histogram and plotted with matplotlib. The result is shown in @fig:landau.
![Example of N points generated with the `gsl_ran_landau()`
function and plotted in a 100-bins histogram ranging from -10 to
80.](images/landau-hist.png){#fig:landau}
## Randomness testing of the generated sample
### Kolmogorov-Smirnov test
In order to compare the sample with the Landau distribution, the
Kolmogorov-Smirnov (KS) test was applied. This test statistically quantifies the
distance between the cumulative distribution function of the Landau distribution
and the one of the sample. The null hypothesis is that the sample was
drawn from the reference distribution.
The KS statistic for a given cumulative distribution function $F(x)$ is:
$$
D_N = \text{sup}_x |F_N(x) - F(x)|
$$
where:
- $x$ runs over the sample,
- $F(x)$ is the Landau cumulative distribution and function
- $F_N(x)$ is the empirical cumulative distribution function of the sample.
If $N$ numbers have been generated, for every point $x$,
$F_N(x)$ is simply given by the number of points preceding the point (itself
included) normalized by $N$, once the sample is sorted in ascending order.
$F(x)$ was computed numerically from the Landau distribution with a maximum
relative error of $10^{-6}$, using the function `gsl_integration_qagiu()`,
found in GSL.
Under the null hypothesis, the distribution of $D_N$ is expected to
asymptotically approach a Kolmogorov distribution:
$$
\sqrt{N}D_N \xrightarrow{N \rightarrow + \infty} K
$$
where $K$ is the Kolmogorov variable, with cumulative
distribution function given by:
$$
P(K \leqslant K_0) = 1 - p = \frac{\sqrt{2 \pi}}{K_0}
\sum_{j = 1}^{+ \infty} e^{-(2j - 1)^2 \pi^2 / 8 K_0^2}
$$
Plugging the observed value $\sqrt{N}D_N$ in $K_0$, the $p$-value can be
computed. At 95% confidence level (which is the probability of confirming the
null hypothesis when correct) the compatibility with the Landau distribution
cannot be disproved if $p > α = 0.05$.
To approximate the series, the convergence was accelerated using the Levin
$u$-transform with the `gsl_sum_levin_utrunc_accel()` function. The algorithm
terminates when the difference between two successive extrapolations reaches a
minimum.
For $N = 1000$, the following results were obtained:
- $D = 0.020$
- $p = 0.79$
Hence, the data was reasonably sampled from a Landau distribution.
**Note**:
Contrary to what one would expect, the $\chi^2$ test on a histogram is not very
useful in this case. For the test to be significant, the data has to be binned
such that at least several points fall in each bin. However, it can be seen
(@fig:landau) that many bins are empty both in the right and left side of the
distribution, so it would be necessary to fit only the region where the points
cluster or use very large bins in the others, making the $\chi^2$ test
unpractical.
## Parameters comparison
When a sample of points is generated in a given range, different tests can be
applied in order to check whether they follow an even distribution or not. The
idea which lies beneath most of them is to measure how far the parameters of
the distribution are from the ones measured in the sample.
The same principle can be used to verify if the generated sample effectively
follows the Landau distribution. Since it turns out to be a very pathological
PDF, very few parameters can be easily checked: mode, median and
full-width-half-maximum (FWHM).
### Mode
![Landau distribution with emphatized mode and
FWHM = ($x_+ - x_-$).](images/landau.pdf)
\begin{figure}
\hypertarget{fig:parameters}{%
\begin{tikzpicture}[overlay]
\begin{scope}[shift={(0,0.4)}]
% Mode
\draw [thick, dashed] (7.57,3.1) -- (7.57,8.55);
\draw [thick, dashed] (1.9,8.55) -- (7.57,8.55);
\node [above right] at (7.6,3.1) {$m_e$};
\node [below right] at (1.9,8.55) {$f(m_e)$};
% FWHM
\draw [thick, dashed] (1.9,5.95) -- (9.05,5.95);
\draw [thick, dashed] (6.85,5.83) -- (6.85,3.1);
\draw [thick, dashed] (8.95,5.83) -- (8.95,3.1);
\node [below right] at (1.9,5.95) {$\frac{f(m_e)}{2}$};
\node [above right] at (6.85,3.1) {$x_-$};
\node [above right] at (8.95,3.1) {$x_+$};
\end{scope}
\end{tikzpicture}}
\end{figure}
The mode of a set of data values is defined as the value that appears most
often, namely: it is the maximum of the PDF. Since there is no closed form
for the mode of the Landau PDF, it was computed numerically by the *golden
section* method (`gsl_min_fminimizer_goldensection` in GSL), applied to $-f$
with an arbitrary error of $10^{-2}$, giving:
$$
\text{expected mode: } m_e = \SI{-0.2227830 \pm 0.0000001}{}
$$
This is a minimization algorithm that begins with a bounded region known to
contain a minimum. The region is described by a lower bound $x_\text{min}$ and
an upper bound $x_\text{max}$, with an estimate of the location of the minimum
$x_e$. The value of the function at $x_e$ must be less than the value of the
function at the ends of the interval, in order to guarantee that a minimum is
contained somewhere within the interval.
$$
f(x_\text{min}) > f(x_e) < f(x_\text{max})
$$
On each iteration the interval is divided in a golden section (using the ratio
($(3 - \sqrt{5})/2 \approx 0.3819660$) and the value of the function at this new
point $x'$ is calculated. If the new point is a better estimate of the minimum,
namely if $f(x') < f(x_e)$, then the current estimate of the minimum is
updated.
The new point allows the size of the bounded interval to be reduced, by choosing
the most compact set of points which satisfies the constraint $f(a) > f(x') <
f(b)$ between $f(x_\text{min})$, $f(x_\text{min})$ and $f(x_e)$. The interval is
reduced until it encloses the true minimum to a desired tolerance.
In the sample, on the other hand, once the data were binned, the mode can be
estimated as the central value of the bin with maximum events and the error
is the half width of the bins. In this case, with 40 bins between -20 and 20,
the following result was obtained:
$$
\text{observed mode: } m_o = \SI{0 \pm 1}{}
$$
In order to compare the values $m_e$ and $x_0$, the following compatibility
t-test was applied:
$$
p = 1 - \text{erf}\left(\frac{t}{\sqrt{2}}\right)\ \with
t = \frac{|m_e - m_o|}{\sqrt{\sigma_e^2 + \sigma_o^2}}
$$
where $\sigma_e$ and $\sigma_o$ are the absolute errors of $m_e$ and $m_o$
respectively. At 95% confidence level, the values are compatible if $p > 0.05$.
In this case:
$$
p = 0.82
$$
Thus, the observed value is compatible with the expected one.
### Median
The median is a central tendency statistics that, unlike the mean, is not
very sensitive to extreme values, albeit less indicative. For this reason
is well suited as test statistic in a pathological case such as the
Landau distribution.
The median of a real probability distribution is defined as the value
such that its cumulative probability is $1/2$. In other words the median
partitions the probability in two (connected) halves.
The median of a sample, once sorted, is given by its middle element if the
sample size is odd, or the average of the two middle elements otherwise.
The expected median was derived from the quantile function (QDF) of the Landau
distribution[^1].
Once this is know, the median is simply given by $QDF(1/2)$.
Since both the CDF and QDF have no known closed form they must
be computed numerically. The comulative probability has been computed by
quadrature-based numerical integration of the PDF (`gsl_integration_qagiu()`
function in GSL). The function calculate an approximation of the integral
$$
I(x) = \int_x^{+\infty} f(t)dt
$$
[^1]: This is neither necessary nor the easiest way: it was chosen simply
because the quantile had been already implemented and was initially
used for reverse sampling.
The $CDF$ is then given by $p(x) = 1 - I(x)$. This was done to avoid the
left tail of the distribution, where the integration can sometimes fail.
The integral $I$ is actually mapped beforehand onto $(0, 1]$ by
the change of variable $t = a + (1-u)/u$, because the integration
routine works on definite integrals. The result should satisfy the following
accuracy requirement:
$$
|\text{result} - I| \le \max(\varepsilon_\text{abs}, \varepsilon_\text{rel}I)
$$
The tolerances have been set to \SI{1e-10}{} and \SI{1e-6}{}, respectively.
As for the QDF, this was implemented by numerically inverting the CDF. This is
done by solving the equation
$$
p(x) = p_0
$$
for x, given a probability value $p_0$, where $p(x)$ is again the CDF.
The (unique) root of this equation is found by a root-finding routine
(`gsl_root_fsolver_brent` in GSL) based on the Brent-Dekker method. This
algorithm consists in a bisection search, similar to the one employed in the
mode optimisation, but improved by interpolating the function with a parabola
at each step. The following condition is checked for convergence:
$$
|a - b| < \varepsilon_\text{abs} + \varepsilon_\text{rel} \min(|a|, |b|)
$$
where $a,b$ are the current interval bounds. The condition immediately gives an
upper bound on the error of the root as $\varepsilon = |a-b|$. The tolerances
here have been set to 0 and \SI{1e-3}{}.
The result of the numerical computation is:
$$
\text{expected median: } m_e = \SI{1.3557804 \pm 0.0000091}{}
$$
while the sample median was found to be
$$
\text{observed median: } m_e = \SI{1.3479314}{}
$$
### FWHM
The same approach was taken as regards the FWHM. This statistic is defined as
the distance between the two points at which the function assumes half times
the maximum value. Even in this case, there is not an analytic expression for
it, thus it was computed numerically ad follow.
First, some definitions must be given:
$$
f_{\text{max}} := f(m_e) \et \text{FWHM} = x_+ - x_- \with
f(x_{\pm}) = \frac{f_{\text{max}}}{2}
$$
then the function $f'(x)$ was minimized using the same minimization method
used for finding $m_e$, dividing the range into $[x_\text{min}, m_e]$ and
$[m_e, x_\text{max}]$ (where $x_\text{min}$ and $x_\text{max}$ are the limits
in which the points have been sampled) in order to be able to find both the
minima of the function:
$$
f'(x) = \left|f(x) - \frac{f_{\text{max}}}{2}\right|
$$
resulting in:
$$
\text{expected FWHM: } w_e = \SI{4.0186457 \pm 0.0000001}{}
$$
On the other hand, the observed FWHM was computed as the difference between
the center of the bins with the values closer to $\frac{f_{\text{max}}}{2}$
and the error was taken as twice the width of the bins, obtaining:
$$
\text{observed FWHM: } w_o = \SI{4 \pm 2}{}
$$
This two values turn out to be compatible too, with:
$$
p = 0.99
$$