2020-03-06 02:24:32 +01:00
|
|
|
|
# Exercise 1
|
|
|
|
|
|
|
|
|
|
## Random numbers following the Landau distribution
|
|
|
|
|
|
|
|
|
|
The Landau distribution is a probability density function which can be defined
|
|
|
|
|
as follows:
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
f(x) = \int \limits_{0}^{+ \infty} dt \, e^{-t log(t) -xt} \sin (\pi t)
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
![Landau distribution.](images/landau-small.pdf){width=50%}
|
|
|
|
|
|
|
|
|
|
The GNU Scientific Library (GSL) provides a number of functions for generating
|
|
|
|
|
random variates following tens of probability distributions. Thus, the function
|
|
|
|
|
for generating numbers from the Landau distribution, namely `gsl_ran_landau()`,
|
|
|
|
|
was used.
|
|
|
|
|
For the purpose of visualizing the resulting sample, the data was put into
|
|
|
|
|
an histogram and plotted with matplotlib. The result is shown in @fig:landau.
|
|
|
|
|
|
|
|
|
|
![Example of N points generated with the `gsl_ran_landau()`
|
|
|
|
|
function and plotted in a 100-bins histogram ranging from -10 to
|
2020-03-29 21:24:41 +02:00
|
|
|
|
80.](images/landau-hist.png){#fig:landau}
|
2020-03-06 02:24:32 +01:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Randomness testing of the generated sample
|
|
|
|
|
|
|
|
|
|
### Kolmogorov-Smirnov test
|
|
|
|
|
|
|
|
|
|
In order to compare the sample with the Landau distribution, the
|
|
|
|
|
Kolmogorov-Smirnov (KS) test was applied. This test statistically quantifies the
|
|
|
|
|
distance between the cumulative distribution function of the Landau distribution
|
|
|
|
|
and the one of the sample. The null hypothesis is that the sample was
|
|
|
|
|
drawn from the reference distribution.
|
|
|
|
|
The KS statistic for a given cumulative distribution function $F(x)$ is:
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
D_N = \text{sup}_x |F_N(x) - F(x)|
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
where:
|
|
|
|
|
|
|
|
|
|
- $x$ runs over the sample,
|
|
|
|
|
- $F(x)$ is the Landau cumulative distribution and function
|
|
|
|
|
- $F_N(x)$ is the empirical cumulative distribution function of the sample.
|
|
|
|
|
|
|
|
|
|
If $N$ numbers have been generated, for every point $x$,
|
|
|
|
|
$F_N(x)$ is simply given by the number of points preceding the point (itself
|
|
|
|
|
included) normalized by $N$, once the sample is sorted in ascending order.
|
|
|
|
|
$F(x)$ was computed numerically from the Landau distribution with a maximum
|
|
|
|
|
relative error of $10^{-6}$, using the function `gsl_integration_qagiu()`,
|
|
|
|
|
found in GSL.
|
|
|
|
|
|
|
|
|
|
Under the null hypothesis, the distribution of $D_N$ is expected to
|
|
|
|
|
asymptotically approach a Kolmogorov distribution:
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\sqrt{N}D_N \xrightarrow{N \rightarrow + \infty} K
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
where $K$ is the Kolmogorov variable, with cumulative
|
|
|
|
|
distribution function given by:
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
P(K \leqslant K_0) = 1 - p = \frac{\sqrt{2 \pi}}{K_0}
|
|
|
|
|
\sum_{j = 1}^{+ \infty} e^{-(2j - 1)^2 \pi^2 / 8 K_0^2}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
Plugging the observed value $\sqrt{N}D_N$ in $K_0$, the $p$-value can be
|
|
|
|
|
computed. At 95% confidence level (which is the probability of confirming the
|
|
|
|
|
null hypothesis when correct) the compatibility with the Landau distribution
|
|
|
|
|
cannot be disproved if $p > α = 0.05$.
|
|
|
|
|
To approximate the series, the convergence was accelerated using the Levin
|
|
|
|
|
$u$-transform with the `gsl_sum_levin_utrunc_accel()` function. The algorithm
|
|
|
|
|
terminates when the difference between two successive extrapolations reaches a
|
|
|
|
|
minimum.
|
|
|
|
|
|
|
|
|
|
For $N = 1000$, the following results were obtained:
|
|
|
|
|
|
|
|
|
|
- $D = 0.020$
|
|
|
|
|
- $p = 0.79$
|
|
|
|
|
|
|
|
|
|
Hence, the data was reasonably sampled from a Landau distribution.
|
|
|
|
|
|
|
|
|
|
**Note**:
|
|
|
|
|
Contrary to what one would expect, the $\chi^2$ test on a histogram is not very
|
|
|
|
|
useful in this case. For the test to be significant, the data has to be binned
|
|
|
|
|
such that at least several points fall in each bin. However, it can be seen
|
2020-03-29 21:24:41 +02:00
|
|
|
|
(@fig:landau) that many bins are empty both in the right and left side of the
|
2020-03-06 02:24:32 +01:00
|
|
|
|
distribution, so it would be necessary to fit only the region where the points
|
|
|
|
|
cluster or use very large bins in the others, making the $\chi^2$ test
|
|
|
|
|
unpractical.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### Parameters comparison
|
|
|
|
|
|
|
|
|
|
When a sample of points is generated in a given range, different tests can be
|
|
|
|
|
applied in order to check whether they follow an even distribution or not. The
|
|
|
|
|
idea which lies beneath most of them is to measure how far the parameters of
|
|
|
|
|
the distribution are from the ones measured in the sample.
|
|
|
|
|
The same principle can be used to verify if the generated sample effectively
|
|
|
|
|
follows the Landau distribution. Since it turns out to be a very pathological
|
|
|
|
|
PDF, only two parameters can be easily checked: the mode and the
|
|
|
|
|
full-width-half-maximum (FWHM).
|
|
|
|
|
|
|
|
|
|
![Landau distribution with emphatized mode and
|
|
|
|
|
FWHM = ($x_+ - x_-$).](images/landau.pdf)
|
|
|
|
|
|
|
|
|
|
\begin{figure}
|
|
|
|
|
\hypertarget{fig:parameters}{%
|
|
|
|
|
\begin{tikzpicture}[overlay]
|
|
|
|
|
\begin{scope}[shift={(0,0.4)}]
|
|
|
|
|
% Mode
|
|
|
|
|
\draw [thick, dashed] (7.57,3.1) -- (7.57,8.55);
|
|
|
|
|
\draw [thick, dashed] (1.9,8.55) -- (7.57,8.55);
|
|
|
|
|
\node [above right] at (7.6,3.1) {$m_e$};
|
|
|
|
|
\node [below right] at (1.9,8.55) {$f(m_e)$};
|
|
|
|
|
% FWHM
|
|
|
|
|
\draw [thick, dashed] (1.9,5.95) -- (9.05,5.95);
|
|
|
|
|
\draw [thick, dashed] (6.85,5.83) -- (6.85,3.1);
|
|
|
|
|
\draw [thick, dashed] (8.95,5.83) -- (8.95,3.1);
|
|
|
|
|
\node [below right] at (1.9,5.95) {$\frac{f(m_e)}{2}$};
|
|
|
|
|
\node [above right] at (6.85,3.1) {$x_-$};
|
|
|
|
|
\node [above right] at (8.95,3.1) {$x_+$};
|
|
|
|
|
\end{scope}
|
|
|
|
|
\end{tikzpicture}
|
|
|
|
|
}
|
|
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
|
|
The mode of a set of data values is defined as the value that appears most
|
|
|
|
|
often, namely: it is the maximum of the PDF. Since there is no way to find
|
|
|
|
|
an analytic form for the mode of the Landau PDF, it was numerically estimated
|
|
|
|
|
through a minimization method (found in GSL, called method 'golden section')
|
|
|
|
|
with an arbitrary error of $10^{-2}$, obtaining:
|
|
|
|
|
|
|
|
|
|
- expected mode $= m_e = \SI{-0.2227830 \pm 0.0000001}{}$
|
|
|
|
|
|
|
|
|
|
The minimization algorithm begins with a bounded region known to contain a
|
|
|
|
|
minimum. The region is described by a lower bound $x_\text{min}$ and an upper
|
|
|
|
|
bound $x_\text{max}$, with an estimate of the location of the minimum $x_e$.
|
|
|
|
|
The value of the function at $x_e$ must be less than the value of the function
|
|
|
|
|
at the ends of the interval, in order to guarantee that a minimum is contained
|
|
|
|
|
somewhere within the interval.
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
f(x_\text{min}) > f(x_e) < f(x_\text{max})
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
On each iteration the interval is divided in a golden section (using the ratio
|
|
|
|
|
($3 - \sqrt{5}/2 \approx 0.3819660$ and the value of the function at this new
|
|
|
|
|
point $x'$ is calculated. If the new point is a better estimate of the minimum,
|
|
|
|
|
namely is $f(x') < f(x_e)$, then the current estimate of the minimum is
|
|
|
|
|
updated.
|
|
|
|
|
The new point allows the size of the bounded interval to be reduced, by choosing
|
|
|
|
|
the most compact set of points which satisfies the constraint $f(a) > f(x') <
|
|
|
|
|
f(b)$ between $f(x_\text{min})$, $f(x_\text{min})$ and $f(x_e)$. The interval is
|
|
|
|
|
reduced until it encloses the true minimum to a desired tolerance.
|
|
|
|
|
|
|
|
|
|
In the sample, on the other hand, once the data were binned, the mode can be
|
|
|
|
|
estimated as the central value of the bin with maximum events and the error
|
|
|
|
|
is the half width of the bins. In this case, with 40 bins between -20 and 20,
|
|
|
|
|
the following result was obtained:
|
|
|
|
|
|
|
|
|
|
- observed mode $= m_o = \SI{0 \pm 1}{}$
|
|
|
|
|
|
|
|
|
|
In order to compare the values $m_e$ and $x_0$, the following compatibility
|
|
|
|
|
t-test was applied:
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
p = 1 - \text{erf}\left(\frac{t}{\sqrt{2}}\right)\ \with
|
|
|
|
|
t = \frac{|m_e - m_o|}{\sqrt{\sigma_e^2 + \sigma_o^2}}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
where $\sigma_e$ and $\sigma_o$ are the absolute errors of $m_e$ and $m_o$
|
|
|
|
|
respectively. At 95% confidence level, the values are compatible if $p > 0.05$.
|
|
|
|
|
In this case:
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
p = 0.82
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
Thus, the observed value is compatible with the expected one.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
The same approach was taken as regards the FWHM. It is defined as the distance
|
|
|
|
|
between the two points at which the function assumes half times the maximum
|
|
|
|
|
value. Even in this case, there is not an analytic expression for it, thus it
|
|
|
|
|
was computed numerically ad follow.
|
|
|
|
|
First, some definitions must be given:
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
f_{\text{max}} := f(m_e) \et \text{FWHM} = x_{+} - x_{-} \with
|
|
|
|
|
f(x_{\pm}) = \frac{f_{\text{max}}}{2}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
then the function $f'(x)$ was minimized using the same minimization method
|
|
|
|
|
used for finding $m_e$, dividing the range into $[x_\text{min}, m_e]$ and
|
|
|
|
|
$[m_e, x_\text{max}]$ (where $x_\text{min}$ and $x_\text{max}$ are the limits
|
|
|
|
|
in which the points have been sampled) in order to be able to find both the
|
|
|
|
|
minima of the function:
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
f'(x) = |f(x) - \frac{f_{\text{max}}}{2}|
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
resulting in:
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\text{expected FWHM} = w_e = \SI{4.0186457 \pm 0.0000001}{}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
On the other hand, the observed FWHM was computed as the difference between
|
|
|
|
|
the center of the bins with the values closer to $\frac{f_{\text{max}}}{2}$
|
|
|
|
|
and the error was taken as twice the width of the bins, obtaining:
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\text{observed FWHM} = w_o = \SI{4 \pm 2}{}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
This two values turn out to be compatible too, with:
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
p = 0.99
|
|
|
|
|
$$
|
|
|
|
|
|