analistica/notes/sections/1.md
2020-03-06 02:24:32 +01:00

8.2 KiB
Raw Blame History

Exercise 1

Random numbers following the Landau distribution

The Landau distribution is a probability density function which can be defined as follows:


  f(x) = \int \limits_{0}^{+ \infty} dt \, e^{-t log(t) -xt} \sin (\pi t)

Landau distribution.{width=50%}

The GNU Scientific Library (GSL) provides a number of functions for generating random variates following tens of probability distributions. Thus, the function for generating numbers from the Landau distribution, namely gsl_ran_landau(), was used.
For the purpose of visualizing the resulting sample, the data was put into an histogram and plotted with matplotlib. The result is shown in @fig:landau.

Example of N points generated with the gsl_ran_landau()
function and plotted in a 100-bins histogram ranging from -10 to
80.{#fig:landau}

Randomness testing of the generated sample

Kolmogorov-Smirnov test

In order to compare the sample with the Landau distribution, the Kolmogorov-Smirnov (KS) test was applied. This test statistically quantifies the distance between the cumulative distribution function of the Landau distribution and the one of the sample. The null hypothesis is that the sample was drawn from the reference distribution.
The KS statistic for a given cumulative distribution function F(x) is:


  D_N = \text{sup}_x |F_N(x) - F(x)|

where:

  • x runs over the sample,
  • F(x) is the Landau cumulative distribution and function
  • F_N(x) is the empirical cumulative distribution function of the sample.

If N numbers have been generated, for every point x, F_N(x) is simply given by the number of points preceding the point (itself included) normalized by N, once the sample is sorted in ascending order.
F(x) was computed numerically from the Landau distribution with a maximum relative error of 10^{-6}, using the function gsl_integration_qagiu(), found in GSL.

Under the null hypothesis, the distribution of D_N is expected to asymptotically approach a Kolmogorov distribution:


  \sqrt{N}D_N \xrightarrow{N \rightarrow + \infty} K

where K is the Kolmogorov variable, with cumulative distribution function given by:


  P(K \leqslant K_0) = 1 - p = \frac{\sqrt{2 \pi}}{K_0}
  \sum_{j = 1}^{+ \infty} e^{-(2j - 1)^2 \pi^2 / 8 K_0^2}

Plugging the observed value \sqrt{N}D_N in K_0, the $p$-value can be computed. At 95% confidence level (which is the probability of confirming the null hypothesis when correct) the compatibility with the Landau distribution cannot be disproved if p > α = 0.05.
To approximate the series, the convergence was accelerated using the Levin $u$-transform with the gsl_sum_levin_utrunc_accel() function. The algorithm terminates when the difference between two successive extrapolations reaches a minimum.

For N = 1000, the following results were obtained:

  • D = 0.020
  • p = 0.79

Hence, the data was reasonably sampled from a Landau distribution.

Note: Contrary to what one would expect, the \chi^2 test on a histogram is not very useful in this case. For the test to be significant, the data has to be binned such that at least several points fall in each bin. However, it can be seen (@fig:Landau) that many bins are empty both in the right and left side of the distribution, so it would be necessary to fit only the region where the points cluster or use very large bins in the others, making the \chi^2 test unpractical.

Parameters comparison

When a sample of points is generated in a given range, different tests can be applied in order to check whether they follow an even distribution or not. The idea which lies beneath most of them is to measure how far the parameters of the distribution are from the ones measured in the sample.
The same principle can be used to verify if the generated sample effectively follows the Landau distribution. Since it turns out to be a very pathological PDF, only two parameters can be easily checked: the mode and the full-width-half-maximum (FWHM).

Landau distribution with emphatized mode and
FWHM = (x_+ - x_-).

\begin{figure} \hypertarget{fig:parameters}{% \begin{tikzpicture}[overlay] \begin{scope}[shift={(0,0.4)}] % Mode \draw [thick, dashed] (7.57,3.1) -- (7.57,8.55); \draw [thick, dashed] (1.9,8.55) -- (7.57,8.55); \node [above right] at (7.6,3.1) {$m_e$}; \node [below right] at (1.9,8.55) {$f(m_e)$}; % FWHM \draw [thick, dashed] (1.9,5.95) -- (9.05,5.95); \draw [thick, dashed] (6.85,5.83) -- (6.85,3.1); \draw [thick, dashed] (8.95,5.83) -- (8.95,3.1); \node [below right] at (1.9,5.95) {$\frac{f(m_e)}{2}$}; \node [above right] at (6.85,3.1) {$x_-$}; \node [above right] at (8.95,3.1) {$x_+$}; \end{scope} \end{tikzpicture} } \end{figure}

The mode of a set of data values is defined as the value that appears most often, namely: it is the maximum of the PDF. Since there is no way to find an analytic form for the mode of the Landau PDF, it was numerically estimated through a minimization method (found in GSL, called method 'golden section') with an arbitrary error of 10^{-2}, obtaining:

  • expected mode = m_e = \SI{-0.2227830 \pm 0.0000001}{}

The minimization algorithm begins with a bounded region known to contain a minimum. The region is described by a lower bound x_\text{min} and an upper bound x_\text{max}, with an estimate of the location of the minimum x_e. The value of the function at x_e must be less than the value of the function at the ends of the interval, in order to guarantee that a minimum is contained somewhere within the interval.


  f(x_\text{min}) > f(x_e) < f(x_\text{max})

On each iteration the interval is divided in a golden section (using the ratio (3 - \sqrt{5}/2 \approx 0.3819660 and the value of the function at this new point x' is calculated. If the new point is a better estimate of the minimum, namely is f(x') < f(x_e), then the current estimate of the minimum is updated.
The new point allows the size of the bounded interval to be reduced, by choosing the most compact set of points which satisfies the constraint $f(a) > f(x') < f(b)$ between f(x_\text{min}), f(x_\text{min}) and f(x_e). The interval is reduced until it encloses the true minimum to a desired tolerance.

In the sample, on the other hand, once the data were binned, the mode can be estimated as the central value of the bin with maximum events and the error is the half width of the bins. In this case, with 40 bins between -20 and 20, the following result was obtained:

  • observed mode = m_o = \SI{0 \pm 1}{}

In order to compare the values m_e and x_0, the following compatibility t-test was applied:


  p = 1 - \text{erf}\left(\frac{t}{\sqrt{2}}\right)\ \with
  t = \frac{|m_e - m_o|}{\sqrt{\sigma_e^2 + \sigma_o^2}}

where \sigma_e and \sigma_o are the absolute errors of m_e and $m_o$ respectively. At 95% confidence level, the values are compatible if p > 0.05. In this case:


  p = 0.82

Thus, the observed value is compatible with the expected one.


The same approach was taken as regards the FWHM. It is defined as the distance between the two points at which the function assumes half times the maximum value. Even in this case, there is not an analytic expression for it, thus it was computed numerically ad follow.
First, some definitions must be given:


  f_{\text{max}} := f(m_e) \et \text{FWHM} = x_{+} - x_{-} \with
  f(x_{\pm}) = \frac{f_{\text{max}}}{2}

then the function f'(x) was minimized using the same minimization method used for finding m_e, dividing the range into [x_\text{min}, m_e] and [m_e, x_\text{max}] (where x_\text{min} and x_\text{max} are the limits in which the points have been sampled) in order to be able to find both the minima of the function:


  f'(x) = |f(x) - \frac{f_{\text{max}}}{2}|

resulting in:


  \text{expected FWHM} = w_e = \SI{4.0186457 \pm 0.0000001}{}

On the other hand, the observed FWHM was computed as the difference between the center of the bins with the values closer to $\frac{f_{\text{max}}}{2}$ and the error was taken as twice the width of the bins, obtaining:


  \text{observed FWHM} = w_o = \SI{4 \pm 2}{}

This two values turn out to be compatible too, with:


  p = 0.99