363 lines
14 KiB
Markdown
363 lines
14 KiB
Markdown
# Exercise 1
|
||
|
||
## Random numbers following the Landau distribution
|
||
|
||
The Landau distribution is a probability density function which can be defined
|
||
as follows:
|
||
|
||
$$
|
||
f(x) = \int \limits_{0}^{+ \infty} dt \, e^{-t log(t) -xt} \sin (\pi t)
|
||
$$
|
||
|
||
![Landau distribution.](images/landau-small.pdf){width=50%}
|
||
|
||
The GNU Scientific Library (GSL) provides a number of functions for generating
|
||
random variates following tens of probability distributions. Thus, the function
|
||
for generating numbers from the Landau distribution, namely `gsl_ran_landau()`,
|
||
was used.
|
||
For the purpose of visualizing the resulting sample, the data was put into
|
||
an histogram and plotted with matplotlib. The result is shown in @fig:landau.
|
||
|
||
![Example of N points generated with the `gsl_ran_landau()`
|
||
function and plotted in a 100-bins histogram ranging from -10 to
|
||
80.](images/landau-hist.png){#fig:landau}
|
||
|
||
|
||
## Randomness testing of the generated sample
|
||
|
||
### Kolmogorov-Smirnov test
|
||
|
||
In order to compare the sample with the Landau distribution, the
|
||
Kolmogorov-Smirnov (KS) test was applied. This test statistically quantifies the
|
||
distance between the cumulative distribution function of the Landau distribution
|
||
and the one of the sample. The null hypothesis is that the sample was
|
||
drawn from the reference distribution.
|
||
The KS statistic for a given cumulative distribution function $F(x)$ is:
|
||
|
||
$$
|
||
D_N = \text{sup}_x |F_N(x) - F(x)|
|
||
$$
|
||
|
||
where:
|
||
|
||
- $x$ runs over the sample,
|
||
- $F(x)$ is the Landau cumulative distribution and function
|
||
- $F_N(x)$ is the empirical cumulative distribution function of the sample.
|
||
|
||
If $N$ numbers have been generated, for every point $x$,
|
||
$F_N(x)$ is simply given by the number of points preceding the point (itself
|
||
included) normalized by $N$, once the sample is sorted in ascending order.
|
||
$F(x)$ was computed numerically from the Landau distribution with a maximum
|
||
relative error of $10^{-6}$, using the function `gsl_integration_qagiu()`,
|
||
found in GSL.
|
||
|
||
Under the null hypothesis, the distribution of $D_N$ is expected to
|
||
asymptotically approach a Kolmogorov distribution:
|
||
|
||
$$
|
||
\sqrt{N}D_N \xrightarrow{N \rightarrow + \infty} K
|
||
$$
|
||
|
||
where $K$ is the Kolmogorov variable, with cumulative
|
||
distribution function given by:
|
||
|
||
$$
|
||
P(K \leqslant K_0) = 1 - p = \frac{\sqrt{2 \pi}}{K_0}
|
||
\sum_{j = 1}^{+ \infty} e^{-(2j - 1)^2 \pi^2 / 8 K_0^2}
|
||
$$
|
||
|
||
Plugging the observed value $\sqrt{N}D_N$ in $K_0$, the $p$-value can be
|
||
computed. At 95% confidence level (which is the probability of confirming the
|
||
null hypothesis when correct) the compatibility with the Landau distribution
|
||
cannot be disproved if $p > α = 0.05$.
|
||
To approximate the series, the convergence was accelerated using the Levin
|
||
$u$-transform with the `gsl_sum_levin_utrunc_accel()` function. The algorithm
|
||
terminates when the difference between two successive extrapolations reaches a
|
||
minimum.
|
||
|
||
\clearpage
|
||
|
||
For $N = 50000$, the following results were obtained:
|
||
|
||
- $D = 0.004$
|
||
- $p = 0.38$
|
||
|
||
Hence, the data was reasonably sampled from a Landau distribution.
|
||
|
||
**Note**:
|
||
Contrary to what one would expect, the $\chi^2$ test on a histogram is not very
|
||
useful in this case. For the test to be significant, the data has to be binned
|
||
such that at least several points fall in each bin. However, it can be seen
|
||
(@fig:landau) that many bins are empty both in the right and left side of the
|
||
distribution, so it would be necessary to fit only the region where the points
|
||
cluster or use very large bins in the others, making the $\chi^2$ test
|
||
unpractical.
|
||
|
||
|
||
## Parameters comparison
|
||
|
||
When a sample of points is generated in a given range, different tests can be
|
||
applied in order to check whether they follow an even distribution or not. The
|
||
idea which lies beneath most of them is to measure how far the parameters of
|
||
the distribution are from the ones measured in the sample.
|
||
The same principle can be used to verify if the generated sample effectively
|
||
follows the Landau distribution. Since it turns out to be a very pathological
|
||
PDF, very few parameters can be easily checked: mode, median and full width at
|
||
half maximum (FWHM).
|
||
|
||
### Mode
|
||
|
||
![Landau distribution with emphatized mode and
|
||
FWHM = ($x_+ - x_-$).](images/landau.pdf)
|
||
|
||
\begin{figure}
|
||
\hypertarget{fig:parameters}{%
|
||
\begin{tikzpicture}[overlay]
|
||
\begin{scope}[shift={(0,0.4)}]
|
||
% Mode
|
||
\draw [thick, dashed] (7.57,3.1) -- (7.57,8.55);
|
||
\draw [thick, dashed] (1.9,8.55) -- (7.57,8.55);
|
||
\node [above right] at (7.6,3.1) {$m_e$};
|
||
\node [below right] at (1.9,8.55) {$f(m_e)$};
|
||
% FWHM
|
||
\draw [thick, dashed] (1.9,5.95) -- (9.05,5.95);
|
||
\draw [thick, dashed] (6.85,5.83) -- (6.85,3.1);
|
||
\draw [thick, dashed] (8.95,5.83) -- (8.95,3.1);
|
||
\node [below right] at (1.9,5.95) {$\frac{f(m_e)}{2}$};
|
||
\node [above right] at (6.85,3.1) {$x_-$};
|
||
\node [above right] at (8.95,3.1) {$x_+$};
|
||
\end{scope}
|
||
\end{tikzpicture}}
|
||
\end{figure}
|
||
|
||
The mode of a set of data values is defined as the value that appears most
|
||
often, namely: it is the maximum of the PDF. Since there is no closed form for
|
||
the mode of the Landau PDF, it was computed numerically by the *Brent*
|
||
algorithm (`gsl_min_fminimizer_brent` in GSL), applied to $-f$ with a relative
|
||
tolerance of $10^{-7}$, giving:
|
||
$$
|
||
\text{expected mode: } m_e = \SI{-0.22278298 \pm 0.00000006}{}
|
||
$$
|
||
|
||
This is a minimization algorithm that begins with a bounded region known to
|
||
contain a minimum. The region is described by a lower bound $x_\text{min}$ and
|
||
an upper bound $x_\text{max}$, with an estimate of the location of the minimum
|
||
$x_e$. The value of the function at $x_e$ must be less than the value of the
|
||
function at the ends of the interval, in order to guarantee that a minimum is
|
||
contained somewhere within the interval.
|
||
|
||
$$
|
||
f(x_\text{min}) > f(x_e) < f(x_\text{max})
|
||
$$
|
||
|
||
On each iteration the function is interpolated by a parabola passing though the
|
||
points $x_\text{min}$, $x_e$, $x_\text{max}$ and the minimum is computed as the
|
||
vertex of the parabola. If this point is found to be inside the interval it's
|
||
taken as a guess for the true minimum; otherwise the method falls
|
||
back to a golden section (using the ratio $(3 - \sqrt{5})/2 \approx 0.3819660$
|
||
proven to be optimal) of the interval. The value of the function at this new
|
||
point $x'$ is calculated. In any case if the new point is a better estimate of
|
||
the minimum, namely if $f(x') < f(x_e)$, then the current estimate of the
|
||
minimum is updated.
|
||
The new point allows the size of the bounded interval to be reduced, by choosing
|
||
the most compact set of points which satisfies the constraint $f(a) > f(x') <
|
||
f(b)$ between $f(x_\text{min})$, $f(x_\text{min})$ and $f(x_e)$. The interval is
|
||
reduced until it encloses the true minimum to a desired tolerance.
|
||
The error of the result is estimated by the length of the final interval.
|
||
|
||
On the other hand, to compute the mode of the sample the half-sample mode (HSM)
|
||
or *Robertson-Cryer* estimator was used. This estimator was chosen because makes
|
||
no assumptions on the underlying distribution and is not computationally expensive.
|
||
The HSM is obtained by iteratively identifying the half modal interval, which
|
||
is the smallest interval containing half of the observation. Once the sample is
|
||
reduced to less that three points the mode is computed as the average. The
|
||
special case $n=3$ is dealt with by averaging the two closer points.
|
||
|
||
To obtain a better estimate of the mode and its error the above procedure was
|
||
bootstrapped. The original sample is treated as a population and used to build
|
||
other samples, of the same size, by *sampling with replacements*. For each one
|
||
of the new samples the above statistic is computed. By simply taking the
|
||
mean of these statistics the following estimate was obtained
|
||
$$
|
||
\text{observed mode: } m_o = \SI{-0.29 \pm 0.19}{}
|
||
$$
|
||
|
||
In order to compare the values $m_e$ and $x_0$, the following compatibility
|
||
t-test was applied:
|
||
|
||
$$
|
||
p = 1 - \text{erf}\left(\frac{t}{\sqrt{2}}\right)\ \with
|
||
t = \frac{|m_e - m_o|}{\sqrt{\sigma_e^2 + \sigma_o^2}}
|
||
$$
|
||
|
||
where $\sigma_e$ and $\sigma_o$ are the absolute errors of $m_e$ and $m_o$
|
||
respectively. At 95% confidence level, the values are compatible if $p > 0.05$.
|
||
In this case:
|
||
|
||
- t = 1.012
|
||
- p = 0.311
|
||
|
||
Thus, the observed mode is compatible with the mode of the Landau distribution,
|
||
however the result is quite imprecise.
|
||
|
||
|
||
### Median
|
||
|
||
The median is a central tendency statistics that, unlike the mean, is not
|
||
very sensitive to extreme values, albeit less indicative. For this reason
|
||
is well suited as test statistic in a pathological case such as the
|
||
Landau distribution.
|
||
The median of a real probability distribution is defined as the value
|
||
such that its cumulative probability is $1/2$. In other words the median
|
||
partitions the probability in two (connected) halves.
|
||
The median of a sample, once sorted, is given by its middle element if the
|
||
sample size is odd, or the average of the two middle elements otherwise.
|
||
|
||
The expected median was derived from the quantile function (QDF) of the Landau
|
||
distribution[^1].
|
||
Once this is know, the median is simply given by $\text{QDF}(1/2)$.
|
||
Since both the CDF and QDF have no known closed form they must
|
||
be computed numerically. The cumulative probability has been computed by
|
||
quadrature-based numerical integration of the PDF (`gsl_integration_qagiu()`
|
||
function in GSL). The function calculate an approximation of the integral
|
||
|
||
$$
|
||
I(x) = \int_x^{+\infty} f(t)dt
|
||
$$
|
||
|
||
[^1]: This is neither necessary nor the easiest way: it was chosen simply
|
||
because the quantile had been already implemented and was initially
|
||
used for reverse sampling.
|
||
|
||
|
||
The $CDF$ is then given by $p(x) = 1 - I(x)$. This was done to avoid the
|
||
left tail of the distribution, where the integration can sometimes fail.
|
||
The integral $I$ is actually mapped beforehand onto $(0, 1]$ by
|
||
the change of variable $t = a + (1-u)/u$, because the integration
|
||
routine works on definite integrals. The result should satisfy the following
|
||
accuracy requirement:
|
||
$$
|
||
|\text{result} - I| \le \max(\varepsilon_\text{abs}, \varepsilon_\text{rel}I)
|
||
$$
|
||
|
||
The tolerances have been set to \SI{1e-10}{} and \SI{1e-6}{}, respectively.
|
||
As for the QDF, this was implemented by numerically inverting the CDF. This is
|
||
done by solving the equation
|
||
$$
|
||
p(x) = p_0
|
||
$$
|
||
|
||
for x, given a probability value $p_0$, where $p(x)$ is again the CDF.
|
||
The (unique) root of this equation is found by a root-finding routine
|
||
(`gsl_root_fsolver_brent` in GSL) based on the Brent-Dekker method. This
|
||
algorithm consists in a bisection search, similar to the one employed in the
|
||
mode optimisation, but improved by interpolating the function with a parabola
|
||
at each step. The following condition is checked for convergence:
|
||
$$
|
||
|a - b| < \varepsilon_\text{abs} + \varepsilon_\text{rel} \min(|a|, |b|)
|
||
$$
|
||
|
||
where $a,b$ are the current interval bounds. The condition immediately gives an
|
||
upper bound on the error of the root as $\varepsilon = |a-b|$. The tolerances
|
||
here have been set to 0 and \SI{1e-3}{}.
|
||
|
||
The result of the numerical computation is:
|
||
$$
|
||
\text{expected median: } m_e = \SI{1.3557804 \pm 0.0000091}{}
|
||
$$
|
||
|
||
while the sample median, obtained again by bootstrapping, was found to be
|
||
$$
|
||
\text{observed median: } m_e = \SI{1.3605 \pm 0.0062}{}
|
||
$$
|
||
|
||
Applying again the t-test from before to this statistic:
|
||
|
||
- $t=0.761$
|
||
- $p=0.446$
|
||
|
||
This result is much more precise than the mode and the two values show
|
||
a good agreement.
|
||
|
||
|
||
### FWHM
|
||
|
||
For a unimodal distribution (having a single peak) this statistic is defined as
|
||
the distance between the two points at which the PDF attains half the maximum
|
||
value. For the Landau distribution, again, there is no analytic expression
|
||
known, thus the FWHM was computed numerically as follows. First of all, some
|
||
definitions must be given:
|
||
$$
|
||
f_{\text{max}} = f(m_e) \et \text{FWHM} = x_+ - x_- \with
|
||
f(x_\pm) = \frac{f_\text{max}}{2}
|
||
$$
|
||
|
||
The function $f'(x)$ was minimized using the same minimization method
|
||
used for finding $m_e$. Once $f_\text{max}$ is known, the equation
|
||
$$
|
||
f'(x) = \frac{f_\text{max}}{2}
|
||
$$
|
||
|
||
is solved by performing the Brent-Dekker method (described before) in the
|
||
ranges $[x_\text{min}, m_e]$ and $[m_e, x_\text{max}]$ yielding the two
|
||
solutions $x_\pm$. With a relative tolerance of \SI{1e-7}{} the following
|
||
result was obtained:
|
||
|
||
$$
|
||
\text{expected FWHM: } w_e = \SI{4.0186457 \pm 0.0000001}{}
|
||
$$
|
||
\vspace{-1em}
|
||
|
||
![Example of a Moyal distribution density obtained by
|
||
the KDE method described above. The rug plot shows the original
|
||
sample used in the reconstruction.](images/landau-kde.pdf)
|
||
|
||
On the other hand, obtaining a good estimate of the FWHM from a sample is much
|
||
more difficult. In principle it could be measured by binning the data and
|
||
applying the definition to the discretised values, however this yields very
|
||
poor results and depends on an completely arbitrary parameter: the bin width.
|
||
A more refined method to construct an nonparametric empirical PDF function from
|
||
the sample is a kernel density estimation (KDE). This method consist in
|
||
convolving the (ordered) data with a smooth symmetrical kernel, in this cause a
|
||
standard gaussian function. Given a sample $\{x_i\}_{i=1}^N$, the empirical PDF
|
||
is thus constructed as
|
||
$$
|
||
f_\varepsilon(x) = \frac{1}{N\varepsilon} \sum_{i = 1}^N
|
||
\mathcal{N}\left(\frac{x-x_i}{\varepsilon}\right)
|
||
$$
|
||
|
||
where $\varepsilon$ is called the *bandwidth* and is a parameter that controls
|
||
the strength of the smoothing. This parameter can be determined in several
|
||
ways: bootstrapping, cross-validation, etc. For simplicity it was chosen
|
||
to use Silverman's rule of thumb, which gives
|
||
$$
|
||
\varepsilon = 0.63 S_N
|
||
\left(\frac{d + 2}{4}N\right)^{-1/(d + 4)}
|
||
$$
|
||
|
||
where
|
||
|
||
- $S_N$ is the sample standard deviation.
|
||
|
||
- $d$ is ne number of dimensions, in this case $d=1$
|
||
|
||
The $0.63$ factor was chosen to compensate for the distortion that
|
||
systematically reduces the peaks height, which affects the estimation of the
|
||
mode.
|
||
|
||
With the empirical density estimation at hand, the FWHM can be computed by the
|
||
same numerical method described for the true PDF. Again this was bootstrapped
|
||
to estimate the standard error giving:
|
||
$$
|
||
\text{observed FWHM: } w_o = \SI{4.06 \pm 0.08}{}
|
||
$$
|
||
|
||
Applying the t-test to these two values gives
|
||
|
||
- $t=0.495$
|
||
- $p=0.620$
|
||
|
||
which shows a very good agreement and proves the estimator is robust.
|
||
For reference, the initial estimation based on an histogram gave a rather
|
||
inadequate \si{4 \pm 2}.
|