rnhmjoj 64f8a0a76e ex-1: rewrite the chapters for bootstrapping and kde

2020-07-05 11:35:27 +02:00

14 KiB

Raw Blame History

Exercise 1

Random numbers following the Landau distribution

The Landau distribution is a probability density function which can be defined as follows:


  f(x) = \int \limits_{0}^{+ \infty} dt \, e^{-t log(t) -xt} \sin (\pi t)

{width=50%}

The GNU Scientific Library (GSL) provides a number of functions for generating random variates following tens of probability distributions. Thus, the function for generating numbers from the Landau distribution, namely gsl_ran_landau(), was used.
For the purpose of visualizing the resulting sample, the data was put into an histogram and plotted with matplotlib. The result is shown in @fig:landau.

{#fig:landau}

Randomness testing of the generated sample

Kolmogorov-Smirnov test

In order to compare the sample with the Landau distribution, the Kolmogorov-Smirnov (KS) test was applied. This test statistically quantifies the distance between the cumulative distribution function of the Landau distribution and the one of the sample. The null hypothesis is that the sample was drawn from the reference distribution.
The KS statistic for a given cumulative distribution function F(x) is:


  D_N = \text{sup}_x |F_N(x) - F(x)|

where:

x runs over the sample,
F(x) is the Landau cumulative distribution and function
F_N(x) is the empirical cumulative distribution function of the sample.

If N numbers have been generated, for every point x, F_N(x) is simply given by the number of points preceding the point (itself included) normalized by N, once the sample is sorted in ascending order.
F(x) was computed numerically from the Landau distribution with a maximum relative error of 10^{-6}, using the function gsl_integration_qagiu(), found in GSL.

Under the null hypothesis, the distribution of D_N is expected to asymptotically approach a Kolmogorov distribution:


  \sqrt{N}D_N \xrightarrow{N \rightarrow + \infty} K

where K is the Kolmogorov variable, with cumulative distribution function given by:


  P(K \leqslant K_0) = 1 - p = \frac{\sqrt{2 \pi}}{K_0}
  \sum_{j = 1}^{+ \infty} e^{-(2j - 1)^2 \pi^2 / 8 K_0^2}

Plugging the observed value \sqrt{N}D_N in K_0, the $p$-value can be computed. At 95% confidence level (which is the probability of confirming the null hypothesis when correct) the compatibility with the Landau distribution cannot be disproved if p > α = 0.05.
To approximate the series, the convergence was accelerated using the Levin $u$-transform with the gsl_sum_levin_utrunc_accel() function. The algorithm terminates when the difference between two successive extrapolations reaches a minimum.

\clearpage

For N = 50000, the following results were obtained:

D = 0.004
p = 0.38

Hence, the data was reasonably sampled from a Landau distribution.

Note: Contrary to what one would expect, the \chi^2 test on a histogram is not very useful in this case. For the test to be significant, the data has to be binned such that at least several points fall in each bin. However, it can be seen (@fig:landau) that many bins are empty both in the right and left side of the distribution, so it would be necessary to fit only the region where the points cluster or use very large bins in the others, making the \chi^2 test unpractical.

Parameters comparison

When a sample of points is generated in a given range, different tests can be applied in order to check whether they follow an even distribution or not. The idea which lies beneath most of them is to measure how far the parameters of the distribution are from the ones measured in the sample.
The same principle can be used to verify if the generated sample effectively follows the Landau distribution. Since it turns out to be a very pathological PDF, very few parameters can be easily checked: mode, median and full width at half maximum (FWHM).

Mode

\begin{figure} \hypertarget{fig:parameters}{% \begin{tikzpicture}[overlay] \begin{scope}[shift={(0,0.4)}] % Mode \draw [thick, dashed] (7.57,3.1) -- (7.57,8.55); \draw [thick, dashed] (1.9,8.55) -- (7.57,8.55); \node [above right] at (7.6,3.1) {$m_e$}; \node [below right] at (1.9,8.55) {$f(m_e)$}; % FWHM \draw [thick, dashed] (1.9,5.95) -- (9.05,5.95); \draw [thick, dashed] (6.85,5.83) -- (6.85,3.1); \draw [thick, dashed] (8.95,5.83) -- (8.95,3.1); \node [below right] at (1.9,5.95) {$\frac{f(m_e)}{2}$}; \node [above right] at (6.85,3.1) {$x_-$}; \node [above right] at (8.95,3.1) {$x_+$}; \end{scope} \end{tikzpicture}} \end{figure}

The mode of a set of data values is defined as the value that appears most often, namely: it is the maximum of the PDF. Since there is no closed form for the mode of the Landau PDF, it was computed numerically by the Brent algorithm (gsl_min_fminimizer_brent in GSL), applied to -f with a relative tolerance of 10^{-7}, giving:


  \text{expected mode: } m_e = \SI{-0.22278298 \pm 0.00000006}{}

This is a minimization algorithm that begins with a bounded region known to contain a minimum. The region is described by a lower bound x_\text{min} and an upper bound x_\text{max}, with an estimate of the location of the minimum x_e. The value of the function at x_e must be less than the value of the function at the ends of the interval, in order to guarantee that a minimum is contained somewhere within the interval.


  f(x_\text{min}) > f(x_e) < f(x_\text{max})

On each iteration the function is interpolated by a parabola passing though the points x_\text{min}, x_e, x_\text{max} and the minimum is computed as the vertex of the parabola. If this point is found to be inside the interval it's taken as a guess for the true minimum; otherwise the method falls back to a golden section (using the ratio $(3 - \sqrt{5})/2 \approx 0.3819660$ proven to be optimal) of the interval. The value of the function at this new point x' is calculated. In any case if the new point is a better estimate of the minimum, namely if f(x') < f(x_e), then the current estimate of the minimum is updated.
The new point allows the size of the bounded interval to be reduced, by choosing the most compact set of points which satisfies the constraint $f(a) > f(x') < f(b)$ between f(x_\text{min}), f(x_\text{min}) and f(x_e). The interval is reduced until it encloses the true minimum to a desired tolerance. The error of the result is estimated by the length of the final interval.

On the other hand, to compute the mode of the sample the half-sample mode (HSM) or Robertson-Cryer estimator was used. This estimator was chosen because makes no assumptions on the underlying distribution and is not computationally expensive. The HSM is obtained by iteratively identifying the half modal interval, which is the smallest interval containing half of the observation. Once the sample is reduced to less that three points the mode is computed as the average. The special case n=3 is dealt with by averaging the two closer points.

To obtain a better estimate of the mode and its error the above procedure was bootstrapped. The original sample is treated as a population and used to build other samples, of the same size, by sampling with replacements. For each one of the new samples the above statistic is computed. By simply taking the mean of these statistics the following estimate was obtained


  \text{observed mode: } m_o = \SI{-0.29 \pm 0.19}{}

In order to compare the values m_e and x_0, the following compatibility t-test was applied:


  p = 1 - \text{erf}\left(\frac{t}{\sqrt{2}}\right)\ \with
  t = \frac{|m_e - m_o|}{\sqrt{\sigma_e^2 + \sigma_o^2}}

where \sigma_e and \sigma_o are the absolute errors of m_e and $m_o$ respectively. At 95% confidence level, the values are compatible if p > 0.05. In this case:

t = 1.012
p = 0.311

Thus, the observed mode is compatible with the mode of the Landau distribution, however the result is quite imprecise.

Median

The median is a central tendency statistics that, unlike the mean, is not very sensitive to extreme values, albeit less indicative. For this reason is well suited as test statistic in a pathological case such as the Landau distribution.
The median of a real probability distribution is defined as the value such that its cumulative probability is 1/2. In other words the median partitions the probability in two (connected) halves. The median of a sample, once sorted, is given by its middle element if the sample size is odd, or the average of the two middle elements otherwise.

The expected median was derived from the quantile function (QDF) of the Landau distribution¹. Once this is know, the median is simply given by \text{QDF}(1/2). Since both the CDF and QDF have no known closed form they must be computed numerically. The cumulative probability has been computed by quadrature-based numerical integration of the PDF (gsl_integration_qagiu() function in GSL). The function calculate an approximation of the integral


  I(x) = \int_x^{+\infty} f(t)dt

The CDF is then given by p(x) = 1 - I(x). This was done to avoid the left tail of the distribution, where the integration can sometimes fail. The integral I is actually mapped beforehand onto (0, 1] by the change of variable t = a + (1-u)/u, because the integration routine works on definite integrals. The result should satisfy the following accuracy requirement:


  |\text{result} - I| \le \max(\varepsilon_\text{abs}, \varepsilon_\text{rel}I)

The tolerances have been set to \SI{1e-10}{} and \SI{1e-6}{}, respectively.
As for the QDF, this was implemented by numerically inverting the CDF. This is done by solving the equation


  p(x) = p_0

for x, given a probability value p_0, where p(x) is again the CDF. The (unique) root of this equation is found by a root-finding routine (gsl_root_fsolver_brent in GSL) based on the Brent-Dekker method. This algorithm consists in a bisection search, similar to the one employed in the mode optimisation, but improved by interpolating the function with a parabola at each step. The following condition is checked for convergence:


  |a - b| < \varepsilon_\text{abs} + \varepsilon_\text{rel} \min(|a|, |b|)

where a,b are the current interval bounds. The condition immediately gives an upper bound on the error of the root as \varepsilon = |a-b|. The tolerances here have been set to 0 and \SI{1e-3}{}.

The result of the numerical computation is:


  \text{expected median: } m_e = \SI{1.3557804 \pm 0.0000091}{}

while the sample median, obtained again by bootstrapping, was found to be


  \text{observed median: } m_e = \SI{1.3605 \pm 0.0062}{}

Applying again the t-test from before to this statistic:

t=0.761
p=0.446

This result is much more precise than the mode and the two values show a good agreement.

FWHM

For a unimodal distribution (having a single peak) this statistic is defined as the distance between the two points at which the PDF attains half the maximum value. For the Landau distribution, again, there is no analytic expression known, thus the FWHM was computed numerically as follows. First of all, some definitions must be given:


  f_{\text{max}} = f(m_e) \et \text{FWHM} = x_+ - x_- \with
  f(x_\pm) = \frac{f_\text{max}}{2}

The function f'(x) was minimized using the same minimization method used for finding m_e. Once f_\text{max} is known, the equation


  f'(x) = \frac{f_\text{max}}{2}

is solved by performing the Brent-Dekker method (described before) in the ranges [x_\text{min}, m_e] and [m_e, x_\text{max}] yielding the two solutions x_\pm. With a relative tolerance of \SI{1e-7}{} the following result was obtained:


\text{expected FWHM: } w_e = \SI{4.0186457 \pm 0.0000001}{}

\vspace{-1em}

On the other hand, obtaining a good estimate of the FWHM from a sample is much more difficult. In principle it could be measured by binning the data and applying the definition to the discretised values, however this yields very poor results and depends on an completely arbitrary parameter: the bin width.
A more refined method to construct an nonparametric empirical PDF function from the sample is a kernel density estimation (KDE). This method consist in convolving the (ordered) data with a smooth symmetrical kernel, in this cause a standard gaussian function. Given a sample \{x_i\}_{i=1}^N, the empirical PDF is thus constructed as


  f_\varepsilon(x) = \frac{1}{N\varepsilon} \sum_{i = 1}^N
    \mathcal{N}\left(\frac{x-x_i}{\varepsilon}\right)

where \varepsilon is called the bandwidth and is a parameter that controls the strength of the smoothing. This parameter can be determined in several ways: bootstrapping, cross-validation, etc. For simplicity it was chosen to use Silverman's rule of thumb, which gives


  \varepsilon = 0.63 S_N
    \left(\frac{d + 2}{4}N\right)^{-1/(d + 4)}

where

S_N is the sample standard deviation.
d is ne number of dimensions, in this case d=1

The 0.63 factor was chosen to compensate for the distortion that systematically reduces the peaks height, which affects the estimation of the mode.

With the empirical density estimation at hand, the FWHM can be computed by the same numerical method described for the true PDF. Again this was bootstrapped to estimate the standard error giving:


  \text{observed FWHM: } w_o = \SI{4.06 \pm 0.08}{}

Applying the t-test to these two values gives

t=0.495
p=0.620

which shows a very good agreement and proves the estimator is robust. For reference, the initial estimation based on an histogram gave a rather inadequate \si{4 \pm 2}.

This is neither necessary nor the easiest way: it was chosen simply because the quantile had been already implemented and was initially used for reverse sampling. ↩︎

14 KiB Raw Blame History Unescape Escape