analistica/notes/sections/1.md

# Exercise 1 {#sec:Landau}


## Random numbers following the Landau distribution

The Landau distribution is a probability density function which can be defined
as follows:
$$
  f(x) = \int \limits_{0}^{+ \infty} dt \, e^{-t \log(t) -xt} \sin (\pi t)
$$

![Landau distribution.](images/1-landau-small.pdf){width=50%}

The GNU Scientific Library (GSL) provides a number of functions for generating
random variates following tens of probability distributions. Thus, the function
for generating numbers from the Landau distribution, namely `gsl_ran_landau()`,
was used.  
For the purpose of visualizing the resulting sample, the data was put into
an histogram and plotted with matplotlib. The result is shown in @fig:landau.

![Example of N = 10'000 points generated with the `gsl_ran_landau()`
function and plotted in a 100-bins histogram ranging from -20 to
80.](images/1-landau-histo.pdf){#fig:landau}


## Randomness testing of the generated sample


### Kolmogorov-Smirnov test

In order to compare the sample with the Landau distribution, the
Kolmogorov-Smirnov (KS) test was applied. This test statistically quantifies the
distance between the cumulative distribution function of the Landau distribution
and the one of the sample. The null hypothesis is that the sample was
drawn from the reference distribution.  
The KS statistic for a given cumulative distribution function $F(x)$ is:
$$
  D_N = \text{sup}_x |F_N(x) - F(x)|
$$

where:

  - $x$ runs over the sample,
  - $F(x)$ is the Landau cumulative distribution function,
  - $F_N(x)$ is the empirical cumulative distribution function of the sample.

If $N$ numbers have been generated, for every point $x$,
$F_N(x)$ is simply given by the number of points preceding the point (itself
included) normalized by $N$, once the sample is sorted in ascending order.  
$F(x)$ was computed numerically from the Landau distribution with a maximum
relative error of $10^{-6}$, using the function `gsl_integration_qagiu()`,
found in GSL.

Under the null hypothesis, the distribution of $D_N$ is expected to
asymptotically approach a Kolmogorov distribution:
$$
  \sqrt{N}D_N \xrightarrow{N \rightarrow + \infty} K
$$

where $K$ is the Kolmogorov variable, with cumulative distribution function
given by [@marsaglia03]:
$$
  P(K \leqslant K_0) = 1 - p = \frac{\sqrt{2 \pi}}{K_0}
  \sum_{j = 1}^{+ \infty} e^{-(2j - 1)^2 \pi^2 / 8 K_0^2}
$$

Plugging the observed value $\sqrt{N}D_N$ in $K_0$, the $p$-value can be
computed. At 95% confidence level (which is the probability of confirming the
null hypothesis when correct) the compatibility with the Landau distribution
cannot be disproved if $p > α = 0.05$.  
To approximate the series, the convergence was accelerated using the Levin
$u$-transform with the `gsl_sum_levin_utrunc_accel()` function. The algorithm
terminates when the difference between two successive extrapolations reaches a
minimum.

For $N = 50000$, the following results were obtained:

  - $D = 0.004$
  - $p = 0.38$

Hence, the data was reasonably sampled from a Landau distribution.

**Note**:
Contrary to what one would expect, the $\chi^2$ test on a histogram is not very
useful in this case. For the test to be significant, the data has to be binned
such that at least several points fall in each bin. However, it can be seen
in @fig:landau that many bins are empty both in the right and left side of the
distribution, so it would be necessary to fit only the region where the points
cluster or use very large bins in the others, making the $\chi^2$ test
unpractical.


### Parameters comparison

When a sample of points is generated in a given range, different tests can be
applied in order to check whether they follow a given distribution or not. The
idea which lies beneath most of them is to measure how far the parameters of
the distribution are from the ones measured in the sample.  
The same principle can be used to verify if the generated sample effectively
follows the Landau distribution. Since it turns out to be a very pathological
PDF, very few parameters can be easily checked: mode, median and full width at
half maximum (FWHM).

![Landau distribution with emphatized mode $m_e$ and
  FWHM = ($x_+ - x_-$).](images/1-landau.pdf)

\begin{figure}
\hypertarget{fig:parameters}{%
\begin{tikzpicture}[overlay]
  \begin{scope}[shift={(0,0.4)}]
  % Mode
  \draw [thick, dashed] (7.57,3.1) -- (7.57,8.55);
  \draw [thick, dashed] (1.9,8.55) -- (7.57,8.55);
  \node [above right] at (7.6,3.1)  {$m_e$};
  \node [below right] at (1.9,8.55) {$f(m_e)$};
  % FWHM
  \draw [thick, dashed] (1.9,5.95)  -- (9.05,5.95);
  \draw [thick, dashed] (6.85,5.83) -- (6.85,3.1);
  \draw [thick, dashed] (8.95,5.83)    -- (8.95,3.1);
  \node [below right] at (1.9,5.95) {$\frac{f(m_e)}{2}$};
  \node [above right] at (6.85,3.1) {$x_-$};
  \node [above right] at (8.95,3.1)  {$x_+$};
  \end{scope}
\end{tikzpicture}}
\end{figure}

\vspace{30pt}

#### Mode

The mode of a set of data values is defined as the value that appears most
often, namely: it is the maximum of the PDF. Since there is no closed form for
the mode of the Landau PDF, it was computed numerically by the *Brent*
algorithm (`gsl_min_fminimizer_brent` in GSL), applied to $-f$ with a relative
tolerance of $10^{-7}$, giving:
$$
  \text{expected mode: } m_e = \num{-0.22278298 \pm 0.00000006}
$$

This is a minimization algorithm that begins with a bounded region known to
contain a minimum. The region is described by a lower bound $x_\text{min}$ and
an upper bound $x_\text{max}$, with an estimate of the location of the minimum
$x_e$.  The value of the function at $x_e$ must be less than the value of the
function at the ends of the interval, in order to guarantee that a minimum is
contained somewhere within the interval.
$$
  f(x_\text{min}) > f(x_e) < f(x_\text{max})
$$

On each iteration the function is interpolated by a parabola passing though the
points $x_\text{min}$, $x_e$, $x_\text{max}$ and the minimum is computed as the
vertex of the parabola. If this point is found to be inside the interval it's
taken as a guess for the true minimum; otherwise the method falls
back to a golden section (using the ratio $(3 - \sqrt{5})/2 \approx 0.3819660$
proven to be optimal) of the interval. The value of the function at this new
point $x'$ is calculated. In any case if the new point is a better estimate of
the minimum, namely if $f(x') < f(x_e)$, then the current estimate of the
minimum is updated.  
The new point allows the size of the bounded interval to be reduced, by choosing
the most compact set of points which satisfies the constraint $f(a) > f(x') <
f(b)$ between $f(x_\text{min})$, $f(x_\text{min})$ and $f(x_e)$. The interval is
reduced until it encloses the true minimum to a desired tolerance.
The error of the result is estimated by the length of the final interval.

On the other hand, to compute the mode of the sample, the half-sample mode (HSM)
or *Robertson-Cryer* estimator was used. This estimator was chosen because makes
no assumptions on the underlying distribution and is not computationally
expensive. The HSM is obtained by iteratively identifying the half modal
interval, which is the smallest interval containing half of the observation.
Once the sample is reduced to less than three points the mode is computed as the
average. The special case $n=3$ is dealt with by averaging the two closer points
[@robertson74].

To obtain a better estimate of the mode and its error the above procedure was
bootstrapped. The original sample was treated as a population and used to build
100 other samples of the same size, by *sampling with replacements*. For each one
of the new samples, the above statistic was computed. By simply taking the
mean of these statistics the following estimate was obtained:
$$
  \text{observed mode: } m_o = \num{-0.29 \pm 0.19}
$$

In order to compare the values $m_e$ and $m_0$, the following compatibility
$t$-test was applied:
$$
  p = 1 - \text{erf}\left(\frac{t}{\sqrt{2}}\right)\ \with
  t = \frac{|m_e - m_o|}{\sqrt{\sigma_e^2 + \sigma_o^2}}
$$

where $\sigma_e$ and $\sigma_o$ are the absolute errors of $m_e$ and $m_o$
respectively. At 95% confidence level, the values are compatible if $p > 0.05$.
In this case:

  - $t = 1.012$
  - $p = 0.311$

Thus, the observed mode is compatible with the mode of the Landau distribution,
although the result is quite imprecise.


#### Median

The median is a central tendency statistics that, unlike the mean, is not
very sensitive to extreme values, albeit less indicative. For this reason
is well suited as test statistic in a pathological case such as the Landau
distribution.  
The median of a probability distribution is defined as the value such that its
cumulative probability is $1/2$. In other words, the median partitions the
probability in two (connected) halves. The median of a sample, once sorted, is
given by its middle element if the sample size is odd, or the average of the two
middle elements otherwise.

The expected median was derived from the quantile function (QDF) of the Landau
distribution[^1].
Once this is know, the median is simply given by $\text{QDF}(1/2)$. Since both
the CDF and QDF have no known closed form, they must be computed numerically.
The cumulative probability was computed by quadrature-based numerical
integration of the PDF (`gsl_integration_qagiu()` function in GSL). The function
calculate an approximation of the integral:
$$
  I(x) = \int_x^{+\infty} f(t)dt
$$

[^1]: This is neither necessary nor the easiest way: it was chosen simply
because the quantile had been already implemented and was initially
used for reverse sampling.

The CDF is then given by $p(x) = 1 - I(x)$. This was done to avoid the
left tail of the distribution, where the integration can sometimes fail.
The integral $I$ was actually mapped beforehand onto $(0, 1]$ by
the change of variable $t = x + (1-u)/u$, because the integration
routine works on definite integrals. The result should satisfy the following
accuracy requirement:
$$
  |\text{result} - I| \le \max(\varepsilon_\text{abs}, \varepsilon_\text{rel}I)
$$

where the absolute and relative tolerances $\varepsilon_\text{abs}$ and
$\varepsilon_\text{rel}$ were set to \num{1e-10} and \num{1e-6},
respectively.  
As for the QDF, this was implemented by numerically inverting the CDF. This was
done by solving the equation;
$$
  p(x) = p_0
$$

for x, given a probability value $p_0$, where $p(x)$ is the CDF. The (unique)
root of this equation was found by a root-finding routine
(`gsl_root_fsolver_brent` in GSL) based on the Brent-Dekker method.
The following condition was checked for convergence:
$$
  |a - b| < \varepsilon_\text{abs} + \varepsilon_\text{rel} \min(|a|, |b|)
$$

where $a,b$ are the current interval bounds. The condition immediately gives an
upper bound on the error of the root as $\varepsilon = |a-b|$.  The tolerances
here were set to 0 and \num{1e-3}.  
The result of the numerical computation is:
$$
  \text{expected median: } m_e = \num{1.3557804 \pm 0.0000091}
$$

while the sample median, obtained again by bootstrapping, was found to be:
$$
  \text{observed median: } m_e = \num{1.3605 \pm 0.0062}
$$

As stated above, the median is less sensitive to extreme values with respect to
the mode: this lead the result to be much more precise. Applying again the
aforementioned $t$-test to this statistic:

  - $t=0.761$
  - $p=0.446$

Hence, the two values show a good agreement.


#### FWHM

For a unimodal distribution (having a single peak) this statistic is defined as
the distance between the two points at which the PDF attains half the maximum
value. For the Landau distribution, again, there is no analytic expression
known, thus the FWHM was computed numerically as follows. First of all, some
definitions must be given:
$$
  f_{\text{max}} = f(m_e) \et \text{FWHM} = x_+ - x_- \with
  f(x_\pm) = \frac{f_\text{max}}{2}
$$

The function derivative $f'(x)$ was minimized using the same minimization method
used for finding $m_e$. Once $f_\text{max}$ was known, the equation:
$$
  f(x) = \frac{f_\text{max}}{2}
$$

was solved by performing the Brent-Dekker method (described before) in the
ranges $[x_\text{min}, m_e]$ and $[m_e, x_\text{max}]$ yielding the two
solutions $x_\pm$. With a relative tolerance of \num{1e-7}, the following
result was obtained:
$$
\text{expected FWHM: } w_e = \num{4.0186457 \pm 0.0000001}
$$
\vspace{-1em}

![Example of a Moyal distribution density obtained by the KDE method. The rug
  plot shows the original sample used in the reconstruction.
](images/1-landau-kde.pdf)

On the other hand, obtaining a good estimate of the FWHM from a sample is much
more difficult. In principle, it could be measured by binning the data and
applying the definition to the discretised values, however this yields very
poor results and depends on an completely arbitrary parameter: the bin width.  
A more refined method to construct a nonparametric empirical PDF function from
the sample is a kernel density estimation (KDE). This method consist in
convolving the (ordered) data with a smooth symmetrical kernel: in this case a
standard Gaussian function. Given a sample of values $\{x_i\}_{i=1}^N$, the
empirical PDF is defined as:
$$
  f_\varepsilon(x) = \frac{1}{N\varepsilon} \sum_{i = 1}^N
    \mathcal{N}\left(\frac{x-x_i}{\varepsilon}\right)
$$

where $\mathcal{N}$ is the kernel and the parameter $\varepsilon$, called
*bandwidth*, controls the strength of the smoothing. This parameter can be
determined in several ways: bootstrapping, cross-validation, etc. For
simplicity, it was chosen to use Silverman's rule of thumb [@silverman86],
which gives:
$$
  \varepsilon = 0.63 S_N
    \left(\frac{d + 2}{4}N\right)^{-1/(d + 4)}
$$

where the $0.63$ factor was chosen to compensate for the distortion that
systematically reduces the peaks height, which affects the estimation of the
mode, and:

  - $S_N$ is the sample standard deviation,
  - $d$ is the number of dimensions, in this case $d=1$.

With the empirical density estimation at hand, the FWHM can be computed by the
same numerical method described for the true PDF. Again this was bootstrapped
to estimate the standard error giving:
$$
  \text{observed FWHM: } w_o = \num{4.06 \pm 0.08}
$$

Applying the $t$-test to these two values gives

  - $t=0.495$
  - $p=0.620$

which shows a very good agreement and proves the estimator is robust.
For reference, the initial estimation based on an histogram gave a rather
inadequate \si{4 \pm 2}.
-												ex-6: started writing about the histograms comparison

											
										
										
											2020-05-01 23:56:35 +02:00
+								# Exercise 1 {#sec:Landau}
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
-												ex-2: revised and typo-fixed

											
										
										
											2020-05-08 23:07:37 +02:00
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
+								## Random numbers following the Landau distribution
 								The Landau distribution is a probability density function which can be defined
 								as follows:
 								$$
-												ex-6: started writing about the histograms comparison

											
										
										
											2020-05-01 23:56:35 +02:00
+								  f(x) = \int \limits_{0}^{+ \infty} dt \, e^{-t \log(t) -xt} \sin (\pi t)
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
+								$$
-												images: neatly renamed all the images

											
										
										
											2020-05-20 18:45:27 +02:00
+								![Landau distribution.](images/1-landau-small.pdf){width=50%}
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
 								The GNU Scientific Library (GSL) provides a number of functions for generating
 								random variates following tens of probability distributions. Thus, the function
 								for generating numbers from the Landau distribution, namely `gsl_ran_landau()`,
 								was used.
 								For the purpose of visualizing the resulting sample, the data was put into
 								an histogram and plotted with matplotlib. The result is shown in @fig:landau.
-												ex-6: started writing about the histograms comparison

											
										
										
											2020-05-01 23:56:35 +02:00
+								![Example of N = 10'000 points generated with the `gsl_ran_landau()`
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								function and plotted in a 100-bins histogram ranging from -20 to
-												images: neatly renamed all the images

											
										
										
											2020-05-20 18:45:27 +02:00
+.](images/1-landau-histo.pdf){#fig:landau}
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
 								## Randomness testing of the generated sample
-												ex-2: revised and typo-fixed

											
										
										
											2020-05-08 23:07:37 +02:00
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
+								### Kolmogorov-Smirnov test
 								In order to compare the sample with the Landau distribution, the
 								Kolmogorov-Smirnov (KS) test was applied. This test statistically quantifies the
 								distance between the cumulative distribution function of the Landau distribution
 								and the one of the sample. The null hypothesis is that the sample was
 								drawn from the reference distribution.
 								The KS statistic for a given cumulative distribution function $F(x)$ is:
 								$$
 								  D_N = \text{sup}_x |F_N(x) - F(x)|
 								$$
 								where:
 								  - $x$ runs over the sample,
-												ex-6: started writing about the histograms comparison

											
										
										
											2020-05-01 23:56:35 +02:00
+								  - $F(x)$ is the Landau cumulative distribution function,
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
+								  - $F_N(x)$ is the empirical cumulative distribution function of the sample.
 								If $N$ numbers have been generated, for every point $x$,
 								$F_N(x)$ is simply given by the number of points preceding the point (itself
 								included) normalized by $N$, once the sample is sorted in ascending order.
 								$F(x)$ was computed numerically from the Landau distribution with a maximum
 								relative error of $10^{-6}$, using the function `gsl_integration_qagiu()`,
 								found in GSL.
 								Under the null hypothesis, the distribution of $D_N$ is expected to
 								asymptotically approach a Kolmogorov distribution:
 								$$
 								  \sqrt{N}D_N \xrightarrow{N \rightarrow + \infty} K
 								$$
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								where $K$ is the Kolmogorov variable, with cumulative distribution function
 								given by [@marsaglia03]:
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
+								$$
 								  P(K \leqslant K_0) = 1 - p = \frac{\sqrt{2 \pi}}{K_0}
 								  \sum_{j = 1}^{+ \infty} e^{-(2j - 1)^2 \pi^2 / 8 K_0^2}
 								$$
 								Plugging the observed value $\sqrt{N}D_N$ in $K_0$, the $p$-value can be
 								computed. At 95% confidence level (which is the probability of confirming the
 								null hypothesis when correct) the compatibility with the Landau distribution
 								cannot be disproved if $p > α = 0.05$.
 								To approximate the series, the convergence was accelerated using the Levin
 								$u$-transform with the `gsl_sum_levin_utrunc_accel()` function. The algorithm
 								terminates when the difference between two successive extrapolations reaches a
 								minimum.
-												ex-1: rewrite the chapters for bootstrapping and kde

											
										
										
											2020-04-11 19:30:46 +02:00
+								For $N = 50000$, the following results were obtained:
 								  - $D = 0.004$
 								  - $p = 0.38$
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
 								Hence, the data was reasonably sampled from a Landau distribution.
 								**Note**:
 								Contrary to what one would expect, the $\chi^2$ test on a histogram is not very
 								useful in this case. For the test to be significant, the data has to be binned
 								such that at least several points fall in each bin. However, it can be seen
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								in @fig:landau that many bins are empty both in the right and left side of the
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
+								distribution, so it would be necessary to fit only the region where the points
 								cluster or use very large bins in the others, making the $\chi^2$ test
 								unpractical.
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								### Parameters comparison
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
 								When a sample of points is generated in a given range, different tests can be
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								applied in order to check whether they follow a given distribution or not. The
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
+								idea which lies beneath most of them is to measure how far the parameters of
 								the distribution are from the ones measured in the sample.
 								The same principle can be used to verify if the generated sample effectively
 								follows the Landau distribution. Since it turns out to be a very pathological
-												ex-1: rewrite the chapters for bootstrapping and kde

											
										
										
											2020-04-11 19:30:46 +02:00
+								PDF, very few parameters can be easily checked: mode, median and full width at
 								half maximum (FWHM).
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								![Landau distribution with emphatized mode $m_e$ and
-												images: neatly renamed all the images

											
										
										
											2020-05-20 18:45:27 +02:00
+								  FWHM = ($x_+ - x_-$).](images/1-landau.pdf)
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
 								\begin{figure}
 								\hypertarget{fig:parameters}{%
 								\begin{tikzpicture}[overlay]
 								  \begin{scope}[shift={(0,0.4)}]
 								  % Mode
 								  \draw [thick, dashed] (7.57,3.1) -- (7.57,8.55);
 								  \draw [thick, dashed] (1.9,8.55) -- (7.57,8.55);
 								  \node [above right] at (7.6,3.1)  {$m_e$};
 								  \node [below right] at (1.9,8.55) {$f(m_e)$};
 								  % FWHM
 								  \draw [thick, dashed] (1.9,5.95)  -- (9.05,5.95);
 								  \draw [thick, dashed] (6.85,5.83) -- (6.85,3.1);
 								  \draw [thick, dashed] (8.95,5.83)    -- (8.95,3.1);
 								  \node [below right] at (1.9,5.95) {$\frac{f(m_e)}{2}$};
 								  \node [above right] at (6.85,3.1) {$x_-$};
 								  \node [above right] at (8.95,3.1)  {$x_+$};
 								  \end{scope}
-												ex-1: write section on median test

											
										
										
											2020-04-01 01:37:03 +02:00
+								\end{tikzpicture}}
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
+								\end{figure}
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								\vspace{30pt}
 								#### Mode
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
+								The mode of a set of data values is defined as the value that appears most
-												ex-1: rewrite the chapters for bootstrapping and kde

											
										
										
											2020-04-11 19:30:46 +02:00
+								often, namely: it is the maximum of the PDF. Since there is no closed form for
 								the mode of the Landau PDF, it was computed numerically by the *Brent*
 								algorithm (`gsl_min_fminimizer_brent` in GSL), applied to $-f$ with a relative
 								tolerance of $10^{-7}$, giving:
-												ex-1: write section on median test

											
										
										
											2020-04-01 01:37:03 +02:00
+								$$
-												notes: use \num for numbers with exponential notation

											
										
										
											2020-05-26 10:07:30 +02:00
+								  \text{expected mode: } m_e = \num{-0.22278298 \pm 0.00000006}
-												ex-1: write section on median test

											
										
										
											2020-04-01 01:37:03 +02:00
+								$$
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
-												ex-1: write section on median test

											
										
										
											2020-04-01 01:37:03 +02:00
+								This is a minimization algorithm that begins with a bounded region known to
 								contain a minimum. The region is described by a lower bound $x_\text{min}$ and
 								an upper bound $x_\text{max}$, with an estimate of the location of the minimum
 								$x_e$.  The value of the function at $x_e$ must be less than the value of the
 								function at the ends of the interval, in order to guarantee that a minimum is
 								contained somewhere within the interval.
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
+								$$
 								  f(x_\text{min}) > f(x_e) < f(x_\text{max})
 								$$
-												ex-1: rewrite the chapters for bootstrapping and kde

											
										
										
											2020-04-11 19:30:46 +02:00
+								On each iteration the function is interpolated by a parabola passing though the
 								points $x_\text{min}$, $x_e$, $x_\text{max}$ and the minimum is computed as the
 								vertex of the parabola. If this point is found to be inside the interval it's
 								taken as a guess for the true minimum; otherwise the method falls
 								back to a golden section (using the ratio $(3 - \sqrt{5})/2 \approx 0.3819660$
 								proven to be optimal) of the interval. The value of the function at this new
 								point $x'$ is calculated. In any case if the new point is a better estimate of
 								the minimum, namely if $f(x') < f(x_e)$, then the current estimate of the
 								minimum is updated.
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
+								The new point allows the size of the bounded interval to be reduced, by choosing
 								the most compact set of points which satisfies the constraint $f(a) > f(x') <
 								f(b)$ between $f(x_\text{min})$, $f(x_\text{min})$ and $f(x_e)$. The interval is
 								reduced until it encloses the true minimum to a desired tolerance.
-												ex-1: rewrite the chapters for bootstrapping and kde

											
										
										
											2020-04-11 19:30:46 +02:00
+								The error of the result is estimated by the length of the final interval.
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								On the other hand, to compute the mode of the sample, the half-sample mode (HSM)
-												ex-1: rewrite the chapters for bootstrapping and kde

											
										
										
											2020-04-11 19:30:46 +02:00
+								or *Robertson-Cryer* estimator was used. This estimator was chosen because makes
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								no assumptions on the underlying distribution and is not computationally
 								expensive. The HSM is obtained by iteratively identifying the half modal
 								interval, which is the smallest interval containing half of the observation.
-												ex-1: reviewed

											
										
										
											2020-05-19 20:51:13 +02:00
+								Once the sample is reduced to less than three points the mode is computed as the
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								average. The special case $n=3$ is dealt with by averaging the two closer points
 								[@robertson74].
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
-												ex-1: rewrite the chapters for bootstrapping and kde

											
										
										
											2020-04-11 19:30:46 +02:00
+								To obtain a better estimate of the mode and its error the above procedure was
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								bootstrapped. The original sample was treated as a population and used to build
 other samples of the same size, by *sampling with replacements*. For each one
 								of the new samples, the above statistic was computed. By simply taking the
 								mean of these statistics the following estimate was obtained:
-												ex-1: write section on median test

											
										
										
											2020-04-01 01:37:03 +02:00
+								$$
-												notes: use \num for numbers with exponential notation

											
										
										
											2020-05-26 10:07:30 +02:00
+								  \text{observed mode: } m_o = \num{-0.29 \pm 0.19}
-												ex-1: write section on median test

											
										
										
											2020-04-01 01:37:03 +02:00
+								$$
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								In order to compare the values $m_e$ and $m_0$, the following compatibility
-												notes: use math mode to write "t-test"

											
										
										
											2020-05-28 20:49:08 +02:00
+								$t$-test was applied:
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
+								$$
 								  p = 1 - \text{erf}\left(\frac{t}{\sqrt{2}}\right)\ \with
 								  t = \frac{|m_e - m_o|}{\sqrt{\sigma_e^2 + \sigma_o^2}}
 								$$
 								where $\sigma_e$ and $\sigma_o$ are the absolute errors of $m_e$ and $m_o$
 								respectively. At 95% confidence level, the values are compatible if $p > 0.05$.
 								In this case:
-												ex-1: reviewed

											
										
										
											2020-05-19 20:51:13 +02:00
+								  - $t = 1.012$
 								  - $p = 0.311$
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
-												ex-1: rewrite the chapters for bootstrapping and kde

											
										
										
											2020-04-11 19:30:46 +02:00
+								Thus, the observed mode is compatible with the mode of the Landau distribution,
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								although the result is quite imprecise.
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								#### Median
-												ex-1: write section on median test

											
										
										
											2020-04-01 01:37:03 +02:00
 								The median is a central tendency statistics that, unlike the mean, is not
 								very sensitive to extreme values, albeit less indicative. For this reason
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								is well suited as test statistic in a pathological case such as the Landau
 								distribution.
 								The median of a probability distribution is defined as the value such that its
 								cumulative probability is $1/2$. In other words, the median partitions the
 								probability in two (connected) halves. The median of a sample, once sorted, is
 								given by its middle element if the sample size is odd, or the average of the two
 								middle elements otherwise.
-												ex-1: write section on median test

											
										
										
											2020-04-01 01:37:03 +02:00
 								The expected median was derived from the quantile function (QDF) of the Landau
 								distribution[^1].
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								Once this is know, the median is simply given by $\text{QDF}(1/2)$. Since both
 								the CDF and QDF have no known closed form, they must be computed numerically.
 								The cumulative probability was computed by quadrature-based numerical
 								integration of the PDF (`gsl_integration_qagiu()` function in GSL). The function
 								calculate an approximation of the integral:
-												ex-1: write section on median test

											
										
										
											2020-04-01 01:37:03 +02:00
+								$$
 								  I(x) = \int_x^{+\infty} f(t)dt
 								$$
 								[^1]: This is neither necessary nor the easiest way: it was chosen simply
 								because the quantile had been already implemented and was initially
 								used for reverse sampling.
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								The CDF is then given by $p(x) = 1 - I(x)$. This was done to avoid the
-												ex-1: write section on median test

											
										
										
											2020-04-01 01:37:03 +02:00
+								left tail of the distribution, where the integration can sometimes fail.
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								The integral $I$ was actually mapped beforehand onto $(0, 1]$ by
 								the change of variable $t = x + (1-u)/u$, because the integration
-												ex-1: write section on median test

											
										
										
											2020-04-01 01:37:03 +02:00
+								routine works on definite integrals. The result should satisfy the following
 								accuracy requirement:
 								$$
 								  |\text{result} - I| \le \max(\varepsilon_\text{abs}, \varepsilon_\text{rel}I)
 								$$
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								where the absolute and relative tolerances $\varepsilon_\text{abs}$ and
-												notes: use \num for numbers with exponential notation

											
										
										
											2020-05-26 10:07:30 +02:00
+								$\varepsilon_\text{rel}$ were set to \num{1e-10} and \num{1e-6},
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								respectively.
 								As for the QDF, this was implemented by numerically inverting the CDF. This was
 								done by solving the equation;
-												ex-1: write section on median test

											
										
										
											2020-04-01 01:37:03 +02:00
+								$$
 								  p(x) = p_0
 								$$
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								for x, given a probability value $p_0$, where $p(x)$ is the CDF. The (unique)
 								root of this equation was found by a root-finding routine
-												ex-1: reviewed

											
										
										
											2020-05-19 20:51:13 +02:00
+								(`gsl_root_fsolver_brent` in GSL) based on the Brent-Dekker method.
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								The following condition was checked for convergence:
-												ex-1: write section on median test

											
										
										
											2020-04-01 01:37:03 +02:00
+								$$
 								  |a - b| < \varepsilon_\text{abs} + \varepsilon_\text{rel} \min(|a|, |b|)
 								$$
 								where $a,b$ are the current interval bounds. The condition immediately gives an
 								upper bound on the error of the root as $\varepsilon = |a-b|$.  The tolerances
-												notes: use \num for numbers with exponential notation

											
										
										
											2020-05-26 10:07:30 +02:00
+								here were set to 0 and \num{1e-3}.
-												ex-1: write section on median test

											
										
										
											2020-04-01 01:37:03 +02:00
+								The result of the numerical computation is:
 								$$
-												notes: use \num for numbers with exponential notation

											
										
										
											2020-05-26 10:07:30 +02:00
+								  \text{expected median: } m_e = \num{1.3557804 \pm 0.0000091}
-												ex-1: write section on median test

											
										
										
											2020-04-01 01:37:03 +02:00
+								$$
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								while the sample median, obtained again by bootstrapping, was found to be:
-												ex-1: write section on median test

											
										
										
											2020-04-01 01:37:03 +02:00
+								$$
-												notes: use \num for numbers with exponential notation

											
										
										
											2020-05-26 10:07:30 +02:00
+								  \text{observed median: } m_e = \num{1.3605 \pm 0.0062}
-												ex-1: write section on median test

											
										
										
											2020-04-01 01:37:03 +02:00
+								$$
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								As stated above, the median is less sensitive to extreme values with respect to
 								the mode: this lead the result to be much more precise. Applying again the
-												notes: use math mode to write "t-test"

											
										
										
											2020-05-28 20:49:08 +02:00
+								aforementioned $t$-test to this statistic:
-												ex-1: rewrite the chapters for bootstrapping and kde

											
										
										
											2020-04-11 19:30:46 +02:00
 								  - $t=0.761$
 								  - $p=0.446$
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								Hence, the two values show a good agreement.
-												ex-1: rewrite the chapters for bootstrapping and kde

											
										
										
											2020-04-11 19:30:46 +02:00
-												ex-1: write section on median test

											
										
										
											2020-04-01 01:37:03 +02:00
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								#### FWHM
-												ex-1: write section on median test

											
										
										
											2020-04-01 01:37:03 +02:00
-												ex-1: rewrite the chapters for bootstrapping and kde

											
										
										
											2020-04-11 19:30:46 +02:00
+								For a unimodal distribution (having a single peak) this statistic is defined as
 								the distance between the two points at which the PDF attains half the maximum
 								value. For the Landau distribution, again, there is no analytic expression
 								known, thus the FWHM was computed numerically as follows. First of all, some
 								definitions must be given:
 								$$
 								  f_{\text{max}} = f(m_e) \et \text{FWHM} = x_+ - x_- \with
 								  f(x_\pm) = \frac{f_\text{max}}{2}
 								$$
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								The function derivative $f'(x)$ was minimized using the same minimization method
 								used for finding $m_e$. Once $f_\text{max}$ was known, the equation:
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
+								$$
-												ex-1: reviewed

											
										
										
											2020-05-19 20:51:13 +02:00
+								  f(x) = \frac{f_\text{max}}{2}
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
+								$$
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								was solved by performing the Brent-Dekker method (described before) in the
 								ranges $[x_\text{min}, m_e]$ and $[m_e, x_\text{max}]$ yielding the two
-												notes: use \num for numbers with exponential notation

											
										
										
											2020-05-26 10:07:30 +02:00
+								solutions $x_\pm$. With a relative tolerance of \num{1e-7}, the following
-												ex-1: rewrite the chapters for bootstrapping and kde

											
										
										
											2020-04-11 19:30:46 +02:00
+								result was obtained:
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
+								$$
-												notes: use \num for numbers with exponential notation

											
										
										
											2020-05-26 10:07:30 +02:00
+								\text{expected FWHM: } w_e = \num{4.0186457 \pm 0.0000001}
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
+								$$
-												ex-1: rewrite the chapters for bootstrapping and kde

											
										
										
											2020-04-11 19:30:46 +02:00
+								\vspace{-1em}
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								![Example of a Moyal distribution density obtained by the KDE method. The rug
-												ex-1: reviewed

											
										
										
											2020-05-19 20:51:13 +02:00
+								  plot shows the original sample used in the reconstruction.
-												ex-6: section results completed

											
										
										
											2020-05-21 18:50:21 +02:00
+								](images/1-landau-kde.pdf)
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
-												ex-1: rewrite the chapters for bootstrapping and kde

											
										
										
											2020-04-11 19:30:46 +02:00
+								On the other hand, obtaining a good estimate of the FWHM from a sample is much
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								more difficult. In principle, it could be measured by binning the data and
-												ex-1: rewrite the chapters for bootstrapping and kde

											
										
										
											2020-04-11 19:30:46 +02:00
+								applying the definition to the discretised values, however this yields very
 								poor results and depends on an completely arbitrary parameter: the bin width.
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								A more refined method to construct a nonparametric empirical PDF function from
-												ex-1: rewrite the chapters for bootstrapping and kde

											
										
										
											2020-04-11 19:30:46 +02:00
+								the sample is a kernel density estimation (KDE). This method consist in
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								convolving the (ordered) data with a smooth symmetrical kernel: in this case a
 								standard Gaussian function. Given a sample of values $\{x_i\}_{i=1}^N$, the
 								empirical PDF is defined as:
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
+								$$
-												ex-1: rewrite the chapters for bootstrapping and kde

											
										
										
											2020-04-11 19:30:46 +02:00
+								  f_\varepsilon(x) = \frac{1}{N\varepsilon} \sum_{i = 1}^N
 								    \mathcal{N}\left(\frac{x-x_i}{\varepsilon}\right)
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
+								$$
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								where $\mathcal{N}$ is the kernel and the parameter $\varepsilon$, called
 								*bandwidth*, controls the strength of the smoothing. This parameter can be
 								determined in several ways: bootstrapping, cross-validation, etc. For
 								simplicity, it was chosen to use Silverman's rule of thumb [@silverman86],
 								which gives:
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
+								$$
-												ex-1: rewrite the chapters for bootstrapping and kde

											
										
										
											2020-04-11 19:30:46 +02:00
+								  \varepsilon = 0.63 S_N
 								    \left(\frac{d + 2}{4}N\right)^{-1/(d + 4)}
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
+								$$
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								where the $0.63$ factor was chosen to compensate for the distortion that
-												ex-1: rewrite the chapters for bootstrapping and kde

											
										
										
											2020-04-11 19:30:46 +02:00
+								systematically reduces the peaks height, which affects the estimation of the
-												ex-1: revised and typo-fixed

											
										
										
											2020-05-07 23:18:59 +02:00
+								mode, and:
 								  - $S_N$ is the sample standard deviation,
 								  - $d$ is the number of dimensions, in this case $d=1$.
-												ex-1: rewrite the chapters for bootstrapping and kde

											
										
										
											2020-04-11 19:30:46 +02:00
 								With the empirical density estimation at hand, the FWHM can be computed by the
 								same numerical method described for the true PDF. Again this was bootstrapped
 								to estimate the standard error giving:
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
+								$$
-												notes: use \num for numbers with exponential notation

											
										
										
											2020-05-26 10:07:30 +02:00
+								  \text{observed FWHM: } w_o = \num{4.06 \pm 0.08}
-												initial commit

											
										
										
											2020-03-06 02:24:32 +01:00
+								$$
-												notes: use math mode to write "t-test"

											
										
										
											2020-05-28 20:49:08 +02:00
+								Applying the $t$-test to these two values gives
-												ex-1: rewrite the chapters for bootstrapping and kde

											
										
										
											2020-04-11 19:30:46 +02:00
 								  - $t=0.495$
 								  - $p=0.620$
 								which shows a very good agreement and proves the estimator is robust.
 								For reference, the initial estimation based on an histogram gave a rather
 								inadequate \si{4 \pm 2}.