ex-1: rewrite the chapters for bootstrapping and kde

This commit is contained in:
Michele Guerini Rocco 2020-04-11 19:30:46 +02:00
parent 343fe3aac3
commit 64f8a0a76e
2 changed files with 113 additions and 55 deletions

BIN
notes/images/landau-kde.pdf Normal file

Binary file not shown.

View File

@ -75,10 +75,12 @@ $u$-transform with the `gsl_sum_levin_utrunc_accel()` function. The algorithm
terminates when the difference between two successive extrapolations reaches a
minimum.
For $N = 1000$, the following results were obtained:
\clearpage
- $D = 0.020$
- $p = 0.79$
For $N = 50000$, the following results were obtained:
- $D = 0.004$
- $p = 0.38$
Hence, the data was reasonably sampled from a Landau distribution.
@ -100,8 +102,8 @@ idea which lies beneath most of them is to measure how far the parameters of
the distribution are from the ones measured in the sample.
The same principle can be used to verify if the generated sample effectively
follows the Landau distribution. Since it turns out to be a very pathological
PDF, very few parameters can be easily checked: mode, median and
full-width-half-maximum (FWHM).
PDF, very few parameters can be easily checked: mode, median and full width at
half maximum (FWHM).
### Mode
@ -129,12 +131,12 @@ full-width-half-maximum (FWHM).
\end{figure}
The mode of a set of data values is defined as the value that appears most
often, namely: it is the maximum of the PDF. Since there is no closed form
for the mode of the Landau PDF, it was computed numerically by the *golden
section* method (`gsl_min_fminimizer_goldensection` in GSL), applied to $-f$
with an arbitrary error of $10^{-2}$, giving:
often, namely: it is the maximum of the PDF. Since there is no closed form for
the mode of the Landau PDF, it was computed numerically by the *Brent*
algorithm (`gsl_min_fminimizer_brent` in GSL), applied to $-f$ with a relative
tolerance of $10^{-7}$, giving:
$$
\text{expected mode: } m_e = \SI{-0.2227830 \pm 0.0000001}{}
\text{expected mode: } m_e = \SI{-0.22278298 \pm 0.00000006}{}
$$
This is a minimization algorithm that begins with a bounded region known to
@ -148,23 +150,36 @@ $$
f(x_\text{min}) > f(x_e) < f(x_\text{max})
$$
On each iteration the interval is divided in a golden section (using the ratio
($(3 - \sqrt{5})/2 \approx 0.3819660$) and the value of the function at this new
point $x'$ is calculated. If the new point is a better estimate of the minimum,
namely if $f(x') < f(x_e)$, then the current estimate of the minimum is
updated.
On each iteration the function is interpolated by a parabola passing though the
points $x_\text{min}$, $x_e$, $x_\text{max}$ and the minimum is computed as the
vertex of the parabola. If this point is found to be inside the interval it's
taken as a guess for the true minimum; otherwise the method falls
back to a golden section (using the ratio $(3 - \sqrt{5})/2 \approx 0.3819660$
proven to be optimal) of the interval. The value of the function at this new
point $x'$ is calculated. In any case if the new point is a better estimate of
the minimum, namely if $f(x') < f(x_e)$, then the current estimate of the
minimum is updated.
The new point allows the size of the bounded interval to be reduced, by choosing
the most compact set of points which satisfies the constraint $f(a) > f(x') <
f(b)$ between $f(x_\text{min})$, $f(x_\text{min})$ and $f(x_e)$. The interval is
reduced until it encloses the true minimum to a desired tolerance.
The error of the result is estimated by the length of the final interval.
In the sample, on the other hand, once the data were binned, the mode can be
estimated as the central value of the bin with maximum events and the error
is the half width of the bins. In this case, with 40 bins between -20 and 20,
the following result was obtained:
On the other hand, to compute the mode of the sample the half-sample mode (HSM)
or *Robertson-Cryer* estimator was used. This estimator was chosen because makes
no assumptions on the underlying distribution and is not computationally expensive.
The HSM is obtained by iteratively identifying the half modal interval, which
is the smallest interval containing half of the observation. Once the sample is
reduced to less that three points the mode is computed as the average. The
special case $n=3$ is dealt with by averaging the two closer points.
To obtain a better estimate of the mode and its error the above procedure was
bootstrapped. The original sample is treated as a population and used to build
other samples, of the same size, by *sampling with replacements*. For each one
of the new samples the above statistic is computed. By simply taking the
mean of these statistics the following estimate was obtained
$$
\text{observed mode: } m_o = \SI{0 \pm 1}{}
\text{observed mode: } m_o = \SI{-0.29 \pm 0.19}{}
$$
In order to compare the values $m_e$ and $x_0$, the following compatibility
@ -179,11 +194,11 @@ where $\sigma_e$ and $\sigma_o$ are the absolute errors of $m_e$ and $m_o$
respectively. At 95% confidence level, the values are compatible if $p > 0.05$.
In this case:
$$
p = 0.82
$$
- t = 1.012
- p = 0.311
Thus, the observed value is compatible with the expected one.
Thus, the observed mode is compatible with the mode of the Landau distribution,
however the result is quite imprecise.
### Median
@ -200,9 +215,9 @@ sample size is odd, or the average of the two middle elements otherwise.
The expected median was derived from the quantile function (QDF) of the Landau
distribution[^1].
Once this is know, the median is simply given by $QDF(1/2)$.
Once this is know, the median is simply given by $\text{QDF}(1/2)$.
Since both the CDF and QDF have no known closed form they must
be computed numerically. The comulative probability has been computed by
be computed numerically. The cumulative probability has been computed by
quadrature-based numerical integration of the PDF (`gsl_integration_qagiu()`
function in GSL). The function calculate an approximation of the integral
@ -247,58 +262,101 @@ upper bound on the error of the root as $\varepsilon = |a-b|$. The tolerances
here have been set to 0 and \SI{1e-3}{}.
The result of the numerical computation is:
$$
\text{expected median: } m_e = \SI{1.3557804 \pm 0.0000091}{}
$$
while the sample median was found to be
while the sample median, obtained again by bootstrapping, was found to be
$$
\text{observed median: } m_e = \SI{1.3605 \pm 0.0062}{}
$$
$$
\text{observed median: } m_e = \SI{1.3479314}{}
$$
Applying again the t-test from before to this statistic:
- $t=0.761$
- $p=0.446$
This result is much more precise than the mode and the two values show
a good agreement.
### FWHM
The same approach was taken as regards the FWHM. This statistic is defined as
the distance between the two points at which the function assumes half times
the maximum value. Even in this case, there is not an analytic expression for
it, thus it was computed numerically ad follow.
First, some definitions must be given:
For a unimodal distribution (having a single peak) this statistic is defined as
the distance between the two points at which the PDF attains half the maximum
value. For the Landau distribution, again, there is no analytic expression
known, thus the FWHM was computed numerically as follows. First of all, some
definitions must be given:
$$
f_{\text{max}} := f(m_e) \et \text{FWHM} = x_+ - x_- \with
f(x_{\pm}) = \frac{f_{\text{max}}}{2}
f_{\text{max}} = f(m_e) \et \text{FWHM} = x_+ - x_- \with
f(x_\pm) = \frac{f_\text{max}}{2}
$$
then the function $f'(x)$ was minimized using the same minimization method
used for finding $m_e$, dividing the range into $[x_\text{min}, m_e]$ and
$[m_e, x_\text{max}]$ (where $x_\text{min}$ and $x_\text{max}$ are the limits
in which the points have been sampled) in order to be able to find both the
minima of the function:
The function $f'(x)$ was minimized using the same minimization method
used for finding $m_e$. Once $f_\text{max}$ is known, the equation
$$
f'(x) = \left|f(x) - \frac{f_{\text{max}}}{2}\right|
f'(x) = \frac{f_\text{max}}{2}
$$
resulting in:
is solved by performing the Brent-Dekker method (described before) in the
ranges $[x_\text{min}, m_e]$ and $[m_e, x_\text{max}]$ yielding the two
solutions $x_\pm$. With a relative tolerance of \SI{1e-7}{} the following
result was obtained:
$$
\text{expected FWHM: } w_e = \SI{4.0186457 \pm 0.0000001}{}
\text{expected FWHM: } w_e = \SI{4.0186457 \pm 0.0000001}{}
$$
\vspace{-1em}
![Example of a Moyal distribution density obtained by
the KDE method described above. The rug plot shows the original
sample used in the reconstruction.](images/landau-kde.pdf)
On the other hand, obtaining a good estimate of the FWHM from a sample is much
more difficult. In principle it could be measured by binning the data and
applying the definition to the discretised values, however this yields very
poor results and depends on an completely arbitrary parameter: the bin width.
A more refined method to construct an nonparametric empirical PDF function from
the sample is a kernel density estimation (KDE). This method consist in
convolving the (ordered) data with a smooth symmetrical kernel, in this cause a
standard gaussian function. Given a sample $\{x_i\}_{i=1}^N$, the empirical PDF
is thus constructed as
$$
f_\varepsilon(x) = \frac{1}{N\varepsilon} \sum_{i = 1}^N
\mathcal{N}\left(\frac{x-x_i}{\varepsilon}\right)
$$
On the other hand, the observed FWHM was computed as the difference between
the center of the bins with the values closer to $\frac{f_{\text{max}}}{2}$
and the error was taken as twice the width of the bins, obtaining:
where $\varepsilon$ is called the *bandwidth* and is a parameter that controls
the strength of the smoothing. This parameter can be determined in several
ways: bootstrapping, cross-validation, etc. For simplicity it was chosen
to use Silverman's rule of thumb, which gives
$$
\text{observed FWHM: } w_o = \SI{4 \pm 2}{}
\varepsilon = 0.63 S_N
\left(\frac{d + 2}{4}N\right)^{-1/(d + 4)}
$$
This two values turn out to be compatible too, with:
where
- $S_N$ is the sample standard deviation.
- $d$ is ne number of dimensions, in this case $d=1$
The $0.63$ factor was chosen to compensate for the distortion that
systematically reduces the peaks height, which affects the estimation of the
mode.
With the empirical density estimation at hand, the FWHM can be computed by the
same numerical method described for the true PDF. Again this was bootstrapped
to estimate the standard error giving:
$$
p = 0.99
\text{observed FWHM: } w_o = \SI{4.06 \pm 0.08}{}
$$
Applying the t-test to these two values gives
- $t=0.495$
- $p=0.620$
which shows a very good agreement and proves the estimator is robust.
For reference, the initial estimation based on an histogram gave a rather
inadequate \si{4 \pm 2}.