ex-1: rewrite the chapters for bootstrapping and kde

This commit is contained in:
Michele Guerini Rocco 2020-04-11 19:30:46 +02:00
parent 343fe3aac3
commit 64f8a0a76e
2 changed files with 113 additions and 55 deletions

BIN
notes/images/landau-kde.pdf Normal file

Binary file not shown.

View File

@ -75,10 +75,12 @@ $u$-transform with the `gsl_sum_levin_utrunc_accel()` function. The algorithm
terminates when the difference between two successive extrapolations reaches a terminates when the difference between two successive extrapolations reaches a
minimum. minimum.
For $N = 1000$, the following results were obtained: \clearpage
- $D = 0.020$ For $N = 50000$, the following results were obtained:
- $p = 0.79$
- $D = 0.004$
- $p = 0.38$
Hence, the data was reasonably sampled from a Landau distribution. Hence, the data was reasonably sampled from a Landau distribution.
@ -100,8 +102,8 @@ idea which lies beneath most of them is to measure how far the parameters of
the distribution are from the ones measured in the sample. the distribution are from the ones measured in the sample.
The same principle can be used to verify if the generated sample effectively The same principle can be used to verify if the generated sample effectively
follows the Landau distribution. Since it turns out to be a very pathological follows the Landau distribution. Since it turns out to be a very pathological
PDF, very few parameters can be easily checked: mode, median and PDF, very few parameters can be easily checked: mode, median and full width at
full-width-half-maximum (FWHM). half maximum (FWHM).
### Mode ### Mode
@ -129,12 +131,12 @@ full-width-half-maximum (FWHM).
\end{figure} \end{figure}
The mode of a set of data values is defined as the value that appears most The mode of a set of data values is defined as the value that appears most
often, namely: it is the maximum of the PDF. Since there is no closed form often, namely: it is the maximum of the PDF. Since there is no closed form for
for the mode of the Landau PDF, it was computed numerically by the *golden the mode of the Landau PDF, it was computed numerically by the *Brent*
section* method (`gsl_min_fminimizer_goldensection` in GSL), applied to $-f$ algorithm (`gsl_min_fminimizer_brent` in GSL), applied to $-f$ with a relative
with an arbitrary error of $10^{-2}$, giving: tolerance of $10^{-7}$, giving:
$$ $$
\text{expected mode: } m_e = \SI{-0.2227830 \pm 0.0000001}{} \text{expected mode: } m_e = \SI{-0.22278298 \pm 0.00000006}{}
$$ $$
This is a minimization algorithm that begins with a bounded region known to This is a minimization algorithm that begins with a bounded region known to
@ -148,23 +150,36 @@ $$
f(x_\text{min}) > f(x_e) < f(x_\text{max}) f(x_\text{min}) > f(x_e) < f(x_\text{max})
$$ $$
On each iteration the interval is divided in a golden section (using the ratio On each iteration the function is interpolated by a parabola passing though the
($(3 - \sqrt{5})/2 \approx 0.3819660$) and the value of the function at this new points $x_\text{min}$, $x_e$, $x_\text{max}$ and the minimum is computed as the
point $x'$ is calculated. If the new point is a better estimate of the minimum, vertex of the parabola. If this point is found to be inside the interval it's
namely if $f(x') < f(x_e)$, then the current estimate of the minimum is taken as a guess for the true minimum; otherwise the method falls
updated. back to a golden section (using the ratio $(3 - \sqrt{5})/2 \approx 0.3819660$
proven to be optimal) of the interval. The value of the function at this new
point $x'$ is calculated. In any case if the new point is a better estimate of
the minimum, namely if $f(x') < f(x_e)$, then the current estimate of the
minimum is updated.
The new point allows the size of the bounded interval to be reduced, by choosing The new point allows the size of the bounded interval to be reduced, by choosing
the most compact set of points which satisfies the constraint $f(a) > f(x') < the most compact set of points which satisfies the constraint $f(a) > f(x') <
f(b)$ between $f(x_\text{min})$, $f(x_\text{min})$ and $f(x_e)$. The interval is f(b)$ between $f(x_\text{min})$, $f(x_\text{min})$ and $f(x_e)$. The interval is
reduced until it encloses the true minimum to a desired tolerance. reduced until it encloses the true minimum to a desired tolerance.
The error of the result is estimated by the length of the final interval.
In the sample, on the other hand, once the data were binned, the mode can be On the other hand, to compute the mode of the sample the half-sample mode (HSM)
estimated as the central value of the bin with maximum events and the error or *Robertson-Cryer* estimator was used. This estimator was chosen because makes
is the half width of the bins. In this case, with 40 bins between -20 and 20, no assumptions on the underlying distribution and is not computationally expensive.
the following result was obtained: The HSM is obtained by iteratively identifying the half modal interval, which
is the smallest interval containing half of the observation. Once the sample is
reduced to less that three points the mode is computed as the average. The
special case $n=3$ is dealt with by averaging the two closer points.
To obtain a better estimate of the mode and its error the above procedure was
bootstrapped. The original sample is treated as a population and used to build
other samples, of the same size, by *sampling with replacements*. For each one
of the new samples the above statistic is computed. By simply taking the
mean of these statistics the following estimate was obtained
$$ $$
\text{observed mode: } m_o = \SI{0 \pm 1}{} \text{observed mode: } m_o = \SI{-0.29 \pm 0.19}{}
$$ $$
In order to compare the values $m_e$ and $x_0$, the following compatibility In order to compare the values $m_e$ and $x_0$, the following compatibility
@ -179,11 +194,11 @@ where $\sigma_e$ and $\sigma_o$ are the absolute errors of $m_e$ and $m_o$
respectively. At 95% confidence level, the values are compatible if $p > 0.05$. respectively. At 95% confidence level, the values are compatible if $p > 0.05$.
In this case: In this case:
$$ - t = 1.012
p = 0.82 - p = 0.311
$$
Thus, the observed value is compatible with the expected one. Thus, the observed mode is compatible with the mode of the Landau distribution,
however the result is quite imprecise.
### Median ### Median
@ -200,9 +215,9 @@ sample size is odd, or the average of the two middle elements otherwise.
The expected median was derived from the quantile function (QDF) of the Landau The expected median was derived from the quantile function (QDF) of the Landau
distribution[^1]. distribution[^1].
Once this is know, the median is simply given by $QDF(1/2)$. Once this is know, the median is simply given by $\text{QDF}(1/2)$.
Since both the CDF and QDF have no known closed form they must Since both the CDF and QDF have no known closed form they must
be computed numerically. The comulative probability has been computed by be computed numerically. The cumulative probability has been computed by
quadrature-based numerical integration of the PDF (`gsl_integration_qagiu()` quadrature-based numerical integration of the PDF (`gsl_integration_qagiu()`
function in GSL). The function calculate an approximation of the integral function in GSL). The function calculate an approximation of the integral
@ -247,58 +262,101 @@ upper bound on the error of the root as $\varepsilon = |a-b|$. The tolerances
here have been set to 0 and \SI{1e-3}{}. here have been set to 0 and \SI{1e-3}{}.
The result of the numerical computation is: The result of the numerical computation is:
$$ $$
\text{expected median: } m_e = \SI{1.3557804 \pm 0.0000091}{} \text{expected median: } m_e = \SI{1.3557804 \pm 0.0000091}{}
$$ $$
while the sample median was found to be while the sample median, obtained again by bootstrapping, was found to be
$$
\text{observed median: } m_e = \SI{1.3605 \pm 0.0062}{}
$$
$$ Applying again the t-test from before to this statistic:
\text{observed median: } m_e = \SI{1.3479314}{}
$$ - $t=0.761$
- $p=0.446$
This result is much more precise than the mode and the two values show
a good agreement.
### FWHM ### FWHM
The same approach was taken as regards the FWHM. This statistic is defined as For a unimodal distribution (having a single peak) this statistic is defined as
the distance between the two points at which the function assumes half times the distance between the two points at which the PDF attains half the maximum
the maximum value. Even in this case, there is not an analytic expression for value. For the Landau distribution, again, there is no analytic expression
it, thus it was computed numerically ad follow. known, thus the FWHM was computed numerically as follows. First of all, some
First, some definitions must be given: definitions must be given:
$$ $$
f_{\text{max}} := f(m_e) \et \text{FWHM} = x_+ - x_- \with f_{\text{max}} = f(m_e) \et \text{FWHM} = x_+ - x_- \with
f(x_{\pm}) = \frac{f_{\text{max}}}{2} f(x_\pm) = \frac{f_\text{max}}{2}
$$ $$
then the function $f'(x)$ was minimized using the same minimization method The function $f'(x)$ was minimized using the same minimization method
used for finding $m_e$, dividing the range into $[x_\text{min}, m_e]$ and used for finding $m_e$. Once $f_\text{max}$ is known, the equation
$[m_e, x_\text{max}]$ (where $x_\text{min}$ and $x_\text{max}$ are the limits
in which the points have been sampled) in order to be able to find both the
minima of the function:
$$ $$
f'(x) = \left|f(x) - \frac{f_{\text{max}}}{2}\right| f'(x) = \frac{f_\text{max}}{2}
$$ $$
resulting in: is solved by performing the Brent-Dekker method (described before) in the
ranges $[x_\text{min}, m_e]$ and $[m_e, x_\text{max}]$ yielding the two
solutions $x_\pm$. With a relative tolerance of \SI{1e-7}{} the following
result was obtained:
$$ $$
\text{expected FWHM: } w_e = \SI{4.0186457 \pm 0.0000001}{} \text{expected FWHM: } w_e = \SI{4.0186457 \pm 0.0000001}{}
$$ $$
\vspace{-1em}
On the other hand, the observed FWHM was computed as the difference between ![Example of a Moyal distribution density obtained by
the center of the bins with the values closer to $\frac{f_{\text{max}}}{2}$ the KDE method described above. The rug plot shows the original
and the error was taken as twice the width of the bins, obtaining: sample used in the reconstruction.](images/landau-kde.pdf)
On the other hand, obtaining a good estimate of the FWHM from a sample is much
more difficult. In principle it could be measured by binning the data and
applying the definition to the discretised values, however this yields very
poor results and depends on an completely arbitrary parameter: the bin width.
A more refined method to construct an nonparametric empirical PDF function from
the sample is a kernel density estimation (KDE). This method consist in
convolving the (ordered) data with a smooth symmetrical kernel, in this cause a
standard gaussian function. Given a sample $\{x_i\}_{i=1}^N$, the empirical PDF
is thus constructed as
$$ $$
\text{observed FWHM: } w_o = \SI{4 \pm 2}{} f_\varepsilon(x) = \frac{1}{N\varepsilon} \sum_{i = 1}^N
\mathcal{N}\left(\frac{x-x_i}{\varepsilon}\right)
$$ $$
This two values turn out to be compatible too, with: where $\varepsilon$ is called the *bandwidth* and is a parameter that controls
the strength of the smoothing. This parameter can be determined in several
ways: bootstrapping, cross-validation, etc. For simplicity it was chosen
to use Silverman's rule of thumb, which gives
$$ $$
p = 0.99 \varepsilon = 0.63 S_N
\left(\frac{d + 2}{4}N\right)^{-1/(d + 4)}
$$ $$
where
- $S_N$ is the sample standard deviation.
- $d$ is ne number of dimensions, in this case $d=1$
The $0.63$ factor was chosen to compensate for the distortion that
systematically reduces the peaks height, which affects the estimation of the
mode.
With the empirical density estimation at hand, the FWHM can be computed by the
same numerical method described for the true PDF. Again this was bootstrapped
to estimate the standard error giving:
$$
\text{observed FWHM: } w_o = \SI{4.06 \pm 0.08}{}
$$
Applying the t-test to these two values gives
- $t=0.495$
- $p=0.620$
which shows a very good agreement and proves the estimator is robust.
For reference, the initial estimation based on an histogram gave a rather
inadequate \si{4 \pm 2}.