ex-1: rewrite the chapters for bootstrapping and kde
This commit is contained in:
parent
343fe3aac3
commit
64f8a0a76e
BIN
notes/images/landau-kde.pdf
Normal file
BIN
notes/images/landau-kde.pdf
Normal file
Binary file not shown.
@ -75,10 +75,12 @@ $u$-transform with the `gsl_sum_levin_utrunc_accel()` function. The algorithm
|
|||||||
terminates when the difference between two successive extrapolations reaches a
|
terminates when the difference between two successive extrapolations reaches a
|
||||||
minimum.
|
minimum.
|
||||||
|
|
||||||
For $N = 1000$, the following results were obtained:
|
\clearpage
|
||||||
|
|
||||||
- $D = 0.020$
|
For $N = 50000$, the following results were obtained:
|
||||||
- $p = 0.79$
|
|
||||||
|
- $D = 0.004$
|
||||||
|
- $p = 0.38$
|
||||||
|
|
||||||
Hence, the data was reasonably sampled from a Landau distribution.
|
Hence, the data was reasonably sampled from a Landau distribution.
|
||||||
|
|
||||||
@ -100,8 +102,8 @@ idea which lies beneath most of them is to measure how far the parameters of
|
|||||||
the distribution are from the ones measured in the sample.
|
the distribution are from the ones measured in the sample.
|
||||||
The same principle can be used to verify if the generated sample effectively
|
The same principle can be used to verify if the generated sample effectively
|
||||||
follows the Landau distribution. Since it turns out to be a very pathological
|
follows the Landau distribution. Since it turns out to be a very pathological
|
||||||
PDF, very few parameters can be easily checked: mode, median and
|
PDF, very few parameters can be easily checked: mode, median and full width at
|
||||||
full-width-half-maximum (FWHM).
|
half maximum (FWHM).
|
||||||
|
|
||||||
### Mode
|
### Mode
|
||||||
|
|
||||||
@ -129,12 +131,12 @@ full-width-half-maximum (FWHM).
|
|||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
The mode of a set of data values is defined as the value that appears most
|
The mode of a set of data values is defined as the value that appears most
|
||||||
often, namely: it is the maximum of the PDF. Since there is no closed form
|
often, namely: it is the maximum of the PDF. Since there is no closed form for
|
||||||
for the mode of the Landau PDF, it was computed numerically by the *golden
|
the mode of the Landau PDF, it was computed numerically by the *Brent*
|
||||||
section* method (`gsl_min_fminimizer_goldensection` in GSL), applied to $-f$
|
algorithm (`gsl_min_fminimizer_brent` in GSL), applied to $-f$ with a relative
|
||||||
with an arbitrary error of $10^{-2}$, giving:
|
tolerance of $10^{-7}$, giving:
|
||||||
$$
|
$$
|
||||||
\text{expected mode: } m_e = \SI{-0.2227830 \pm 0.0000001}{}
|
\text{expected mode: } m_e = \SI{-0.22278298 \pm 0.00000006}{}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
This is a minimization algorithm that begins with a bounded region known to
|
This is a minimization algorithm that begins with a bounded region known to
|
||||||
@ -148,23 +150,36 @@ $$
|
|||||||
f(x_\text{min}) > f(x_e) < f(x_\text{max})
|
f(x_\text{min}) > f(x_e) < f(x_\text{max})
|
||||||
$$
|
$$
|
||||||
|
|
||||||
On each iteration the interval is divided in a golden section (using the ratio
|
On each iteration the function is interpolated by a parabola passing though the
|
||||||
($(3 - \sqrt{5})/2 \approx 0.3819660$) and the value of the function at this new
|
points $x_\text{min}$, $x_e$, $x_\text{max}$ and the minimum is computed as the
|
||||||
point $x'$ is calculated. If the new point is a better estimate of the minimum,
|
vertex of the parabola. If this point is found to be inside the interval it's
|
||||||
namely if $f(x') < f(x_e)$, then the current estimate of the minimum is
|
taken as a guess for the true minimum; otherwise the method falls
|
||||||
updated.
|
back to a golden section (using the ratio $(3 - \sqrt{5})/2 \approx 0.3819660$
|
||||||
|
proven to be optimal) of the interval. The value of the function at this new
|
||||||
|
point $x'$ is calculated. In any case if the new point is a better estimate of
|
||||||
|
the minimum, namely if $f(x') < f(x_e)$, then the current estimate of the
|
||||||
|
minimum is updated.
|
||||||
The new point allows the size of the bounded interval to be reduced, by choosing
|
The new point allows the size of the bounded interval to be reduced, by choosing
|
||||||
the most compact set of points which satisfies the constraint $f(a) > f(x') <
|
the most compact set of points which satisfies the constraint $f(a) > f(x') <
|
||||||
f(b)$ between $f(x_\text{min})$, $f(x_\text{min})$ and $f(x_e)$. The interval is
|
f(b)$ between $f(x_\text{min})$, $f(x_\text{min})$ and $f(x_e)$. The interval is
|
||||||
reduced until it encloses the true minimum to a desired tolerance.
|
reduced until it encloses the true minimum to a desired tolerance.
|
||||||
|
The error of the result is estimated by the length of the final interval.
|
||||||
|
|
||||||
In the sample, on the other hand, once the data were binned, the mode can be
|
On the other hand, to compute the mode of the sample the half-sample mode (HSM)
|
||||||
estimated as the central value of the bin with maximum events and the error
|
or *Robertson-Cryer* estimator was used. This estimator was chosen because makes
|
||||||
is the half width of the bins. In this case, with 40 bins between -20 and 20,
|
no assumptions on the underlying distribution and is not computationally expensive.
|
||||||
the following result was obtained:
|
The HSM is obtained by iteratively identifying the half modal interval, which
|
||||||
|
is the smallest interval containing half of the observation. Once the sample is
|
||||||
|
reduced to less that three points the mode is computed as the average. The
|
||||||
|
special case $n=3$ is dealt with by averaging the two closer points.
|
||||||
|
|
||||||
|
To obtain a better estimate of the mode and its error the above procedure was
|
||||||
|
bootstrapped. The original sample is treated as a population and used to build
|
||||||
|
other samples, of the same size, by *sampling with replacements*. For each one
|
||||||
|
of the new samples the above statistic is computed. By simply taking the
|
||||||
|
mean of these statistics the following estimate was obtained
|
||||||
$$
|
$$
|
||||||
\text{observed mode: } m_o = \SI{0 \pm 1}{}
|
\text{observed mode: } m_o = \SI{-0.29 \pm 0.19}{}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
In order to compare the values $m_e$ and $x_0$, the following compatibility
|
In order to compare the values $m_e$ and $x_0$, the following compatibility
|
||||||
@ -179,11 +194,11 @@ where $\sigma_e$ and $\sigma_o$ are the absolute errors of $m_e$ and $m_o$
|
|||||||
respectively. At 95% confidence level, the values are compatible if $p > 0.05$.
|
respectively. At 95% confidence level, the values are compatible if $p > 0.05$.
|
||||||
In this case:
|
In this case:
|
||||||
|
|
||||||
$$
|
- t = 1.012
|
||||||
p = 0.82
|
- p = 0.311
|
||||||
$$
|
|
||||||
|
|
||||||
Thus, the observed value is compatible with the expected one.
|
Thus, the observed mode is compatible with the mode of the Landau distribution,
|
||||||
|
however the result is quite imprecise.
|
||||||
|
|
||||||
|
|
||||||
### Median
|
### Median
|
||||||
@ -200,9 +215,9 @@ sample size is odd, or the average of the two middle elements otherwise.
|
|||||||
|
|
||||||
The expected median was derived from the quantile function (QDF) of the Landau
|
The expected median was derived from the quantile function (QDF) of the Landau
|
||||||
distribution[^1].
|
distribution[^1].
|
||||||
Once this is know, the median is simply given by $QDF(1/2)$.
|
Once this is know, the median is simply given by $\text{QDF}(1/2)$.
|
||||||
Since both the CDF and QDF have no known closed form they must
|
Since both the CDF and QDF have no known closed form they must
|
||||||
be computed numerically. The comulative probability has been computed by
|
be computed numerically. The cumulative probability has been computed by
|
||||||
quadrature-based numerical integration of the PDF (`gsl_integration_qagiu()`
|
quadrature-based numerical integration of the PDF (`gsl_integration_qagiu()`
|
||||||
function in GSL). The function calculate an approximation of the integral
|
function in GSL). The function calculate an approximation of the integral
|
||||||
|
|
||||||
@ -247,58 +262,101 @@ upper bound on the error of the root as $\varepsilon = |a-b|$. The tolerances
|
|||||||
here have been set to 0 and \SI{1e-3}{}.
|
here have been set to 0 and \SI{1e-3}{}.
|
||||||
|
|
||||||
The result of the numerical computation is:
|
The result of the numerical computation is:
|
||||||
|
|
||||||
$$
|
$$
|
||||||
\text{expected median: } m_e = \SI{1.3557804 \pm 0.0000091}{}
|
\text{expected median: } m_e = \SI{1.3557804 \pm 0.0000091}{}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
while the sample median was found to be
|
while the sample median, obtained again by bootstrapping, was found to be
|
||||||
|
$$
|
||||||
|
\text{observed median: } m_e = \SI{1.3605 \pm 0.0062}{}
|
||||||
|
$$
|
||||||
|
|
||||||
$$
|
Applying again the t-test from before to this statistic:
|
||||||
\text{observed median: } m_e = \SI{1.3479314}{}
|
|
||||||
$$
|
- $t=0.761$
|
||||||
|
- $p=0.446$
|
||||||
|
|
||||||
|
This result is much more precise than the mode and the two values show
|
||||||
|
a good agreement.
|
||||||
|
|
||||||
|
|
||||||
### FWHM
|
### FWHM
|
||||||
|
|
||||||
The same approach was taken as regards the FWHM. This statistic is defined as
|
For a unimodal distribution (having a single peak) this statistic is defined as
|
||||||
the distance between the two points at which the function assumes half times
|
the distance between the two points at which the PDF attains half the maximum
|
||||||
the maximum value. Even in this case, there is not an analytic expression for
|
value. For the Landau distribution, again, there is no analytic expression
|
||||||
it, thus it was computed numerically ad follow.
|
known, thus the FWHM was computed numerically as follows. First of all, some
|
||||||
First, some definitions must be given:
|
definitions must be given:
|
||||||
|
|
||||||
$$
|
$$
|
||||||
f_{\text{max}} := f(m_e) \et \text{FWHM} = x_+ - x_- \with
|
f_{\text{max}} = f(m_e) \et \text{FWHM} = x_+ - x_- \with
|
||||||
f(x_{\pm}) = \frac{f_{\text{max}}}{2}
|
f(x_\pm) = \frac{f_\text{max}}{2}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
then the function $f'(x)$ was minimized using the same minimization method
|
The function $f'(x)$ was minimized using the same minimization method
|
||||||
used for finding $m_e$, dividing the range into $[x_\text{min}, m_e]$ and
|
used for finding $m_e$. Once $f_\text{max}$ is known, the equation
|
||||||
$[m_e, x_\text{max}]$ (where $x_\text{min}$ and $x_\text{max}$ are the limits
|
|
||||||
in which the points have been sampled) in order to be able to find both the
|
|
||||||
minima of the function:
|
|
||||||
|
|
||||||
$$
|
$$
|
||||||
f'(x) = \left|f(x) - \frac{f_{\text{max}}}{2}\right|
|
f'(x) = \frac{f_\text{max}}{2}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
resulting in:
|
is solved by performing the Brent-Dekker method (described before) in the
|
||||||
|
ranges $[x_\text{min}, m_e]$ and $[m_e, x_\text{max}]$ yielding the two
|
||||||
|
solutions $x_\pm$. With a relative tolerance of \SI{1e-7}{} the following
|
||||||
|
result was obtained:
|
||||||
|
|
||||||
$$
|
$$
|
||||||
\text{expected FWHM: } w_e = \SI{4.0186457 \pm 0.0000001}{}
|
\text{expected FWHM: } w_e = \SI{4.0186457 \pm 0.0000001}{}
|
||||||
$$
|
$$
|
||||||
|
\vspace{-1em}
|
||||||
|
|
||||||
On the other hand, the observed FWHM was computed as the difference between
|
![Example of a Moyal distribution density obtained by
|
||||||
the center of the bins with the values closer to $\frac{f_{\text{max}}}{2}$
|
the KDE method described above. The rug plot shows the original
|
||||||
and the error was taken as twice the width of the bins, obtaining:
|
sample used in the reconstruction.](images/landau-kde.pdf)
|
||||||
|
|
||||||
|
On the other hand, obtaining a good estimate of the FWHM from a sample is much
|
||||||
|
more difficult. In principle it could be measured by binning the data and
|
||||||
|
applying the definition to the discretised values, however this yields very
|
||||||
|
poor results and depends on an completely arbitrary parameter: the bin width.
|
||||||
|
A more refined method to construct an nonparametric empirical PDF function from
|
||||||
|
the sample is a kernel density estimation (KDE). This method consist in
|
||||||
|
convolving the (ordered) data with a smooth symmetrical kernel, in this cause a
|
||||||
|
standard gaussian function. Given a sample $\{x_i\}_{i=1}^N$, the empirical PDF
|
||||||
|
is thus constructed as
|
||||||
$$
|
$$
|
||||||
\text{observed FWHM: } w_o = \SI{4 \pm 2}{}
|
f_\varepsilon(x) = \frac{1}{N\varepsilon} \sum_{i = 1}^N
|
||||||
|
\mathcal{N}\left(\frac{x-x_i}{\varepsilon}\right)
|
||||||
$$
|
$$
|
||||||
|
|
||||||
This two values turn out to be compatible too, with:
|
where $\varepsilon$ is called the *bandwidth* and is a parameter that controls
|
||||||
|
the strength of the smoothing. This parameter can be determined in several
|
||||||
|
ways: bootstrapping, cross-validation, etc. For simplicity it was chosen
|
||||||
|
to use Silverman's rule of thumb, which gives
|
||||||
$$
|
$$
|
||||||
p = 0.99
|
\varepsilon = 0.63 S_N
|
||||||
|
\left(\frac{d + 2}{4}N\right)^{-1/(d + 4)}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
|
where
|
||||||
|
|
||||||
|
- $S_N$ is the sample standard deviation.
|
||||||
|
|
||||||
|
- $d$ is ne number of dimensions, in this case $d=1$
|
||||||
|
|
||||||
|
The $0.63$ factor was chosen to compensate for the distortion that
|
||||||
|
systematically reduces the peaks height, which affects the estimation of the
|
||||||
|
mode.
|
||||||
|
|
||||||
|
With the empirical density estimation at hand, the FWHM can be computed by the
|
||||||
|
same numerical method described for the true PDF. Again this was bootstrapped
|
||||||
|
to estimate the standard error giving:
|
||||||
|
$$
|
||||||
|
\text{observed FWHM: } w_o = \SI{4.06 \pm 0.08}{}
|
||||||
|
$$
|
||||||
|
|
||||||
|
Applying the t-test to these two values gives
|
||||||
|
|
||||||
|
- $t=0.495$
|
||||||
|
- $p=0.620$
|
||||||
|
|
||||||
|
which shows a very good agreement and proves the estimator is robust.
|
||||||
|
For reference, the initial estimation based on an histogram gave a rather
|
||||||
|
inadequate \si{4 \pm 2}.
|
||||||
|
Loading…
Reference in New Issue
Block a user