ex-1: rewrite the chapters for bootstrapping and kde

2020-04-11 19:30:46 +02:00 · 2020-04-11 19:30:46 +02:00 · 64f8a0a76e
commit 64f8a0a76e
parent 343fe3aac3
2 changed files with 113 additions and 55 deletions
--- a/notes/images/landau-kde.pdf
+++ b/notes/images/landau-kde.pdf
--- a/notes/sections/1.md
+++ b/notes/sections/1.md
@ -75,10 +75,12 @@ $u$-transform with the `gsl_sum_levin_utrunc_accel()` function. The algorithm
 terminates when the difference between two successive extrapolations reaches a
 minimum.

-For $N = 1000$, the following results were obtained:
+\clearpage

-  - $D = 0.020$
-  - $p = 0.79$
+For $N = 50000$, the following results were obtained:
+
+  - $D = 0.004$
+  - $p = 0.38$

 Hence, the data was reasonably sampled from a Landau distribution.

@ -100,8 +102,8 @@ idea which lies beneath most of them is to measure how far the parameters of
 the distribution are from the ones measured in the sample.  
 The same principle can be used to verify if the generated sample effectively
 follows the Landau distribution. Since it turns out to be a very pathological
-PDF, very few parameters can be easily checked: mode, median and
-full-width-half-maximum (FWHM).
+PDF, very few parameters can be easily checked: mode, median and full width at
+half maximum (FWHM).

 ### Mode

@ -129,12 +131,12 @@ full-width-half-maximum (FWHM).
 \end{figure}

 The mode of a set of data values is defined as the value that appears most
-often, namely: it is the maximum of the PDF. Since there is no closed form
-for the mode of the Landau PDF, it was computed numerically by the *golden
-section* method (`gsl_min_fminimizer_goldensection` in GSL), applied to $-f$
-with an arbitrary error of $10^{-2}$, giving:
+often, namely: it is the maximum of the PDF. Since there is no closed form for
+the mode of the Landau PDF, it was computed numerically by the *Brent*
+algorithm (`gsl_min_fminimizer_brent` in GSL), applied to $-f$ with a relative
+tolerance of $10^{-7}$, giving:
 $$
-  \text{expected mode: } m_e = \SI{-0.2227830 \pm 0.0000001}{}
+  \text{expected mode: } m_e = \SI{-0.22278298 \pm 0.00000006}{}
 $$

 This is a minimization algorithm that begins with a bounded region known to
@ -148,23 +150,36 @@ $$
  f(x_\text{min}) > f(x_e) < f(x_\text{max})
 $$

-On each iteration the interval is divided in a golden section (using the ratio
-($(3 - \sqrt{5})/2 \approx 0.3819660$) and the value of the function at this new
-point $x'$ is calculated. If the new point is a better estimate of the minimum,
-namely if $f(x') < f(x_e)$, then the current estimate of the minimum is
-updated.  
+On each iteration the function is interpolated by a parabola passing though the
+points $x_\text{min}$, $x_e$, $x_\text{max}$ and the minimum is computed as the
+vertex of the parabola. If this point is found to be inside the interval it's
+taken as a guess for the true minimum; otherwise the method falls
+back to a golden section (using the ratio $(3 - \sqrt{5})/2 \approx 0.3819660$
+proven to be optimal) of the interval. The value of the function at this new
+point $x'$ is calculated. In any case if the new point is a better estimate of
+the minimum, namely if $f(x') < f(x_e)$, then the current estimate of the
+minimum is updated.  
 The new point allows the size of the bounded interval to be reduced, by choosing
 the most compact set of points which satisfies the constraint $f(a) > f(x') <
 f(b)$ between $f(x_\text{min})$, $f(x_\text{min})$ and $f(x_e)$. The interval is
 reduced until it encloses the true minimum to a desired tolerance.
+The error of the result is estimated by the length of the final interval.

-In the sample, on the other hand, once the data were binned, the mode can be
-estimated as the central value of the bin with maximum events and the error
-is the half width of the bins. In this case, with 40 bins between -20 and 20,
-the following result was obtained:
+On the other hand, to compute the mode of the sample the half-sample mode (HSM)
+or *Robertson-Cryer* estimator was used. This estimator was chosen because makes
+no assumptions on the underlying distribution and is not computationally expensive.
+The HSM is obtained by iteratively identifying the half modal interval, which
+is the smallest interval containing half of the observation. Once the sample is
+reduced to less that three points the mode is computed as the average. The
+special case $n=3$ is dealt with by averaging the two closer points.

+To obtain a better estimate of the mode and its error the above procedure was
+bootstrapped. The original sample is treated as a population and used to build
+other samples, of the same size, by *sampling with replacements*. For each one
+of the new samples the above statistic is computed. By simply taking the
+mean of these statistics the following estimate was obtained
 $$
-  \text{observed mode: } m_o = \SI{0 \pm 1}{}
+  \text{observed mode: } m_o = \SI{-0.29 \pm 0.19}{}
 $$

 In order to compare the values $m_e$ and $x_0$, the following compatibility
@ -179,11 +194,11 @@ where $\sigma_e$ and $\sigma_o$ are the absolute errors of $m_e$ and $m_o$
 respectively. At 95% confidence level, the values are compatible if $p > 0.05$.
 In this case:

-$$
-  p = 0.82
-$$
+  - t = 1.012
+  - p = 0.311

-Thus, the observed value is compatible with the expected one.
+Thus, the observed mode is compatible with the mode of the Landau distribution,
+however the result is quite imprecise.


 ### Median
@ -200,9 +215,9 @@ sample size is odd, or the average of the two middle elements otherwise.

 The expected median was derived from the quantile function (QDF) of the Landau
 distribution[^1].
-Once this is know, the median is simply given by $QDF(1/2)$.
+Once this is know, the median is simply given by $\text{QDF}(1/2)$.
 Since both the CDF and QDF have no known closed form they must
-be computed numerically. The comulative probability has been computed by
+be computed numerically. The cumulative probability has been computed by
 quadrature-based numerical integration of the PDF (`gsl_integration_qagiu()`
 function in GSL). The function calculate an approximation of the integral

@ -247,58 +262,101 @@ upper bound on the error of the root as $\varepsilon = |a-b|$.  The tolerances
 here have been set to 0 and \SI{1e-3}{}.

 The result of the numerical computation is:
-
 $$
  \text{expected median: } m_e = \SI{1.3557804 \pm 0.0000091}{}
 $$

-while the sample median was found to be
+while the sample median, obtained again by bootstrapping, was found to be
+$$
+  \text{observed median: } m_e = \SI{1.3605 \pm 0.0062}{}
+$$

-$$
-  \text{observed median: } m_e = \SI{1.3479314}{}
-$$
+Applying again the t-test from before to this statistic:
+
+  - $t=0.761$
+  - $p=0.446$
+
+This result is much more precise than the mode and the two values show
+a good agreement.


 ### FWHM

-The same approach was taken as regards the FWHM. This statistic is defined as
-the distance between the two points at which the function assumes half times
-the maximum value. Even in this case, there is not an analytic expression for
-it, thus it was computed numerically ad follow.  
-First, some definitions must be given:
-
+For a unimodal distribution (having a single peak) this statistic is defined as
+the distance between the two points at which the PDF attains half the maximum
+value. For the Landau distribution, again, there is no analytic expression
+known, thus the FWHM was computed numerically as follows. First of all, some
+definitions must be given:
 $$
-  f_{\text{max}} := f(m_e) \et \text{FWHM} = x_+ - x_- \with
-  f(x_{\pm}) = \frac{f_{\text{max}}}{2}
+  f_{\text{max}} = f(m_e) \et \text{FWHM} = x_+ - x_- \with
+  f(x_\pm) = \frac{f_\text{max}}{2}
 $$

-then the function $f'(x)$ was minimized using the same minimization method
-used for finding $m_e$, dividing the range into $[x_\text{min}, m_e]$ and
-$[m_e, x_\text{max}]$ (where $x_\text{min}$ and $x_\text{max}$ are the limits
-in which the points have been sampled) in order to be able to find both the
-minima of the function:
-
+The function $f'(x)$ was minimized using the same minimization method
+used for finding $m_e$. Once $f_\text{max}$ is known, the equation
 $$
-  f'(x) = \left|f(x) - \frac{f_{\text{max}}}{2}\right|
+  f'(x) = \frac{f_\text{max}}{2}
 $$

-resulting in:
+is solved by performing the Brent-Dekker method (described before) in the
+ranges  $[x_\text{min}, m_e]$ and $[m_e, x_\text{max}]$ yielding the two
+solutions $x_\pm$. With a relative tolerance of \SI{1e-7}{} the following
+result was obtained:

 $$
 \text{expected FWHM: } w_e = \SI{4.0186457 \pm 0.0000001}{}
 $$
+\vspace{-1em}

-On the other hand, the observed FWHM was computed as the difference between
-the center of the bins with the values closer to $\frac{f_{\text{max}}}{2}$
-and the error was taken as twice the width of the bins, obtaining:
+![Example of a Moyal distribution density obtained by
+  the KDE method described above. The rug plot shows the original
+  sample used in the reconstruction.](images/landau-kde.pdf)

+On the other hand, obtaining a good estimate of the FWHM from a sample is much
+more difficult. In principle it could be measured by binning the data and
+applying the definition to the discretised values, however this yields very
+poor results and depends on an completely arbitrary parameter: the bin width.  
+A more refined method to construct an nonparametric empirical PDF function from
+the sample is a kernel density estimation (KDE). This method consist in
+convolving the (ordered) data with a smooth symmetrical kernel, in this cause a
+standard gaussian function. Given a sample $\{x_i\}_{i=1}^N$, the empirical PDF
+is thus constructed as
 $$
-  \text{observed FWHM: } w_o = \SI{4 \pm 2}{}
+  f_\varepsilon(x) = \frac{1}{N\varepsilon} \sum_{i = 1}^N
+    \mathcal{N}\left(\frac{x-x_i}{\varepsilon}\right)
 $$

-This two values turn out to be compatible too, with:
-
+where $\varepsilon$ is called the *bandwidth* and is a parameter that controls
+the strength of the smoothing. This parameter can be determined in several
+ways: bootstrapping, cross-validation, etc. For simplicity it was chosen
+to use Silverman's rule of thumb, which gives
 $$
-  p = 0.99
+  \varepsilon = 0.63 S_N
+    \left(\frac{d + 2}{4}N\right)^{-1/(d + 4)}
 $$

+where 
+
+  - $S_N$ is the sample standard deviation.
+
+  - $d$ is ne number of dimensions, in this case $d=1$
+
+The $0.63$ factor was chosen to compensate for the distortion that
+systematically reduces the peaks height, which affects the estimation of the
+mode.
+
+With the empirical density estimation at hand, the FWHM can be computed by the
+same numerical method described for the true PDF. Again this was bootstrapped
+to estimate the standard error giving:
+$$
+  \text{observed FWHM: } w_o = \SI{4.06 \pm 0.08}{}
+$$
+
+Applying the t-test to these two values gives
+
+  - $t=0.495$
+  - $p=0.620$
+
+which shows a very good agreement and proves the estimator is robust.
+For reference, the initial estimation based on an histogram gave a rather
+inadequate \si{4 \pm 2}.