ex-1: rewrite the chapters for bootstrapping and kde

2020-04-11 19:30:46 +02:00 · 2020-04-11 19:30:46 +02:00 · 64f8a0a76e
commit 64f8a0a76e
parent 343fe3aac3
2 changed files with 113 additions and 55 deletions
--- a/notes/images/landau-kde.pdf
+++ b/notes/images/landau-kde.pdf
--- a/notes/sections/1.md
+++ b/notes/sections/1.md
@ -75,10 +75,12 @@ $u$-transform with the `gsl_sum_levin_utrunc_accel()` function. The algorithm
 terminates when the difference between two successive extrapolations reaches a
 minimum.
-For $N = 1000$, the following results were obtained:
+\clearpage
-  - $D = 0.020$
+For $N = 50000$, the following results were obtained:
-  - $p = 0.79$
+
  - $D = 0.004$
  - $p = 0.38$
 Hence, the data was reasonably sampled from a Landau distribution.
@ -100,8 +102,8 @@ idea which lies beneath most of them is to measure how far the parameters of
 the distribution are from the ones measured in the sample.  
 The same principle can be used to verify if the generated sample effectively
 follows the Landau distribution. Since it turns out to be a very pathological
-PDF, very few parameters can be easily checked: mode, median and
+PDF, very few parameters can be easily checked: mode, median and full width at
-full-width-half-maximum (FWHM).
+half maximum (FWHM).
 ### Mode
@ -129,12 +131,12 @@ full-width-half-maximum (FWHM).
 \end{figure}
 The mode of a set of data values is defined as the value that appears most
-often, namely: it is the maximum of the PDF. Since there is no closed form
+often, namely: it is the maximum of the PDF. Since there is no closed form for
-for the mode of the Landau PDF, it was computed numerically by the *golden
+the mode of the Landau PDF, it was computed numerically by the *Brent*
-section* method (`gsl_min_fminimizer_goldensection` in GSL), applied to $-f$
+algorithm (`gsl_min_fminimizer_brent` in GSL), applied to $-f$ with a relative
-with an arbitrary error of $10^{-2}$, giving:
+tolerance of $10^{-7}$, giving:
 $$
-  \text{expected mode: } m_e = \SI{-0.2227830 \pm 0.0000001}{}
+  \text{expected mode: } m_e = \SI{-0.22278298 \pm 0.00000006}{}
 $$
 This is a minimization algorithm that begins with a bounded region known to
@ -148,23 +150,36 @@ $$
  f(x_\text{min}) > f(x_e) < f(x_\text{max})
 $$
-On each iteration the interval is divided in a golden section (using the ratio
+On each iteration the function is interpolated by a parabola passing though the
-($(3 - \sqrt{5})/2 \approx 0.3819660$) and the value of the function at this new
+points $x_\text{min}$, $x_e$, $x_\text{max}$ and the minimum is computed as the
-point $x'$ is calculated. If the new point is a better estimate of the minimum,
+vertex of the parabola. If this point is found to be inside the interval it's
-namely if $f(x') < f(x_e)$, then the current estimate of the minimum is
+taken as a guess for the true minimum; otherwise the method falls
-updated.  
+back to a golden section (using the ratio $(3 - \sqrt{5})/2 \approx 0.3819660$
 proven to be optimal) of the interval. The value of the function at this new
 point $x'$ is calculated. In any case if the new point is a better estimate of
 the minimum, namely if $f(x') < f(x_e)$, then the current estimate of the
 minimum is updated.  
 The new point allows the size of the bounded interval to be reduced, by choosing
 the most compact set of points which satisfies the constraint $f(a) > f(x') <
 f(b)$ between $f(x_\text{min})$, $f(x_\text{min})$ and $f(x_e)$. The interval is
 reduced until it encloses the true minimum to a desired tolerance.
 The error of the result is estimated by the length of the final interval.
-In the sample, on the other hand, once the data were binned, the mode can be
+On the other hand, to compute the mode of the sample the half-sample mode (HSM)
-estimated as the central value of the bin with maximum events and the error
+or *Robertson-Cryer* estimator was used. This estimator was chosen because makes
-is the half width of the bins. In this case, with 40 bins between -20 and 20,
+no assumptions on the underlying distribution and is not computationally expensive.
-the following result was obtained:
+The HSM is obtained by iteratively identifying the half modal interval, which
 is the smallest interval containing half of the observation. Once the sample is
 reduced to less that three points the mode is computed as the average. The
 special case $n=3$ is dealt with by averaging the two closer points.
 To obtain a better estimate of the mode and its error the above procedure was
 bootstrapped. The original sample is treated as a population and used to build
 other samples, of the same size, by *sampling with replacements*. For each one
 of the new samples the above statistic is computed. By simply taking the
 mean of these statistics the following estimate was obtained
 $$
-  \text{observed mode: } m_o = \SI{0 \pm 1}{}
+  \text{observed mode: } m_o = \SI{-0.29 \pm 0.19}{}
 $$
 In order to compare the values $m_e$ and $x_0$, the following compatibility
@ -179,11 +194,11 @@ where $\sigma_e$ and $\sigma_o$ are the absolute errors of $m_e$ and $m_o$
 respectively. At 95% confidence level, the values are compatible if $p > 0.05$.
 In this case:
-$$
+  - t = 1.012
-  p = 0.82
+  - p = 0.311
 $$
-Thus, the observed value is compatible with the expected one.
+Thus, the observed mode is compatible with the mode of the Landau distribution,
 however the result is quite imprecise.
 ### Median
@ -200,9 +215,9 @@ sample size is odd, or the average of the two middle elements otherwise.
 The expected median was derived from the quantile function (QDF) of the Landau
 distribution[^1].
-Once this is know, the median is simply given by $QDF(1/2)$.
+Once this is know, the median is simply given by $\text{QDF}(1/2)$.
 Since both the CDF and QDF have no known closed form they must
-be computed numerically. The comulative probability has been computed by
+be computed numerically. The cumulative probability has been computed by
 quadrature-based numerical integration of the PDF (`gsl_integration_qagiu()`
 function in GSL). The function calculate an approximation of the integral
@ -247,58 +262,101 @@ upper bound on the error of the root as $\varepsilon = |a-b|$.  The tolerances
 here have been set to 0 and \SI{1e-3}{}.
 The result of the numerical computation is:
 $$
  \text{expected median: } m_e = \SI{1.3557804 \pm 0.0000091}{}
 $$
-while the sample median was found to be
+while the sample median, obtained again by bootstrapping, was found to be
 $$
  \text{observed median: } m_e = \SI{1.3605 \pm 0.0062}{}
 $$
-$$
+Applying again the t-test from before to this statistic:
-  \text{observed median: } m_e = \SI{1.3479314}{}
+
-$$
+  - $t=0.761$
  - $p=0.446$
 This result is much more precise than the mode and the two values show
 a good agreement.
 ### FWHM
-The same approach was taken as regards the FWHM. This statistic is defined as
+For a unimodal distribution (having a single peak) this statistic is defined as
-the distance between the two points at which the function assumes half times
+the distance between the two points at which the PDF attains half the maximum
-the maximum value. Even in this case, there is not an analytic expression for
+value. For the Landau distribution, again, there is no analytic expression
-it, thus it was computed numerically ad follow.  
+known, thus the FWHM was computed numerically as follows. First of all, some
-First, some definitions must be given:
+definitions must be given:
 $$
-  f_{\text{max}} := f(m_e) \et \text{FWHM} = x_+ - x_- \with
+  f_{\text{max}} = f(m_e) \et \text{FWHM} = x_+ - x_- \with
-  f(x_{\pm}) = \frac{f_{\text{max}}}{2}
+  f(x_\pm) = \frac{f_\text{max}}{2}
 $$
-then the function $f'(x)$ was minimized using the same minimization method
+The function $f'(x)$ was minimized using the same minimization method
-used for finding $m_e$, dividing the range into $[x_\text{min}, m_e]$ and
+used for finding $m_e$. Once $f_\text{max}$ is known, the equation
 $[m_e, x_\text{max}]$ (where $x_\text{min}$ and $x_\text{max}$ are the limits
 in which the points have been sampled) in order to be able to find both the
 minima of the function:
 $$
-  f'(x) = \left|f(x) - \frac{f_{\text{max}}}{2}\right|
+  f'(x) = \frac{f_\text{max}}{2}
 $$
-resulting in:
+is solved by performing the Brent-Dekker method (described before) in the
 ranges  $[x_\text{min}, m_e]$ and $[m_e, x_\text{max}]$ yielding the two
 solutions $x_\pm$. With a relative tolerance of \SI{1e-7}{} the following
 result was obtained:
 $$
 \text{expected FWHM: } w_e = \SI{4.0186457 \pm 0.0000001}{}
 $$
 \vspace{-1em}
-On the other hand, the observed FWHM was computed as the difference between
+![Example of a Moyal distribution density obtained by
-the center of the bins with the values closer to $\frac{f_{\text{max}}}{2}$
+  the KDE method described above. The rug plot shows the original
-and the error was taken as twice the width of the bins, obtaining:
+  sample used in the reconstruction.](images/landau-kde.pdf)
 On the other hand, obtaining a good estimate of the FWHM from a sample is much
 more difficult. In principle it could be measured by binning the data and
 applying the definition to the discretised values, however this yields very
 poor results and depends on an completely arbitrary parameter: the bin width.  
 A more refined method to construct an nonparametric empirical PDF function from
 the sample is a kernel density estimation (KDE). This method consist in
 convolving the (ordered) data with a smooth symmetrical kernel, in this cause a
 standard gaussian function. Given a sample $\{x_i\}_{i=1}^N$, the empirical PDF
 is thus constructed as
 $$
-  \text{observed FWHM: } w_o = \SI{4 \pm 2}{}
+  f_\varepsilon(x) = \frac{1}{N\varepsilon} \sum_{i = 1}^N
    \mathcal{N}\left(\frac{x-x_i}{\varepsilon}\right)
 $$
-This two values turn out to be compatible too, with:
+where $\varepsilon$ is called the *bandwidth* and is a parameter that controls
-
+the strength of the smoothing. This parameter can be determined in several
 ways: bootstrapping, cross-validation, etc. For simplicity it was chosen
 to use Silverman's rule of thumb, which gives
 $$
-  p = 0.99
+  \varepsilon = 0.63 S_N
    \left(\frac{d + 2}{4}N\right)^{-1/(d + 4)}
 $$
 where 
  - $S_N$ is the sample standard deviation.
  - $d$ is ne number of dimensions, in this case $d=1$
 The $0.63$ factor was chosen to compensate for the distortion that
 systematically reduces the peaks height, which affects the estimation of the
 mode.
 With the empirical density estimation at hand, the FWHM can be computed by the
 same numerical method described for the true PDF. Again this was bootstrapped
 to estimate the standard error giving:
 $$
  \text{observed FWHM: } w_o = \SI{4.06 \pm 0.08}{}
 $$
 Applying the t-test to these two values gives
  - $t=0.495$
  - $p=0.620$
 which shows a very good agreement and proves the estimator is robust.
 For reference, the initial estimation based on an histogram gave a rather
 inadequate \si{4 \pm 2}.