ex-1: write section on median test

2020-04-01 01:37:03 +02:00 · 2020-04-01 01:37:03 +02:00 · 12b8f778f2
commit 12b8f778f2
parent ae34e5e224
2 changed files with 109 additions and 31 deletions
--- a/notes/sections/1.md
+++ b/notes/sections/1.md
@ -41,7 +41,7 @@ $$
 where:

  - $x$ runs over the sample,
-  - $F(x)$ is the Landau cumulative distribution and function 
+  - $F(x)$ is the Landau cumulative distribution and function
  - $F_N(x)$ is the empirical cumulative distribution function of the sample.

 If $N$ numbers have been generated, for every point $x$,
@ -92,7 +92,7 @@ cluster or use very large bins in the others, making the $\chi^2$ test
 unpractical.


-### Parameters comparison
+## Parameters comparison

 When a sample of points is generated in a given range, different tests can be
 applied in order to check whether they follow an even distribution or not. The
@ -100,9 +100,11 @@ idea which lies beneath most of them is to measure how far the parameters of
 the distribution are from the ones measured in the sample.  
 The same principle can be used to verify if the generated sample effectively
 follows the Landau distribution. Since it turns out to be a very pathological
-PDF, only two parameters can be easily checked: the mode and the
+PDF, very few parameters can be easily checked: mode, median and
 full-width-half-maximum (FWHM).

+### Mode
+
 ![Landau distribution with emphatized mode and
  FWHM = ($x_+ - x_-$).](images/landau.pdf)

@ -123,33 +125,33 @@ full-width-half-maximum (FWHM).
  \node [above right] at (6.85,3.1) {$x_-$};
  \node [above right] at (8.95,3.1)  {$x_+$};
  \end{scope}
-\end{tikzpicture}
-}
+\end{tikzpicture}}
 \end{figure}

 The mode of a set of data values is defined as the value that appears most
-often, namely: it is the maximum of the PDF. Since there is no way to find
-an analytic form for the mode of the Landau PDF, it was numerically estimated
-through a minimization method (found in GSL, called method 'golden section')
-with an arbitrary error of $10^{-2}$, obtaining:
+often, namely: it is the maximum of the PDF. Since there is no closed form
+for the mode of the Landau PDF, it was computed numerically by the *golden
+section* method (`gsl_min_fminimizer_goldensection` in GSL), applied to $-f$
+with an arbitrary error of $10^{-2}$, giving:
+$$
+  \text{expected mode: } m_e = \SI{-0.2227830 \pm 0.0000001}{}
+$$

-  - expected mode $= m_e = \SI{-0.2227830 \pm 0.0000001}{}$
-
-The minimization algorithm begins with a bounded region known to contain a
-minimum. The region is described by a lower bound $x_\text{min}$ and an upper
-bound $x_\text{max}$, with an estimate of the location of the minimum $x_e$.
-The value of the function at $x_e$ must be less than the value of the function
-at the ends of the interval, in order to guarantee that a minimum is contained
-somewhere within the interval.
+This is a minimization algorithm that begins with a bounded region known to
+contain a minimum. The region is described by a lower bound $x_\text{min}$ and
+an upper bound $x_\text{max}$, with an estimate of the location of the minimum
+$x_e$.  The value of the function at $x_e$ must be less than the value of the
+function at the ends of the interval, in order to guarantee that a minimum is
+contained somewhere within the interval.

 $$
  f(x_\text{min}) > f(x_e) < f(x_\text{max})
 $$

 On each iteration the interval is divided in a golden section (using the ratio
-($3 - \sqrt{5}/2 \approx 0.3819660$ and the value of the function at this new
+($(3 - \sqrt{5})/2 \approx 0.3819660$) and the value of the function at this new
 point $x'$ is calculated. If the new point is a better estimate of the minimum,
-namely is $f(x') < f(x_e)$, then the current estimate of the minimum is
+namely if $f(x') < f(x_e)$, then the current estimate of the minimum is
 updated.  
 The new point allows the size of the bounded interval to be reduced, by choosing
 the most compact set of points which satisfies the constraint $f(a) > f(x') <
@ -161,7 +163,9 @@ estimated as the central value of the bin with maximum events and the error
 is the half width of the bins. In this case, with 40 bins between -20 and 20,
 the following result was obtained:

-  - observed mode $= m_o = \SI{0 \pm 1}{}$
+$$
+  \text{observed mode: } m_o = \SI{0 \pm 1}{}
+$$

 In order to compare the values $m_e$ and $x_0$, the following compatibility
 t-test was applied:
@ -181,16 +185,90 @@ $$

 Thus, the observed value is compatible with the expected one.

---

-The same approach was taken as regards the FWHM. It is defined as the distance
-between the two points at which the function assumes half times the maximum
-value. Even in this case, there is not an analytic expression for it, thus it
-was computed numerically ad follow.  
+### Median
+
+The median is a central tendency statistics that, unlike the mean, is not
+very sensitive to extreme values, albeit less indicative. For this reason
+is well suited as test statistic in a pathological case such as the
+Landau distribution.  
+The median of a real probability distribution is defined as the value
+such that its cumulative probability is $1/2$. In other words the median
+partitions the probability in two (connected) halves.
+The median of a sample, once sorted, is given by its middle element if the
+sample size is odd, or the average of the two middle elements otherwise.
+
+The expected median was derived from the quantile function (QDF) of the Landau
+distribution[^1].
+Once this is know, the median is simply given by $QDF(1/2)$.
+Since both the CDF and QDF have no known closed form they must
+be computed numerically. The comulative probability has been computed by
+quadrature-based numerical integration of the PDF (`gsl_integration_qagiu()`
+function in GSL). The function calculate an approximation of the integral
+
+$$
+  I(x) = \int_x^{+\infty} f(t)dt
+$$
+
+[^1]: This is neither necessary nor the easiest way: it was chosen simply
+because the quantile had been already implemented and was initially
+used for reverse sampling.
+
+
+The $CDF$ is then given by $p(x) = 1 - I(x)$. This was done to avoid the
+left tail of the distribution, where the integration can sometimes fail.
+The integral $I$ is actually mapped beforehand onto $(0, 1]$ by
+the change of variable $t = a + (1-u)/u$, because the integration
+routine works on definite integrals. The result should satisfy the following
+accuracy requirement:
+$$
+  |\text{result} - I| \le \max(\varepsilon_\text{abs}, \varepsilon_\text{rel}I)
+$$
+
+The tolerances have been set to \SI{1e-10}{} and \SI{1e-6}{}, respectively.  
+As for the QDF, this was implemented by numerically inverting the CDF. This is
+done by solving the equation
+$$
+  p(x) = p_0
+$$
+
+for x, given a probability value $p_0$, where $p(x)$ is again the CDF.
+The (unique) root of this equation is found by a root-finding routine
+(`gsl_root_fsolver_brent` in GSL) based on the Brent-Dekker method. This
+algorithm consists in a bisection search, similar to the one employed in the
+mode optimisation, but improved by interpolating the function with a parabola
+at each step. The following condition is checked for convergence:
+$$
+  |a - b| < \varepsilon_\text{abs} + \varepsilon_\text{rel} \min(|a|, |b|)
+$$
+
+where $a,b$ are the current interval bounds. The condition immediately gives an
+upper bound on the error of the root as $\varepsilon = |a-b|$.  The tolerances
+here have been set to 0 and \SI{1e-3}{}.
+
+The result of the numerical computation is:
+
+$$
+  \text{expected median: } m_e = \SI{1.3557804 \pm 0.0000091}{}
+$$
+
+while the sample median was found to be
+
+$$
+  \text{observed median: } m_e = \SI{1.3479314}{}
+$$
+
+
+### FWHM
+
+The same approach was taken as regards the FWHM. This statistic is defined as
+the distance between the two points at which the function assumes half times
+the maximum value. Even in this case, there is not an analytic expression for
+it, thus it was computed numerically ad follow.  
 First, some definitions must be given:

 $$
-  f_{\text{max}} := f(m_e) \et \text{FWHM} = x_{+} - x_{-} \with
+  f_{\text{max}} := f(m_e) \et \text{FWHM} = x_+ - x_- \with
  f(x_{\pm}) = \frac{f_{\text{max}}}{2}
 $$

@ -201,13 +279,13 @@ in which the points have been sampled) in order to be able to find both the
 minima of the function:

 $$
-  f'(x) = |f(x) - \frac{f_{\text{max}}}{2}|
+  f'(x) = \left|f(x) - \frac{f_{\text{max}}}{2}\right|
 $$

 resulting in:

 $$
-  \text{expected FWHM} = w_e = \SI{4.0186457 \pm 0.0000001}{}
+  \text{expected FWHM: } w_e = \SI{4.0186457 \pm 0.0000001}{}
 $$

 On the other hand, the observed FWHM was computed as the difference between
@ -215,7 +293,7 @@ the center of the bins with the values closer to $\frac{f_{\text{max}}}{2}$
 and the error was taken as twice the width of the bins, obtaining:

 $$
-  \text{observed FWHM} = w_o = \SI{4 \pm 2}{}
+  \text{observed FWHM: } w_o = \SI{4 \pm 2}{}
 $$

 This two values turn out to be compatible too, with:
--- a/notes/sections/3.md
+++ b/notes/sections/3.md
@ -315,7 +315,7 @@ on a theorem that proves the existence of a number $\mu_k$ such that
   \Delta_k \mu_k = \|D_k \vec\delta_k\| &&
    (H_k + \mu_k D_k^TD_k) \vec\delta_k = -\nabla_k
 \end{align*}
-Using the approximation[^1] $H\approx J^TJ$, obtained by computing the Hessian
+Using the approximation[^2] $H\approx J^TJ$, obtained by computing the Hessian
 of the first-order Taylor expansion of $\chi^2$, $\vec\delta_k$ can
 be found by solving the system

@ -326,7 +326,7 @@ $$
 \end{cases}
 $$

-[^1]: Here $J_{ij} = \partial f_i/\partial x_j$ is the Jacobian matrix of the
+[^2]: Here $J_{ij} = \partial f_i/\partial x_j$ is the Jacobian matrix of the
 vector-valued function $\vec f(\vec x)$.

 The algorithm terminates if on of the following condition are satisfied: