ex-1: write section on median test

2020-04-01 01:37:03 +02:00 · 2020-04-01 01:37:03 +02:00 · 12b8f778f2
commit 12b8f778f2
parent ae34e5e224
2 changed files with 109 additions and 31 deletions
--- a/notes/sections/1.md
+++ b/notes/sections/1.md
@ -41,7 +41,7 @@ $$
 where:
  - $x$ runs over the sample,
-  - $F(x)$ is the Landau cumulative distribution and function 
+  - $F(x)$ is the Landau cumulative distribution and function
  - $F_N(x)$ is the empirical cumulative distribution function of the sample.
 If $N$ numbers have been generated, for every point $x$,
@ -92,7 +92,7 @@ cluster or use very large bins in the others, making the $\chi^2$ test
 unpractical.
-### Parameters comparison
+## Parameters comparison
 When a sample of points is generated in a given range, different tests can be
 applied in order to check whether they follow an even distribution or not. The
@ -100,9 +100,11 @@ idea which lies beneath most of them is to measure how far the parameters of
 the distribution are from the ones measured in the sample.  
 The same principle can be used to verify if the generated sample effectively
 follows the Landau distribution. Since it turns out to be a very pathological
-PDF, only two parameters can be easily checked: the mode and the
+PDF, very few parameters can be easily checked: mode, median and
 full-width-half-maximum (FWHM).
 ### Mode
 ![Landau distribution with emphatized mode and
  FWHM = ($x_+ - x_-$).](images/landau.pdf)
@ -123,33 +125,33 @@ full-width-half-maximum (FWHM).
  \node [above right] at (6.85,3.1) {$x_-$};
  \node [above right] at (8.95,3.1)  {$x_+$};
  \end{scope}
-\end{tikzpicture}
+\end{tikzpicture}}
 }
 \end{figure}
 The mode of a set of data values is defined as the value that appears most
-often, namely: it is the maximum of the PDF. Since there is no way to find
+often, namely: it is the maximum of the PDF. Since there is no closed form
-an analytic form for the mode of the Landau PDF, it was numerically estimated
+for the mode of the Landau PDF, it was computed numerically by the *golden
-through a minimization method (found in GSL, called method 'golden section')
+section* method (`gsl_min_fminimizer_goldensection` in GSL), applied to $-f$
-with an arbitrary error of $10^{-2}$, obtaining:
+with an arbitrary error of $10^{-2}$, giving:
 $$
  \text{expected mode: } m_e = \SI{-0.2227830 \pm 0.0000001}{}
 $$
-  - expected mode $= m_e = \SI{-0.2227830 \pm 0.0000001}{}$
+This is a minimization algorithm that begins with a bounded region known to
-
+contain a minimum. The region is described by a lower bound $x_\text{min}$ and
-The minimization algorithm begins with a bounded region known to contain a
+an upper bound $x_\text{max}$, with an estimate of the location of the minimum
-minimum. The region is described by a lower bound $x_\text{min}$ and an upper
+$x_e$.  The value of the function at $x_e$ must be less than the value of the
-bound $x_\text{max}$, with an estimate of the location of the minimum $x_e$.
+function at the ends of the interval, in order to guarantee that a minimum is
-The value of the function at $x_e$ must be less than the value of the function
+contained somewhere within the interval.
 at the ends of the interval, in order to guarantee that a minimum is contained
 somewhere within the interval.
 $$
  f(x_\text{min}) > f(x_e) < f(x_\text{max})
 $$
 On each iteration the interval is divided in a golden section (using the ratio
-($3 - \sqrt{5}/2 \approx 0.3819660$ and the value of the function at this new
+($(3 - \sqrt{5})/2 \approx 0.3819660$) and the value of the function at this new
 point $x'$ is calculated. If the new point is a better estimate of the minimum,
-namely is $f(x') < f(x_e)$, then the current estimate of the minimum is
+namely if $f(x') < f(x_e)$, then the current estimate of the minimum is
 updated.  
 The new point allows the size of the bounded interval to be reduced, by choosing
 the most compact set of points which satisfies the constraint $f(a) > f(x') <
@ -161,7 +163,9 @@ estimated as the central value of the bin with maximum events and the error
 is the half width of the bins. In this case, with 40 bins between -20 and 20,
 the following result was obtained:
-  - observed mode $= m_o = \SI{0 \pm 1}{}$
+$$
  \text{observed mode: } m_o = \SI{0 \pm 1}{}
 $$
 In order to compare the values $m_e$ and $x_0$, the following compatibility
 t-test was applied:
@ -181,16 +185,90 @@ $$
 Thus, the observed value is compatible with the expected one.
 ---
-The same approach was taken as regards the FWHM. It is defined as the distance
+### Median
-between the two points at which the function assumes half times the maximum
+
-value. Even in this case, there is not an analytic expression for it, thus it
+The median is a central tendency statistics that, unlike the mean, is not
-was computed numerically ad follow.  
+very sensitive to extreme values, albeit less indicative. For this reason
 is well suited as test statistic in a pathological case such as the
 Landau distribution.  
 The median of a real probability distribution is defined as the value
 such that its cumulative probability is $1/2$. In other words the median
 partitions the probability in two (connected) halves.
 The median of a sample, once sorted, is given by its middle element if the
 sample size is odd, or the average of the two middle elements otherwise.
 The expected median was derived from the quantile function (QDF) of the Landau
 distribution[^1].
 Once this is know, the median is simply given by $QDF(1/2)$.
 Since both the CDF and QDF have no known closed form they must
 be computed numerically. The comulative probability has been computed by
 quadrature-based numerical integration of the PDF (`gsl_integration_qagiu()`
 function in GSL). The function calculate an approximation of the integral
 $$
  I(x) = \int_x^{+\infty} f(t)dt
 $$
 [^1]: This is neither necessary nor the easiest way: it was chosen simply
 because the quantile had been already implemented and was initially
 used for reverse sampling.
 The $CDF$ is then given by $p(x) = 1 - I(x)$. This was done to avoid the
 left tail of the distribution, where the integration can sometimes fail.
 The integral $I$ is actually mapped beforehand onto $(0, 1]$ by
 the change of variable $t = a + (1-u)/u$, because the integration
 routine works on definite integrals. The result should satisfy the following
 accuracy requirement:
 $$
  |\text{result} - I| \le \max(\varepsilon_\text{abs}, \varepsilon_\text{rel}I)
 $$
 The tolerances have been set to \SI{1e-10}{} and \SI{1e-6}{}, respectively.  
 As for the QDF, this was implemented by numerically inverting the CDF. This is
 done by solving the equation
 $$
  p(x) = p_0
 $$
 for x, given a probability value $p_0$, where $p(x)$ is again the CDF.
 The (unique) root of this equation is found by a root-finding routine
 (`gsl_root_fsolver_brent` in GSL) based on the Brent-Dekker method. This
 algorithm consists in a bisection search, similar to the one employed in the
 mode optimisation, but improved by interpolating the function with a parabola
 at each step. The following condition is checked for convergence:
 $$
  |a - b| < \varepsilon_\text{abs} + \varepsilon_\text{rel} \min(|a|, |b|)
 $$
 where $a,b$ are the current interval bounds. The condition immediately gives an
 upper bound on the error of the root as $\varepsilon = |a-b|$.  The tolerances
 here have been set to 0 and \SI{1e-3}{}.
 The result of the numerical computation is:
 $$
  \text{expected median: } m_e = \SI{1.3557804 \pm 0.0000091}{}
 $$
 while the sample median was found to be
 $$
  \text{observed median: } m_e = \SI{1.3479314}{}
 $$
 ### FWHM
 The same approach was taken as regards the FWHM. This statistic is defined as
 the distance between the two points at which the function assumes half times
 the maximum value. Even in this case, there is not an analytic expression for
 it, thus it was computed numerically ad follow.  
 First, some definitions must be given:
 $$
-  f_{\text{max}} := f(m_e) \et \text{FWHM} = x_{+} - x_{-} \with
+  f_{\text{max}} := f(m_e) \et \text{FWHM} = x_+ - x_- \with
  f(x_{\pm}) = \frac{f_{\text{max}}}{2}
 $$
@ -201,13 +279,13 @@ in which the points have been sampled) in order to be able to find both the
 minima of the function:
 $$
-  f'(x) = |f(x) - \frac{f_{\text{max}}}{2}|
+  f'(x) = \left|f(x) - \frac{f_{\text{max}}}{2}\right|
 $$
 resulting in:
 $$
-  \text{expected FWHM} = w_e = \SI{4.0186457 \pm 0.0000001}{}
+  \text{expected FWHM: } w_e = \SI{4.0186457 \pm 0.0000001}{}
 $$
 On the other hand, the observed FWHM was computed as the difference between
@ -215,7 +293,7 @@ the center of the bins with the values closer to $\frac{f_{\text{max}}}{2}$
 and the error was taken as twice the width of the bins, obtaining:
 $$
-  \text{observed FWHM} = w_o = \SI{4 \pm 2}{}
+  \text{observed FWHM: } w_o = \SI{4 \pm 2}{}
 $$
 This two values turn out to be compatible too, with:
--- a/notes/sections/3.md
+++ b/notes/sections/3.md
@ -315,7 +315,7 @@ on a theorem that proves the existence of a number $\mu_k$ such that
   \Delta_k \mu_k = \|D_k \vec\delta_k\| &&
    (H_k + \mu_k D_k^TD_k) \vec\delta_k = -\nabla_k
 \end{align*}
-Using the approximation[^1] $H\approx J^TJ$, obtained by computing the Hessian
+Using the approximation[^2] $H\approx J^TJ$, obtained by computing the Hessian
 of the first-order Taylor expansion of $\chi^2$, $\vec\delta_k$ can
 be found by solving the system
@ -326,7 +326,7 @@ $$
 \end{cases}
 $$
-[^1]: Here $J_{ij} = \partial f_i/\partial x_j$ is the Jacobian matrix of the
+[^2]: Here $J_{ij} = \partial f_i/\partial x_j$ is the Jacobian matrix of the
 vector-valued function $\vec f(\vec x)$.
 The algorithm terminates if on of the following condition are satisfied: