ex-1: write section on median test

This commit is contained in:
Michele Guerini Rocco 2020-04-01 01:37:03 +02:00
parent ae34e5e224
commit 12b8f778f2
2 changed files with 109 additions and 31 deletions

View File

@ -41,7 +41,7 @@ $$
where: where:
- $x$ runs over the sample, - $x$ runs over the sample,
- $F(x)$ is the Landau cumulative distribution and function - $F(x)$ is the Landau cumulative distribution and function
- $F_N(x)$ is the empirical cumulative distribution function of the sample. - $F_N(x)$ is the empirical cumulative distribution function of the sample.
If $N$ numbers have been generated, for every point $x$, If $N$ numbers have been generated, for every point $x$,
@ -92,7 +92,7 @@ cluster or use very large bins in the others, making the $\chi^2$ test
unpractical. unpractical.
### Parameters comparison ## Parameters comparison
When a sample of points is generated in a given range, different tests can be When a sample of points is generated in a given range, different tests can be
applied in order to check whether they follow an even distribution or not. The applied in order to check whether they follow an even distribution or not. The
@ -100,9 +100,11 @@ idea which lies beneath most of them is to measure how far the parameters of
the distribution are from the ones measured in the sample. the distribution are from the ones measured in the sample.
The same principle can be used to verify if the generated sample effectively The same principle can be used to verify if the generated sample effectively
follows the Landau distribution. Since it turns out to be a very pathological follows the Landau distribution. Since it turns out to be a very pathological
PDF, only two parameters can be easily checked: the mode and the PDF, very few parameters can be easily checked: mode, median and
full-width-half-maximum (FWHM). full-width-half-maximum (FWHM).
### Mode
![Landau distribution with emphatized mode and ![Landau distribution with emphatized mode and
FWHM = ($x_+ - x_-$).](images/landau.pdf) FWHM = ($x_+ - x_-$).](images/landau.pdf)
@ -123,33 +125,33 @@ full-width-half-maximum (FWHM).
\node [above right] at (6.85,3.1) {$x_-$}; \node [above right] at (6.85,3.1) {$x_-$};
\node [above right] at (8.95,3.1) {$x_+$}; \node [above right] at (8.95,3.1) {$x_+$};
\end{scope} \end{scope}
\end{tikzpicture} \end{tikzpicture}}
}
\end{figure} \end{figure}
The mode of a set of data values is defined as the value that appears most The mode of a set of data values is defined as the value that appears most
often, namely: it is the maximum of the PDF. Since there is no way to find often, namely: it is the maximum of the PDF. Since there is no closed form
an analytic form for the mode of the Landau PDF, it was numerically estimated for the mode of the Landau PDF, it was computed numerically by the *golden
through a minimization method (found in GSL, called method 'golden section') section* method (`gsl_min_fminimizer_goldensection` in GSL), applied to $-f$
with an arbitrary error of $10^{-2}$, obtaining: with an arbitrary error of $10^{-2}$, giving:
$$
\text{expected mode: } m_e = \SI{-0.2227830 \pm 0.0000001}{}
$$
- expected mode $= m_e = \SI{-0.2227830 \pm 0.0000001}{}$ This is a minimization algorithm that begins with a bounded region known to
contain a minimum. The region is described by a lower bound $x_\text{min}$ and
The minimization algorithm begins with a bounded region known to contain a an upper bound $x_\text{max}$, with an estimate of the location of the minimum
minimum. The region is described by a lower bound $x_\text{min}$ and an upper $x_e$. The value of the function at $x_e$ must be less than the value of the
bound $x_\text{max}$, with an estimate of the location of the minimum $x_e$. function at the ends of the interval, in order to guarantee that a minimum is
The value of the function at $x_e$ must be less than the value of the function contained somewhere within the interval.
at the ends of the interval, in order to guarantee that a minimum is contained
somewhere within the interval.
$$ $$
f(x_\text{min}) > f(x_e) < f(x_\text{max}) f(x_\text{min}) > f(x_e) < f(x_\text{max})
$$ $$
On each iteration the interval is divided in a golden section (using the ratio On each iteration the interval is divided in a golden section (using the ratio
($3 - \sqrt{5}/2 \approx 0.3819660$ and the value of the function at this new ($(3 - \sqrt{5})/2 \approx 0.3819660$) and the value of the function at this new
point $x'$ is calculated. If the new point is a better estimate of the minimum, point $x'$ is calculated. If the new point is a better estimate of the minimum,
namely is $f(x') < f(x_e)$, then the current estimate of the minimum is namely if $f(x') < f(x_e)$, then the current estimate of the minimum is
updated. updated.
The new point allows the size of the bounded interval to be reduced, by choosing The new point allows the size of the bounded interval to be reduced, by choosing
the most compact set of points which satisfies the constraint $f(a) > f(x') < the most compact set of points which satisfies the constraint $f(a) > f(x') <
@ -161,7 +163,9 @@ estimated as the central value of the bin with maximum events and the error
is the half width of the bins. In this case, with 40 bins between -20 and 20, is the half width of the bins. In this case, with 40 bins between -20 and 20,
the following result was obtained: the following result was obtained:
- observed mode $= m_o = \SI{0 \pm 1}{}$ $$
\text{observed mode: } m_o = \SI{0 \pm 1}{}
$$
In order to compare the values $m_e$ and $x_0$, the following compatibility In order to compare the values $m_e$ and $x_0$, the following compatibility
t-test was applied: t-test was applied:
@ -181,16 +185,90 @@ $$
Thus, the observed value is compatible with the expected one. Thus, the observed value is compatible with the expected one.
---
The same approach was taken as regards the FWHM. It is defined as the distance ### Median
between the two points at which the function assumes half times the maximum
value. Even in this case, there is not an analytic expression for it, thus it The median is a central tendency statistics that, unlike the mean, is not
was computed numerically ad follow. very sensitive to extreme values, albeit less indicative. For this reason
is well suited as test statistic in a pathological case such as the
Landau distribution.
The median of a real probability distribution is defined as the value
such that its cumulative probability is $1/2$. In other words the median
partitions the probability in two (connected) halves.
The median of a sample, once sorted, is given by its middle element if the
sample size is odd, or the average of the two middle elements otherwise.
The expected median was derived from the quantile function (QDF) of the Landau
distribution[^1].
Once this is know, the median is simply given by $QDF(1/2)$.
Since both the CDF and QDF have no known closed form they must
be computed numerically. The comulative probability has been computed by
quadrature-based numerical integration of the PDF (`gsl_integration_qagiu()`
function in GSL). The function calculate an approximation of the integral
$$
I(x) = \int_x^{+\infty} f(t)dt
$$
[^1]: This is neither necessary nor the easiest way: it was chosen simply
because the quantile had been already implemented and was initially
used for reverse sampling.
The $CDF$ is then given by $p(x) = 1 - I(x)$. This was done to avoid the
left tail of the distribution, where the integration can sometimes fail.
The integral $I$ is actually mapped beforehand onto $(0, 1]$ by
the change of variable $t = a + (1-u)/u$, because the integration
routine works on definite integrals. The result should satisfy the following
accuracy requirement:
$$
|\text{result} - I| \le \max(\varepsilon_\text{abs}, \varepsilon_\text{rel}I)
$$
The tolerances have been set to \SI{1e-10}{} and \SI{1e-6}{}, respectively.
As for the QDF, this was implemented by numerically inverting the CDF. This is
done by solving the equation
$$
p(x) = p_0
$$
for x, given a probability value $p_0$, where $p(x)$ is again the CDF.
The (unique) root of this equation is found by a root-finding routine
(`gsl_root_fsolver_brent` in GSL) based on the Brent-Dekker method. This
algorithm consists in a bisection search, similar to the one employed in the
mode optimisation, but improved by interpolating the function with a parabola
at each step. The following condition is checked for convergence:
$$
|a - b| < \varepsilon_\text{abs} + \varepsilon_\text{rel} \min(|a|, |b|)
$$
where $a,b$ are the current interval bounds. The condition immediately gives an
upper bound on the error of the root as $\varepsilon = |a-b|$. The tolerances
here have been set to 0 and \SI{1e-3}{}.
The result of the numerical computation is:
$$
\text{expected median: } m_e = \SI{1.3557804 \pm 0.0000091}{}
$$
while the sample median was found to be
$$
\text{observed median: } m_e = \SI{1.3479314}{}
$$
### FWHM
The same approach was taken as regards the FWHM. This statistic is defined as
the distance between the two points at which the function assumes half times
the maximum value. Even in this case, there is not an analytic expression for
it, thus it was computed numerically ad follow.
First, some definitions must be given: First, some definitions must be given:
$$ $$
f_{\text{max}} := f(m_e) \et \text{FWHM} = x_{+} - x_{-} \with f_{\text{max}} := f(m_e) \et \text{FWHM} = x_+ - x_- \with
f(x_{\pm}) = \frac{f_{\text{max}}}{2} f(x_{\pm}) = \frac{f_{\text{max}}}{2}
$$ $$
@ -201,13 +279,13 @@ in which the points have been sampled) in order to be able to find both the
minima of the function: minima of the function:
$$ $$
f'(x) = |f(x) - \frac{f_{\text{max}}}{2}| f'(x) = \left|f(x) - \frac{f_{\text{max}}}{2}\right|
$$ $$
resulting in: resulting in:
$$ $$
\text{expected FWHM} = w_e = \SI{4.0186457 \pm 0.0000001}{} \text{expected FWHM: } w_e = \SI{4.0186457 \pm 0.0000001}{}
$$ $$
On the other hand, the observed FWHM was computed as the difference between On the other hand, the observed FWHM was computed as the difference between
@ -215,7 +293,7 @@ the center of the bins with the values closer to $\frac{f_{\text{max}}}{2}$
and the error was taken as twice the width of the bins, obtaining: and the error was taken as twice the width of the bins, obtaining:
$$ $$
\text{observed FWHM} = w_o = \SI{4 \pm 2}{} \text{observed FWHM: } w_o = \SI{4 \pm 2}{}
$$ $$
This two values turn out to be compatible too, with: This two values turn out to be compatible too, with:

View File

@ -315,7 +315,7 @@ on a theorem that proves the existence of a number $\mu_k$ such that
\Delta_k \mu_k = \|D_k \vec\delta_k\| && \Delta_k \mu_k = \|D_k \vec\delta_k\| &&
(H_k + \mu_k D_k^TD_k) \vec\delta_k = -\nabla_k (H_k + \mu_k D_k^TD_k) \vec\delta_k = -\nabla_k
\end{align*} \end{align*}
Using the approximation[^1] $H\approx J^TJ$, obtained by computing the Hessian Using the approximation[^2] $H\approx J^TJ$, obtained by computing the Hessian
of the first-order Taylor expansion of $\chi^2$, $\vec\delta_k$ can of the first-order Taylor expansion of $\chi^2$, $\vec\delta_k$ can
be found by solving the system be found by solving the system
@ -326,7 +326,7 @@ $$
\end{cases} \end{cases}
$$ $$
[^1]: Here $J_{ij} = \partial f_i/\partial x_j$ is the Jacobian matrix of the [^2]: Here $J_{ij} = \partial f_i/\partial x_j$ is the Jacobian matrix of the
vector-valued function $\vec f(\vec x)$. vector-valued function $\vec f(\vec x)$.
The algorithm terminates if on of the following condition are satisfied: The algorithm terminates if on of the following condition are satisfied: