ex-1: write section on median test
This commit is contained in:
parent
ae34e5e224
commit
12b8f778f2
@ -41,7 +41,7 @@ $$
|
||||
where:
|
||||
|
||||
- $x$ runs over the sample,
|
||||
- $F(x)$ is the Landau cumulative distribution and function
|
||||
- $F(x)$ is the Landau cumulative distribution and function
|
||||
- $F_N(x)$ is the empirical cumulative distribution function of the sample.
|
||||
|
||||
If $N$ numbers have been generated, for every point $x$,
|
||||
@ -92,7 +92,7 @@ cluster or use very large bins in the others, making the $\chi^2$ test
|
||||
unpractical.
|
||||
|
||||
|
||||
### Parameters comparison
|
||||
## Parameters comparison
|
||||
|
||||
When a sample of points is generated in a given range, different tests can be
|
||||
applied in order to check whether they follow an even distribution or not. The
|
||||
@ -100,9 +100,11 @@ idea which lies beneath most of them is to measure how far the parameters of
|
||||
the distribution are from the ones measured in the sample.
|
||||
The same principle can be used to verify if the generated sample effectively
|
||||
follows the Landau distribution. Since it turns out to be a very pathological
|
||||
PDF, only two parameters can be easily checked: the mode and the
|
||||
PDF, very few parameters can be easily checked: mode, median and
|
||||
full-width-half-maximum (FWHM).
|
||||
|
||||
### Mode
|
||||
|
||||
![Landau distribution with emphatized mode and
|
||||
FWHM = ($x_+ - x_-$).](images/landau.pdf)
|
||||
|
||||
@ -123,33 +125,33 @@ full-width-half-maximum (FWHM).
|
||||
\node [above right] at (6.85,3.1) {$x_-$};
|
||||
\node [above right] at (8.95,3.1) {$x_+$};
|
||||
\end{scope}
|
||||
\end{tikzpicture}
|
||||
}
|
||||
\end{tikzpicture}}
|
||||
\end{figure}
|
||||
|
||||
The mode of a set of data values is defined as the value that appears most
|
||||
often, namely: it is the maximum of the PDF. Since there is no way to find
|
||||
an analytic form for the mode of the Landau PDF, it was numerically estimated
|
||||
through a minimization method (found in GSL, called method 'golden section')
|
||||
with an arbitrary error of $10^{-2}$, obtaining:
|
||||
often, namely: it is the maximum of the PDF. Since there is no closed form
|
||||
for the mode of the Landau PDF, it was computed numerically by the *golden
|
||||
section* method (`gsl_min_fminimizer_goldensection` in GSL), applied to $-f$
|
||||
with an arbitrary error of $10^{-2}$, giving:
|
||||
$$
|
||||
\text{expected mode: } m_e = \SI{-0.2227830 \pm 0.0000001}{}
|
||||
$$
|
||||
|
||||
- expected mode $= m_e = \SI{-0.2227830 \pm 0.0000001}{}$
|
||||
|
||||
The minimization algorithm begins with a bounded region known to contain a
|
||||
minimum. The region is described by a lower bound $x_\text{min}$ and an upper
|
||||
bound $x_\text{max}$, with an estimate of the location of the minimum $x_e$.
|
||||
The value of the function at $x_e$ must be less than the value of the function
|
||||
at the ends of the interval, in order to guarantee that a minimum is contained
|
||||
somewhere within the interval.
|
||||
This is a minimization algorithm that begins with a bounded region known to
|
||||
contain a minimum. The region is described by a lower bound $x_\text{min}$ and
|
||||
an upper bound $x_\text{max}$, with an estimate of the location of the minimum
|
||||
$x_e$. The value of the function at $x_e$ must be less than the value of the
|
||||
function at the ends of the interval, in order to guarantee that a minimum is
|
||||
contained somewhere within the interval.
|
||||
|
||||
$$
|
||||
f(x_\text{min}) > f(x_e) < f(x_\text{max})
|
||||
$$
|
||||
|
||||
On each iteration the interval is divided in a golden section (using the ratio
|
||||
($3 - \sqrt{5}/2 \approx 0.3819660$ and the value of the function at this new
|
||||
($(3 - \sqrt{5})/2 \approx 0.3819660$) and the value of the function at this new
|
||||
point $x'$ is calculated. If the new point is a better estimate of the minimum,
|
||||
namely is $f(x') < f(x_e)$, then the current estimate of the minimum is
|
||||
namely if $f(x') < f(x_e)$, then the current estimate of the minimum is
|
||||
updated.
|
||||
The new point allows the size of the bounded interval to be reduced, by choosing
|
||||
the most compact set of points which satisfies the constraint $f(a) > f(x') <
|
||||
@ -161,7 +163,9 @@ estimated as the central value of the bin with maximum events and the error
|
||||
is the half width of the bins. In this case, with 40 bins between -20 and 20,
|
||||
the following result was obtained:
|
||||
|
||||
- observed mode $= m_o = \SI{0 \pm 1}{}$
|
||||
$$
|
||||
\text{observed mode: } m_o = \SI{0 \pm 1}{}
|
||||
$$
|
||||
|
||||
In order to compare the values $m_e$ and $x_0$, the following compatibility
|
||||
t-test was applied:
|
||||
@ -181,16 +185,90 @@ $$
|
||||
|
||||
Thus, the observed value is compatible with the expected one.
|
||||
|
||||
---
|
||||
|
||||
The same approach was taken as regards the FWHM. It is defined as the distance
|
||||
between the two points at which the function assumes half times the maximum
|
||||
value. Even in this case, there is not an analytic expression for it, thus it
|
||||
was computed numerically ad follow.
|
||||
### Median
|
||||
|
||||
The median is a central tendency statistics that, unlike the mean, is not
|
||||
very sensitive to extreme values, albeit less indicative. For this reason
|
||||
is well suited as test statistic in a pathological case such as the
|
||||
Landau distribution.
|
||||
The median of a real probability distribution is defined as the value
|
||||
such that its cumulative probability is $1/2$. In other words the median
|
||||
partitions the probability in two (connected) halves.
|
||||
The median of a sample, once sorted, is given by its middle element if the
|
||||
sample size is odd, or the average of the two middle elements otherwise.
|
||||
|
||||
The expected median was derived from the quantile function (QDF) of the Landau
|
||||
distribution[^1].
|
||||
Once this is know, the median is simply given by $QDF(1/2)$.
|
||||
Since both the CDF and QDF have no known closed form they must
|
||||
be computed numerically. The comulative probability has been computed by
|
||||
quadrature-based numerical integration of the PDF (`gsl_integration_qagiu()`
|
||||
function in GSL). The function calculate an approximation of the integral
|
||||
|
||||
$$
|
||||
I(x) = \int_x^{+\infty} f(t)dt
|
||||
$$
|
||||
|
||||
[^1]: This is neither necessary nor the easiest way: it was chosen simply
|
||||
because the quantile had been already implemented and was initially
|
||||
used for reverse sampling.
|
||||
|
||||
|
||||
The $CDF$ is then given by $p(x) = 1 - I(x)$. This was done to avoid the
|
||||
left tail of the distribution, where the integration can sometimes fail.
|
||||
The integral $I$ is actually mapped beforehand onto $(0, 1]$ by
|
||||
the change of variable $t = a + (1-u)/u$, because the integration
|
||||
routine works on definite integrals. The result should satisfy the following
|
||||
accuracy requirement:
|
||||
$$
|
||||
|\text{result} - I| \le \max(\varepsilon_\text{abs}, \varepsilon_\text{rel}I)
|
||||
$$
|
||||
|
||||
The tolerances have been set to \SI{1e-10}{} and \SI{1e-6}{}, respectively.
|
||||
As for the QDF, this was implemented by numerically inverting the CDF. This is
|
||||
done by solving the equation
|
||||
$$
|
||||
p(x) = p_0
|
||||
$$
|
||||
|
||||
for x, given a probability value $p_0$, where $p(x)$ is again the CDF.
|
||||
The (unique) root of this equation is found by a root-finding routine
|
||||
(`gsl_root_fsolver_brent` in GSL) based on the Brent-Dekker method. This
|
||||
algorithm consists in a bisection search, similar to the one employed in the
|
||||
mode optimisation, but improved by interpolating the function with a parabola
|
||||
at each step. The following condition is checked for convergence:
|
||||
$$
|
||||
|a - b| < \varepsilon_\text{abs} + \varepsilon_\text{rel} \min(|a|, |b|)
|
||||
$$
|
||||
|
||||
where $a,b$ are the current interval bounds. The condition immediately gives an
|
||||
upper bound on the error of the root as $\varepsilon = |a-b|$. The tolerances
|
||||
here have been set to 0 and \SI{1e-3}{}.
|
||||
|
||||
The result of the numerical computation is:
|
||||
|
||||
$$
|
||||
\text{expected median: } m_e = \SI{1.3557804 \pm 0.0000091}{}
|
||||
$$
|
||||
|
||||
while the sample median was found to be
|
||||
|
||||
$$
|
||||
\text{observed median: } m_e = \SI{1.3479314}{}
|
||||
$$
|
||||
|
||||
|
||||
### FWHM
|
||||
|
||||
The same approach was taken as regards the FWHM. This statistic is defined as
|
||||
the distance between the two points at which the function assumes half times
|
||||
the maximum value. Even in this case, there is not an analytic expression for
|
||||
it, thus it was computed numerically ad follow.
|
||||
First, some definitions must be given:
|
||||
|
||||
$$
|
||||
f_{\text{max}} := f(m_e) \et \text{FWHM} = x_{+} - x_{-} \with
|
||||
f_{\text{max}} := f(m_e) \et \text{FWHM} = x_+ - x_- \with
|
||||
f(x_{\pm}) = \frac{f_{\text{max}}}{2}
|
||||
$$
|
||||
|
||||
@ -201,13 +279,13 @@ in which the points have been sampled) in order to be able to find both the
|
||||
minima of the function:
|
||||
|
||||
$$
|
||||
f'(x) = |f(x) - \frac{f_{\text{max}}}{2}|
|
||||
f'(x) = \left|f(x) - \frac{f_{\text{max}}}{2}\right|
|
||||
$$
|
||||
|
||||
resulting in:
|
||||
|
||||
$$
|
||||
\text{expected FWHM} = w_e = \SI{4.0186457 \pm 0.0000001}{}
|
||||
\text{expected FWHM: } w_e = \SI{4.0186457 \pm 0.0000001}{}
|
||||
$$
|
||||
|
||||
On the other hand, the observed FWHM was computed as the difference between
|
||||
@ -215,7 +293,7 @@ the center of the bins with the values closer to $\frac{f_{\text{max}}}{2}$
|
||||
and the error was taken as twice the width of the bins, obtaining:
|
||||
|
||||
$$
|
||||
\text{observed FWHM} = w_o = \SI{4 \pm 2}{}
|
||||
\text{observed FWHM: } w_o = \SI{4 \pm 2}{}
|
||||
$$
|
||||
|
||||
This two values turn out to be compatible too, with:
|
||||
|
@ -315,7 +315,7 @@ on a theorem that proves the existence of a number $\mu_k$ such that
|
||||
\Delta_k \mu_k = \|D_k \vec\delta_k\| &&
|
||||
(H_k + \mu_k D_k^TD_k) \vec\delta_k = -\nabla_k
|
||||
\end{align*}
|
||||
Using the approximation[^1] $H\approx J^TJ$, obtained by computing the Hessian
|
||||
Using the approximation[^2] $H\approx J^TJ$, obtained by computing the Hessian
|
||||
of the first-order Taylor expansion of $\chi^2$, $\vec\delta_k$ can
|
||||
be found by solving the system
|
||||
|
||||
@ -326,7 +326,7 @@ $$
|
||||
\end{cases}
|
||||
$$
|
||||
|
||||
[^1]: Here $J_{ij} = \partial f_i/\partial x_j$ is the Jacobian matrix of the
|
||||
[^2]: Here $J_{ij} = \partial f_i/\partial x_j$ is the Jacobian matrix of the
|
||||
vector-valued function $\vec f(\vec x)$.
|
||||
|
||||
The algorithm terminates if on of the following condition are satisfied:
|
||||
|
Loading…
Reference in New Issue
Block a user