ex-1: revised and typo-fixed

This commit is contained in:
Giù Marcer 2020-05-07 23:18:59 +02:00 committed by rnhmjoj
parent aa36fca43e
commit 1bd691a24b
5 changed files with 84 additions and 91 deletions

View File

@ -94,6 +94,7 @@ int main(int argc, char** argv) {
pdf.function = &landau_pdf; pdf.function = &landau_pdf;
pdf.params = NULL; pdf.params = NULL;
// number of bootstrap samples
const size_t boots = 100; const size_t boots = 100;
double mode_e = numeric_mode(min, max, &pdf, 1); double mode_e = numeric_mode(min, max, &pdf, 1);

View File

@ -6,11 +6,13 @@ CCOMPILE = \
mkdir -p $(@D); \ mkdir -p $(@D); \
$(CC) $(CFLAGS) $^ -o $@ $(CC) $(CFLAGS) $^ -o $@
ex-1: ex-1/bin/main ex-1/bin/pdf ex-1: ex-1/bin/main ex-1/bin/pdf ex-1/bin/histo
ex-1/bin/main: ex-1/main.c ex-1/landau.c ex-1/tests.c ex-1/bootstrap.c ex-1/bin/main: ex-1/main.c ex-1/landau.c ex-1/tests.c ex-1/bootstrap.c
$(CCOMPILE) $(CCOMPILE)
ex-1/bin/pdf: ex-1/pdf.c ex-1/bin/pdf: ex-1/pdf.c
$(CCOMPILE) $(CCOMPILE)
ex-1/bin/histo: ex-1/histo.c
$(CCOMPILE)
ex-2: ex-2/bin/fancy ex-2/bin/fancier ex-2/bin/limit ex-2/bin/naive ex-2/bin/recip ex-2: ex-2/bin/fancy ex-2/bin/fancier ex-2/bin/limit ex-2/bin/naive ex-2/bin/recip
ex-2/bin/%: ex-2/%.c ex-2/bin/%: ex-2/%.c

Binary file not shown.

Before

Width:  |  Height:  |  Size: 16 KiB

Binary file not shown.

View File

@ -4,7 +4,6 @@
The Landau distribution is a probability density function which can be defined The Landau distribution is a probability density function which can be defined
as follows: as follows:
$$ $$
f(x) = \int \limits_{0}^{+ \infty} dt \, e^{-t \log(t) -xt} \sin (\pi t) f(x) = \int \limits_{0}^{+ \infty} dt \, e^{-t \log(t) -xt} \sin (\pi t)
$$ $$
@ -19,8 +18,8 @@ For the purpose of visualizing the resulting sample, the data was put into
an histogram and plotted with matplotlib. The result is shown in @fig:landau. an histogram and plotted with matplotlib. The result is shown in @fig:landau.
![Example of N = 10'000 points generated with the `gsl_ran_landau()` ![Example of N = 10'000 points generated with the `gsl_ran_landau()`
function and plotted in a 100-bins histogram ranging from -10 to function and plotted in a 100-bins histogram ranging from -20 to
80.](images/landau-hist.png){#fig:landau} 80.](images/landau-histo.pdf){#fig:landau}
## Randomness testing of the generated sample ## Randomness testing of the generated sample
@ -33,7 +32,6 @@ distance between the cumulative distribution function of the Landau distribution
and the one of the sample. The null hypothesis is that the sample was and the one of the sample. The null hypothesis is that the sample was
drawn from the reference distribution. drawn from the reference distribution.
The KS statistic for a given cumulative distribution function $F(x)$ is: The KS statistic for a given cumulative distribution function $F(x)$ is:
$$ $$
D_N = \text{sup}_x |F_N(x) - F(x)| D_N = \text{sup}_x |F_N(x) - F(x)|
$$ $$
@ -53,14 +51,12 @@ found in GSL.
Under the null hypothesis, the distribution of $D_N$ is expected to Under the null hypothesis, the distribution of $D_N$ is expected to
asymptotically approach a Kolmogorov distribution: asymptotically approach a Kolmogorov distribution:
$$ $$
\sqrt{N}D_N \xrightarrow{N \rightarrow + \infty} K \sqrt{N}D_N \xrightarrow{N \rightarrow + \infty} K
$$ $$
where $K$ is the Kolmogorov variable, with cumulative where $K$ is the Kolmogorov variable, with cumulative distribution function
distribution function given by: given by [@marsaglia03]:
$$ $$
P(K \leqslant K_0) = 1 - p = \frac{\sqrt{2 \pi}}{K_0} P(K \leqslant K_0) = 1 - p = \frac{\sqrt{2 \pi}}{K_0}
\sum_{j = 1}^{+ \infty} e^{-(2j - 1)^2 \pi^2 / 8 K_0^2} \sum_{j = 1}^{+ \infty} e^{-(2j - 1)^2 \pi^2 / 8 K_0^2}
@ -75,8 +71,6 @@ $u$-transform with the `gsl_sum_levin_utrunc_accel()` function. The algorithm
terminates when the difference between two successive extrapolations reaches a terminates when the difference between two successive extrapolations reaches a
minimum. minimum.
\clearpage
For $N = 50000$, the following results were obtained: For $N = 50000$, the following results were obtained:
- $D = 0.004$ - $D = 0.004$
@ -88,16 +82,16 @@ Hence, the data was reasonably sampled from a Landau distribution.
Contrary to what one would expect, the $\chi^2$ test on a histogram is not very Contrary to what one would expect, the $\chi^2$ test on a histogram is not very
useful in this case. For the test to be significant, the data has to be binned useful in this case. For the test to be significant, the data has to be binned
such that at least several points fall in each bin. However, it can be seen such that at least several points fall in each bin. However, it can be seen
(@fig:landau) that many bins are empty both in the right and left side of the in @fig:landau that many bins are empty both in the right and left side of the
distribution, so it would be necessary to fit only the region where the points distribution, so it would be necessary to fit only the region where the points
cluster or use very large bins in the others, making the $\chi^2$ test cluster or use very large bins in the others, making the $\chi^2$ test
unpractical. unpractical.
## Parameters comparison ### Parameters comparison
When a sample of points is generated in a given range, different tests can be When a sample of points is generated in a given range, different tests can be
applied in order to check whether they follow an even distribution or not. The applied in order to check whether they follow a given distribution or not. The
idea which lies beneath most of them is to measure how far the parameters of idea which lies beneath most of them is to measure how far the parameters of
the distribution are from the ones measured in the sample. the distribution are from the ones measured in the sample.
The same principle can be used to verify if the generated sample effectively The same principle can be used to verify if the generated sample effectively
@ -105,9 +99,7 @@ follows the Landau distribution. Since it turns out to be a very pathological
PDF, very few parameters can be easily checked: mode, median and full width at PDF, very few parameters can be easily checked: mode, median and full width at
half maximum (FWHM). half maximum (FWHM).
### Mode ![Landau distribution with emphatized mode $m_e$ and
![Landau distribution with emphatized mode and
FWHM = ($x_+ - x_-$).](images/landau.pdf) FWHM = ($x_+ - x_-$).](images/landau.pdf)
\begin{figure} \begin{figure}
@ -130,6 +122,10 @@ half maximum (FWHM).
\end{tikzpicture}} \end{tikzpicture}}
\end{figure} \end{figure}
\vspace{30pt}
#### Mode
The mode of a set of data values is defined as the value that appears most The mode of a set of data values is defined as the value that appears most
often, namely: it is the maximum of the PDF. Since there is no closed form for often, namely: it is the maximum of the PDF. Since there is no closed form for
the mode of the Landau PDF, it was computed numerically by the *Brent* the mode of the Landau PDF, it was computed numerically by the *Brent*
@ -145,7 +141,6 @@ an upper bound $x_\text{max}$, with an estimate of the location of the minimum
$x_e$. The value of the function at $x_e$ must be less than the value of the $x_e$. The value of the function at $x_e$ must be less than the value of the
function at the ends of the interval, in order to guarantee that a minimum is function at the ends of the interval, in order to guarantee that a minimum is
contained somewhere within the interval. contained somewhere within the interval.
$$ $$
f(x_\text{min}) > f(x_e) < f(x_\text{max}) f(x_\text{min}) > f(x_e) < f(x_\text{max})
$$ $$
@ -165,26 +160,26 @@ f(b)$ between $f(x_\text{min})$, $f(x_\text{min})$ and $f(x_e)$. The interval is
reduced until it encloses the true minimum to a desired tolerance. reduced until it encloses the true minimum to a desired tolerance.
The error of the result is estimated by the length of the final interval. The error of the result is estimated by the length of the final interval.
On the other hand, to compute the mode of the sample the half-sample mode (HSM) On the other hand, to compute the mode of the sample, the half-sample mode (HSM)
or *Robertson-Cryer* estimator was used. This estimator was chosen because makes or *Robertson-Cryer* estimator was used. This estimator was chosen because makes
no assumptions on the underlying distribution and is not computationally expensive. no assumptions on the underlying distribution and is not computationally
The HSM is obtained by iteratively identifying the half modal interval, which expensive. The HSM is obtained by iteratively identifying the half modal
is the smallest interval containing half of the observation. Once the sample is interval, which is the smallest interval containing half of the observation.
reduced to less that three points the mode is computed as the average. The Once the sample is reduced to less that three points the mode is computed as the
special case $n=3$ is dealt with by averaging the two closer points. average. The special case $n=3$ is dealt with by averaging the two closer points
[@robertson74].
To obtain a better estimate of the mode and its error the above procedure was To obtain a better estimate of the mode and its error the above procedure was
bootstrapped. The original sample is treated as a population and used to build bootstrapped. The original sample was treated as a population and used to build
other samples, of the same size, by *sampling with replacements*. For each one 100 other samples of the same size, by *sampling with replacements*. For each one
of the new samples the above statistic is computed. By simply taking the of the new samples, the above statistic was computed. By simply taking the
mean of these statistics the following estimate was obtained mean of these statistics the following estimate was obtained:
$$ $$
\text{observed mode: } m_o = \SI{-0.29 \pm 0.19}{} \text{observed mode: } m_o = \SI{-0.29 \pm 0.19}{}
$$ $$
In order to compare the values $m_e$ and $x_0$, the following compatibility In order to compare the values $m_e$ and $m_0$, the following compatibility
t-test was applied: t-test was applied:
$$ $$
p = 1 - \text{erf}\left(\frac{t}{\sqrt{2}}\right)\ \with p = 1 - \text{erf}\left(\frac{t}{\sqrt{2}}\right)\ \with
t = \frac{|m_e - m_o|}{\sqrt{\sigma_e^2 + \sigma_o^2}} t = \frac{|m_e - m_o|}{\sqrt{\sigma_e^2 + \sigma_o^2}}
@ -198,29 +193,28 @@ In this case:
- p = 0.311 - p = 0.311
Thus, the observed mode is compatible with the mode of the Landau distribution, Thus, the observed mode is compatible with the mode of the Landau distribution,
however the result is quite imprecise. although the result is quite imprecise.
### Median #### Median
The median is a central tendency statistics that, unlike the mean, is not The median is a central tendency statistics that, unlike the mean, is not
very sensitive to extreme values, albeit less indicative. For this reason very sensitive to extreme values, albeit less indicative. For this reason
is well suited as test statistic in a pathological case such as the is well suited as test statistic in a pathological case such as the Landau
Landau distribution. distribution.
The median of a real probability distribution is defined as the value The median of a probability distribution is defined as the value such that its
such that its cumulative probability is $1/2$. In other words the median cumulative probability is $1/2$. In other words, the median partitions the
partitions the probability in two (connected) halves. probability in two (connected) halves. The median of a sample, once sorted, is
The median of a sample, once sorted, is given by its middle element if the given by its middle element if the sample size is odd, or the average of the two
sample size is odd, or the average of the two middle elements otherwise. middle elements otherwise.
The expected median was derived from the quantile function (QDF) of the Landau The expected median was derived from the quantile function (QDF) of the Landau
distribution[^1]. distribution[^1].
Once this is know, the median is simply given by $\text{QDF}(1/2)$. Once this is know, the median is simply given by $\text{QDF}(1/2)$. Since both
Since both the CDF and QDF have no known closed form they must the CDF and QDF have no known closed form, they must be computed numerically.
be computed numerically. The cumulative probability has been computed by The cumulative probability was computed by quadrature-based numerical
quadrature-based numerical integration of the PDF (`gsl_integration_qagiu()` integration of the PDF (`gsl_integration_qagiu()` function in GSL). The function
function in GSL). The function calculate an approximation of the integral calculate an approximation of the integral:
$$ $$
I(x) = \int_x^{+\infty} f(t)dt I(x) = \int_x^{+\infty} f(t)dt
$$ $$
@ -229,58 +223,57 @@ $$
because the quantile had been already implemented and was initially because the quantile had been already implemented and was initially
used for reverse sampling. used for reverse sampling.
The CDF is then given by $p(x) = 1 - I(x)$. This was done to avoid the
The $CDF$ is then given by $p(x) = 1 - I(x)$. This was done to avoid the
left tail of the distribution, where the integration can sometimes fail. left tail of the distribution, where the integration can sometimes fail.
The integral $I$ is actually mapped beforehand onto $(0, 1]$ by The integral $I$ was actually mapped beforehand onto $(0, 1]$ by
the change of variable $t = a + (1-u)/u$, because the integration the change of variable $t = x + (1-u)/u$, because the integration
routine works on definite integrals. The result should satisfy the following routine works on definite integrals. The result should satisfy the following
accuracy requirement: accuracy requirement:
$$ $$
|\text{result} - I| \le \max(\varepsilon_\text{abs}, \varepsilon_\text{rel}I) |\text{result} - I| \le \max(\varepsilon_\text{abs}, \varepsilon_\text{rel}I)
$$ $$
The tolerances have been set to \SI{1e-10}{} and \SI{1e-6}{}, respectively. where the absolute and relative tolerances $\varepsilon_\text{abs}$ and
As for the QDF, this was implemented by numerically inverting the CDF. This is $\varepsilon_\text{rel}$ were set to \SI{1e-10}{} and \SI{1e-6}{},
done by solving the equation respectively.
As for the QDF, this was implemented by numerically inverting the CDF. This was
done by solving the equation;
$$ $$
p(x) = p_0 p(x) = p_0
$$ $$
for x, given a probability value $p_0$, where $p(x)$ is again the CDF. for x, given a probability value $p_0$, where $p(x)$ is the CDF. The (unique)
The (unique) root of this equation is found by a root-finding routine root of this equation was found by a root-finding routine
(`gsl_root_fsolver_brent` in GSL) based on the Brent-Dekker method. This (`gsl_root_fsolver_brent` in GSL) based on the Brent-Dekker method it too.
algorithm consists in a bisection search, similar to the one employed in the The following condition was checked for convergence:
mode optimisation, but improved by interpolating the function with a parabola
at each step. The following condition is checked for convergence:
$$ $$
|a - b| < \varepsilon_\text{abs} + \varepsilon_\text{rel} \min(|a|, |b|) |a - b| < \varepsilon_\text{abs} + \varepsilon_\text{rel} \min(|a|, |b|)
$$ $$
where $a,b$ are the current interval bounds. The condition immediately gives an where $a,b$ are the current interval bounds. The condition immediately gives an
upper bound on the error of the root as $\varepsilon = |a-b|$. The tolerances upper bound on the error of the root as $\varepsilon = |a-b|$. The tolerances
here have been set to 0 and \SI{1e-3}{}. here were set to 0 and \SI{1e-3}{}.
The result of the numerical computation is: The result of the numerical computation is:
$$ $$
\text{expected median: } m_e = \SI{1.3557804 \pm 0.0000091}{} \text{expected median: } m_e = \SI{1.3557804 \pm 0.0000091}{}
$$ $$
while the sample median, obtained again by bootstrapping, was found to be while the sample median, obtained again by bootstrapping, was found to be:
$$ $$
\text{observed median: } m_e = \SI{1.3605 \pm 0.0062}{} \text{observed median: } m_e = \SI{1.3605 \pm 0.0062}{}
$$ $$
Applying again the t-test from before to this statistic: As stated above, the median is less sensitive to extreme values with respect to
the mode: this lead the result to be much more precise. Applying again the
aforementioned t-test to this statistic:
- $t=0.761$ - $t=0.761$
- $p=0.446$ - $p=0.446$
This result is much more precise than the mode and the two values show Hence, the two values show a good agreement.
a good agreement.
### FWHM #### FWHM
For a unimodal distribution (having a single peak) this statistic is defined as For a unimodal distribution (having a single peak) this statistic is defined as
the distance between the two points at which the PDF attains half the maximum the distance between the two points at which the PDF attains half the maximum
@ -292,58 +285,55 @@ $$
f(x_\pm) = \frac{f_\text{max}}{2} f(x_\pm) = \frac{f_\text{max}}{2}
$$ $$
The function $f'(x)$ was minimized using the same minimization method The function derivative $f'(x)$ was minimized using the same minimization method
used for finding $m_e$. Once $f_\text{max}$ is known, the equation used for finding $m_e$. Once $f_\text{max}$ was known, the equation:
$$ $$
f'(x) = \frac{f_\text{max}}{2} f'(x) = \frac{f_\text{max}}{2}
$$ $$
is solved by performing the Brent-Dekker method (described before) in the was solved by performing the Brent-Dekker method (described before) in the
ranges $[x_\text{min}, m_e]$ and $[m_e, x_\text{max}]$ yielding the two ranges $[x_\text{min}, m_e]$ and $[m_e, x_\text{max}]$ yielding the two
solutions $x_\pm$. With a relative tolerance of \SI{1e-7}{} the following solutions $x_\pm$. With a relative tolerance of \SI{1e-7}{}, the following
result was obtained: result was obtained:
$$ $$
\text{expected FWHM: } w_e = \SI{4.0186457 \pm 0.0000001}{} \text{expected FWHM: } w_e = \SI{4.0186457 \pm 0.0000001}{}
$$ $$
\vspace{-1em} \vspace{-1em}
![Example of a Moyal distribution density obtained by ![Example of a Moyal distribution density obtained by the KDE method. The rug
the KDE method described above. The rug plot shows the original plot shows the original sample used in the reconstruction. The 0.6 factor
sample used in the reconstruction.](images/landau-kde.pdf) compensate for the otherwise peak height reduction.](images/landau-kde.pdf)
On the other hand, obtaining a good estimate of the FWHM from a sample is much On the other hand, obtaining a good estimate of the FWHM from a sample is much
more difficult. In principle it could be measured by binning the data and more difficult. In principle, it could be measured by binning the data and
applying the definition to the discretised values, however this yields very applying the definition to the discretised values, however this yields very
poor results and depends on an completely arbitrary parameter: the bin width. poor results and depends on an completely arbitrary parameter: the bin width.
A more refined method to construct an nonparametric empirical PDF function from A more refined method to construct a nonparametric empirical PDF function from
the sample is a kernel density estimation (KDE). This method consist in the sample is a kernel density estimation (KDE). This method consist in
convolving the (ordered) data with a smooth symmetrical kernel, in this cause a convolving the (ordered) data with a smooth symmetrical kernel: in this case a
standard gaussian function. Given a sample $\{x_i\}_{i=1}^N$, the empirical PDF standard Gaussian function. Given a sample of values $\{x_i\}_{i=1}^N$, the
is thus constructed as empirical PDF is defined as:
$$ $$
f_\varepsilon(x) = \frac{1}{N\varepsilon} \sum_{i = 1}^N f_\varepsilon(x) = \frac{1}{N\varepsilon} \sum_{i = 1}^N
\mathcal{N}\left(\frac{x-x_i}{\varepsilon}\right) \mathcal{N}\left(\frac{x-x_i}{\varepsilon}\right)
$$ $$
where $\varepsilon$ is called the *bandwidth* and is a parameter that controls where $\mathcal{N}$ is the kernel and the parameter $\varepsilon$, called
the strength of the smoothing. This parameter can be determined in several *bandwidth*, controls the strength of the smoothing. This parameter can be
ways: bootstrapping, cross-validation, etc. For simplicity it was chosen determined in several ways: bootstrapping, cross-validation, etc. For
to use Silverman's rule of thumb, which gives simplicity, it was chosen to use Silverman's rule of thumb [@silverman86],
which gives:
$$ $$
\varepsilon = 0.63 S_N \varepsilon = 0.63 S_N
\left(\frac{d + 2}{4}N\right)^{-1/(d + 4)} \left(\frac{d + 2}{4}N\right)^{-1/(d + 4)}
$$ $$
where where the $0.63$ factor was chosen to compensate for the distortion that
- $S_N$ is the sample standard deviation.
- $d$ is ne number of dimensions, in this case $d=1$
The $0.63$ factor was chosen to compensate for the distortion that
systematically reduces the peaks height, which affects the estimation of the systematically reduces the peaks height, which affects the estimation of the
mode. mode, and:
- $S_N$ is the sample standard deviation,
- $d$ is the number of dimensions, in this case $d=1$.
With the empirical density estimation at hand, the FWHM can be computed by the With the empirical density estimation at hand, the FWHM can be computed by the
same numerical method described for the true PDF. Again this was bootstrapped same numerical method described for the true PDF. Again this was bootstrapped