ex-1: revised and typo-fixed
This commit is contained in:
parent
aa36fca43e
commit
1bd691a24b
@ -94,6 +94,7 @@ int main(int argc, char** argv) {
|
|||||||
pdf.function = &landau_pdf;
|
pdf.function = &landau_pdf;
|
||||||
pdf.params = NULL;
|
pdf.params = NULL;
|
||||||
|
|
||||||
|
// number of bootstrap samples
|
||||||
const size_t boots = 100;
|
const size_t boots = 100;
|
||||||
|
|
||||||
double mode_e = numeric_mode(min, max, &pdf, 1);
|
double mode_e = numeric_mode(min, max, &pdf, 1);
|
||||||
|
4
makefile
4
makefile
@ -6,11 +6,13 @@ CCOMPILE = \
|
|||||||
mkdir -p $(@D); \
|
mkdir -p $(@D); \
|
||||||
$(CC) $(CFLAGS) $^ -o $@
|
$(CC) $(CFLAGS) $^ -o $@
|
||||||
|
|
||||||
ex-1: ex-1/bin/main ex-1/bin/pdf
|
ex-1: ex-1/bin/main ex-1/bin/pdf ex-1/bin/histo
|
||||||
ex-1/bin/main: ex-1/main.c ex-1/landau.c ex-1/tests.c ex-1/bootstrap.c
|
ex-1/bin/main: ex-1/main.c ex-1/landau.c ex-1/tests.c ex-1/bootstrap.c
|
||||||
$(CCOMPILE)
|
$(CCOMPILE)
|
||||||
ex-1/bin/pdf: ex-1/pdf.c
|
ex-1/bin/pdf: ex-1/pdf.c
|
||||||
$(CCOMPILE)
|
$(CCOMPILE)
|
||||||
|
ex-1/bin/histo: ex-1/histo.c
|
||||||
|
$(CCOMPILE)
|
||||||
|
|
||||||
ex-2: ex-2/bin/fancy ex-2/bin/fancier ex-2/bin/limit ex-2/bin/naive ex-2/bin/recip
|
ex-2: ex-2/bin/fancy ex-2/bin/fancier ex-2/bin/limit ex-2/bin/naive ex-2/bin/recip
|
||||||
ex-2/bin/%: ex-2/%.c
|
ex-2/bin/%: ex-2/%.c
|
||||||
|
Binary file not shown.
Before Width: | Height: | Size: 16 KiB |
BIN
notes/images/landau-histo.pdf
Normal file
BIN
notes/images/landau-histo.pdf
Normal file
Binary file not shown.
@ -4,7 +4,6 @@
|
|||||||
|
|
||||||
The Landau distribution is a probability density function which can be defined
|
The Landau distribution is a probability density function which can be defined
|
||||||
as follows:
|
as follows:
|
||||||
|
|
||||||
$$
|
$$
|
||||||
f(x) = \int \limits_{0}^{+ \infty} dt \, e^{-t \log(t) -xt} \sin (\pi t)
|
f(x) = \int \limits_{0}^{+ \infty} dt \, e^{-t \log(t) -xt} \sin (\pi t)
|
||||||
$$
|
$$
|
||||||
@ -19,8 +18,8 @@ For the purpose of visualizing the resulting sample, the data was put into
|
|||||||
an histogram and plotted with matplotlib. The result is shown in @fig:landau.
|
an histogram and plotted with matplotlib. The result is shown in @fig:landau.
|
||||||
|
|
||||||
![Example of N = 10'000 points generated with the `gsl_ran_landau()`
|
![Example of N = 10'000 points generated with the `gsl_ran_landau()`
|
||||||
function and plotted in a 100-bins histogram ranging from -10 to
|
function and plotted in a 100-bins histogram ranging from -20 to
|
||||||
80.](images/landau-hist.png){#fig:landau}
|
80.](images/landau-histo.pdf){#fig:landau}
|
||||||
|
|
||||||
|
|
||||||
## Randomness testing of the generated sample
|
## Randomness testing of the generated sample
|
||||||
@ -33,7 +32,6 @@ distance between the cumulative distribution function of the Landau distribution
|
|||||||
and the one of the sample. The null hypothesis is that the sample was
|
and the one of the sample. The null hypothesis is that the sample was
|
||||||
drawn from the reference distribution.
|
drawn from the reference distribution.
|
||||||
The KS statistic for a given cumulative distribution function $F(x)$ is:
|
The KS statistic for a given cumulative distribution function $F(x)$ is:
|
||||||
|
|
||||||
$$
|
$$
|
||||||
D_N = \text{sup}_x |F_N(x) - F(x)|
|
D_N = \text{sup}_x |F_N(x) - F(x)|
|
||||||
$$
|
$$
|
||||||
@ -53,14 +51,12 @@ found in GSL.
|
|||||||
|
|
||||||
Under the null hypothesis, the distribution of $D_N$ is expected to
|
Under the null hypothesis, the distribution of $D_N$ is expected to
|
||||||
asymptotically approach a Kolmogorov distribution:
|
asymptotically approach a Kolmogorov distribution:
|
||||||
|
|
||||||
$$
|
$$
|
||||||
\sqrt{N}D_N \xrightarrow{N \rightarrow + \infty} K
|
\sqrt{N}D_N \xrightarrow{N \rightarrow + \infty} K
|
||||||
$$
|
$$
|
||||||
|
|
||||||
where $K$ is the Kolmogorov variable, with cumulative
|
where $K$ is the Kolmogorov variable, with cumulative distribution function
|
||||||
distribution function given by:
|
given by [@marsaglia03]:
|
||||||
|
|
||||||
$$
|
$$
|
||||||
P(K \leqslant K_0) = 1 - p = \frac{\sqrt{2 \pi}}{K_0}
|
P(K \leqslant K_0) = 1 - p = \frac{\sqrt{2 \pi}}{K_0}
|
||||||
\sum_{j = 1}^{+ \infty} e^{-(2j - 1)^2 \pi^2 / 8 K_0^2}
|
\sum_{j = 1}^{+ \infty} e^{-(2j - 1)^2 \pi^2 / 8 K_0^2}
|
||||||
@ -75,8 +71,6 @@ $u$-transform with the `gsl_sum_levin_utrunc_accel()` function. The algorithm
|
|||||||
terminates when the difference between two successive extrapolations reaches a
|
terminates when the difference between two successive extrapolations reaches a
|
||||||
minimum.
|
minimum.
|
||||||
|
|
||||||
\clearpage
|
|
||||||
|
|
||||||
For $N = 50000$, the following results were obtained:
|
For $N = 50000$, the following results were obtained:
|
||||||
|
|
||||||
- $D = 0.004$
|
- $D = 0.004$
|
||||||
@ -88,16 +82,16 @@ Hence, the data was reasonably sampled from a Landau distribution.
|
|||||||
Contrary to what one would expect, the $\chi^2$ test on a histogram is not very
|
Contrary to what one would expect, the $\chi^2$ test on a histogram is not very
|
||||||
useful in this case. For the test to be significant, the data has to be binned
|
useful in this case. For the test to be significant, the data has to be binned
|
||||||
such that at least several points fall in each bin. However, it can be seen
|
such that at least several points fall in each bin. However, it can be seen
|
||||||
(@fig:landau) that many bins are empty both in the right and left side of the
|
in @fig:landau that many bins are empty both in the right and left side of the
|
||||||
distribution, so it would be necessary to fit only the region where the points
|
distribution, so it would be necessary to fit only the region where the points
|
||||||
cluster or use very large bins in the others, making the $\chi^2$ test
|
cluster or use very large bins in the others, making the $\chi^2$ test
|
||||||
unpractical.
|
unpractical.
|
||||||
|
|
||||||
|
|
||||||
## Parameters comparison
|
### Parameters comparison
|
||||||
|
|
||||||
When a sample of points is generated in a given range, different tests can be
|
When a sample of points is generated in a given range, different tests can be
|
||||||
applied in order to check whether they follow an even distribution or not. The
|
applied in order to check whether they follow a given distribution or not. The
|
||||||
idea which lies beneath most of them is to measure how far the parameters of
|
idea which lies beneath most of them is to measure how far the parameters of
|
||||||
the distribution are from the ones measured in the sample.
|
the distribution are from the ones measured in the sample.
|
||||||
The same principle can be used to verify if the generated sample effectively
|
The same principle can be used to verify if the generated sample effectively
|
||||||
@ -105,9 +99,7 @@ follows the Landau distribution. Since it turns out to be a very pathological
|
|||||||
PDF, very few parameters can be easily checked: mode, median and full width at
|
PDF, very few parameters can be easily checked: mode, median and full width at
|
||||||
half maximum (FWHM).
|
half maximum (FWHM).
|
||||||
|
|
||||||
### Mode
|
![Landau distribution with emphatized mode $m_e$ and
|
||||||
|
|
||||||
![Landau distribution with emphatized mode and
|
|
||||||
FWHM = ($x_+ - x_-$).](images/landau.pdf)
|
FWHM = ($x_+ - x_-$).](images/landau.pdf)
|
||||||
|
|
||||||
\begin{figure}
|
\begin{figure}
|
||||||
@ -130,6 +122,10 @@ half maximum (FWHM).
|
|||||||
\end{tikzpicture}}
|
\end{tikzpicture}}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
|
\vspace{30pt}
|
||||||
|
|
||||||
|
#### Mode
|
||||||
|
|
||||||
The mode of a set of data values is defined as the value that appears most
|
The mode of a set of data values is defined as the value that appears most
|
||||||
often, namely: it is the maximum of the PDF. Since there is no closed form for
|
often, namely: it is the maximum of the PDF. Since there is no closed form for
|
||||||
the mode of the Landau PDF, it was computed numerically by the *Brent*
|
the mode of the Landau PDF, it was computed numerically by the *Brent*
|
||||||
@ -145,7 +141,6 @@ an upper bound $x_\text{max}$, with an estimate of the location of the minimum
|
|||||||
$x_e$. The value of the function at $x_e$ must be less than the value of the
|
$x_e$. The value of the function at $x_e$ must be less than the value of the
|
||||||
function at the ends of the interval, in order to guarantee that a minimum is
|
function at the ends of the interval, in order to guarantee that a minimum is
|
||||||
contained somewhere within the interval.
|
contained somewhere within the interval.
|
||||||
|
|
||||||
$$
|
$$
|
||||||
f(x_\text{min}) > f(x_e) < f(x_\text{max})
|
f(x_\text{min}) > f(x_e) < f(x_\text{max})
|
||||||
$$
|
$$
|
||||||
@ -165,26 +160,26 @@ f(b)$ between $f(x_\text{min})$, $f(x_\text{min})$ and $f(x_e)$. The interval is
|
|||||||
reduced until it encloses the true minimum to a desired tolerance.
|
reduced until it encloses the true minimum to a desired tolerance.
|
||||||
The error of the result is estimated by the length of the final interval.
|
The error of the result is estimated by the length of the final interval.
|
||||||
|
|
||||||
On the other hand, to compute the mode of the sample the half-sample mode (HSM)
|
On the other hand, to compute the mode of the sample, the half-sample mode (HSM)
|
||||||
or *Robertson-Cryer* estimator was used. This estimator was chosen because makes
|
or *Robertson-Cryer* estimator was used. This estimator was chosen because makes
|
||||||
no assumptions on the underlying distribution and is not computationally expensive.
|
no assumptions on the underlying distribution and is not computationally
|
||||||
The HSM is obtained by iteratively identifying the half modal interval, which
|
expensive. The HSM is obtained by iteratively identifying the half modal
|
||||||
is the smallest interval containing half of the observation. Once the sample is
|
interval, which is the smallest interval containing half of the observation.
|
||||||
reduced to less that three points the mode is computed as the average. The
|
Once the sample is reduced to less that three points the mode is computed as the
|
||||||
special case $n=3$ is dealt with by averaging the two closer points.
|
average. The special case $n=3$ is dealt with by averaging the two closer points
|
||||||
|
[@robertson74].
|
||||||
|
|
||||||
To obtain a better estimate of the mode and its error the above procedure was
|
To obtain a better estimate of the mode and its error the above procedure was
|
||||||
bootstrapped. The original sample is treated as a population and used to build
|
bootstrapped. The original sample was treated as a population and used to build
|
||||||
other samples, of the same size, by *sampling with replacements*. For each one
|
100 other samples of the same size, by *sampling with replacements*. For each one
|
||||||
of the new samples the above statistic is computed. By simply taking the
|
of the new samples, the above statistic was computed. By simply taking the
|
||||||
mean of these statistics the following estimate was obtained
|
mean of these statistics the following estimate was obtained:
|
||||||
$$
|
$$
|
||||||
\text{observed mode: } m_o = \SI{-0.29 \pm 0.19}{}
|
\text{observed mode: } m_o = \SI{-0.29 \pm 0.19}{}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
In order to compare the values $m_e$ and $x_0$, the following compatibility
|
In order to compare the values $m_e$ and $m_0$, the following compatibility
|
||||||
t-test was applied:
|
t-test was applied:
|
||||||
|
|
||||||
$$
|
$$
|
||||||
p = 1 - \text{erf}\left(\frac{t}{\sqrt{2}}\right)\ \with
|
p = 1 - \text{erf}\left(\frac{t}{\sqrt{2}}\right)\ \with
|
||||||
t = \frac{|m_e - m_o|}{\sqrt{\sigma_e^2 + \sigma_o^2}}
|
t = \frac{|m_e - m_o|}{\sqrt{\sigma_e^2 + \sigma_o^2}}
|
||||||
@ -198,29 +193,28 @@ In this case:
|
|||||||
- p = 0.311
|
- p = 0.311
|
||||||
|
|
||||||
Thus, the observed mode is compatible with the mode of the Landau distribution,
|
Thus, the observed mode is compatible with the mode of the Landau distribution,
|
||||||
however the result is quite imprecise.
|
although the result is quite imprecise.
|
||||||
|
|
||||||
|
|
||||||
### Median
|
#### Median
|
||||||
|
|
||||||
The median is a central tendency statistics that, unlike the mean, is not
|
The median is a central tendency statistics that, unlike the mean, is not
|
||||||
very sensitive to extreme values, albeit less indicative. For this reason
|
very sensitive to extreme values, albeit less indicative. For this reason
|
||||||
is well suited as test statistic in a pathological case such as the
|
is well suited as test statistic in a pathological case such as the Landau
|
||||||
Landau distribution.
|
distribution.
|
||||||
The median of a real probability distribution is defined as the value
|
The median of a probability distribution is defined as the value such that its
|
||||||
such that its cumulative probability is $1/2$. In other words the median
|
cumulative probability is $1/2$. In other words, the median partitions the
|
||||||
partitions the probability in two (connected) halves.
|
probability in two (connected) halves. The median of a sample, once sorted, is
|
||||||
The median of a sample, once sorted, is given by its middle element if the
|
given by its middle element if the sample size is odd, or the average of the two
|
||||||
sample size is odd, or the average of the two middle elements otherwise.
|
middle elements otherwise.
|
||||||
|
|
||||||
The expected median was derived from the quantile function (QDF) of the Landau
|
The expected median was derived from the quantile function (QDF) of the Landau
|
||||||
distribution[^1].
|
distribution[^1].
|
||||||
Once this is know, the median is simply given by $\text{QDF}(1/2)$.
|
Once this is know, the median is simply given by $\text{QDF}(1/2)$. Since both
|
||||||
Since both the CDF and QDF have no known closed form they must
|
the CDF and QDF have no known closed form, they must be computed numerically.
|
||||||
be computed numerically. The cumulative probability has been computed by
|
The cumulative probability was computed by quadrature-based numerical
|
||||||
quadrature-based numerical integration of the PDF (`gsl_integration_qagiu()`
|
integration of the PDF (`gsl_integration_qagiu()` function in GSL). The function
|
||||||
function in GSL). The function calculate an approximation of the integral
|
calculate an approximation of the integral:
|
||||||
|
|
||||||
$$
|
$$
|
||||||
I(x) = \int_x^{+\infty} f(t)dt
|
I(x) = \int_x^{+\infty} f(t)dt
|
||||||
$$
|
$$
|
||||||
@ -229,58 +223,57 @@ $$
|
|||||||
because the quantile had been already implemented and was initially
|
because the quantile had been already implemented and was initially
|
||||||
used for reverse sampling.
|
used for reverse sampling.
|
||||||
|
|
||||||
|
The CDF is then given by $p(x) = 1 - I(x)$. This was done to avoid the
|
||||||
The $CDF$ is then given by $p(x) = 1 - I(x)$. This was done to avoid the
|
|
||||||
left tail of the distribution, where the integration can sometimes fail.
|
left tail of the distribution, where the integration can sometimes fail.
|
||||||
The integral $I$ is actually mapped beforehand onto $(0, 1]$ by
|
The integral $I$ was actually mapped beforehand onto $(0, 1]$ by
|
||||||
the change of variable $t = a + (1-u)/u$, because the integration
|
the change of variable $t = x + (1-u)/u$, because the integration
|
||||||
routine works on definite integrals. The result should satisfy the following
|
routine works on definite integrals. The result should satisfy the following
|
||||||
accuracy requirement:
|
accuracy requirement:
|
||||||
$$
|
$$
|
||||||
|\text{result} - I| \le \max(\varepsilon_\text{abs}, \varepsilon_\text{rel}I)
|
|\text{result} - I| \le \max(\varepsilon_\text{abs}, \varepsilon_\text{rel}I)
|
||||||
$$
|
$$
|
||||||
|
|
||||||
The tolerances have been set to \SI{1e-10}{} and \SI{1e-6}{}, respectively.
|
where the absolute and relative tolerances $\varepsilon_\text{abs}$ and
|
||||||
As for the QDF, this was implemented by numerically inverting the CDF. This is
|
$\varepsilon_\text{rel}$ were set to \SI{1e-10}{} and \SI{1e-6}{},
|
||||||
done by solving the equation
|
respectively.
|
||||||
|
As for the QDF, this was implemented by numerically inverting the CDF. This was
|
||||||
|
done by solving the equation;
|
||||||
$$
|
$$
|
||||||
p(x) = p_0
|
p(x) = p_0
|
||||||
$$
|
$$
|
||||||
|
|
||||||
for x, given a probability value $p_0$, where $p(x)$ is again the CDF.
|
for x, given a probability value $p_0$, where $p(x)$ is the CDF. The (unique)
|
||||||
The (unique) root of this equation is found by a root-finding routine
|
root of this equation was found by a root-finding routine
|
||||||
(`gsl_root_fsolver_brent` in GSL) based on the Brent-Dekker method. This
|
(`gsl_root_fsolver_brent` in GSL) based on the Brent-Dekker method it too.
|
||||||
algorithm consists in a bisection search, similar to the one employed in the
|
The following condition was checked for convergence:
|
||||||
mode optimisation, but improved by interpolating the function with a parabola
|
|
||||||
at each step. The following condition is checked for convergence:
|
|
||||||
$$
|
$$
|
||||||
|a - b| < \varepsilon_\text{abs} + \varepsilon_\text{rel} \min(|a|, |b|)
|
|a - b| < \varepsilon_\text{abs} + \varepsilon_\text{rel} \min(|a|, |b|)
|
||||||
$$
|
$$
|
||||||
|
|
||||||
where $a,b$ are the current interval bounds. The condition immediately gives an
|
where $a,b$ are the current interval bounds. The condition immediately gives an
|
||||||
upper bound on the error of the root as $\varepsilon = |a-b|$. The tolerances
|
upper bound on the error of the root as $\varepsilon = |a-b|$. The tolerances
|
||||||
here have been set to 0 and \SI{1e-3}{}.
|
here were set to 0 and \SI{1e-3}{}.
|
||||||
|
|
||||||
The result of the numerical computation is:
|
The result of the numerical computation is:
|
||||||
$$
|
$$
|
||||||
\text{expected median: } m_e = \SI{1.3557804 \pm 0.0000091}{}
|
\text{expected median: } m_e = \SI{1.3557804 \pm 0.0000091}{}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
while the sample median, obtained again by bootstrapping, was found to be
|
while the sample median, obtained again by bootstrapping, was found to be:
|
||||||
$$
|
$$
|
||||||
\text{observed median: } m_e = \SI{1.3605 \pm 0.0062}{}
|
\text{observed median: } m_e = \SI{1.3605 \pm 0.0062}{}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
Applying again the t-test from before to this statistic:
|
As stated above, the median is less sensitive to extreme values with respect to
|
||||||
|
the mode: this lead the result to be much more precise. Applying again the
|
||||||
|
aforementioned t-test to this statistic:
|
||||||
|
|
||||||
- $t=0.761$
|
- $t=0.761$
|
||||||
- $p=0.446$
|
- $p=0.446$
|
||||||
|
|
||||||
This result is much more precise than the mode and the two values show
|
Hence, the two values show a good agreement.
|
||||||
a good agreement.
|
|
||||||
|
|
||||||
|
|
||||||
### FWHM
|
#### FWHM
|
||||||
|
|
||||||
For a unimodal distribution (having a single peak) this statistic is defined as
|
For a unimodal distribution (having a single peak) this statistic is defined as
|
||||||
the distance between the two points at which the PDF attains half the maximum
|
the distance between the two points at which the PDF attains half the maximum
|
||||||
@ -292,58 +285,55 @@ $$
|
|||||||
f(x_\pm) = \frac{f_\text{max}}{2}
|
f(x_\pm) = \frac{f_\text{max}}{2}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
The function $f'(x)$ was minimized using the same minimization method
|
The function derivative $f'(x)$ was minimized using the same minimization method
|
||||||
used for finding $m_e$. Once $f_\text{max}$ is known, the equation
|
used for finding $m_e$. Once $f_\text{max}$ was known, the equation:
|
||||||
$$
|
$$
|
||||||
f'(x) = \frac{f_\text{max}}{2}
|
f'(x) = \frac{f_\text{max}}{2}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
is solved by performing the Brent-Dekker method (described before) in the
|
was solved by performing the Brent-Dekker method (described before) in the
|
||||||
ranges $[x_\text{min}, m_e]$ and $[m_e, x_\text{max}]$ yielding the two
|
ranges $[x_\text{min}, m_e]$ and $[m_e, x_\text{max}]$ yielding the two
|
||||||
solutions $x_\pm$. With a relative tolerance of \SI{1e-7}{} the following
|
solutions $x_\pm$. With a relative tolerance of \SI{1e-7}{}, the following
|
||||||
result was obtained:
|
result was obtained:
|
||||||
|
|
||||||
$$
|
$$
|
||||||
\text{expected FWHM: } w_e = \SI{4.0186457 \pm 0.0000001}{}
|
\text{expected FWHM: } w_e = \SI{4.0186457 \pm 0.0000001}{}
|
||||||
$$
|
$$
|
||||||
\vspace{-1em}
|
\vspace{-1em}
|
||||||
|
|
||||||
![Example of a Moyal distribution density obtained by
|
![Example of a Moyal distribution density obtained by the KDE method. The rug
|
||||||
the KDE method described above. The rug plot shows the original
|
plot shows the original sample used in the reconstruction. The 0.6 factor
|
||||||
sample used in the reconstruction.](images/landau-kde.pdf)
|
compensate for the otherwise peak height reduction.](images/landau-kde.pdf)
|
||||||
|
|
||||||
On the other hand, obtaining a good estimate of the FWHM from a sample is much
|
On the other hand, obtaining a good estimate of the FWHM from a sample is much
|
||||||
more difficult. In principle it could be measured by binning the data and
|
more difficult. In principle, it could be measured by binning the data and
|
||||||
applying the definition to the discretised values, however this yields very
|
applying the definition to the discretised values, however this yields very
|
||||||
poor results and depends on an completely arbitrary parameter: the bin width.
|
poor results and depends on an completely arbitrary parameter: the bin width.
|
||||||
A more refined method to construct an nonparametric empirical PDF function from
|
A more refined method to construct a nonparametric empirical PDF function from
|
||||||
the sample is a kernel density estimation (KDE). This method consist in
|
the sample is a kernel density estimation (KDE). This method consist in
|
||||||
convolving the (ordered) data with a smooth symmetrical kernel, in this cause a
|
convolving the (ordered) data with a smooth symmetrical kernel: in this case a
|
||||||
standard gaussian function. Given a sample $\{x_i\}_{i=1}^N$, the empirical PDF
|
standard Gaussian function. Given a sample of values $\{x_i\}_{i=1}^N$, the
|
||||||
is thus constructed as
|
empirical PDF is defined as:
|
||||||
$$
|
$$
|
||||||
f_\varepsilon(x) = \frac{1}{N\varepsilon} \sum_{i = 1}^N
|
f_\varepsilon(x) = \frac{1}{N\varepsilon} \sum_{i = 1}^N
|
||||||
\mathcal{N}\left(\frac{x-x_i}{\varepsilon}\right)
|
\mathcal{N}\left(\frac{x-x_i}{\varepsilon}\right)
|
||||||
$$
|
$$
|
||||||
|
|
||||||
where $\varepsilon$ is called the *bandwidth* and is a parameter that controls
|
where $\mathcal{N}$ is the kernel and the parameter $\varepsilon$, called
|
||||||
the strength of the smoothing. This parameter can be determined in several
|
*bandwidth*, controls the strength of the smoothing. This parameter can be
|
||||||
ways: bootstrapping, cross-validation, etc. For simplicity it was chosen
|
determined in several ways: bootstrapping, cross-validation, etc. For
|
||||||
to use Silverman's rule of thumb, which gives
|
simplicity, it was chosen to use Silverman's rule of thumb [@silverman86],
|
||||||
|
which gives:
|
||||||
$$
|
$$
|
||||||
\varepsilon = 0.63 S_N
|
\varepsilon = 0.63 S_N
|
||||||
\left(\frac{d + 2}{4}N\right)^{-1/(d + 4)}
|
\left(\frac{d + 2}{4}N\right)^{-1/(d + 4)}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
where
|
where the $0.63$ factor was chosen to compensate for the distortion that
|
||||||
|
|
||||||
- $S_N$ is the sample standard deviation.
|
|
||||||
|
|
||||||
- $d$ is ne number of dimensions, in this case $d=1$
|
|
||||||
|
|
||||||
The $0.63$ factor was chosen to compensate for the distortion that
|
|
||||||
systematically reduces the peaks height, which affects the estimation of the
|
systematically reduces the peaks height, which affects the estimation of the
|
||||||
mode.
|
mode, and:
|
||||||
|
|
||||||
|
- $S_N$ is the sample standard deviation,
|
||||||
|
- $d$ is the number of dimensions, in this case $d=1$.
|
||||||
|
|
||||||
With the empirical density estimation at hand, the FWHM can be computed by the
|
With the empirical density estimation at hand, the FWHM can be computed by the
|
||||||
same numerical method described for the true PDF. Again this was bootstrapped
|
same numerical method described for the true PDF. Again this was bootstrapped
|
||||||
|
Loading…
Reference in New Issue
Block a user