2020-04-26 00:30:18 +02:00
|
|
|
|
# Statistical analysis
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Description
|
|
|
|
|
|
2020-04-28 00:23:13 +02:00
|
|
|
|
This repository is structured as follows:
|
2020-04-26 00:30:18 +02:00
|
|
|
|
|
2020-06-09 14:26:18 +02:00
|
|
|
|
- `lectures`: notes and slides of the course lectures
|
2020-04-26 00:30:18 +02:00
|
|
|
|
|
|
|
|
|
- `notes`: an explanation of the solutions of the exercises
|
|
|
|
|
|
2020-06-09 14:26:18 +02:00
|
|
|
|
- `slides`: a slideshow about some further researches
|
|
|
|
|
|
2020-04-28 00:23:13 +02:00
|
|
|
|
* `ex-n`: programs written for each exercise
|
2020-04-26 00:30:18 +02:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Building the documents
|
|
|
|
|
|
|
|
|
|
The two documents `excercise.pdf` and `lectures.pdf` are written in Pandoc
|
|
|
|
|
markdown. XeTeX (with some standard LaTeX packages), the
|
|
|
|
|
[pandoc-crossref](https://github.com/lierdakil/pandoc-crossref) filter and a
|
|
|
|
|
Make program are required to build. Simply typing `make` in the respective
|
|
|
|
|
directory will build the document, provided the above dependencies are met.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Building the programs
|
|
|
|
|
|
|
|
|
|
The programs used to solve the exercise are written in standard C99 (with the
|
|
|
|
|
only exception of the `#pragma once` clause) and require the following
|
|
|
|
|
libraries to build:
|
|
|
|
|
|
2020-05-24 18:18:58 +02:00
|
|
|
|
- [GMP] (≥ 6.2)
|
|
|
|
|
- [GSL] (≥ 2.6)
|
|
|
|
|
* [pkg-config] (≥ 0.29, build-time only)
|
2020-04-26 00:30:18 +02:00
|
|
|
|
|
2020-05-24 18:18:58 +02:00
|
|
|
|
To generate plots, Python (version 3) with
|
|
|
|
|
|
|
|
|
|
- [numpy] (≥ 1.18)
|
|
|
|
|
- [matplotlib] (≥ 2.2)
|
|
|
|
|
- [scipy] (≥ 1.4, optional)
|
|
|
|
|
|
|
|
|
|
is required to generate plots.
|
|
|
|
|
|
|
|
|
|
[GMP]: https://gmplib.org/
|
|
|
|
|
[GSL]: https://www.gnu.org/software/gsl/
|
|
|
|
|
[pkg-config]: https://www.freedesktop.org/wiki/Software/pkg-config/
|
|
|
|
|
[numpy]: https://numpy.org/
|
|
|
|
|
[scipy]: https://www.scipy.org/scipylib/index.html
|
|
|
|
|
[matplotlib]: https://matplotlib.org/
|
2020-04-26 00:30:18 +02:00
|
|
|
|
|
2020-04-30 22:28:27 +02:00
|
|
|
|
For convenience, a `shell.nix` file is provided to set up the build environment.
|
2020-04-26 00:30:18 +02:00
|
|
|
|
See this [guide](https://nixos.org/nix/manual/#chap-quick-start) if you have
|
|
|
|
|
never used Nix before. Running `nix-shell` in the top-level will drop you into
|
|
|
|
|
the development shell.
|
|
|
|
|
|
2020-05-19 16:01:52 +02:00
|
|
|
|
Once ready, invoke `make` with the program you wish to build. For example:
|
2020-04-26 00:30:18 +02:00
|
|
|
|
|
|
|
|
|
$ make ex-1/bin/main
|
|
|
|
|
|
2020-05-19 16:01:52 +02:00
|
|
|
|
or, to build every program of an exercise:
|
2020-04-26 00:30:18 +02:00
|
|
|
|
|
|
|
|
|
$ make ex-1
|
|
|
|
|
|
2020-05-19 16:01:52 +02:00
|
|
|
|
To clean up the build results run:
|
2020-04-26 00:30:18 +02:00
|
|
|
|
|
|
|
|
|
$ make clean
|
|
|
|
|
|
|
|
|
|
## Running the programs
|
|
|
|
|
|
|
|
|
|
Notes:
|
|
|
|
|
|
|
|
|
|
- Many programs generate random numbers using a PRNG that is seeded with a
|
|
|
|
|
fixed value, for reproducibility. It's possible to test the program on
|
|
|
|
|
different samples by changing the seed via the environment variable
|
|
|
|
|
`GSL_RNG_SEED`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### Exercise 1
|
|
|
|
|
|
2020-06-09 14:26:18 +02:00
|
|
|
|
`ex-1/bin/main` generate random numbers following either the Landau or Moyal
|
|
|
|
|
distributions (controlled by the argument `-m`) and run a series of statistical
|
|
|
|
|
test to check if the points where samples from a Landau.
|
2020-04-27 23:51:34 +02:00
|
|
|
|
The size of the sample can be controlled with the argument `-n N`.
|
2020-04-26 00:30:18 +02:00
|
|
|
|
The program outputs the result of a Kolmogorov-Smirnov test and t-tests
|
|
|
|
|
comparing the sample mode, FWHM and median, in this order.
|
2020-04-27 23:51:34 +02:00
|
|
|
|
|
2020-06-03 14:27:26 +02:00
|
|
|
|
`ex-1/bin/pdf` prints a list of x-y points of the Landau PDF to the `stdout`.
|
2020-04-27 23:51:34 +02:00
|
|
|
|
The output can be redirected to `ex-1/pdf-plot.py` to generate a plot.
|
|
|
|
|
|
2020-05-24 18:18:58 +02:00
|
|
|
|
(optional) `ex-1/plots/kde.py` makes the example plot (shown in exercises.pdf,
|
|
|
|
|
fig. 4) of the kernel density estimation used to compute a non-parametric FWHM
|
|
|
|
|
from a sample of random points. To run this program you must additionally
|
|
|
|
|
install [scipy].
|
|
|
|
|
|
2020-06-09 14:26:18 +02:00
|
|
|
|
(optional) `ex-1/plots/slides.py` makes two plots. The first (shown
|
|
|
|
|
in fig. 3, exercises.pdf) is an illustration of the Landau distribution
|
|
|
|
|
FWHM and the second (shown in slides.pdf) is a comparison of the Landau
|
|
|
|
|
and Moyal distributions.
|
|
|
|
|
|
2020-04-27 23:51:34 +02:00
|
|
|
|
|
|
|
|
|
### Exercise 2
|
|
|
|
|
|
|
|
|
|
Every program in `ex-2` computes the best available approximation (with a given
|
|
|
|
|
method) to the Euler-Mascheroni γ constant and prints[1]:
|
|
|
|
|
|
2020-05-19 16:01:52 +02:00
|
|
|
|
1. the leading decimal digits of the approximate value found;
|
2020-04-27 23:51:34 +02:00
|
|
|
|
|
2020-05-19 16:01:52 +02:00
|
|
|
|
2. the exact decimal digits of γ;
|
2020-04-27 23:51:34 +02:00
|
|
|
|
|
|
|
|
|
3. the absolute difference between the 1. and 2.
|
|
|
|
|
|
|
|
|
|
[1]: Some program may also print additional debugging information.
|
|
|
|
|
|
|
|
|
|
`ex-2/bin/fancy`, `ex-2/bin/fancier` can compute γ to a variable precision and
|
|
|
|
|
take therefore the required number of decimal places as their only argument.
|
|
|
|
|
The exact γ digits (used in comparison) are limited to 50 and 500 places,
|
|
|
|
|
respectively.
|
|
|
|
|
|
2020-05-24 18:18:58 +02:00
|
|
|
|
`ex-2/bin/fast` is a highly optimized version of `ex-2/bin/fancier`, meant to
|
|
|
|
|
compute a very large number of digits and therefore doesn't come with a
|
|
|
|
|
verified, fixed, approximation of γ.
|
|
|
|
|
|
2020-06-09 14:32:36 +02:00
|
|
|
|
`ex-2/digits` containes compressed text files of the first 1M digits
|
|
|
|
|
of γ, obtained from `ex-2/bin/fast` and from `sympy` (using `mpmath`).
|
|
|
|
|
|
2020-04-27 23:51:34 +02:00
|
|
|
|
|
|
|
|
|
### Exercise 3
|
|
|
|
|
|
|
|
|
|
`ex-3/bin/main` generates a sample of particle decay events and attempts to
|
|
|
|
|
recover the distribution parameters via both a MLE and a χ² method. In both
|
|
|
|
|
cases the best fit and the parameter covariance matrix are printed.
|
|
|
|
|
The program then performs a t-test to assert the compatibility of the data with
|
|
|
|
|
two hypothesis and print the results in a table.
|
|
|
|
|
|
2020-05-19 16:01:52 +02:00
|
|
|
|
To plot a 2D histogram of the generated sample do:
|
2020-04-27 23:51:34 +02:00
|
|
|
|
|
|
|
|
|
$ ex-3/bin/main -i | ex-3/plot.py
|
|
|
|
|
|
|
|
|
|
In addition the program accepts a few more parameters to control the histogram
|
|
|
|
|
and number of events, run it with `-h` to see their usage.
|
|
|
|
|
|
|
|
|
|
Note: the histogram parameters affect the computation of the χ² and the
|
|
|
|
|
relative parameter estimation.
|
2020-04-28 00:23:13 +02:00
|
|
|
|
|
|
|
|
|
|
2020-04-30 22:28:27 +02:00
|
|
|
|
### Exercise 4
|
|
|
|
|
|
|
|
|
|
`ex-4/bin/main` generates a sample of particles with random oriented momentum
|
|
|
|
|
and creates an histogram with average vertical component, in modulus, versus
|
|
|
|
|
horizontal component. It is possible to set the maximum momentum with the
|
|
|
|
|
option `-p`. A χ² fit and a t-test compatibility are performed with respect
|
|
|
|
|
to the expected distribution and results are printed.
|
|
|
|
|
|
2020-05-19 16:01:52 +02:00
|
|
|
|
To plot a histogram of the generated sample do:
|
2020-04-30 22:28:27 +02:00
|
|
|
|
|
|
|
|
|
$ ex-4/bin/main -o | ex-4/plot.py
|
|
|
|
|
|
|
|
|
|
It is possible to set the number of particles and bins with the options `-n`
|
|
|
|
|
and `-b`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### Exercise 5
|
|
|
|
|
|
2020-05-24 18:18:58 +02:00
|
|
|
|
`ex-5/main` compute estimations of the integral of exp(x) between 0 and 1
|
2020-05-24 22:04:31 +02:00
|
|
|
|
using several methods: a plain Monte Carlo, the MISER and VEGAS algorithms
|
|
|
|
|
with different number of samples. The program takes no arguments and prints
|
|
|
|
|
a table of the result and its error for each method.
|
|
|
|
|
To visualise the results, you can plot the table by doing:
|
2020-05-19 16:01:52 +02:00
|
|
|
|
|
2020-06-03 22:27:59 +02:00
|
|
|
|
$ ex-5/bin/main | ex-5/plot.py
|
|
|
|
|
|
|
|
|
|
(optional) `ex-6/plots/fit.py` makes the plot (shown in exercises.pdf, fig. 13)
|
|
|
|
|
of the standard deviation vs function calls for the plain MC method. The
|
|
|
|
|
program takes the tabular results of `ex-5/bin/main` as input, so run it as:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
$ ex-5/bin/main | ex-5/plot.py
|
2020-04-30 22:28:27 +02:00
|
|
|
|
|
2020-05-24 18:18:58 +02:00
|
|
|
|
|
2020-04-28 00:23:13 +02:00
|
|
|
|
### Exercise 6
|
|
|
|
|
|
|
|
|
|
`ex-6/bin/main` simulates a Fraunhöfer diffraction experiment. The program
|
|
|
|
|
prints to `stdout` the bin counts of the intensity as a function of the
|
2020-05-22 15:27:16 +02:00
|
|
|
|
diffraction angle. To plot a histogram do:
|
|
|
|
|
|
|
|
|
|
$ ex-6/bin/main | ex-6/plot.py
|
2020-04-28 00:23:13 +02:00
|
|
|
|
|
2020-05-24 18:18:58 +02:00
|
|
|
|
The program convolves the original signal with a gaussian kernel (`-s` to
|
|
|
|
|
change the kernel σ), optionally adds a gaussian noise (`-n` to change the
|
2020-05-22 15:27:16 +02:00
|
|
|
|
noise σ) and performs either a naive deconvolution by a FFT (`-m fft` mode)
|
|
|
|
|
or applying the Richardson-Lucy deconvolution algorithm (`-m rl` mode).
|
2020-04-28 00:23:13 +02:00
|
|
|
|
|
2020-05-22 15:27:16 +02:00
|
|
|
|
The `-o`, `-c` and `-d` options control whether the original, convolved or
|
|
|
|
|
deconvolved histogram counts should be printed to `stdout`. For more options
|
2020-04-28 00:23:13 +02:00
|
|
|
|
run the program with `-h` to see the usage screen.
|
2020-04-28 22:16:21 +02:00
|
|
|
|
|
2020-05-22 15:27:16 +02:00
|
|
|
|
`ex-6/bin/test` simulates a customizable number of experiments and prints
|
2020-05-24 18:18:58 +02:00
|
|
|
|
to `stdout` the histograms of the distribution of the EMD from the original
|
|
|
|
|
signal to:
|
|
|
|
|
|
|
|
|
|
1. the result of the FFT deconvolution
|
|
|
|
|
2. the result of the Richardson-Lucy deconvolution
|
|
|
|
|
3. the convolved signal (with noise if `-n` has been given)
|
|
|
|
|
|
|
|
|
|
It also prints to `stderr` the average, standard deviation and skewness of
|
|
|
|
|
each distribution. To plot the histograms, do:
|
2020-05-22 15:27:16 +02:00
|
|
|
|
|
|
|
|
|
$ ex-6/bin/test | ex-6/dist-plot.py
|
|
|
|
|
|
|
|
|
|
The program accepts some parameters to control the histogram and number of
|
|
|
|
|
events, run it with `-h` to see their usage.
|
|
|
|
|
|
2020-05-24 18:18:58 +02:00
|
|
|
|
(optional) `ex-6/plots/emd.py` makes the plots of the EMD statistics of the RL
|
2020-05-24 22:04:31 +02:00
|
|
|
|
deconvolution (shown in exercises.pdf, section 6.6) as a function of the number
|
2020-05-24 18:18:58 +02:00
|
|
|
|
of rounds. The programs sources its data from two files in the same directory,
|
|
|
|
|
these were obtained by running `ex-6/bin/test`. Do:
|
2020-05-22 15:27:16 +02:00
|
|
|
|
|
2020-05-24 18:18:58 +02:00
|
|
|
|
$ ex-6/plots/emd.py noisy
|
2020-05-22 15:27:16 +02:00
|
|
|
|
|
2020-05-24 18:18:58 +02:00
|
|
|
|
for the plots of the experiment with gaussian noise, and
|
2020-05-22 15:27:16 +02:00
|
|
|
|
|
2020-05-24 18:18:58 +02:00
|
|
|
|
$ ex-6/plots/emd.py noiseless
|
2020-05-22 15:27:16 +02:00
|
|
|
|
|
2020-05-24 18:18:58 +02:00
|
|
|
|
for the experiment without noise.
|
2020-05-22 15:27:16 +02:00
|
|
|
|
|
2020-04-28 22:16:21 +02:00
|
|
|
|
|
|
|
|
|
### Exercise 7
|
|
|
|
|
|
|
|
|
|
`ex-7/bin/main` generates a sample with two classes of 2D points (signal,
|
|
|
|
|
noise) and trains either a Fisher linear discriminant or a single perceptron to
|
|
|
|
|
classify them (`-m` argument to change mode). Alternatively the weights can be
|
|
|
|
|
set manually via the `-w` argument. In either case the program then prints the
|
|
|
|
|
classified data in this order: signal then noise.
|
|
|
|
|
|
|
|
|
|
To plot the result of the linear classification pipe the output to
|
|
|
|
|
`ex-7/plot.py`. The program generates two figures:
|
2020-05-19 16:01:52 +02:00
|
|
|
|
- a scatter plot showing the Fisher projection line and the cut line;
|
|
|
|
|
- two histograms of the projected data and the cut line.
|
2020-04-28 22:16:21 +02:00
|
|
|
|
|
|
|
|
|
`ex-7/bin/test` takes a model trained in `ex-7/bin/main` and test it against
|
|
|
|
|
newly generated datasets (`-i` to set the number of test iterations). The
|
|
|
|
|
program prints the statistics of the number of false positives, false
|
|
|
|
|
negatives and finally the purity and efficiency of the classification.
|
2020-06-03 22:27:59 +02:00
|
|
|
|
|
|
|
|
|
(optional) `ex-7/plots/fisher.py` makes the example plot (shown in
|
|
|
|
|
exercises.pdf, fig. 27) of the naïve projection vs Fisher projection.
|