247 lines
8.7 KiB
Markdown
247 lines
8.7 KiB
Markdown
# Statistical analysis
|
||
|
||
|
||
## Description
|
||
|
||
This repository is structured as follows:
|
||
|
||
- `lectures`: notes and slides of the course lectures
|
||
|
||
- `notes`: an explanation of the solutions of the exercises
|
||
|
||
- `slides`: a slideshow about some further researches
|
||
|
||
* `ex-n`: programs written for each exercise
|
||
|
||
|
||
## Building the documents
|
||
|
||
The two documents `excercise.pdf` and `lectures.pdf` are written in Pandoc
|
||
markdown. XeTeX (with some standard LaTeX packages), the
|
||
[pandoc-crossref](https://github.com/lierdakil/pandoc-crossref) filter and a
|
||
Make program are required to build. Simply typing `make` in the respective
|
||
directory will build the document, provided the above dependencies are met.
|
||
|
||
|
||
## Building the programs
|
||
|
||
The programs used to solve the exercise are written in standard C99 (with the
|
||
only exception of the `#pragma once` clause) and require the following
|
||
libraries to build:
|
||
|
||
- [GMP] (≥ 6.2)
|
||
- [GSL] (≥ 2.6)
|
||
* [pkg-config] (≥ 0.29, build-time only)
|
||
|
||
To generate plots, Python (version 3) with
|
||
|
||
- [numpy] (≥ 1.18)
|
||
- [matplotlib] (≥ 2.2)
|
||
- [scipy] (≥ 1.4, optional)
|
||
|
||
is required to generate plots.
|
||
|
||
[GMP]: https://gmplib.org/
|
||
[GSL]: https://www.gnu.org/software/gsl/
|
||
[pkg-config]: https://www.freedesktop.org/wiki/Software/pkg-config/
|
||
[numpy]: https://numpy.org/
|
||
[scipy]: https://www.scipy.org/scipylib/index.html
|
||
[matplotlib]: https://matplotlib.org/
|
||
|
||
For convenience, a `shell.nix` file is provided to set up the build environment.
|
||
See this [guide](https://nixos.org/nix/manual/#chap-quick-start) if you have
|
||
never used Nix before. Running `nix-shell` in the top-level will drop you into
|
||
the development shell.
|
||
|
||
Once ready, invoke `make` with the program you wish to build. For example:
|
||
|
||
$ make ex-1/bin/main
|
||
|
||
or, to build every program of an exercise:
|
||
|
||
$ make ex-1
|
||
|
||
To clean up the build results run:
|
||
|
||
$ make clean
|
||
|
||
## Running the programs
|
||
|
||
Notes:
|
||
|
||
- Many programs generate random numbers using a PRNG that is seeded with a
|
||
fixed value, for reproducibility. It's possible to test the program on
|
||
different samples by changing the seed via the environment variable
|
||
`GSL_RNG_SEED`.
|
||
|
||
|
||
### Exercise 1
|
||
|
||
`ex-1/bin/main` generate random numbers following either the Landau or Moyal
|
||
distributions (controlled by the argument `-m`) and run a series of statistical
|
||
test to check if the points where samples from a Landau.
|
||
The size of the sample can be controlled with the argument `-n N`.
|
||
The program outputs the result of a Kolmogorov-Smirnov test and t-tests
|
||
comparing the sample mode, FWHM and median, in this order.
|
||
|
||
`ex-1/bin/pdf` prints a list of x-y points of the Landau PDF to the `stdout`.
|
||
The output can be redirected to `ex-1/pdf-plot.py` to generate a plot.
|
||
|
||
(optional) `ex-1/plots/kde.py` makes the example plot (shown in exercises.pdf,
|
||
fig. 4) of the kernel density estimation used to compute a non-parametric FWHM
|
||
from a sample of random points. To run this program you must additionally
|
||
install [scipy].
|
||
|
||
(optional) `ex-1/plots/slides.py` makes two plots. The first (shown
|
||
in fig. 3, exercises.pdf) is an illustration of the Landau distribution
|
||
FWHM and the second (shown in slides.pdf) is a comparison of the Landau
|
||
and Moyal distributions.
|
||
|
||
|
||
### Exercise 2
|
||
|
||
Every program in `ex-2` computes the best available approximation (with a given
|
||
method) to the Euler-Mascheroni γ constant and prints[1]:
|
||
|
||
1. the leading decimal digits of the approximate value found;
|
||
|
||
2. the exact decimal digits of γ;
|
||
|
||
3. the absolute difference between the 1. and 2.
|
||
|
||
[1]: Some program may also print additional debugging information.
|
||
|
||
`ex-2/bin/fancy`, `ex-2/bin/fancier` can compute γ to a variable precision and
|
||
take therefore the required number of decimal places as their only argument.
|
||
The exact γ digits (used in comparison) are limited to 50 and 500 places,
|
||
respectively.
|
||
|
||
`ex-2/bin/fast` is a highly optimized version of `ex-2/bin/fancier`, meant to
|
||
compute a very large number of digits and therefore doesn't come with a
|
||
verified, fixed, approximation of γ.
|
||
|
||
`ex-2/digits` containes compressed text files of the first 1M digits
|
||
of γ, obtained from `ex-2/bin/fast` and from `sympy` (using `mpmath`).
|
||
|
||
|
||
### Exercise 3
|
||
|
||
`ex-3/bin/main` generates a sample of particle decay events and attempts to
|
||
recover the distribution parameters via both a MLE and a χ² method. In both
|
||
cases the best fit and the parameter covariance matrix are printed.
|
||
The program then performs a t-test to assert the compatibility of the data with
|
||
two hypothesis and print the results in a table.
|
||
|
||
To plot a 2D histogram of the generated sample do:
|
||
|
||
$ ex-3/bin/main -i | ex-3/plot.py
|
||
|
||
In addition the program accepts a few more parameters to control the histogram
|
||
and number of events, run it with `-h` to see their usage.
|
||
|
||
Note: the histogram parameters affect the computation of the χ² and the
|
||
relative parameter estimation.
|
||
|
||
|
||
### Exercise 4
|
||
|
||
`ex-4/bin/main` generates a sample of particles with random oriented momentum
|
||
and creates an histogram with average vertical component, in modulus, versus
|
||
horizontal component. It is possible to set the maximum momentum with the
|
||
option `-p`. A χ² fit and a t-test compatibility are performed with respect
|
||
to the expected distribution and results are printed.
|
||
|
||
To plot a histogram of the generated sample do:
|
||
|
||
$ ex-4/bin/main -o | ex-4/plot.py
|
||
|
||
It is possible to set the number of particles and bins with the options `-n`
|
||
and `-b`.
|
||
|
||
|
||
### Exercise 5
|
||
|
||
`ex-5/main` compute estimations of the integral of exp(x) between 0 and 1
|
||
using several methods: a plain Monte Carlo, the MISER and VEGAS algorithms
|
||
with different number of samples. The program takes no arguments and prints
|
||
a table of the result and its error for each method.
|
||
To visualise the results, you can plot the table by doing:
|
||
|
||
$ ex-5/bin/main | ex-5/plot.py
|
||
|
||
(optional) `ex-6/plots/fit.py` makes the plot (shown in exercises.pdf, fig. 13)
|
||
of the standard deviation vs function calls for the plain MC method. The
|
||
program takes the tabular results of `ex-5/bin/main` as input, so run it as:
|
||
|
||
|
||
$ ex-5/bin/main | ex-5/plots/fit.py
|
||
|
||
|
||
### Exercise 6
|
||
|
||
`ex-6/bin/main` simulates a Fraunhöfer diffraction experiment. The program
|
||
prints to `stdout` the bin counts of the intensity as a function of the
|
||
diffraction angle. To plot a histogram do:
|
||
|
||
$ ex-6/bin/main | ex-6/plot.py
|
||
|
||
The program convolves the original signal with a gaussian kernel (`-s` to
|
||
change the kernel σ), optionally adds a gaussian noise (`-n` to change the
|
||
noise σ) and performs either a naive deconvolution by a FFT (`-m fft` mode)
|
||
or applying the Richardson-Lucy deconvolution algorithm (`-m rl` mode).
|
||
|
||
The `-o`, `-c` and `-d` options control whether the original, convolved or
|
||
deconvolved histogram counts should be printed to `stdout`. For more options
|
||
run the program with `-h` to see the usage screen.
|
||
|
||
`ex-6/bin/test` simulates a customizable number of experiments and prints
|
||
to `stdout` the histograms of the distribution of the EMD from the original
|
||
signal to:
|
||
|
||
1. the result of the FFT deconvolution
|
||
2. the result of the Richardson-Lucy deconvolution
|
||
3. the convolved signal (with noise if `-n` has been given)
|
||
|
||
It also prints to `stderr` the average, standard deviation and skewness of
|
||
each distribution. To plot the histograms, do:
|
||
|
||
$ ex-6/bin/test | ex-6/dist-plot.py
|
||
|
||
The program accepts some parameters to control the histogram and number of
|
||
events, run it with `-h` to see their usage.
|
||
|
||
(optional) `ex-6/plots/emd.py` makes the plots of the EMD statistics of the RL
|
||
deconvolution (shown in exercises.pdf, section 6.6) as a function of the number
|
||
of rounds. The programs sources its data from two files in the same directory,
|
||
these were obtained by running `ex-6/bin/test`. Do:
|
||
|
||
$ ex-6/plots/emd.py noisy
|
||
|
||
for the plots of the experiment with gaussian noise, and
|
||
|
||
$ ex-6/plots/emd.py noiseless
|
||
|
||
for the experiment without noise.
|
||
|
||
|
||
### Exercise 7
|
||
|
||
`ex-7/bin/main` generates a sample with two classes of 2D points (signal,
|
||
noise) and trains either a Fisher linear discriminant or a single perceptron to
|
||
classify them (`-m` argument to change mode). Alternatively the weights can be
|
||
set manually via the `-w` argument. In either case the program then prints the
|
||
classified data in this order: signal then noise.
|
||
|
||
To plot the result of the linear classification pipe the output to
|
||
`ex-7/plot.py`. The program generates two figures:
|
||
- a scatter plot showing the Fisher projection line and the cut line;
|
||
- two histograms of the projected data and the cut line.
|
||
|
||
`ex-7/bin/test` takes a model trained in `ex-7/bin/main` and test it against
|
||
newly generated datasets (`-i` to set the number of test iterations). The
|
||
program prints the statistics of the number of false positives, false
|
||
negatives and finally the purity and efficiency of the classification.
|
||
|
||
(optional) `ex-7/plots/fisher.py` makes the example plot (shown in
|
||
exercises.pdf, fig. 27) of the naïve projection vs Fisher projection.
|