analistica/notes/sections/3.md

# Exercise 3 {#sec:3}

## Meson decay events generation

A number of $N = 50'000$ points on the unit sphere, each representing a
particle detection event, must be generated according to then angular
probability distribution function $F_0$:

\begin{align*}
  F_0 (\theta, \phi) = \frac{3}{4 \pi} \Bigg[
  \frac{1}{2} (1 - \alpha_0) + \frac{1}{2} (3\alpha_0 - 1) \cos^2(\theta)& \\
  - \beta_0 \sin^2(\theta) \cos(2 \phi)
  - \gamma_0 \sqrt{2} \sin(2 \theta) \cos(\phi)& \Bigg]
\end{align*}

where $\theta$ and $\phi$ are, respectively, the polar and azimuthal angles, and

$$
  \alpha_0 = 0.65 \et \beta_0 = 0.06 \et \gamma_0 = -0.18
$$

To generate the points a *try and catch* method was employed:

1. generate a point $(\theta , \phi)$ uniformly on a unit sphere
2. generate a third value $Y$ uniformly in $[0, 1]$
3. if @eq:requirement is satisfied save the point
4. repeat from 1.

$$
  Y < F_0(\theta, \phi)
$$ {#eq:requirement}

To increase the efficiency of the procedure, the $Y$ values were actually
generated in $[0, 0.2]$, since $\max F_0 = 0.179$ for the given parameters
(an other option would have been generating the numbers in the range $[0 ; 1]$,
but all the points with $Y_{\theta, \phi} > 0.179$ would have been rejected with
the same probability, namely 1).

While $\phi$ can simply be generated uniformly between 0 and $2 \pi$, for
$\theta$ one has to be more careful: by generating evenly distributed numbers
between 0 and $\pi$, the points turn out to cluster on the poles. The point
$(\theta,\phi)$ is *not* uniform on the sphere but is uniform on a cylindrical
projection of it. (The difference can be appreciated in @fig:compare).

The proper distribution of $\theta$ is obtained by applying a transform
to a variable $x$ with uniform distribution in $[0, 1)$,
easily generated by the GSL function `gsl_rng_uniform()`.

<div id="fig:compare">
  ![Uniformly distributed points with $\theta$ evenly distributed between
  0 and $\pi$.](images/histo-i-u.pdf){width=45%}
  ![Points uniformly distributed on a spherical
  surface.](images/histo-p-u.pdf){width=45%}

  ![Sample generated according to $F$ with $\theta$ evenly distributed between
  0 and $\pi$.](images/histo-i-F.pdf){width=45%}
  ![Sample generated according to $F$ with $\theta$ properly
  distributed.](images/histo-p-F.pdf){width=45%}

Examples of samples. On the left, points with $\theta$ evenly distributed
between 0 and $\pi$; on the right, points with $\theta$ properly distributed.
</div>

The transformation can be found by imposing the angular PDF to be a constant:

\begin{align*}
  \frac{d^2 P}{d \Omega^2} = \text{const} = \frac{1}{4 \pi}
  \hspace{20pt} &\Longrightarrow \hspace{20pt}
  d^2P = \frac{1}{4 \pi} \, d \Omega^2 = \frac{1}{4 \pi} \,
  d \phi \sin(\theta) d \theta
  \\
  &\Longrightarrow \hspace{20pt}
  \frac{dP}{d \theta} = \int_0^{2 \pi} d \phi \, \frac{1}{4 \pi} \sin(\theta)=
  \frac{1}{4 \pi} \sin(\theta) 2 \pi = \frac{1}{2} \sin(\theta)
\end{align*}
\begin{align*}
  \theta = \theta (x)
  \hspace{20pt} &\Longrightarrow \hspace{20pt}
  \frac{dP}{d \theta} = \frac{dP}{dx} \cdot \left| \frac{dx}{d \theta} \right|
  = \left. \frac{dP}{dx} \middle/ \, \left| \frac{d \theta}{dx} \right| \right.
  \\
  &\Longrightarrow \hspace{20pt}
  \frac{1}{2} \sin(\theta) = \left. 1 \middle/ \, \left|
  \frac{d \theta}{dx} \right| \right.
\end{align*}

If $\theta$ is chosen to grew together with $x$, then the absolute value can be
omitted:

\begin{align*}
  \frac{d \theta}{dx} = \frac{2}{\sin(\theta)}
  \hspace{20pt} &\Longrightarrow \hspace{20pt}
  d \theta \sin(\theta) = 2 dx
  \\
  &\Longrightarrow \hspace{20pt}
  - \cos (\theta') |_{0}^{\theta} = 2[x(\theta) - x(0)] = 2(x - 0) = 2x
  \\
  &\Longrightarrow \hspace{20pt}
  - \cos(\theta) + 1 = 2x
  \\
  &\Longrightarrow \hspace{20pt}
  \theta = \text{acos} (1 -2x)
\end{align*}

Since 1 is not in $[0 , 1)$, $\theta = \pi$ can't be generated. Being it just
a single point, the effect of this omission is negligible.

\vspace{30pt}


## Parameter estimation

The sample must now be used to estimate the parameters $\alpha_0$,
$\beta_0$ and $\gamma_0$ of the angular distribution $F_0$.


### Maximum Likelihood method

The Maximum Likelihood method (MLM) is based on the observation that the best
estimate {$\bar{\alpha}$, $\bar{\beta}$, $\bar{\gamma}$} of the parameters
{$\alpha$, $\beta$, $\gamma$} should maximize the Likelihood function $L$
defined in @eq:Like0, where the index $i$ runs over the sample and $F$ is the
function $F_0$ with free parameters $\alpha$, $\beta$ and $\gamma$:

$$
  L = \prod_i F(\theta_i, \phi_i) = \prod_i F_i
$$ {#eq:Like0}

The reason is that the points were generated according to the probability
distribution function $F$ with parameters $\alpha_0$, $\beta_0$ and $\gamma_0$,
thus the probability $P$ of sampling this special sample is, by definition, the
product of their probabilities $F_i$ and it must be maximum for {$\alpha$,
$\beta$, $\gamma$} = {$\alpha_0$, $\beta_0$, $\gamma_0$}. Thus, by viewing $F_i$
as a function of the parameters, by maximizing $P = L$ with respect
to $\alpha$, $\beta$ and $\gamma$, a good estimate should be found.
Instead of actually maximising $L$, the function $-\ln(L)$
was minimised, as minimisation methods as usually described in literature
and $\ln$ since it simplifies the math:

$$
  L = \prod_i F_i \thus - \ln(L) = - \sum_{i} \ln(F_i)
$$ {#eq:Like}


The problem of multidimensional minimisation was tackled using the GSL library
`gsl_multimin.h`, which implements a variety of methods, including the
conjugate gradient Fletcher-Reeves algorithm, used in the solution, which requires
the implementation of $-\ln(L)$ and its gradient.

To minimise a function $f$, given an initial guess point $x_0$, the gradient
$\nabla f$ is used to find the initial direction $v_0$ (the steepest descent) along
which a line minimisation is performed: the minimum occurs where the
directional derivative along $v_0$ is zero:
$$
  \frac{\partial f}{\partial v_0} = \nabla f \cdot v_0 = 0
$$
or, equivalently, at the point where the gradient is orthogonal to $v_0$.
After this first step the following procedure is iterated:

  1. Find the steepest descent $v_n$.
  2. Compute the *conjugate* direction: $w_n = v_n + \beta_n w_{n-1}$
  3. Perform a line minimisation along $w_n$
  4. Update the position $x_{n+1} = x_n + \alpha_n w_n$

where $\alpha_n$ is line minimum and $\beta_n$ is given by the Fletcher-Reeves
formula:
$$
  \beta = \frac{\|\nabla f(x_n)\|^2}{\|\nabla f(x_{n-1})\|^2}
$$

The accuracy of each line minimisation is controlled the parameter `tol`,
meaning:

$$
  w \cdot \nabla f < \text{tol} \, \|w\| \, \|\nabla f\|
$$

The minimisation is quite unstable and this forced the starting point to be
taken very close to the solution, namely:

$$
  \alpha_{sp} = 0.79 \et
  \beta_{sp} = 0.02 \et
  \gamma_{sp} = -0.17
$$

The overall minimisation ends when the gradient module is smaller than $10^{-4}$.
The program took 25 of the above iterations to reach the result shown in @eq:Like_res.

The Cramér-Rao bound states that the covariance matrix of parameters estimated
by MLM is greater than the of the inverse of the Hessian matrix of $-\log(L)$
at the minimum. Thus, the Hessian matrix was computed
analytically and inverted by a Cholesky decomposition, which states that
every positive definite symmetric square matrix $H$ can be written as the
product of a lower triangular matrix $L$ and its transpose $L^T$:

$$
  H = L \cdot L^T
$$

The Hessian matrix is, indeed, both symmetric and positive definite, when
computed in a minimum. Once decomposed the inverse is given by

$$
  H^{-1} = (L \cdot L^T)^{-1} = (L^{-1})^T \cdot L^{-1}
$$
The inverse of $L$ is easily computed since it's a triangular matrix.

The GSL library provides two functions `gsl_linalg_cholesky_decomp()` and
`gsl_linalg_cholesky_invert()` to decompose and invert in-place a matrix.
The result is shown below:

$$
  \bar{\alpha_L} = 0.6541 \et
  \bar{\beta_L}  = 0.06125 \et
  \bar{\gamma_L} = -0.1763
$$ {#eq:Like_res}

$$
\text{cov}_L = \begin{pmatrix}
 9.225 \cdot 10^{-6} & -1.785 \cdot 10^{-6} & 2.41  \cdot 10^{-7} \\
-1.785 \cdot 10^{-6} &  4.413 \cdot 10^{-6} & 1.58  \cdot 10^{-6} \\
  2.41 \cdot 10^{-7} &  1.58  \cdot 10^{-6} & 2.602 \cdot 10^{-6}
\end{pmatrix}
$$ {#eq:Like_cov}

Hence:
\begin{align*}
  \alpha_L &=  0.654  \pm 0.003  \\
  \beta_L  &=  0.0613 \pm 0.0021 \\
  \gamma_L &= -0.1763 \pm 0.0016
\end{align*}


### Least squares estimation

Another method that can be used to estimate {$\alpha$, $\beta$, $\gamma$} are
the least squares (LSQ), which is a multidimensional minimisation
specialised to the case of sum of squared functions called residuals.
The residuals can be anything in general but are usually interpreted as the
difference between an expected (fit) and an observed value. The least squares
then correspond to the expected values that best fit the observations.

To apply the LSQ method, the data must be properly binned, meaning that each
bin should contain a significant number of events (say greater than four or
five): in this case, 30 bins for $\theta$ and 60 for $\phi$ turned out
to be satisfactory. The expected values were given as the product of $N$, $F$
computed in the geometric center of the bin and the solid angle enclosed by the
bin itself, namely:

$$
  E_{\theta, \phi} = N F(\theta, \phi) \, \Delta \theta \, \Delta \phi \,
  \sin{\theta}
$$

Once the data are binned, the number of events in each bin plays the role of the
observed value, while the expected one is given by the probability of finding
an event in that bin multiplied by the total number of points, $N$. Then, the
$\chi^2$, defined as follow, must be minimized with respect to the three
parameters to be estimated:

$$
  \chi^2 = \sum_i f_i^2 = \|\vec f\|^2 \with f_i = \frac{O_i - E_i}{\sqrt{E_i}}
$$

where the sum runs over the bins, $O_i$ are the observed values and $E_i$ the
expected ones. Besides the more generic `gsl_multimin.h` library, GSL provides a
special one for least square minimisation, called `gsl_multifit_nlinear.h`. A
trust region method was used to minimize the $\chi^2$.
In such methods, the $\chi^2$, its gradient and its Hessian matrix are computed
in a series of points in the parameters space till the gradient and the matrix
reach given proper values, meaning that the minimum has been well approached,
where "well" is related to that values.
If {$x_1 \dots x_p$} are the parameters which must be estimated, first $\chi^2$
is computed in the starting point $\vec{x_k}$, then its value in a point
$\vec{x}_k + \vec{\delta}$ is modelled by a function $m_k (\vec{\delta})$ which
is the second order Taylor expansion around the point $\vec{x}_k$, namely:

$$
  \chi^2 (\vec{x}_k + \vec{\delta}) \sim m_k (\vec{\delta}) =
  \chi^2 (\vec{x}_k) + \nabla_k^T  \vec{\delta} + \frac{1}{2}
  \vec{\delta}^T H_k \vec{\delta}
$$

where $\nabla_k$ and $H_k$ are the gradient and the Hessian matrix of $\chi^2$
in the point $\vec{x}_k$ and the superscript $T$ stands for the transpose. The
initial problem is reduced the so called trust-region subproblem (TRS), which
is the minimisation of $m_k(\vec{\delta})$ in a region where
$m_k(\vec{\delta})$ should be a good approximation of $\chi^2 (\vec{x}_k
+ \vec{\delta})$, given by the condition:
$$
  \|D_k \vec\delta\| < \Delta_k
$$
If $D_k = I$ the region is a hypersphere of radius $\Delta_k$
centered at $\vec{x}_k$ but can be deformed according to $D_k$. This is
necessary in the case one or more parameters have scale very different scales.

Given an initial point $x_0$, radius $\Delta_0$, scale matrix $D_0$
and step $\vec\delta_0$ the LSQ algorithm consist in the following iteration:

  1. Construct the function $m_k(\vec\delta)$.
  2. Solve the TRS for step $\vec\delta_k$.
  3. Check whether the solution actually decreases $\chi^2$
      1. If positive increase the trust region radius $\Delta_{k+1} =
         \alpha\Delta_k$ and shift the position $\vec x_{k+1} = \vec x_k +
         \vec \delta_k$.
      2. If negative decrease the radius $\Delta_{k+1} = \Delta_k/\beta$.
  4. Repeat

This method is advantageous compared to a general minimisation because
the TRS usually amounts to solving a linear system. There are many algorithms
to solve the problem, in this case the Levenberg-Marquardt was used. It is based
on a theorem that proves the existence of a number $\mu_k$ such that
\begin{align*}
   \Delta_k \mu_k = \|D_k \vec\delta_k\| &&
    (H_k + \mu_k D_k^TD_k) \vec\delta_k = -\nabla_k
\end{align*}
Using the approximation[^1] $H\approx J^TJ$, obtained by computing the Hessian
of the first-order Taylor expansion of $\chi^2$, $\vec\delta_k$ can
be found by solving the system

$$
\begin{cases}
  J_k \vec{\delta_k} + \vec{f_k} = 0 \\
  \sqrt{\mu_k} D_k \vec{\delta_k} = 0
\end{cases}
$$

[^1]: Here $J_{ij} = \partial f_i/\partial x_j$ is the Jacobian matrix of the
vector-valued function $\vec f(\vec x)$.

The algorithm terminates if on of the following condition are satisfied:

  1. $|\delta_i| \leq \text{xtol} (|x_i| + \text{xtol})$ for every component
     of $\vec \delta$.
  2. $\max_i |\nabla_i \cdot \max(x_i, 1)| \leq \text{gtol}\cdot \max(\chi^2, 1)$
  3. $\|\vec f(\vec x+ \vec\delta) - \vec f(\vec x)|| \leq \text{ftol}
     \cdot \max(\|\vec f(\vec x)\|, 1)$

These tolerance have all been set to \SI{1e-8}{}. The program converged in 7
iterations giving the results below. The covariance of the parameters can again
been estimated through the Hessian matrix at the minimum. The following results
were obtained:

$$
  \bar{\alpha_{\chi}} = 0.6537 \et
  \bar{\beta_{\chi}} = 0.05972 \et
  \bar{\gamma_{\chi}} = -0.1736
$$ {#eq:lsq-res}

$$
\text{cov}_{\chi} = \begin{pmatrix}
 9.244 \cdot 10^{-6} & -1.66  \cdot 10^{-6} & 2.383 \cdot 10^{-7} \\
-1.66  \cdot 10^{-6} &  4.495 \cdot 10^{-6} & 1.505 \cdot 10^{-6} \\
 2.383 \cdot 10^{-7} &  1.505 \cdot 10^{-6} & 2.681 \cdot 10^{-6}
\end{pmatrix}
$$

Hence:
\begin{align*}
  &\alpha_{\chi} = 0.654   \pm 0.003   \\
  &\beta_{\chi}  = 0.0597  \pm 0.0021  \\
  &\gamma_{\chi} = -0.1736 \pm 0.0016
\end{align*}


### Results compatibility

In order to compare the values $x_L$ and $x_{\chi}$ obtained from both methods
with the correct ones (namely {$\alpha_0$, $\beta_0$, $\gamma_0$}), the
following compatibility t-test was applied:

$$
  p = 1 - \text{erf}\left(\frac{t}{\sqrt{2}}\right)\ \with
  t = \frac{x_i - x_0}{\sqrt{\sigma_{x_i}^2 + \sigma_{x_0}^2}}
  = \frac{x_i - x_0}{\sigma_{x_i}}
$$

where $i$ stands either for the MLM or LSQ parameters and $\sigma_{x_i}$ is the
uncertainty of the value $x_i$. At 95% confidence level, the values are
compatible if $p > 0.05$.

Likelihood results:

----------------------------
 par         $p$-value
------------ ---------------
 $\alpha_L$  0.18

 $\beta_L$   0.55

 $\gamma_L$  0.023
----------------------------

Table: Likelihood results compatibility.

$\chi^2$ results:

---------------------------------
 par              $p$-value
----------------- ---------------
 $\alpha_{\chi}$  0.22

 $\beta_{\chi}$   0.89

 $\gamma_{\chi}$  0.0001
---------------------------------

Table: $\chi^2$ results compatibility.

It can be concluded that only the third parameter, $\gamma$ is not compatible
with the expected one in both cases. An in-depth analysis of the algebraic
arrangement of $F$ would be required in order to justify this outcome.

\vspace{30pt}

## Isotropic hypothesis testing

What if the probability distribution function was isotropic? Could it be
compatible with the found results?  If $F$ was isotropic, then $\alpha_I$,
$\beta_I$ and $\gamma_I$ would be $1/3$, 0, and 0 respectively, since this
gives $F_I = 1/{4 \pi}$. The t-test gives a $p$-value approximately zero for all
the three parameters, meaning that there is no compatibility at all with this
hypothesis.