ex-6: written EMD description and implementation

This commit is contained in:
Giù Marcer 2020-05-03 00:06:02 +02:00 committed by rnhmjoj
parent df5f5f9ac6
commit aa36fca43e

View File

@ -404,16 +404,9 @@ deconvolved signal is always the same 'distance' far form the convolved one:
if it very smooth, the deconvolved signal is very smooth too and if the
convolved is less smooth, it is less smooth too.
It was also implemented the possibility to add a Poisson noise to the
convolved histogram to check weather the deconvolution is affected or not by
this kind of interference. It was took as an example the case with $\sigma =
\Delta \theta$. In @fig:poisson the results are shown for both methods when a
Poisson noise with mean $\mu = 50$ is employed.
In both cases, the addition of the noise seems to partially affect the
deconvolution. When the FFT method is applied, it adds little spikes nearly
everywhere on the curve and it is particularly evident on the edges, where the
expected data are very small. On the other hand, the Richardson-Lucy routine is
less affected by this further complication.
The original signal is shown below for convenience.
![Example of an intensity histogram.](images/fraun-original.pdf){#fig:original}
<div id="fig:results1">
![Convolved signal.](images/fraun-conv-0.05.pdf){width=12cm}
@ -447,6 +440,17 @@ width.
Results for $\sigma = \Delta \theta$, where $\Delta \theta$ is the bin width.
</div>
It was also implemented the possibility to add a Poisson noise to the
convolved histogram to check weather the deconvolution is affected or not by
this kind of interference. It was took as an example the case with $\sigma =
\Delta \theta$. In @fig:poisson the results are shown for both methods when a
Poisson noise with mean $\mu = 50$ is employed.
In both cases, the addition of the noise seems to partially affect the
deconvolution. When the FFT method is applied, it adds little spikes nearly
everywhere on the curve and it is particularly evident on the edges, where the
expected data are very small. On the other hand, the Richardson-Lucy routine is
less affected by this further complication.
<div id="fig:poisson">
![Deconvolved signal with FFT.](images/fraun-noise-fft.pdf){width=12cm}
@ -455,8 +459,8 @@ Results for $\sigma = \Delta \theta$, where $\Delta \theta$ is the bin width.
Results for $\sigma = \Delta \theta$, with Poisson noise.
</div>
In order to quantify the similarity of the deconvolution outcome with the
original signal, a null hypotesis test was made up.
In order to quantify the similarity of a deconvolution outcome with the original
signal, a null hypotesis test was made up.
Likewise in @sec:Landau, the original sample was treated as a population from
which other samples of the same size were sampled with replacements. For each
new sample, the earth mover's distance with respect to the original signal was
@ -469,13 +473,91 @@ a region and the EMD is the minimum cost of turning one pile into the other,
where the cost is the amount of dirt moved times the distance by which it is
moved. It is valid only if the two distributions have the same integral, that
is if the two piles have the same amount of dirt.
Computing the EMD is based on a solution of transportation problem.
Computing the EMD is based on a solution to the well-known transportation
problem, which can be formalized as follows.
\textcolor{red}{earth mover's distance}
Consider two vectors:
In this case, where the EMD must be applied to two histograms, the procedure
simplifies a lot boiling down to the difference of the comulative functions of
the two histograms.
$$
P = \{ (p_1, w_{p1}) \dots (p_n, w_{pm}) \} \et
Q = \{ (q_1, w_{q1}) \dots (q_n, w_{qn}) \}
$$
where $p_i$ and $q_i$ are the 'values' and $w_{pi}$ and $w_{qi}$ are their
weights. The entries $d_{ij}$ of the ground distance matrix $D_{ij}$ are
defined as the distances between $p_i$ and $q_j$.
The aim is to find the flow $F =$ {$f_{ij}$}, where $f_{ij}$ is the flow
between $p_i$ and $p_j$ (which would be the quantity of moved dirt), which
minimizes the cost $W$:
$$
W (P, Q, F) = \sum_{i = 1}^m \sum_{j = 1}^n f_{ij} d_{ij}
$$
with the constraints:
\begin{align*}
&f_{ij} \ge 0 \hspace{15pt} &1 \le i \le m \wedge 1 \le j \le n \\
&\sum_{j = 1}^n f_{ij} \le w_{pi} &1 \le i \le m \\
&\sum_{j = 1}^m f_{ij} \le w_{qj} &1 \le j \le n
\end{align*}
$$
\sum_{j = 1}^n f_{ij} \sum_{j = 1}^m f_{ij} \le w_{qj}
= \text{min} \left( \sum_{i = 1}^m w_{pi}, \sum_{j = 1}^n w_{qj} \right)
$$
The first constraint allows moving 'dirt' from $P$ to $Q$ and not vice versa.
The next two constraints limits the amount of supplies that can be sent by the
values in $P$ to their weights, and the values in $Q$ to receive no more
supplies than their weights; the last constraint forces to move the maximum
amount of supplies possible. The total moved amount is the total flow. Once the
transportation problem is solved, and the optimal flow is found, the earth
mover's distance $D$ is defined as the work normalized by the total flow:
$$
D (P, Q) = \frac{\sum_{i = 1}^m \sum_{j = 1}^n f_{ij} d_{ij}}
{\sum_{i = 1}^m \sum_{j=1}^n f_{ij}}
$$
In this case, where the EMD must be applied to two same-lenght histograms, the
procedure simplifies a lot. By representing both histograms with two vectors $u$
and $v$, the equation above boils down to [@ramdas17]:
$$
D (u, v) = \sum_i |U_i - V_i|
$$
where the sum runs over the entries of the vectors $U$ and $V$, which are the
cumulative vectors of the histograms.
In the code, the following equivalent recursive routine was implemented.
$$
D (u, v) = \sum_i |D_i| \with
\begin{cases}
D_i = v_i - u_i + D_{i-1} \\
D_0 = 0
\end{cases}
$$
In fact:
\begin{align*}
D (u, v) &= \sum_i |D_i| = |D_0| + |D_1| + |D_2| + |D_3| + \dots \\
&= 0 + |v_1 - u_1 + D_0| +
|v_2 - u_2 + D_1| +
|v_3 - u_3 + D_2| + \dots \\
&= |v_1 - u_1| +
|v_1 - u_1 + v_2 - u_2| +
|v_1 - u_1 + v_2 - u_2 + v_3 - u_3| + \dots \\
&= |v_1 - u_i| +
|v_1 + v_2 - (u_1 + u_2)| +
|v_1 + v_2 + v_3 - (u_1 + u_2 + u_3))| + \dots \\
&= |V_1 - U_1| + |V_2 - U_2| + |V_3 - U_3| + \dots \\
&= \sum_i |U_i - V_i|
\end{align*}
\textcolor{red}{EMD}
These distances were used to build their empirical cumulative distribution.