ex-6: EMD section completed

This commit is contained in:
Giù Marcer 2020-05-21 09:51:36 +02:00 committed by rnhmjoj
parent ee1d2242eb
commit 000cc827a2
2 changed files with 50 additions and 62 deletions

View File

@ -38,8 +38,8 @@ int show_help(char **argv) {
/* Performs an experiment consisting in /* Performs an experiment consisting in
* *
* 1. Measuring the distribution I(θ) by reverse * 1. Measuring the distribution I(θ) sampling from
* sampling from an RNG; * an RNG;
* 2. Convolving the I(θ) sample with a kernel * 2. Convolving the I(θ) sample with a kernel
* to simulate the instrumentation response; * to simulate the instrumentation response;
* 3. Applying a gaussian noise with σ=opts.noise * 3. Applying a gaussian noise with σ=opts.noise

View File

@ -423,36 +423,22 @@ deconvolved outcome with the original signal was quantified using the earth
mover's distance. mover's distance.
In statistics, the earth mover's distance (EMD) is the measure of distance In statistics, the earth mover's distance (EMD) is the measure of distance
between two probability distributions [@cock41]. Informally, the distributions between two distributions [@cock41]. Informally, if one imagines the two
are interpreted as two different ways of piling up a certain amount of dirt over distributions as two piles of different amount of dirt in their respective
a region and the EMD is the minimum cost of turning one pile into the other, regions, the EMD is the minimum cost of turning one pile into the other,
where the cost is the amount of dirt moved times the distance by which it is making the first one the most possible similar to the second one, where the
moved. It is valid only if the two distributions have the same integral, that cost is the amount of dirt moved times the distance by which it is moved.
is if the two piles have the same amount of dirt.
Computing the EMD is based on a solution to the transportation problem, which Computing the EMD is based on a solution to the transportation problem, which
can be formalized as follows. can be formalized as follows.
Consider two vectors $P$ and $Q$ which represent the two probability Consider two vectors $P$ and $Q$ which represent the two distributions whose
distributions whose EMD has to be measured: EMD has to be measured:
$$ $$
P = \{ (p_1, w_{p1}) \dots (p_m, w_{pm}) \} \et P = \{ (p_1, w_{p1}) \dots (p_m, w_{pm}) \} \et
Q = \{ (q_1, w_{q1}) \dots (q_n, w_{qn}) \} Q = \{ (q_1, w_{q1}) \dots (q_n, w_{qn}) \}
$$ $$
L'istogramma P deve essere distrutto in modo tale da ottenere l'istogramma Q,
che in partenza è vuoto ma so che vorrò avere w_qj in ogni bin che sta alla
posizione qj.
- sposto solo da P a Q
- sposto non più di ogni ingresso di P
- ottengo non più di ogni ingreddo di Q
- sposto tutto quello che posso: o ottengo tutto Q o ho finito P
e non devono venire uguali, quindi!
where $p_i$ and $q_i$ are the 'values' (that is, the location of the dirt) and where $p_i$ and $q_i$ are the 'values' (that is, the location of the dirt) and
$w_{pi}$ and $w_{qi}$ are the 'weights' (that is, the quantity of dirt). A $w_{pi}$ and $w_{qi}$ are the 'weights' (that is, the quantity of dirt). A
ground distance matrix $D_{ij}$ is defined such as its entries $d_{ij}$ are the ground distance matrix $D_{ij}$ is defined such as its entries $d_{ij}$ are the
@ -464,28 +450,40 @@ $$
W (P, Q, F) = \sum_{i = 1}^m \sum_{j = 1}^n f_{ij} d_{ij} W (P, Q, F) = \sum_{i = 1}^m \sum_{j = 1}^n f_{ij} d_{ij}
$$ $$
with the constraints: The fact is that the $Q$ region is to be considerd empty at the beginning: the
'dirt' present in $P$ must be moved to $Q$ in order to reach the same
distribution as close as possible. Namely, the following constraints must be
satisfied:
\begin{align*} \begin{align*}
&f_{ij} \ge 0 \hspace{15pt} &1 \le i \le m \wedge 1 \le j \le n \\ &\text{1.} \hspace{20pt} f_{ij} \ge 0 \hspace{15pt}
&\sum_{j = 1}^n f_{ij} \le w_{pi} &1 \le i \le m \\ &1 \le i \le m \wedge 1 \le j \le n
&\sum_{i = 1}^m f_{ij} \le w_{qj} &1 \le j \le n \\
\end{align*} &\text{2.} \hspace{20pt} \sum_{j = 1}^n f_{ij} \le w_{pi}
$$ &1 \le i \le m
\sum_{j = 1}^n f_{ij} \sum_{j = 1}^m f_{ij} \le w_{qj} \\
&\text{3.} \hspace{20pt} \sum_{i = 1}^m f_{ij} \le w_{qj}
&1 \le j \le n
\\
&\text{4.} \hspace{20pt} \sum_{j = 1}^n f_{ij} \sum_{j = 1}^m f_{ij} \le w_{qj}
= \text{min} \left( \sum_{i = 1}^m w_{pi}, \sum_{j = 1}^n w_{qj} \right) = \text{min} \left( \sum_{i = 1}^m w_{pi}, \sum_{j = 1}^n w_{qj} \right)
$$ \end{align*}
The first constraint allows moving 'dirt' from $P$ to $Q$ and not vice versa. The first constraint allows moving dirt from $P$ to $Q$ and not vice versa; the
The next two constraints limits the amount of supplies that can be sent by the second limits the amount of dirt moved by each position in $P$ in order to not
values in $P$ to their weights, and the values in $Q$ to receive no more exceed the available quantity; the third sets a limit to the dirt moved to each
supplies than their weights; the last constraint forces to move the maximum position in $Q$ in order to not exceed the required quantity and the last one
amount of supplies possible. The total moved amount is the total flow. Once the forces to move the maximum amount of supplies possible: either all the dirt
transportation problem is solved, and the optimal flow is found, the earth present in $P$ has be moved, or the $Q$ distibution is obtained.
mover's distance $D$ is defined as the work normalized by the total flow: The total moved amount is the total flow. If the two distributions have the
same amount of dirt, hence all the dirt present in $P$ is necessarily moved to
$Q$ and the flow equals the total amount of available dirt.
Once the transportation problem is solved and the optimal flow is found, the
EMD is defined as the work normalized by the total flow:
$$ $$
D (P, Q) = \frac{\sum_{i = 1}^m \sum_{j = 1}^n f_{ij} d_{ij}} \text{EMD} (P, Q) = \frac{\sum_{i = 1}^m \sum_{j = 1}^n f_{ij} d_{ij}}
{\sum_{i = 1}^m \sum_{j=1}^n f_{ij}} {\sum_{i = 1}^m \sum_{j=1}^n f_{ij}}
$$ $$
@ -494,28 +492,29 @@ procedure simplifies a lot. By representing both histograms with two vectors $u$
and $v$, the equation above boils down to [@ramdas17]: and $v$, the equation above boils down to [@ramdas17]:
$$ $$
D (u, v) = \sum_i |U_i - V_i| \text{EMD} (u, v) = \sum_i |U_i - V_i|
$$ $$
where the sum runs over the entries of the vectors $U$ and $V$, which are the where the sum runs over the entries of the vectors $U$ and $V$, which are the
cumulative vectors of the histograms. cumulative vectors of the histograms. In the code, the following equivalent
In the code, the following equivalent recursive routine was implemented. recursive routine was implemented.
$$ $$
D (u, v) = \sum_i |D_i| \with \text{EMD} (u, v) = \sum_i |\text{EMD}_i| \with
\begin{cases} \begin{cases}
D_i = v_i - u_i + D_{i-1} \\ \text{EMD}_i = v_i - u_i + \text{EMD}_{i-1} \\
D_0 = 0 \text{EMD}_0 = 0
\end{cases} \end{cases}
$$ $$
In fact: In fact:
\begin{align*} \begin{align*}
D (u, v) &= \sum_i |D_i| = |D_0| + |D_1| + |D_2| + |D_3| + \dots \\ \text{EMD} (u, v) &= \sum_i |\text{EMD}_i| = |\text{EMD}_0| + |\text{EMD}_1|
&= 0 + |v_1 - u_1 + D_0| + + |\text{EMD}_2| + |\text{EMD}_3| + \dots \\
|v_2 - u_2 + D_1| + &= 0 + |v_1 - u_1 + \text{EMD}_0| +
|v_3 - u_3 + D_2| + \dots \\ |v_2 - u_2 + \text{EMD}_1| +
|v_3 - u_3 + \text{EMD}_2| + \dots \\
&= |v_1 - u_1| + &= |v_1 - u_1| +
|v_1 - u_1 + v_2 - u_2| + |v_1 - u_1 + v_2 - u_2| +
|v_1 - u_1 + v_2 - u_2 + v_3 - u_3| + \dots \\ |v_1 - u_1 + v_2 - u_2 + v_3 - u_3| + \dots \\
@ -526,19 +525,8 @@ In fact:
&= \sum_i |U_i - V_i| &= \sum_i |U_i - V_i|
\end{align*} \end{align*}
This simple formula enabled comparisons to be made between a great number of
\textcolor{red}{EMD} results.
These distances were used to build their empirical cumulative distribution.
\textcolor{red}{empirical distribution}
At 95% confidence level, the compatibility of the deconvolved signal with
the original one cannot be disporoved if its distance from the original signal
is grater than \textcolor{red}{value}.
\textcolor{red}{counts}
## Results comparison {#sec:conv_results} ## Results comparison {#sec:conv_results}