ex-6: EMD section completed

This commit is contained in:
Giù Marcer 2020-05-21 09:51:36 +02:00 committed by rnhmjoj
parent ee1d2242eb
commit 000cc827a2
2 changed files with 50 additions and 62 deletions

View File

@ -38,8 +38,8 @@ int show_help(char **argv) {
/* Performs an experiment consisting in
*
* 1. Measuring the distribution I(θ) by reverse
* sampling from an RNG;
* 1. Measuring the distribution I(θ) sampling from
* an RNG;
* 2. Convolving the I(θ) sample with a kernel
* to simulate the instrumentation response;
* 3. Applying a gaussian noise with σ=opts.noise

View File

@ -423,36 +423,22 @@ deconvolved outcome with the original signal was quantified using the earth
mover's distance.
In statistics, the earth mover's distance (EMD) is the measure of distance
between two probability distributions [@cock41]. Informally, the distributions
are interpreted as two different ways of piling up a certain amount of dirt over
a region and the EMD is the minimum cost of turning one pile into the other,
where the cost is the amount of dirt moved times the distance by which it is
moved. It is valid only if the two distributions have the same integral, that
is if the two piles have the same amount of dirt.
between two distributions [@cock41]. Informally, if one imagines the two
distributions as two piles of different amount of dirt in their respective
regions, the EMD is the minimum cost of turning one pile into the other,
making the first one the most possible similar to the second one, where the
cost is the amount of dirt moved times the distance by which it is moved.
Computing the EMD is based on a solution to the transportation problem, which
can be formalized as follows.
Consider two vectors $P$ and $Q$ which represent the two probability
distributions whose EMD has to be measured:
Consider two vectors $P$ and $Q$ which represent the two distributions whose
EMD has to be measured:
$$
P = \{ (p_1, w_{p1}) \dots (p_m, w_{pm}) \} \et
Q = \{ (q_1, w_{q1}) \dots (q_n, w_{qn}) \}
$$
L'istogramma P deve essere distrutto in modo tale da ottenere l'istogramma Q,
che in partenza è vuoto ma so che vorrò avere w_qj in ogni bin che sta alla
posizione qj.
- sposto solo da P a Q
- sposto non più di ogni ingresso di P
- ottengo non più di ogni ingreddo di Q
- sposto tutto quello che posso: o ottengo tutto Q o ho finito P
e non devono venire uguali, quindi!
where $p_i$ and $q_i$ are the 'values' (that is, the location of the dirt) and
$w_{pi}$ and $w_{qi}$ are the 'weights' (that is, the quantity of dirt). A
ground distance matrix $D_{ij}$ is defined such as its entries $d_{ij}$ are the
@ -464,28 +450,40 @@ $$
W (P, Q, F) = \sum_{i = 1}^m \sum_{j = 1}^n f_{ij} d_{ij}
$$
with the constraints:
The fact is that the $Q$ region is to be considerd empty at the beginning: the
'dirt' present in $P$ must be moved to $Q$ in order to reach the same
distribution as close as possible. Namely, the following constraints must be
satisfied:
\begin{align*}
&f_{ij} \ge 0 \hspace{15pt} &1 \le i \le m \wedge 1 \le j \le n \\
&\sum_{j = 1}^n f_{ij} \le w_{pi} &1 \le i \le m \\
&\sum_{i = 1}^m f_{ij} \le w_{qj} &1 \le j \le n
\end{align*}
$$
\sum_{j = 1}^n f_{ij} \sum_{j = 1}^m f_{ij} \le w_{qj}
&\text{1.} \hspace{20pt} f_{ij} \ge 0 \hspace{15pt}
&1 \le i \le m \wedge 1 \le j \le n
\\
&\text{2.} \hspace{20pt} \sum_{j = 1}^n f_{ij} \le w_{pi}
&1 \le i \le m
\\
&\text{3.} \hspace{20pt} \sum_{i = 1}^m f_{ij} \le w_{qj}
&1 \le j \le n
\\
&\text{4.} \hspace{20pt} \sum_{j = 1}^n f_{ij} \sum_{j = 1}^m f_{ij} \le w_{qj}
= \text{min} \left( \sum_{i = 1}^m w_{pi}, \sum_{j = 1}^n w_{qj} \right)
$$
\end{align*}
The first constraint allows moving 'dirt' from $P$ to $Q$ and not vice versa.
The next two constraints limits the amount of supplies that can be sent by the
values in $P$ to their weights, and the values in $Q$ to receive no more
supplies than their weights; the last constraint forces to move the maximum
amount of supplies possible. The total moved amount is the total flow. Once the
transportation problem is solved, and the optimal flow is found, the earth
mover's distance $D$ is defined as the work normalized by the total flow:
The first constraint allows moving dirt from $P$ to $Q$ and not vice versa; the
second limits the amount of dirt moved by each position in $P$ in order to not
exceed the available quantity; the third sets a limit to the dirt moved to each
position in $Q$ in order to not exceed the required quantity and the last one
forces to move the maximum amount of supplies possible: either all the dirt
present in $P$ has be moved, or the $Q$ distibution is obtained.
The total moved amount is the total flow. If the two distributions have the
same amount of dirt, hence all the dirt present in $P$ is necessarily moved to
$Q$ and the flow equals the total amount of available dirt.
Once the transportation problem is solved and the optimal flow is found, the
EMD is defined as the work normalized by the total flow:
$$
D (P, Q) = \frac{\sum_{i = 1}^m \sum_{j = 1}^n f_{ij} d_{ij}}
\text{EMD} (P, Q) = \frac{\sum_{i = 1}^m \sum_{j = 1}^n f_{ij} d_{ij}}
{\sum_{i = 1}^m \sum_{j=1}^n f_{ij}}
$$
@ -494,28 +492,29 @@ procedure simplifies a lot. By representing both histograms with two vectors $u$
and $v$, the equation above boils down to [@ramdas17]:
$$
D (u, v) = \sum_i |U_i - V_i|
\text{EMD} (u, v) = \sum_i |U_i - V_i|
$$
where the sum runs over the entries of the vectors $U$ and $V$, which are the
cumulative vectors of the histograms.
In the code, the following equivalent recursive routine was implemented.
cumulative vectors of the histograms. In the code, the following equivalent
recursive routine was implemented.
$$
D (u, v) = \sum_i |D_i| \with
\text{EMD} (u, v) = \sum_i |\text{EMD}_i| \with
\begin{cases}
D_i = v_i - u_i + D_{i-1} \\
D_0 = 0
\text{EMD}_i = v_i - u_i + \text{EMD}_{i-1} \\
\text{EMD}_0 = 0
\end{cases}
$$
In fact:
\begin{align*}
D (u, v) &= \sum_i |D_i| = |D_0| + |D_1| + |D_2| + |D_3| + \dots \\
&= 0 + |v_1 - u_1 + D_0| +
|v_2 - u_2 + D_1| +
|v_3 - u_3 + D_2| + \dots \\
\text{EMD} (u, v) &= \sum_i |\text{EMD}_i| = |\text{EMD}_0| + |\text{EMD}_1|
+ |\text{EMD}_2| + |\text{EMD}_3| + \dots \\
&= 0 + |v_1 - u_1 + \text{EMD}_0| +
|v_2 - u_2 + \text{EMD}_1| +
|v_3 - u_3 + \text{EMD}_2| + \dots \\
&= |v_1 - u_1| +
|v_1 - u_1 + v_2 - u_2| +
|v_1 - u_1 + v_2 - u_2 + v_3 - u_3| + \dots \\
@ -526,19 +525,8 @@ In fact:
&= \sum_i |U_i - V_i|
\end{align*}
\textcolor{red}{EMD}
These distances were used to build their empirical cumulative distribution.
\textcolor{red}{empirical distribution}
At 95% confidence level, the compatibility of the deconvolved signal with
the original one cannot be disporoved if its distance from the original signal
is grater than \textcolor{red}{value}.
\textcolor{red}{counts}
This simple formula enabled comparisons to be made between a great number of
results.
## Results comparison {#sec:conv_results}