ex-6: EMD section completed

2020-05-21 09:51:36 +02:00 · 2020-05-21 09:51:36 +02:00 · 000cc827a2
commit 000cc827a2
parent ee1d2242eb
2 changed files with 50 additions and 62 deletions
--- a/ex-6/test.c
+++ b/ex-6/test.c
@ -38,8 +38,8 @@ int show_help(char **argv) {

 /* Performs an experiment consisting in
 *
- *  1. Measuring the distribution I(θ) by reverse
- *     sampling from an RNG;
+ *  1. Measuring the distribution I(θ) sampling from
+ *     an RNG;
 *  2. Convolving the I(θ) sample with a kernel
 *     to simulate the instrumentation response;
 *  3. Applying a gaussian noise with σ=opts.noise
--- a/notes/sections/6.md
+++ b/notes/sections/6.md
@ -423,36 +423,22 @@ deconvolved outcome with the original signal was quantified using the earth
 mover's distance.

 In statistics, the earth mover's distance (EMD) is the measure of distance
-between two probability distributions [@cock41]. Informally, the distributions
-are interpreted as two different ways of piling up a certain amount of dirt over
-a region and the EMD is the minimum cost of turning one pile into the other,
-where the cost is the amount of dirt moved times the distance by which it is
-moved. It is valid only if the two distributions have the same integral, that
-is if the two piles have the same amount of dirt.  
+between two distributions [@cock41]. Informally, if one imagines the two
+distributions as two piles of different amount of dirt in their respective
+regions, the EMD is the minimum cost of turning one pile into the other,
+making the first one the most possible similar to the second one, where the
+cost is the amount of dirt moved times the distance by which it is moved.  
 Computing the EMD is based on a solution to the transportation problem, which
 can be formalized as follows.

-Consider two vectors $P$ and $Q$ which represent the two probability
-distributions whose EMD has to be measured:
+Consider two vectors $P$ and $Q$ which represent the two distributions whose
+EMD has to be measured:

 $$
  P = \{ (p_1, w_{p1}) \dots (p_m, w_{pm}) \} \et
  Q = \{ (q_1, w_{q1}) \dots (q_n, w_{qn}) \}
 $$

-L'istogramma P deve essere distrutto in modo tale da ottenere l'istogramma Q,
-che in partenza è vuoto ma so che vorrò avere w_qj in ogni bin che sta alla
-posizione qj.
- sposto solo da P a Q
- sposto non più di ogni ingresso di P
- ottengo non più di ogni ingreddo di Q
- sposto tutto quello che posso: o ottengo tutto Q o ho finito P
-
-e non devono venire uguali, quindi!
-
-
-
-
 where $p_i$ and $q_i$ are the 'values' (that is, the location of the dirt) and
 $w_{pi}$ and $w_{qi}$ are the 'weights' (that is, the quantity of dirt). A
 ground distance matrix $D_{ij}$ is defined such as its entries $d_{ij}$ are the
@ -464,28 +450,40 @@ $$
  W (P, Q, F) = \sum_{i = 1}^m \sum_{j = 1}^n f_{ij} d_{ij}
 $$

-with the constraints:
+The fact is that the $Q$ region is to be considerd empty at the beginning: the
+'dirt' present in $P$ must be moved to $Q$ in order to reach the same
+distribution as close as possible. Namely, the following constraints must be
+satisfied:

 \begin{align*}
-  &f_{ij} \ge 0 \hspace{15pt}       &1 \le i \le m \wedge 1 \le j \le n \\
-  &\sum_{j = 1}^n f_{ij} \le w_{pi} &1 \le i \le m                      \\
-  &\sum_{i = 1}^m f_{ij} \le w_{qj} &1 \le j \le n
-\end{align*}
-$$
-  \sum_{j = 1}^n f_{ij} \sum_{j = 1}^m f_{ij} \le w_{qj}
+  &\text{1.} \hspace{20pt} f_{ij} \ge 0 \hspace{15pt}
+  &1 \le i \le m \wedge 1 \le j \le n
+  \\
+  &\text{2.} \hspace{20pt} \sum_{j = 1}^n f_{ij} \le w_{pi}
+  &1 \le i \le m
+  \\
+  &\text{3.} \hspace{20pt} \sum_{i = 1}^m f_{ij} \le w_{qj}
+  &1 \le j \le n
+  \\
+  &\text{4.} \hspace{20pt} \sum_{j = 1}^n f_{ij} \sum_{j = 1}^m f_{ij} \le w_{qj}
  = \text{min} \left( \sum_{i = 1}^m w_{pi}, \sum_{j = 1}^n w_{qj} \right)
-$$
+\end{align*}

-The first constraint allows moving 'dirt' from $P$ to $Q$ and not vice versa.
-The next two constraints limits the amount of supplies that can be sent by the
-values in $P$ to their weights, and the values in $Q$ to receive no more
-supplies than their weights; the last constraint forces to move the maximum
-amount of supplies possible. The total moved amount is the total flow. Once the
-transportation problem is solved, and the optimal flow is found, the earth
-mover's distance $D$ is defined as the work normalized by the total flow: 
+The first constraint allows moving dirt from $P$ to $Q$ and not vice versa; the
+second limits the amount of dirt moved by each position in $P$ in order to not
+exceed the available quantity; the third sets a limit to the dirt moved to each
+position in $Q$ in order to not exceed the required quantity and the last one
+forces to move the maximum amount of supplies possible: either all the dirt
+present in $P$ has be moved, or the $Q$ distibution is obtained.  
+The total moved amount is the total flow. If the two distributions have the
+same amount of dirt, hence all the dirt present in $P$ is necessarily moved to
+$Q$ and the flow equals the total amount of available dirt.
+
+Once the transportation problem is solved and the optimal flow is found, the
+EMD is defined as the work normalized by the total flow: 

 $$
-  D (P, Q) = \frac{\sum_{i = 1}^m \sum_{j = 1}^n f_{ij} d_{ij}}
+  \text{EMD} (P, Q) = \frac{\sum_{i = 1}^m \sum_{j = 1}^n f_{ij} d_{ij}}
                  {\sum_{i = 1}^m \sum_{j=1}^n f_{ij}}
 $$

@ -494,28 +492,29 @@ procedure simplifies a lot. By representing both histograms with two vectors $u$
 and $v$, the equation above boils down to [@ramdas17]:

 $$
-  D (u, v) = \sum_i |U_i - V_i|
+  \text{EMD} (u, v) = \sum_i |U_i - V_i|
 $$

 where the sum runs over the entries of the vectors $U$ and $V$, which are the
-cumulative vectors of the histograms.  
-In the code, the following equivalent recursive routine was implemented.
+cumulative vectors of the histograms. In the code, the following equivalent
+recursive routine was implemented.

 $$
-  D (u, v) = \sum_i |D_i| \with
+  \text{EMD} (u, v) = \sum_i |\text{EMD}_i| \with
  \begin{cases}
-    D_i = v_i - u_i + D_{i-1} \\ 
-    D_0 = 0
+   \text{EMD}_i = v_i - u_i + \text{EMD}_{i-1} \\ 
+   \text{EMD}_0 = 0
  \end{cases}
 $$

 In fact:

 \begin{align*}
-  D (u, v) &= \sum_i |D_i| = |D_0| + |D_1| + |D_2| + |D_3| + \dots \\
-           &= 0 + |v_1 - u_1 + D_0| + 
-                  |v_2 - u_2 + D_1| + 
-                  |v_3 - u_3 + D_2| + \dots                        \\
+  \text{EMD} (u, v) &= \sum_i |\text{EMD}_i| = |\text{EMD}_0| + |\text{EMD}_1|
+                     + |\text{EMD}_2| + |\text{EMD}_3| + \dots     \\
+           &= 0 + |v_1 - u_1 + \text{EMD}_0| + 
+                  |v_2 - u_2 + \text{EMD}_1| + 
+                  |v_3 - u_3 + \text{EMD}_2| + \dots               \\
           &= |v_1 - u_1| + 
              |v_1 - u_1 + v_2 - u_2| + 
              |v_1 - u_1 + v_2 - u_2 + v_3 - u_3| + \dots          \\
@ -526,19 +525,8 @@ In fact:
           &= \sum_i |U_i - V_i|
 \end{align*}

-
-\textcolor{red}{EMD}
-
-These distances were used to build their empirical cumulative distribution.
-
-\textcolor{red}{empirical distribution}
-
-At 95% confidence level, the compatibility of the deconvolved signal with
-the original one cannot be disporoved if its distance from the original signal
-is grater than  \textcolor{red}{value}.
-
-\textcolor{red}{counts}
-
+This simple formula enabled comparisons to be made between a great number of
+results.

 ## Results comparison {#sec:conv_results}