ex-6: EMD section completed

2020-05-21 09:51:36 +02:00 · 2020-05-21 09:51:36 +02:00 · 000cc827a2
commit 000cc827a2
parent ee1d2242eb
2 changed files with 50 additions and 62 deletions
--- a/ex-6/test.c
+++ b/ex-6/test.c
@ -38,8 +38,8 @@ int show_help(char **argv) {
 /* Performs an experiment consisting in
 *
- *  1. Measuring the distribution I(θ) by reverse
+ *  1. Measuring the distribution I(θ) sampling from
- *     sampling from an RNG;
+ *     an RNG;
 *  2. Convolving the I(θ) sample with a kernel
 *     to simulate the instrumentation response;
 *  3. Applying a gaussian noise with σ=opts.noise
--- a/notes/sections/6.md
+++ b/notes/sections/6.md
@ -423,36 +423,22 @@ deconvolved outcome with the original signal was quantified using the earth
 mover's distance.
 In statistics, the earth mover's distance (EMD) is the measure of distance
-between two probability distributions [@cock41]. Informally, the distributions
+between two distributions [@cock41]. Informally, if one imagines the two
-are interpreted as two different ways of piling up a certain amount of dirt over
+distributions as two piles of different amount of dirt in their respective
-a region and the EMD is the minimum cost of turning one pile into the other,
+regions, the EMD is the minimum cost of turning one pile into the other,
-where the cost is the amount of dirt moved times the distance by which it is
+making the first one the most possible similar to the second one, where the
-moved. It is valid only if the two distributions have the same integral, that
+cost is the amount of dirt moved times the distance by which it is moved.  
 is if the two piles have the same amount of dirt.  
 Computing the EMD is based on a solution to the transportation problem, which
 can be formalized as follows.
-Consider two vectors $P$ and $Q$ which represent the two probability
+Consider two vectors $P$ and $Q$ which represent the two distributions whose
-distributions whose EMD has to be measured:
+EMD has to be measured:
 $$
  P = \{ (p_1, w_{p1}) \dots (p_m, w_{pm}) \} \et
  Q = \{ (q_1, w_{q1}) \dots (q_n, w_{qn}) \}
 $$
 L'istogramma P deve essere distrutto in modo tale da ottenere l'istogramma Q,
 che in partenza è vuoto ma so che vorrò avere w_qj in ogni bin che sta alla
 posizione qj.
 - sposto solo da P a Q
 - sposto non più di ogni ingresso di P
 - ottengo non più di ogni ingreddo di Q
 - sposto tutto quello che posso: o ottengo tutto Q o ho finito P
 e non devono venire uguali, quindi!
 where $p_i$ and $q_i$ are the 'values' (that is, the location of the dirt) and
 $w_{pi}$ and $w_{qi}$ are the 'weights' (that is, the quantity of dirt). A
 ground distance matrix $D_{ij}$ is defined such as its entries $d_{ij}$ are the
@ -464,28 +450,40 @@ $$
  W (P, Q, F) = \sum_{i = 1}^m \sum_{j = 1}^n f_{ij} d_{ij}
 $$
-with the constraints:
+The fact is that the $Q$ region is to be considerd empty at the beginning: the
 'dirt' present in $P$ must be moved to $Q$ in order to reach the same
 distribution as close as possible. Namely, the following constraints must be
 satisfied:
 \begin{align*}
-  &f_{ij} \ge 0 \hspace{15pt}       &1 \le i \le m \wedge 1 \le j \le n \\
+  &\text{1.} \hspace{20pt} f_{ij} \ge 0 \hspace{15pt}
-  &\sum_{j = 1}^n f_{ij} \le w_{pi} &1 \le i \le m                      \\
+  &1 \le i \le m \wedge 1 \le j \le n
-  &\sum_{i = 1}^m f_{ij} \le w_{qj} &1 \le j \le n
+  \\
-\end{align*}
+  &\text{2.} \hspace{20pt} \sum_{j = 1}^n f_{ij} \le w_{pi}
-$$
+  &1 \le i \le m
-  \sum_{j = 1}^n f_{ij} \sum_{j = 1}^m f_{ij} \le w_{qj}
+  \\
  &\text{3.} \hspace{20pt} \sum_{i = 1}^m f_{ij} \le w_{qj}
  &1 \le j \le n
  \\
  &\text{4.} \hspace{20pt} \sum_{j = 1}^n f_{ij} \sum_{j = 1}^m f_{ij} \le w_{qj}
  = \text{min} \left( \sum_{i = 1}^m w_{pi}, \sum_{j = 1}^n w_{qj} \right)
-$$
+\end{align*}
-The first constraint allows moving 'dirt' from $P$ to $Q$ and not vice versa.
+The first constraint allows moving dirt from $P$ to $Q$ and not vice versa; the
-The next two constraints limits the amount of supplies that can be sent by the
+second limits the amount of dirt moved by each position in $P$ in order to not
-values in $P$ to their weights, and the values in $Q$ to receive no more
+exceed the available quantity; the third sets a limit to the dirt moved to each
-supplies than their weights; the last constraint forces to move the maximum
+position in $Q$ in order to not exceed the required quantity and the last one
-amount of supplies possible. The total moved amount is the total flow. Once the
+forces to move the maximum amount of supplies possible: either all the dirt
-transportation problem is solved, and the optimal flow is found, the earth
+present in $P$ has be moved, or the $Q$ distibution is obtained.  
-mover's distance $D$ is defined as the work normalized by the total flow: 
+The total moved amount is the total flow. If the two distributions have the
 same amount of dirt, hence all the dirt present in $P$ is necessarily moved to
 $Q$ and the flow equals the total amount of available dirt.
 Once the transportation problem is solved and the optimal flow is found, the
 EMD is defined as the work normalized by the total flow: 
 $$
-  D (P, Q) = \frac{\sum_{i = 1}^m \sum_{j = 1}^n f_{ij} d_{ij}}
+  \text{EMD} (P, Q) = \frac{\sum_{i = 1}^m \sum_{j = 1}^n f_{ij} d_{ij}}
                  {\sum_{i = 1}^m \sum_{j=1}^n f_{ij}}
 $$
@ -494,28 +492,29 @@ procedure simplifies a lot. By representing both histograms with two vectors $u$
 and $v$, the equation above boils down to [@ramdas17]:
 $$
-  D (u, v) = \sum_i |U_i - V_i|
+  \text{EMD} (u, v) = \sum_i |U_i - V_i|
 $$
 where the sum runs over the entries of the vectors $U$ and $V$, which are the
-cumulative vectors of the histograms.  
+cumulative vectors of the histograms. In the code, the following equivalent
-In the code, the following equivalent recursive routine was implemented.
+recursive routine was implemented.
 $$
-  D (u, v) = \sum_i |D_i| \with
+  \text{EMD} (u, v) = \sum_i |\text{EMD}_i| \with
  \begin{cases}
-    D_i = v_i - u_i + D_{i-1} \\ 
+   \text{EMD}_i = v_i - u_i + \text{EMD}_{i-1} \\ 
-    D_0 = 0
+   \text{EMD}_0 = 0
  \end{cases}
 $$
 In fact:
 \begin{align*}
-  D (u, v) &= \sum_i |D_i| = |D_0| + |D_1| + |D_2| + |D_3| + \dots \\
+  \text{EMD} (u, v) &= \sum_i |\text{EMD}_i| = |\text{EMD}_0| + |\text{EMD}_1|
-           &= 0 + |v_1 - u_1 + D_0| + 
+                     + |\text{EMD}_2| + |\text{EMD}_3| + \dots     \\
-                  |v_2 - u_2 + D_1| + 
+           &= 0 + |v_1 - u_1 + \text{EMD}_0| + 
-                  |v_3 - u_3 + D_2| + \dots                        \\
+                  |v_2 - u_2 + \text{EMD}_1| + 
                  |v_3 - u_3 + \text{EMD}_2| + \dots               \\
           &= |v_1 - u_1| + 
              |v_1 - u_1 + v_2 - u_2| + 
              |v_1 - u_1 + v_2 - u_2 + v_3 - u_3| + \dots          \\
@ -526,19 +525,8 @@ In fact:
           &= \sum_i |U_i - V_i|
 \end{align*}
-
+This simple formula enabled comparisons to be made between a great number of
-\textcolor{red}{EMD}
+results.
 These distances were used to build their empirical cumulative distribution.
 \textcolor{red}{empirical distribution}
 At 95% confidence level, the compatibility of the deconvolved signal with
 the original one cannot be disporoved if its distance from the original signal
 is grater than  \textcolor{red}{value}.
 \textcolor{red}{counts}
 ## Results comparison {#sec:conv_results}