From 000cc827a22eeaf7acd9848b258e7985804a1aca Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Gi=C3=B9=20Marcer?= Date: Thu, 21 May 2020 09:51:36 +0200 Subject: [PATCH] ex-6: EMD section completed --- ex-6/test.c | 4 +- notes/sections/6.md | 108 ++++++++++++++++++++------------------------ 2 files changed, 50 insertions(+), 62 deletions(-) diff --git a/ex-6/test.c b/ex-6/test.c index 85eeb93..17e0e50 100644 --- a/ex-6/test.c +++ b/ex-6/test.c @@ -38,8 +38,8 @@ int show_help(char **argv) { /* Performs an experiment consisting in * - * 1. Measuring the distribution I(θ) by reverse - * sampling from an RNG; + * 1. Measuring the distribution I(θ) sampling from + * an RNG; * 2. Convolving the I(θ) sample with a kernel * to simulate the instrumentation response; * 3. Applying a gaussian noise with σ=opts.noise diff --git a/notes/sections/6.md b/notes/sections/6.md index 91d1cc5..007672c 100644 --- a/notes/sections/6.md +++ b/notes/sections/6.md @@ -423,36 +423,22 @@ deconvolved outcome with the original signal was quantified using the earth mover's distance. In statistics, the earth mover's distance (EMD) is the measure of distance -between two probability distributions [@cock41]. Informally, the distributions -are interpreted as two different ways of piling up a certain amount of dirt over -a region and the EMD is the minimum cost of turning one pile into the other, -where the cost is the amount of dirt moved times the distance by which it is -moved. It is valid only if the two distributions have the same integral, that -is if the two piles have the same amount of dirt. +between two distributions [@cock41]. Informally, if one imagines the two +distributions as two piles of different amount of dirt in their respective +regions, the EMD is the minimum cost of turning one pile into the other, +making the first one the most possible similar to the second one, where the +cost is the amount of dirt moved times the distance by which it is moved. Computing the EMD is based on a solution to the transportation problem, which can be formalized as follows. -Consider two vectors $P$ and $Q$ which represent the two probability -distributions whose EMD has to be measured: +Consider two vectors $P$ and $Q$ which represent the two distributions whose +EMD has to be measured: $$ P = \{ (p_1, w_{p1}) \dots (p_m, w_{pm}) \} \et Q = \{ (q_1, w_{q1}) \dots (q_n, w_{qn}) \} $$ -L'istogramma P deve essere distrutto in modo tale da ottenere l'istogramma Q, -che in partenza è vuoto ma so che vorrò avere w_qj in ogni bin che sta alla -posizione qj. -- sposto solo da P a Q -- sposto non più di ogni ingresso di P -- ottengo non più di ogni ingreddo di Q -- sposto tutto quello che posso: o ottengo tutto Q o ho finito P - -e non devono venire uguali, quindi! - - - - where $p_i$ and $q_i$ are the 'values' (that is, the location of the dirt) and $w_{pi}$ and $w_{qi}$ are the 'weights' (that is, the quantity of dirt). A ground distance matrix $D_{ij}$ is defined such as its entries $d_{ij}$ are the @@ -464,28 +450,40 @@ $$ W (P, Q, F) = \sum_{i = 1}^m \sum_{j = 1}^n f_{ij} d_{ij} $$ -with the constraints: +The fact is that the $Q$ region is to be considerd empty at the beginning: the +'dirt' present in $P$ must be moved to $Q$ in order to reach the same +distribution as close as possible. Namely, the following constraints must be +satisfied: \begin{align*} - &f_{ij} \ge 0 \hspace{15pt} &1 \le i \le m \wedge 1 \le j \le n \\ - &\sum_{j = 1}^n f_{ij} \le w_{pi} &1 \le i \le m \\ - &\sum_{i = 1}^m f_{ij} \le w_{qj} &1 \le j \le n -\end{align*} -$$ - \sum_{j = 1}^n f_{ij} \sum_{j = 1}^m f_{ij} \le w_{qj} + &\text{1.} \hspace{20pt} f_{ij} \ge 0 \hspace{15pt} + &1 \le i \le m \wedge 1 \le j \le n + \\ + &\text{2.} \hspace{20pt} \sum_{j = 1}^n f_{ij} \le w_{pi} + &1 \le i \le m + \\ + &\text{3.} \hspace{20pt} \sum_{i = 1}^m f_{ij} \le w_{qj} + &1 \le j \le n + \\ + &\text{4.} \hspace{20pt} \sum_{j = 1}^n f_{ij} \sum_{j = 1}^m f_{ij} \le w_{qj} = \text{min} \left( \sum_{i = 1}^m w_{pi}, \sum_{j = 1}^n w_{qj} \right) -$$ +\end{align*} -The first constraint allows moving 'dirt' from $P$ to $Q$ and not vice versa. -The next two constraints limits the amount of supplies that can be sent by the -values in $P$ to their weights, and the values in $Q$ to receive no more -supplies than their weights; the last constraint forces to move the maximum -amount of supplies possible. The total moved amount is the total flow. Once the -transportation problem is solved, and the optimal flow is found, the earth -mover's distance $D$ is defined as the work normalized by the total flow: +The first constraint allows moving dirt from $P$ to $Q$ and not vice versa; the +second limits the amount of dirt moved by each position in $P$ in order to not +exceed the available quantity; the third sets a limit to the dirt moved to each +position in $Q$ in order to not exceed the required quantity and the last one +forces to move the maximum amount of supplies possible: either all the dirt +present in $P$ has be moved, or the $Q$ distibution is obtained. +The total moved amount is the total flow. If the two distributions have the +same amount of dirt, hence all the dirt present in $P$ is necessarily moved to +$Q$ and the flow equals the total amount of available dirt. + +Once the transportation problem is solved and the optimal flow is found, the +EMD is defined as the work normalized by the total flow: $$ - D (P, Q) = \frac{\sum_{i = 1}^m \sum_{j = 1}^n f_{ij} d_{ij}} + \text{EMD} (P, Q) = \frac{\sum_{i = 1}^m \sum_{j = 1}^n f_{ij} d_{ij}} {\sum_{i = 1}^m \sum_{j=1}^n f_{ij}} $$ @@ -494,28 +492,29 @@ procedure simplifies a lot. By representing both histograms with two vectors $u$ and $v$, the equation above boils down to [@ramdas17]: $$ - D (u, v) = \sum_i |U_i - V_i| + \text{EMD} (u, v) = \sum_i |U_i - V_i| $$ where the sum runs over the entries of the vectors $U$ and $V$, which are the -cumulative vectors of the histograms. -In the code, the following equivalent recursive routine was implemented. +cumulative vectors of the histograms. In the code, the following equivalent +recursive routine was implemented. $$ - D (u, v) = \sum_i |D_i| \with + \text{EMD} (u, v) = \sum_i |\text{EMD}_i| \with \begin{cases} - D_i = v_i - u_i + D_{i-1} \\ - D_0 = 0 + \text{EMD}_i = v_i - u_i + \text{EMD}_{i-1} \\ + \text{EMD}_0 = 0 \end{cases} $$ In fact: \begin{align*} - D (u, v) &= \sum_i |D_i| = |D_0| + |D_1| + |D_2| + |D_3| + \dots \\ - &= 0 + |v_1 - u_1 + D_0| + - |v_2 - u_2 + D_1| + - |v_3 - u_3 + D_2| + \dots \\ + \text{EMD} (u, v) &= \sum_i |\text{EMD}_i| = |\text{EMD}_0| + |\text{EMD}_1| + + |\text{EMD}_2| + |\text{EMD}_3| + \dots \\ + &= 0 + |v_1 - u_1 + \text{EMD}_0| + + |v_2 - u_2 + \text{EMD}_1| + + |v_3 - u_3 + \text{EMD}_2| + \dots \\ &= |v_1 - u_1| + |v_1 - u_1 + v_2 - u_2| + |v_1 - u_1 + v_2 - u_2 + v_3 - u_3| + \dots \\ @@ -526,19 +525,8 @@ In fact: &= \sum_i |U_i - V_i| \end{align*} - -\textcolor{red}{EMD} - -These distances were used to build their empirical cumulative distribution. - -\textcolor{red}{empirical distribution} - -At 95% confidence level, the compatibility of the deconvolved signal with -the original one cannot be disporoved if its distance from the original signal -is grater than \textcolor{red}{value}. - -\textcolor{red}{counts} - +This simple formula enabled comparisons to be made between a great number of +results. ## Results comparison {#sec:conv_results}