ex-6: written EMD description and implementation

2020-05-03 00:06:02 +02:00 · 2020-05-03 00:06:02 +02:00 · aa36fca43e
commit aa36fca43e
parent df5f5f9ac6
1 changed files with 99 additions and 17 deletions
--- a/notes/sections/6.md
+++ b/notes/sections/6.md
@ -404,16 +404,9 @@ deconvolved signal is always the same 'distance' far form the convolved one:
 if it very smooth, the deconvolved signal is very smooth too and if the
 convolved is less smooth, it is less smooth too.

-It was also implemented the possibility to add a Poisson noise to the
-convolved histogram to check weather the deconvolution is affected or not by
-this kind of interference. It was took as an example the case with $\sigma =
-\Delta \theta$. In @fig:poisson the results are shown for both methods when a
-Poisson noise with mean $\mu = 50$ is employed.  
-In both cases, the addition of the noise seems to partially affect the
-deconvolution. When the FFT method is applied, it adds little spikes nearly
-everywhere on the curve and it is particularly evident on the edges, where the
-expected data are very small. On the other hand, the Richardson-Lucy routine is
-less affected by this further complication.
+The original signal is shown below for convenience.
+
+![Example of an intensity histogram.](images/fraun-original.pdf){#fig:original}

 <div id="fig:results1">
 ![Convolved signal.](images/fraun-conv-0.05.pdf){width=12cm}
@ -447,6 +440,17 @@ width.
 Results for $\sigma = \Delta \theta$, where $\Delta \theta$ is the bin width.
 </div>

+It was also implemented the possibility to add a Poisson noise to the
+convolved histogram to check weather the deconvolution is affected or not by
+this kind of interference. It was took as an example the case with $\sigma =
+\Delta \theta$. In @fig:poisson the results are shown for both methods when a
+Poisson noise with mean $\mu = 50$ is employed.  
+In both cases, the addition of the noise seems to partially affect the
+deconvolution. When the FFT method is applied, it adds little spikes nearly
+everywhere on the curve and it is particularly evident on the edges, where the
+expected data are very small. On the other hand, the Richardson-Lucy routine is
+less affected by this further complication.
+
 <div id="fig:poisson">
 ![Deconvolved signal with FFT.](images/fraun-noise-fft.pdf){width=12cm}

@ -455,8 +459,8 @@ Results for $\sigma = \Delta \theta$, where $\Delta \theta$ is the bin width.
 Results for $\sigma = \Delta \theta$, with Poisson noise.
 </div>

-In order to quantify the similarity of the deconvolution outcome with the
-original signal, a null hypotesis test was made up.  
+In order to quantify the similarity of a deconvolution outcome with the original
+signal, a null hypotesis test was made up.  
 Likewise in @sec:Landau, the original sample was treated as a population from
 which other samples of the same size were sampled with replacements. For each
 new sample, the earth mover's distance with respect to the original signal was
@ -469,13 +473,91 @@ a region and the EMD is the minimum cost of turning one pile into the other,
 where the cost is the amount of dirt moved times the distance by which it is
 moved. It is valid only if the two distributions have the same integral, that
 is if the two piles have the same amount of dirt.  
-Computing the EMD is based on a solution of transportation problem. 
+Computing the EMD is based on a solution to the well-known transportation
+problem, which can be formalized as follows.

-\textcolor{red}{earth mover's distance}
+Consider two vectors:

-In this case, where the EMD must be applied to two histograms, the procedure
-simplifies a lot boiling down to the difference of the comulative functions of
-the two histograms.
+$$
+  P = \{ (p_1, w_{p1}) \dots (p_n, w_{pm}) \} \et
+  Q = \{ (q_1, w_{q1}) \dots (q_n, w_{qn}) \}
+$$
+
+where $p_i$ and $q_i$ are the 'values' and $w_{pi}$ and $w_{qi}$ are their
+weights. The entries $d_{ij}$ of the ground distance matrix $D_{ij}$ are
+defined as the distances between $p_i$ and $q_j$.  
+The aim is to find the flow $F =$ {$f_{ij}$}, where $f_{ij}$ is the flow
+between $p_i$ and $p_j$ (which would be the quantity of moved dirt), which
+minimizes the cost $W$:
+
+$$
+  W (P, Q, F) = \sum_{i = 1}^m \sum_{j = 1}^n f_{ij} d_{ij}
+$$
+
+with the constraints:
+
+\begin{align*}
+  &f_{ij} \ge 0 \hspace{15pt}       &1 \le i \le m \wedge 1 \le j \le n \\
+  &\sum_{j = 1}^n f_{ij} \le w_{pi} &1 \le i \le m                      \\
+  &\sum_{j = 1}^m f_{ij} \le w_{qj} &1 \le j \le n
+\end{align*}
+$$
+  \sum_{j = 1}^n f_{ij} \sum_{j = 1}^m f_{ij} \le w_{qj}
+  = \text{min} \left( \sum_{i = 1}^m w_{pi}, \sum_{j = 1}^n w_{qj} \right)
+$$
+
+The first constraint allows moving 'dirt' from $P$ to $Q$ and not vice versa.
+The next two constraints limits the amount of supplies that can be sent by the
+values in $P$ to their weights, and the values in $Q$ to receive no more
+supplies than their weights; the last constraint forces to move the maximum
+amount of supplies possible. The total moved amount is the total flow. Once the
+transportation problem is solved, and the optimal flow is found, the earth
+mover's distance $D$ is defined as the work normalized by the total flow: 
+
+$$
+  D (P, Q) = \frac{\sum_{i = 1}^m \sum_{j = 1}^n f_{ij} d_{ij}}
+                  {\sum_{i = 1}^m \sum_{j=1}^n f_{ij}}
+$$
+
+In this case, where the EMD must be applied to two same-lenght histograms, the
+procedure simplifies a lot. By representing both histograms with two vectors $u$
+and $v$, the equation above boils down to [@ramdas17]:
+
+$$
+  D (u, v) = \sum_i |U_i - V_i|
+$$
+
+where the sum runs over the entries of the vectors $U$ and $V$, which are the
+cumulative vectors of the histograms.  
+In the code, the following equivalent recursive routine was implemented.
+
+$$
+  D (u, v) = \sum_i |D_i| \with
+  \begin{cases}
+    D_i = v_i - u_i + D_{i-1} \\ 
+    D_0 = 0
+  \end{cases}
+$$
+
+In fact:
+
+\begin{align*}
+  D (u, v) &= \sum_i |D_i| = |D_0| + |D_1| + |D_2| + |D_3| + \dots \\
+           &= 0 + |v_1 - u_1 + D_0| + 
+                  |v_2 - u_2 + D_1| + 
+                  |v_3 - u_3 + D_2| + \dots                        \\
+           &= |v_1 - u_1| + 
+              |v_1 - u_1 + v_2 - u_2| + 
+              |v_1 - u_1 + v_2 - u_2 + v_3 - u_3| + \dots          \\
+           &= |v_1 - u_i| + 
+              |v_1 + v_2 - (u_1 + u_2)| + 
+              |v_1 + v_2 + v_3 - (u_1 + u_2 + u_3))| + \dots       \\
+           &= |V_1 - U_1| + |V_2 - U_2| + |V_3 - U_3| + \dots      \\
+           &= \sum_i |U_i - V_i|
+\end{align*}
+
+
+\textcolor{red}{EMD}

 These distances were used to build their empirical cumulative distribution.