ex-6: written EMD description and implementation

2020-05-03 00:06:02 +02:00 · 2020-05-03 00:06:02 +02:00 · aa36fca43e
commit aa36fca43e
parent df5f5f9ac6
1 changed files with 99 additions and 17 deletions
--- a/notes/sections/6.md
+++ b/notes/sections/6.md
@ -404,16 +404,9 @@ deconvolved signal is always the same 'distance' far form the convolved one:
 if it very smooth, the deconvolved signal is very smooth too and if the
 convolved is less smooth, it is less smooth too.
-It was also implemented the possibility to add a Poisson noise to the
+The original signal is shown below for convenience.
-convolved histogram to check weather the deconvolution is affected or not by
+
-this kind of interference. It was took as an example the case with $\sigma =
+![Example of an intensity histogram.](images/fraun-original.pdf){#fig:original}
 \Delta \theta$. In @fig:poisson the results are shown for both methods when a
 Poisson noise with mean $\mu = 50$ is employed.  
 In both cases, the addition of the noise seems to partially affect the
 deconvolution. When the FFT method is applied, it adds little spikes nearly
 everywhere on the curve and it is particularly evident on the edges, where the
 expected data are very small. On the other hand, the Richardson-Lucy routine is
 less affected by this further complication.
 <div id="fig:results1">
 ![Convolved signal.](images/fraun-conv-0.05.pdf){width=12cm}
@ -447,6 +440,17 @@ width.
 Results for $\sigma = \Delta \theta$, where $\Delta \theta$ is the bin width.
 </div>
 It was also implemented the possibility to add a Poisson noise to the
 convolved histogram to check weather the deconvolution is affected or not by
 this kind of interference. It was took as an example the case with $\sigma =
 \Delta \theta$. In @fig:poisson the results are shown for both methods when a
 Poisson noise with mean $\mu = 50$ is employed.  
 In both cases, the addition of the noise seems to partially affect the
 deconvolution. When the FFT method is applied, it adds little spikes nearly
 everywhere on the curve and it is particularly evident on the edges, where the
 expected data are very small. On the other hand, the Richardson-Lucy routine is
 less affected by this further complication.
 <div id="fig:poisson">
 ![Deconvolved signal with FFT.](images/fraun-noise-fft.pdf){width=12cm}
@ -455,8 +459,8 @@ Results for $\sigma = \Delta \theta$, where $\Delta \theta$ is the bin width.
 Results for $\sigma = \Delta \theta$, with Poisson noise.
 </div>
-In order to quantify the similarity of the deconvolution outcome with the
+In order to quantify the similarity of a deconvolution outcome with the original
-original signal, a null hypotesis test was made up.  
+signal, a null hypotesis test was made up.  
 Likewise in @sec:Landau, the original sample was treated as a population from
 which other samples of the same size were sampled with replacements. For each
 new sample, the earth mover's distance with respect to the original signal was
@ -469,13 +473,91 @@ a region and the EMD is the minimum cost of turning one pile into the other,
 where the cost is the amount of dirt moved times the distance by which it is
 moved. It is valid only if the two distributions have the same integral, that
 is if the two piles have the same amount of dirt.  
-Computing the EMD is based on a solution of transportation problem. 
+Computing the EMD is based on a solution to the well-known transportation
 problem, which can be formalized as follows.
-\textcolor{red}{earth mover's distance}
+Consider two vectors:
-In this case, where the EMD must be applied to two histograms, the procedure
+$$
-simplifies a lot boiling down to the difference of the comulative functions of
+  P = \{ (p_1, w_{p1}) \dots (p_n, w_{pm}) \} \et
-the two histograms.
+  Q = \{ (q_1, w_{q1}) \dots (q_n, w_{qn}) \}
 $$
 where $p_i$ and $q_i$ are the 'values' and $w_{pi}$ and $w_{qi}$ are their
 weights. The entries $d_{ij}$ of the ground distance matrix $D_{ij}$ are
 defined as the distances between $p_i$ and $q_j$.  
 The aim is to find the flow $F =$ {$f_{ij}$}, where $f_{ij}$ is the flow
 between $p_i$ and $p_j$ (which would be the quantity of moved dirt), which
 minimizes the cost $W$:
 $$
  W (P, Q, F) = \sum_{i = 1}^m \sum_{j = 1}^n f_{ij} d_{ij}
 $$
 with the constraints:
 \begin{align*}
  &f_{ij} \ge 0 \hspace{15pt}       &1 \le i \le m \wedge 1 \le j \le n \\
  &\sum_{j = 1}^n f_{ij} \le w_{pi} &1 \le i \le m                      \\
  &\sum_{j = 1}^m f_{ij} \le w_{qj} &1 \le j \le n
 \end{align*}
 $$
  \sum_{j = 1}^n f_{ij} \sum_{j = 1}^m f_{ij} \le w_{qj}
  = \text{min} \left( \sum_{i = 1}^m w_{pi}, \sum_{j = 1}^n w_{qj} \right)
 $$
 The first constraint allows moving 'dirt' from $P$ to $Q$ and not vice versa.
 The next two constraints limits the amount of supplies that can be sent by the
 values in $P$ to their weights, and the values in $Q$ to receive no more
 supplies than their weights; the last constraint forces to move the maximum
 amount of supplies possible. The total moved amount is the total flow. Once the
 transportation problem is solved, and the optimal flow is found, the earth
 mover's distance $D$ is defined as the work normalized by the total flow: 
 $$
  D (P, Q) = \frac{\sum_{i = 1}^m \sum_{j = 1}^n f_{ij} d_{ij}}
                  {\sum_{i = 1}^m \sum_{j=1}^n f_{ij}}
 $$
 In this case, where the EMD must be applied to two same-lenght histograms, the
 procedure simplifies a lot. By representing both histograms with two vectors $u$
 and $v$, the equation above boils down to [@ramdas17]:
 $$
  D (u, v) = \sum_i |U_i - V_i|
 $$
 where the sum runs over the entries of the vectors $U$ and $V$, which are the
 cumulative vectors of the histograms.  
 In the code, the following equivalent recursive routine was implemented.
 $$
  D (u, v) = \sum_i |D_i| \with
  \begin{cases}
    D_i = v_i - u_i + D_{i-1} \\ 
    D_0 = 0
  \end{cases}
 $$
 In fact:
 \begin{align*}
  D (u, v) &= \sum_i |D_i| = |D_0| + |D_1| + |D_2| + |D_3| + \dots \\
           &= 0 + |v_1 - u_1 + D_0| + 
                  |v_2 - u_2 + D_1| + 
                  |v_3 - u_3 + D_2| + \dots                        \\
           &= |v_1 - u_1| + 
              |v_1 - u_1 + v_2 - u_2| + 
              |v_1 - u_1 + v_2 - u_2 + v_3 - u_3| + \dots          \\
           &= |v_1 - u_i| + 
              |v_1 + v_2 - (u_1 + u_2)| + 
              |v_1 + v_2 + v_3 - (u_1 + u_2 + u_3))| + \dots       \\
           &= |V_1 - U_1| + |V_2 - U_2| + |V_3 - U_3| + \dots      \\
           &= \sum_i |U_i - V_i|
 \end{align*}
 \textcolor{red}{EMD}
 These distances were used to build their empirical cumulative distribution.