From 01a5f06cc5c2a659b54309f5edf2d3486d33c2bb Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Gi=C3=B9=20Marcer?= <giuliamarcer@yahoo.it>
Date: Sun, 15 Mar 2020 21:42:41 +0100
Subject: [PATCH] ex-5: complete importance sampling

---
 notes/sections/5.md | 114 +++++++++++++++++++++++---------------------
 1 file changed, 59 insertions(+), 55 deletions(-)

diff --git a/notes/sections/5.md b/notes/sections/5.md
index be503c3..dcbe8c4 100644
--- a/notes/sections/5.md
+++ b/notes/sections/5.md
@@ -108,8 +108,7 @@ $$
 $$
 
 $$
-  \sigma_i^2 = \frac{1}{n_i - 1} \sum_j \left( \frac{x_j - \bar{x}_i}{n_i}
-               \right)^2
+  \sigma_i^2 = \frac{1}{n_i - 1} \sum_j \left( x_j - \bar{x}_i \right)^2
   \thus
   {\sigma^2_x}_i = \frac{1}{n_i^2} \sum_j \sigma_i^2 = \frac{\sigma_i^2}{n_i}
 $$
@@ -225,11 +224,14 @@ diff, seems to seesaw around the correct value.
 
 ## Importance sampling
 
-In statistics, importance sampling is a technique for estimating properties of
-a given distribution, while only having samples generated from a different
-distribution than the distribution of interest.  
-Consider a sample of $n$ points {$x_i$} generated according to a probability
-distribition function $P$ which gives thereby the following expected value:
+In statistics, importance sampling is a method which samples points from the
+probability distribution $f$ itself, so that the points cluster in the regions
+that make the largest contribution to the integral.
+
+Remind that $I = V \cdot \langle f \rangle$ and therefore only $\langle f
+\rangle$ must be estimated. Then, consider a sample of $n$ points {$x_i$}
+generated according to a probability distribition function $P$ which gives
+thereby the following expected value:
 
 $$
   E [x, P] = \frac{1}{n} \sum_i x_i
@@ -238,21 +240,46 @@ $$
 with variance:
 
 $$
-  \sigma^2 [E, P] = \frac{\sigma^2 [x, P]}{n} 
+  \sigma^2 [E, P] = \frac{\sigma^2 [x, P]}{n}
+  \with \sigma^2 [x, P] = \frac{1}{n -1} \sum_i \left( x_i - E [x, P] \right)^2
 $$
 
-where $i$ runs over the sample and $\sigma^2 [x, P]$ is the variance of the
-sorted points.   
-The idea is to sample them from a different distribution to lower the variance
-of $E[x, P]$. This is accomplished by choosing a random variable $y \geq 0$ such
-that $E[y ,P] = 1$. Then, a new probability $P^{(y)}$ is defined in order to
-satisfy:
+where $i$ runs over the sample.  
+In the case of plain MC, $\langle f \rangle$ is estimated as the expected
+value of points {$f(x_i)$} sorted with $P (x_i) = 1 \quad \forall i$, since they
+are evenly distributed in $\Omega$. The idea is to sample points from a
+different distribution to lower the variance of $E[x, P]$, which results in
+lowering $\sigma^2 [x, P]$. This is accomplished by choosing a random variable
+$y$ and defining a new probability $P^{(y)}$ in order to satisfy:
 
 $$
   E [x, P] = E \left[ \frac{x}{y}, P^{(y)} \right]
 $$
 
-This new estimate is better then former one if:
+which is to say:
+
+$$
+  I = \int \limits_{\Omega} dx f(x) = 
+      \int \limits_{\Omega} dx \, \frac{f(x)}{g(x)} \, g(x)=
+      \int \limits_{\Omega} dx \, w(x) \, g(x)
+$$
+
+where $E \, \longleftrightarrow \, I$ and:
+
+$$
+  \begin{cases}
+    f(x) \, \longleftrightarrow \, x              \\
+    1  \, \longleftrightarrow \, P
+  \end{cases}
+  \et
+  \begin{cases}
+    w(x) \, \longleftrightarrow \, \frac{x}{y}    \\
+    g(x) \, \longleftrightarrow \, y = P^{(y)}
+  \end{cases}
+$$
+
+Where the symbol $\longleftrightarrow$ points out the connection between the
+variables. This new estimate is better than the former if:
 
 $$
   \sigma^2 \left[ \frac{x}{y}, P^{(y)} \right] < \sigma^2 [x, P]
@@ -261,55 +288,32 @@ $$
 The best variable $y$ would be:
 
 $$
-  y^{\star} = \frac{x}{E [x, P]} \thus \frac{x}{y^{\star}} = E [x, P]
+  y^{\star} = \frac{x}{E [x, P]} \, \longleftrightarrow \, \frac{f(x)}{I}
+  \thus \frac{x}{y^{\star}} = E [x, P]
 $$
 
-and a single sample under $P^{(y^{\star})}$ suffices to give its value. 
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
+and even a single sample under $P^{(y^{\star})}$ would be sufficient to give its
+value. Obviously, it is not possible to take exactly this choice, since $E [x,
+P]$ is not given a priori.  
+However, this gives an insight into what importance sampling does. In fact,
+given that:
 
+$$
+  E [x, P] = \int \limits_{a = - \infty}^{a = + \infty}
+             a P(x \in [a, a + da])
+$$
 
+the best probability change $P^{(y^{\star})}$ redistributes the law of $x$ so
+that its samples frequencies are sorted directly according to their weights in
+$E[x, P]$, namely:
 
+$$
+  P^{(y^{\star})}(x \in [a, a + da]) = \frac{1}{E [x, P]} a P (x \in [a, a + da])
+$$
 
 
 ---
 
-The logic underlying importance sampling lies in a simple rearrangement of terms
-in the integral to be computed:
-
-$$
-  I = \int \limits_{\Omega} dx f(x) = 
-      \int \limits_{\Omega} dx \, \frac{f(x)}{g(x)} \, g(x)=
-      \int \limits_{\Omega} dx \, w(x) \, g(x)
-$$
-
-where $w(x)$ is called 'importance function': a good importance function will be
-large when the integrand is large and small otherwise.
-
----
-
-
-For example, in some of these points the function value is lower compared to
-others and therefore contributes less to the whole integral.
 
 ### VEGAS \textcolor{red}{WIP}