# Exercise 5 The following integral is to be evaluated comparing different Monte Carlo techniques. \begin{figure} \hypertarget{fig:exp}{% \centering \begin{tikzpicture} \definecolor{cyclamen}{RGB}{146, 24, 43} % Integral \filldraw [cyclamen!15!white, domain=0:5, variable=\x] (0,0) -- plot({\x},{exp(\x/5)}) -- (5,0) -- cycle; \draw [cyclamen] (5,0) -- (5,2.7182818); \node [below] at (5,0) {1}; % Axis \draw [thick, <-] (0,4) -- (0,0); \draw [thick, ->] (-2,0) -- (7,0); \node [below right] at (7,0) {$x$}; \node [above left] at (0,4) {$e^{x}$}; % Plot \draw [domain=-2:7, smooth, variable=\x, cyclamen, ultra thick] plot ({\x},{exp(\x/5)}); % Equation \node [above] at (2.5, 2.5) {$I = \int\limits_0^1 dx \, e^x$}; \end{tikzpicture} \caption{Plot of the integral to be evaluated.} } \end{figure} whose exact value is 1.7182818285... The three most popular Monte Carlo (MC) methods where applied: plain MC, Miser and Vegas. Besides this popularity fact, these three method were chosen for being implemented in the GSL libraries `gsl_monte_plain`, `gsl_monte_miser` and `gsl_monte_vegas` respectively. ## Plain Monte Carlo When an integral $I$ over a $n-$dimensional space $\Omega$ of volume $V$ of a function $f$ has to be evaluated, that is: $$ I = \int\limits_{\Omega} dx \, f(x) \with V = \int\limits_{\Omega} dx $$ the simplest MC method approach is to sample $N$ points $x_i$ evenly distributed in $V$ and approx $I$ as: $$ I \sim I_N = \frac{V}{N} \sum_{i=1}^N f(x_i) = V \cdot \langle f \rangle $$ with $I_N \rightarrow I$ for $N \rightarrow + \infty$ because of the law of large numbers, whereas the sample variance can be estimated as: $$ \sigma^2_f = \frac{1}{N - 1} \sum_{i = 1}^N \left( f(x_i) - \langle f \rangle \right)^2 \et \sigma^2_I = \frac{V^2}{N^2} \sum_{i = 1}^N \sigma^2_f = \frac{V^2}{N} \sigma^2_f $$ Thus, the error decreases as $1/\sqrt{N}$. Unlike in deterministic methods, the estimate of the error is not a strict error bound: random sampling may not uncover all the important features of the integrand and this can result in an underestimate of the error. In this case $f(x) = e^{x}$ and $\Omega = [0,1]$, hence $V = 1$. Since the proximity of $I_N$ to $I$ is related to $N$, the accuracy of the method lies in how many points are generated, namely how many function calls are executed when the iterative method is implemented. In @fig:MC and @fig:MI, results obtained with the plain MC method are shown in red. In @tbl:MC, some of them are listed: the estimated integrals $I^{\text{oss}}$ are compared to the expected value $I$ and the differences diff between them are given. ![Estimated values of $I$ obatined by Plain MC technique with different number of function calls; logarithmic scale; errorbars showing their estimated uncertainties. As can be seen, the process does a sort o seesaw around the correct value.](images/5-MC_MC.pdf){#fig:MC} --------------------------------------------------------------------------- calls $I^{\text{oss}}$ $\sigma$ diff ------------------ ------------------ ------------------ ------------------ \num{5e5} 1.7166435813 0.0006955691 0.0016382472 \num{5e6} 1.7181231109 0.0002200309 0.0001587176 \num{5e7} 1.7183387184 0.0000695809 0.0000568899 --------------------------------------------------------------------------- Table: Some MC results with three different numbers of function calls. Differences between computed and exact values are given in diff. {#tbl:MC} As can be seen, $\sigma$ is always of the same order of magnitude of diff, except for very low numbers of function calls. Even with \num{5e7} calls, $I^{\text{oss}}$ still differs from $I$ at the fifth decimal digit, meaning that this method shows a really slow convergence. In fact, since the $\sigma$s dependence on the number $C$ of function calls is confirmed: $$ \begin{cases} \sigma_1 = \num{6.95569e-4} \longleftrightarrow C_1 = \num{5e6} \\ \sigma_2 = \num{6.95809e-5} \longleftrightarrow C_1 = \num{5e8} \end{cases} $$ $$ \thus \frac{\sigma_1}{\sigma_2} = 9.9965508 \sim 10 = \sqrt{\frac{C_2}{C_1}} $$ if an error of $\sim 1^{-n}$ is required, a number $\propto 10^{2n}$ of function calls should be executed, meaning that for $\sigma \sim 1^{-10} \rightarrow C = \num{1e20}$, which would be impractical. ## Stratified sampling In statistics, stratified sampling is a method of sampling from a population partitioned into subpopulations. Stratification, indeed, is the process of dividing the primary sample into subgroups (strata) before sampling random within each stratum. Given the mean $\bar{x}_i$ and variance ${\sigma^2_x}_i$ of an entity $x$ sorted with simple random sampling in the $i^{\text{th}}$ strata, such as: $$ \bar{x}_i = \frac{1}{n_i} \sum_j x_j $$ and: $$ \sigma_i^2 = \frac{1}{n_i - 1} \sum_j \left( x_j - \bar{x}_i \right)^2 \thus {\sigma^2_x}_i = \frac{1}{n_i^2} \sum_j \sigma_i^2 = \frac{\sigma_i^2}{n_i} $$ where: - $j$ runs over the points $x_j$ sampled in the $i^{\text{th}}$ stratum, - $n_i$ is the number of points sorted in it, - $\sigma_i^2$ is the variance associated with the $j^{\text{th}}$ point. then, the mean $\bar{x}$ and variance $\sigma_x^2$ estimated with stratified sampling for the whole population are: $$ \bar{x} = \frac{1}{N} \sum_i N_i \bar{x}_i \et \sigma_x^2 = \sum_i \left( \frac{N_i}{N} \right)^2 {\sigma_x}^2_i = \sum_i \left( \frac{N_i}{N} \right)^2 \frac{\sigma^2_i}{n_i} $$ where $i$ runs over the strata, $N_i$ is the weight of the $i^{\text{th}}$ stratum and $N$ is the sum of all strata weights. In practical terms, it can produce a weighted mean that has less variability than the arithmetic mean of a simple random sample of the whole population. In fact, if measurements within strata have lower standard deviation, the final result will have a smaller error in estimation with respect to the one otherwise obtained with simple sampling. For this reason, stratified sampling is used as a method of variance reduction when MC methods are used to estimate population statistics from a known population [@ridder17]. ### MISER The MISER technique aims to reduce the integration error through the use of recursive stratified sampling. As stated before, according to the law of large numbers, for a large number of extracted points, the estimation of the integral $I$ can be computed as: $$ I= V \cdot \langle f \rangle $$ Since $V$ is known (in this case, $V = 1$), it is sufficient to estimate $\langle f \rangle$. Consider two disjoint regions $a$ and $b$, such that $a \cup b = \Omega$, in which $n_a$ and $n_b$ points are respectively uniformly sampled. Given the Monte Carlo estimates of the means $\langle f \rangle_a$ and $\langle f \rangle_b$ of those points and their variances $\sigma_a^2$ and $\sigma_b^2$, if the weights $N_a$ and $N_b$ of $\langle f \rangle_a$ and $\langle f \rangle_b$ are chosen unitary, then the variance $\sigma^2$ of the combined estimate $\langle f \rangle$: $$ \langle f \rangle = \frac{1}{2} \left( \langle f \rangle_a + \langle f \rangle_b \right) $$ is given by: $$ \sigma^2 = \frac{\sigma_a^2}{4n_a} + \frac{\sigma_b^2}{4n_b} $$ It can be shown that this variance is minimized by distributing the points such that: $$ \frac{n_a}{n_a + n_b} = \frac{\sigma_a}{\sigma_a + \sigma_b} $$ Hence, the smallest error estimate is obtained by allocating sample points in proportion to the standard deviation of the function in each sub-region. The whole integral estimate and its variance are therefore given by: $$ I = V \cdot \langle f \rangle \et \sigma_I^2 = V^2 \cdot \sigma^2 $$ When implemented, MISER is in fact a recursive method. First, all the possible bisections of $\Omega$ are tested and the one which minimizes the combined variance of the two sub-regions is selected. In order to speed up the algorithm, the variance in the sub-regions is estimated with a fraction of the total number of available points (function calls) which is default set to 0.1. The remaining points are allocated to the sub-regions using the formula for $n_a$ and $n_b$, once the variances are computed. The same procedure is then repeated recursively for each of the two half-spaces from the best bisection. When less than 32*16 (default option) function calls are available in a single section, the integral and the error in this section are estimated using the plain Monte Carlo algorithm. The final individual values and their error estimates are then combined upwards to give an overall result and an estimate of its error [@sayah19]. ![Estimations $I^{\text{oss}}$ of the integral $I$ obtained for the three implemented method for different values of function calls. Errorbars showing their estimated uncertainties.](images/5-MC_MC_MI.pdf){#fig:MI} Results for this particular sample are shown in black in @fig:MI and some of them are listed in @tbl:MI. Except for the first very little number of calls, the improvement with respect to the Plain MC technique (in red) is appreciable. --------------------------------------------------------------------------- calls $I^{\text{oss}}$ $\sigma$ diff ------------------ ------------------ ------------------ ------------------ \num{5e5} 1.7182850738 0.0000021829 0.0000032453 \num{5e6} 1.7182819143 0.0000001024 0.0000000858 \num{5e7} 1.7182818221 0.0000000049 0.0000000064 --------------------------------------------------------------------------- Table: MISER results with different numbers of function calls. Differences between computed and exact values are given in diff. {#tbl:MI} This time the convergence is much faster: already with 500'000 number of points, the correct results differs from the computed one at the fifth decimal digit. Once again, the error lies always in the same order of magnitude of diff. ## Importance sampling In statistics, importance sampling is a method which samples points from the probability distribution $f$ itself, so that the points cluster in the regions that make the largest contribution to the integral. Remind that $I = V \cdot \langle f \rangle$ and therefore only $\langle f \rangle$ is to be estimated. Consider a sample of $n$ points {$x_i$} generated according to a probability distribution function $P$ which gives thereby the following expected value: $$ E [x, P] = \frac{1}{n} \sum_i x_i $$ with variance: $$ \sigma^2 [E, P] = \frac{\sigma^2 [x, P]}{n} \with \sigma^2 [x, P] = \frac{1}{n -1} \sum_i \left( x_i - E [x, P] \right)^2 $$ where $i$ runs over the sample. In the case of plain MC, $\langle f \rangle$ is estimated as the expected value of points {$f(x_i)$} sorted with $P (x_i) = 1 \quad \forall i$, since they are evenly distributed in $\Omega$. The idea is to sample points from a different distribution to lower the variance of $E[x, P]$, which results in lowering $\sigma^2 [x, P]$. This is accomplished by choosing a random variable $y$ and defining a new probability $P^{(y)}$ in order to satisfy: $$ E [x, P] = E \left[ \frac{x}{y}, P^{(y)} \right] $$ which is to say: $$ I = \int \limits_{\Omega} dx f(x) = \int \limits_{\Omega} dx \, \frac{f(x)}{g(x)} \, g(x)= \int \limits_{\Omega} dx \, w(x) \, g(x) $$ where $E \, \longleftrightarrow \, I$ and: $$ \begin{cases} f(x) \, \longleftrightarrow \, x \\ 1 \, \longleftrightarrow \, P \end{cases} \et \begin{cases} g(x) \, \longleftrightarrow \, y = P^{(y)} \\ w(x) = \frac{f(x)}{g(x)} \, \longleftrightarrow \, \frac{x}{y} \end{cases} $$ Where the symbol $\longleftrightarrow$ points out the connection between the variables. The best variable $y$ would be: $$ y = \frac{x}{E [x, P]} \, \longleftrightarrow \, \frac{f(x)}{I} \thus \frac{x}{y} = E [x, P] $$ and even a single sample under $P^{(y)}$ would be sufficient to give its value: it can be shown that under this probability distribution the variance vanishes. Obviously, it is not possible to take exactly this choice, since $E [x, P]$ is not given a priori. However, this gives an insight into what importance sampling does. In fact, given that: $$ E [x, P] = \int \limits_{a = - \infty}^{a = + \infty} da \, a P(x \in [a, a + da]) $$ the best probability $P^{(y)}$ redistributes the law of $x$ so that its samples frequencies are sorted directly according to their weights in $E[x, P]$, namely: $$ P^{(y}(x \in [a, a + da]) = \frac{1}{E [x, P]} a P (x \in [a, a + da]) $$ In conclusion, since certain values of $x$ have more impact on $E [x, P]$ than others, these "important" values must be emphasized by sampling them more frequently. As a consequence, the estimator variance will be reduced. ### VEGAS The VEGAS algorithm of Lepage is based on importance sampling. It aims to reduce the integration error by concentrating points in the regions that make the largest contribution to the integral. As stated before, in practice it is impossible to sample points from the best distribution $P^{(y)}$: only a good approximation can be achieved. In GSL, the VEGAS algorithm approximates the distribution by histogramming the function $f$ in different subregions with a iterative method [@lepage78], namely: - a fixed number of points (function calls) is evenly generated in the whole region; - the volume $V$ is divided into $N$ intervals with width $\Delta x_i = \Delta x \, \forall \, i$, where $N$ is limited by the computer storage space available and must be held constant from iteration to iteration. Default $N = 50$; - each interval is then divided into $m_i + 1$ subintervals, where: $$ m_i = K \frac{\bar{f}_i \Delta x_i}{\sum_j \bar{f}_j \Delta x_j} $$ where $j$ runs over all the intervals and $\bar{f}_i$ is the average value of $f$ in the interval. Hence, $m_i$ is therefore a measure of the "importance" of the interval with respect to the others: the higher $\bar{f}_i$, the higher $m_i$. The stiffness $K$ is default set to 1.5; - as it is desirable to restore the number of intervals to its original value $N$, groups of the new intervals must be amalgamated into larger intervals, the number of subintervals in each group being constant. The net effect is to alter the intervals sizes, while keeping the total number constant, so that the smallest intervals occur where $f$ is largest; - the new grid is used and further refined in subsequent iterations until the optimal grid has been obtained. By default, the number of iterations is set to 5. At the end, a cumulative estimate of the integral $I$ and its error $\sigma_I$ are made based on weighted average: $$ I = \sigma_I^2 \sum_i \frac{I_i}{\sigma_i^2} \with \sigma_I^2 = \left( \sum_i \frac{1}{\sigma_i^2} \right)^{-1} $$ where $I_i$ and $\sigma_i$ are are the integral and standard deviation estimated in each interval $\Delta x_i$ of the last iteration using the plain MC technique. For the results to be reliable, the chi-squared per degree of freedom $\chi_r^2$ must be consistent with 1. While performing the eiterations, if the $\chi_r^2$ value exceed 1.5, the cycle breaks. Clearly, once again a better estimation is achieved with a greater number of function calls. For this particular sample, the most accurate results are shown in @fig:MI_VE and some of them are listed in @tbl:VEGAS. ---------------------------------------------------------------------------------------------- calls $I^{\text{oss}}$ $\sigma$ diff $\chi_r^2$ ------------------ ------------------ ------------------ ------------------ ------------------ \num{5e5} 1.7182818281 0.0000000012 0.0000000004 1.457 \num{5e6} 1.7182818284 0.0000000000 0.0000000001 0.632 \num{5e7} 1.7182818285 0.0000000000 0.0000000000 0.884 ---------------------------------------------------------------------------------------------- Table: Some VEGAS results with different numbers of function calls. {#tbl:VEGAS} As can be appreciated in @fig:MI_VE, the VEGAS algorithm manages to compute the integral value in a most accurate way with respect to MISER. The $\chi_r^2$ turns out to be close enough to 1 to guarantee a good estimation of $I$, goodness which is also confirmed by the very small difference between estimation and exact value, as shown in @tbl:VEGAS: with a number of \num{5e7} of function calls, the difference is smaller than \num{1e-10}. ![Only the most accurate results are shown in order to stress the differences between VEGAS (in gray) and MISER (in black) methods results.](images/5-MC_MI_VE.pdf){#fig:MI_VE} In conclusion, between a plain Monte Carlo technique, stratified sampling and importance sampling, the last one turned out to be the most powerful mean to obtain a good estimation of the integrand.