# Exercise 5 The following integral is to be evaluated comparing different Monte Carlo techniques. \vspace{20pt} $$ I = \int_0^1 \!\!\! dx \, e^x $$ \vspace{-70pt} \begin{figure} \hypertarget{fig:exp}{% \centering \begin{tikzpicture} % Integral \filldraw [cyclamen!15!white, domain=0:5, variable=\x] (0,0) -- plot({\x},{exp(\x/5)}) -- (5,0) -- cycle; \draw [cyclamen] (5,0) -- (5,2.7182818); \node [below] at (5,0) {1}; % Axis \draw [thick, <-] (0,4) -- (0,0); \draw [thick, ->] (-2,0) -- (7,0); \node [below right] at (7,0) {$x$}; \node [above left] at (0,4) {$e^{x}$}; % Plot \draw [domain=-2:7, smooth, variable=\x, cyclamen, ultra thick] plot ({\x},{exp(\x/5)}); \end{tikzpicture} \caption{Plot of the integral to be evaluated.} } \end{figure} whose exact value is 1.7182818285... The three most popular Monte Carlo (MC) methods where applied: plain MC, Miser and Vegas. Besides being commonly used, these were chosen for also being implemented in the GSL libraries `gsl_monte_plain`, `gsl_monte_miser` and `gsl_monte_vegas`. ## Plain Monte Carlo When an integral $I$ over a $n-$dimensional space $\Omega$ of volume $V$ of a function $f$ has to be evaluated, that is: $$ I = \int\limits_{\Omega} dx \, f(x) \with V = \int\limits_{\Omega} dx $$ the simplest MC method approach is to sample $N$ points $x_i$ in $V$ and approximate $I$ as: $$ I \approx I_N = \frac{V}{N} \sum_{i=1}^N f(x_i) = V \cdot \avg{f} $$ If $x_i$ are uniformly distributed, $I_N \rightarrow I$ for $N \rightarrow + \infty$ by the law of large numbers, whereas the integral variance $\sigma^2_I$ can be estimated as: $$ \sigma^2_f = \frac{1}{N - 1} \sum_{i = 1}^N \left( f(x_i) - \avg{f} \right)^2 \et \sigma^2_I = \frac{V^2}{N^2} \sum_{i = 1}^N \sigma^2_f = \frac{V^2}{N} \sigma^2_f $$ Thus, the error decreases as $1/\sqrt{N}$. Unlike in deterministic methods, the error estimate is not a strict bound: random sampling may not cover all the important features of the integrand and this can result in an underestimation of the error. In this case $f(x) = e^{x}$ and $\Omega = [0,1]$, hence $V = 1$. ![Estimated values of $I$ obatined by Plain MC technique with different number of function calls; logarithmic scale; errorbars showing their estimated uncertainties. As can be seen, the process does a sort o seesaw around the correct value.](images/5-MC_MC.pdf){#fig:plain-mc-iter} Since the distance from $I$ of $I_N$ is related to $N$, the accuracy of the method lies in how many points are generated, namely how many function calls are executed when the iterative method is implemented. In @fig:plain-mc-iter and @fig:miser-iter, results obtained with the plain MC method are shown in red. In @tbl:plain-mc-res, some of them are listed: the estimated integrals $I^{\text{oss}}$ are compared to the expected value $I$ and the differences between them are given. --------------------------------------------------------------------------- calls $I^{\text{oss}}$ $\sigma$ diff ------------------ ------------------ ------------------ ------------------ \num{5e5} 1.7166435813 0.0006955691 0.0016382472 \num{5e6} 1.7181231109 0.0002200309 0.0001587176 \num{5e7} 1.7183387184 0.0000695809 0.0000568899 --------------------------------------------------------------------------- Table: Some MC results with three different numbers of function calls. Differences between computed and exact values are given in diff. {#tbl:plain-mc-res} As can be seen, $\sigma$ is always of the same order of magnitude of diff, except for very low numbers of function calls. Even with \num{5e7} calls, $I^{\text{oss}}$ still differs from $I$ at the fifth decimal place, meaning that this method shows a really slow convergence. The $\sigma$ dependence on the number $C$ of function calls was checked with a least square minimization by modeling the data with the function: $$ \sigma = \frac{a}{x^b} $$ The obtained result (@fig:err_fit) confirms the expected result $b^{\text{exp}} = 0.5$, with an observed value of $\sim 0.499$. Given this dependence, for an error of $10^{-n}$, a number $\propto 10^{2n}$ of function calls is needed. To compute an integral within double precision, an impossibly large number of $\sigma \sim 10^{32}$ calls is needed, which makes this method unpractical for high-precision applications. ![Plain MC uncertainties estimations $\sigma$ as a function of the number of function calls $C$. Observed values in red, predicted dependence in gray.](images/5-fit.pdf){#fig:err_fit} ## Stratified sampling In statistics, stratified sampling is a method of sampling from a population partitioned into subpopulations. Stratification, indeed, is the process of dividing the primary sample into subgroups (strata) before sampling within each stratum. Given a sample $\{x_j\}_i$ of the $i$-th strata, its mean $\bar{x}_i$ and variance ${\sigma^2_x}_i$, are given by: $$ \bar{x}_i = \frac{1}{n_i} \sum_j x_j $$ and $$ \sigma_i^2 = \frac{1}{n_i - 1} \sum_j \left( x_j - \bar{x}_i \right)^2 \thus {\sigma^2_x}_i = \frac{1}{n_i^2} \sum_j \sigma_i^2 = \frac{\sigma_i^2}{n_i} $$ where: - $j$ runs over the points $x_j$ of the sample, - $n_i$ is the size of the sample, - $\sigma_i^2$ is the variance associated to every point of the $i$-th stratum. An estimation of the mean $\bar{x}$ and variance $\sigma_x^2$ for the whole population are then given by the stratified sampling as follows: $$ \bar{x} = \frac{1}{N} \sum_i N_i \bar{x}_i \et \sigma_x^2 = \sum_i \left( \frac{N_i}{N} \right)^2 {\sigma_x}^2_i = \sum_i \left( \frac{N_i}{N} \right)^2 \frac{\sigma^2_i}{n_i} $$ where: - $i$ runs over the strata, - $N_i$ is the weight of the $i$-th stratum - $N$ is the sum of all strata weights. In practical terms, it can produce a weighted mean that has less variability than the arithmetic mean of a simple random sample of the whole population. In fact, if measurements within strata have lower standard deviation, the final result will have a smaller error in estimation with respect to the one otherwise obtained with simple sampling. For this reason, stratified sampling is used as a method of variance reduction when MC methods are used to estimate population statistics from a known population. For examples, see [@ridder17]. ### MISER The MISER technique aims at reducing the integration error through the use of recursive stratified sampling. As stated before, according to the law of large numbers, for a great number of extracted points, the estimation of the integral $I$ can be computed as: $$ I= V \cdot \avg{f} $$ Since $V$ is known (in this case, $V = 1$), it is sufficient to estimate $\avg{f}$. Consider two disjoint regions $a$ and $b$, such that $a \cup b = \Omega$, in which $n_a$ and $n_b$ points are respectively uniformly sampled. Given the Monte Carlo estimates of the means $\avg{f}_a$ and $\avg{f}_b$ of those points and their variances $\sigma_a^2$ and $\sigma_b^2$, if the weights $N_a$ and $N_b$ of $\avg{f}_a$ and $\avg{f}_b$ are chosen unitary, then the variance $\sigma^2$ of the combined estimate $\avg{f}$: $$ \avg{f} = \frac{1}{2} \left( \avg{f}_a + \avg{f}_b \right) $$ is given by: $$ \sigma^2 = \frac{\sigma_a^2}{4n_a} + \frac{\sigma_b^2}{4n_b} $$ It can be shown that this variance is minimized by distributing the points such that: $$ \frac{n_a}{n_a + n_b} = \frac{\sigma_a}{\sigma_a + \sigma_b} $$ Hence, the smallest error estimate is obtained by allocating sample points in proportion to the standard deviation of the function in each sub-region. The whole integral estimate and its variance are therefore given by: $$ I = V \cdot \avg{f} \et \sigma_I^2 = V^2 \cdot \sigma^2 $$ When implemented, MISER is in fact a recursive method. First, all the possible bisections of $\Omega$ are tested and the one which minimizes the combined variance of the two sub-regions is selected. In order to speed up the algorithm, the variance in the sub-regions is estimated with a fraction of the total number of available points (function calls), in GSL it is default set to 0.1. The remaining points are allocated to the sub-regions using the formula for $n_a$ and $n_b$, once the variances are computed. This procedure is then repeated recursively for each of the two half-regions from the best bisection. When the allocated calls for a region running out (less than 512 in GSL), the method falls back to a plain Monte Carlo. The final individual values and their error estimates are then combined upwards to give an overall result and an estimate of its error [@sayah19]. ![Estimations $I^{\text{oss}}$ of the integral $I$ obtained for plain MC and MISER for different values of function calls. Errorbars showing their estimated uncertainties.](images/5-MC_MC_MI.pdf){#fig:miser-iter} Results for this particular sample are shown in black in @fig:miser-iter and compared with the plain MC results (in red). Some of them are listed in @tbl:miser-res. Except for the first very little number of calls, the improvement with respect to the Plain MC technique is appreciable. The convergence is much faster than a plain MC: at 500'000 function calls, the estimate agrees with the exact integral to the fifth decimal place. Once again, the standard deviation and the difference share the same magnitude. -------------------------------------------------------------------- calls $I^{\text{oss}}$ $\sigma$ diff ----------- ------------------ ------------------ ------------------ \num{5e5} 1.7182850738 0.0000021829 0.0000032453 \num{5e6} 1.7182819143 0.0000001024 0.0000000858 \num{5e7} 1.7182818221 0.0000000049 0.0000000064 -------------------------------------------------------------------- Table: MISER results with different numbers of function calls. Differences between computed and exact values are given in diff. {#tbl:miser-res} ## Importance sampling In Monte Carlo methods, importance sampling is a technique which samples points from distribution whose shape is close to the integrand $f$ itself. This way, the points cluster in the regions that make the largest contribution to the integral $\int f(x)dx$ and consequently decrease the variance. In a plain MC the points are sampled uniformly, so their probability density is given by: $$ g(x) = \frac{1}{V} \quad \forall x \in \Omega $$ and the integral can be written as: $$ I = \int_\Omega dx f(x) = V \int_\Omega f(x) \frac{1}{V}dx \approx V \avg{f} $$ More generally, consider a distribution $h(x)$ and similarly do: $$ I = \int_\Omega dx f(x) = \int_\Omega dx \, \frac{f(x)}{h(x)} \, h(x) = \Exp \left[ \frac{f}{h}, h \right] $$ where $\Exp[X, h]$ is the expected value of $X$ wrt $h$. Also note that $h$ has to vanish outside $\Omega$ for this to hold. To reduce the variance, as anticipated, $h$ must be close to $f$. Assuming they are proportional, $h(x) = \alpha |f(x)|$, it follows that: $$ \Exp \left[ \frac{f}{h}, h \right] = \frac{1}{\alpha} \et \Var \left[ \frac{f}{h}, h \right] = 0 $$ For the expected value to give the original $I$, the proportionality constant must be taken to be $I^{-1}$, meaning: $$ h(z) = \frac{1}{I}\, |f(z)| $$ The sampling from this $h$ would produce a perfect result with zero variance. Of course, this is nonsense: if $I$ is known in advance, there would be no need to do a Monte Carlo integration to begin with. Nonetheless, this example serves to prove how variance reduction is achieved by sampling from an approximation of the integrand. In conclusion, since certain values of $x$ have more impact on $\Exp[f/h, h]$ than others, these "important" values must be emphasized by sampling them more frequently. As a consequence, the estimator variance will be reduced. ### VEGAS The VEGAS algorithm [@lepage78] of G. P. Lepage is based on importance sampling. As stated before, it is in practice impossible to sample points from the best distribution $h(x)$: only a good approximation can be achieved. The VEGAS algorithm attempts this by building a histogram of the function $f$ in different subregions with an iterative method, namely: - a fixed number of points (function calls) is generated uniformly in the whole region; - the volume $V$ is divided into $N$ intervals of width $\Delta x_i = \Delta x \, \forall \, i$, where $N$ is limited by the computer storage space available and must be held constant from iteration to iteration. (In GSL this defaults to $N = 50$); - each interval is then divided into $m_i + 1$ subintervals, where: $$ m_i = K \frac{\bar{f}_i \Delta x_i}{\sum_j \bar{f}_j \Delta x_j} $$ where $j$ runs over all the intervals and $\bar{f}_i$ is the average value of $f$ in the interval. Hence, $m_i$ is therefore a measure of the "importance" of the interval with respect to the others: the higher $\bar{f}_i$, the higher $m_i$. The constant $K$, called stiffness, defaults to 1.5 in GSL; - as it is desirable to restore the number of intervals to its original value $N$, groups of the new intervals must be merged into larger intervals, the number of subintervals in each group being constant. The net effect is to alter the intervals sizes, while keeping the total number constant, so that the smallest intervals occur where $f$ is largest; - the function is integrated with a plain MC method in each interval and the sum of the integrals is taken as the $j$-th estimate of $I$. Its error is given by the sum of the variances in each interval; - the new grid is further refined in subsequent iterations. By default, the number of iterations is 5 in GSL. The final estimate of the integral $I$ and its error $\sigma_I$ are made based on weighted average: $$ \avg{I} = \sigma_I^2 \sum_i \frac{I_i}{\sigma_i^2} \with \frac{1}{\sigma_I^2} = \sum_i \frac{1}{\sigma_i^2} $$ where $I_i$ and $\sigma_i$ are are the integral and standard deviation estimated in each iteration. The reliability of the result is asserted by a chi-squared per degree of freedom $\chi_r^2$, which should be close to 1 for a good estimation. At a given iteration $i$, the $\chi^2_i$ is computed as follows: $$ \chi^2_i = \sum_{j \le i} \frac{(I_j - \avg{I})^2}{\sigma_j^2} $$ While performing the iterations, if the value of $\chi_r^2$ exceeds 1.5, the routine stops since it is not making progress. Clearly, a better estimation is achieved with a greater number of function calls. For this particular sample, the most accurate results are shown in @fig:vegas-iter and some of them are listed in @tbl:vegas-res. ![Only the most accurate results are shown in order to stress the differences between VEGAS (in gray) and MISER (in black) methods results.](images/5-MC_MI_VE.pdf){#fig:vegas-iter} As can be appreciated in @fig:vegas-iter, the VEGAS algorithm manages to compute the integral value more accurately compared to MISER. The $\chi_r^2$ turns out to be close enough to 1 to guarantee a good estimation of $I$, goodness which is also confirmed by the very small difference shown in @tbl:vegas-res. In fact, with a number of \num{5e7} function calls, the difference is smaller than \num{1e-10}. ---------------------------------------------------------------------------------------------- calls $I^{\text{oss}}$ $\sigma$ diff $\chi_r^2$ ------------------ ------------------ ------------------ ------------------ ------------------ \num{5e5} 1.7182818281 0.0000000012 0.0000000004 1.457 \num{5e6} 1.7182818284 0.0000000000 0.0000000001 0.632 \num{5e7} 1.7182818285 0.0000000000 0.0000000000 0.884 ---------------------------------------------------------------------------------------------- Table: Best VEGAS results with different numbers of function calls. {#tbl:vegas-res} In conclusion, between a plain Monte Carlo technique, stratified sampling and importance sampling, the last turned out to be the most powerful mean to obtain a good estimation of the integrand.