analistica/slides/sections/4.md

238 lines
7.8 KiB
Markdown
Raw Normal View History

2020-06-10 16:23:33 +02:00
# Sample statistics
2020-06-10 16:23:33 +02:00
## Sample statistics
2020-06-07 00:02:20 +02:00
2020-06-10 16:23:33 +02:00
How to estimate sample median, mode and FWHM?
2020-06-07 14:32:03 +02:00
. . .
2020-06-10 16:23:33 +02:00
- \only<3>\strike{Binning data $\hence$ depends wildly on bin-width}
2020-06-07 14:32:03 +02:00
. . .
- Alternative solutions
2020-06-10 16:23:33 +02:00
- Robust estimators
- Kernel density estimation
2020-06-07 14:32:03 +02:00
## Sample median
\Begin{block}{Algorithm}
::: incremental
1. Sample points
2. Sort sample in ascending order
3.
Take middle element if odd
Take average of two middle elements if even
2020-06-10 16:23:33 +02:00
:::
\End{block}
2020-06-07 14:32:03 +02:00
2020-06-12 00:09:22 +02:00
\setbeamercovered{}
\begin{center}
\begin{tikzpicture}[remember picture, >=Stealth]
% placeholder
\draw [ultra thick, transparent] (-0.35,0.7) -- (-0.35,-0.7);
2020-06-12 00:09:22 +02:00
% line
\draw <1-> [line width=3, ->, cyclamen] (-5,0) -- (5,0);
\node <1-> [right] at (5,0) {$x$};
2020-06-12 00:09:22 +02:00
% points
\draw <1-> [yellow!50!black, fill=yellow] (-4.6,-0.1) rectangle (-4.8,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (-4,-0.1) rectangle (-4.2,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (-3.3,-0.1) rectangle (-3.5,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (-2.3,-0.1) rectangle (-2.5,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (-0.6,-0.1) rectangle (-0.8,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (-0.1,-0.1) rectangle (0.1,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (1.1,-0.1) rectangle (1.3,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (2,-0.1) rectangle (2.2,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (2.7,-0.1) rectangle (2.9,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (4,-0.1) rectangle (4.2,0.1);
2020-06-12 00:09:22 +02:00
% nodes
\node <2-> [below] at (-4.7,-0.1) {1};
\node <2-> [below] at (-4.1,-0.1) {2};
\node <2-> [below] at (-3.4,-0.1) {3};
\node <2-> [below] at (-2.4,-0.1) {4};
\node <2-> [below] at (-0.7,-0.1) {5};
\node <2-> [below] at ( 0 ,-0.1) {6};
\node <2-> [below] at ( 1.2,-0.1) {7};
\node <2-> [below] at ( 2.1,-0.1) {8};
\node <2-> [below] at ( 2.8,-0.1) {9};
\node <2-> [below] at ( 4.1,-0.1) {10};
\draw <3-> [ultra thick] (-0.35,0.7) -- (-0.35,-0.7);
2020-06-12 00:09:22 +02:00
\end{tikzpicture}
\end{center}
\setbeamercovered{transparent}
2020-06-07 14:32:03 +02:00
## Sample mode
**Half Sample Mode** [@robertson74]
2020-06-07 14:32:03 +02:00
\Begin{block}{Algorithm}
::: incremental
1. Sample points
2. Find the smallest interval containing half points
3. Repeat on the new interval (iterative)
4. If less than four points, take average of the closest two
:::
\End{block}
2020-06-07 14:32:03 +02:00
2020-06-12 16:18:16 +02:00
\centering
2020-06-10 16:23:33 +02:00
\setbeamercovered{}
\begin{tikzpicture}[remember picture, >=Stealth]
2020-06-10 16:23:33 +02:00
% line
\draw <1-> [line width=3, ->, cyclamen] (-5,0) -- (5,0);
\node <1-> [right] at (5,0) {$x$};
2020-06-10 16:23:33 +02:00
% points
\draw <1-> [yellow!50!black, fill=yellow] (-4.6,-0.1) rectangle (-4.8,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (-4,-0.1) rectangle (-4.2,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (-3.3,-0.1) rectangle (-3.5,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (-2.3,-0.1) rectangle (-2.5,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (-0.6,-0.1) rectangle (-0.8,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (-0.1,-0.1) rectangle (0.1,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (1.1,-0.1) rectangle (1.3,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (2,-0.1) rectangle (2.2,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (2.7,-0.1) rectangle (2.9,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (4,-0.1) rectangle (4.2,0.1);
% nodes
\node <1-> at (-1,-0.3) (1a) {};
\node <1-> at (3.1,0.3) (1b) {};
\node <1-> at (0.9,-0.3) (2a) {};
\node <1-> at (1.8,-0.3) (3a) {};
\node <1-> at (2.45,-0.7) (f1) {};
\node <1-> at (2.45,0.7) (f2) {};
% algorithm
\draw <2-> [gray, fill=gray, opacity=0.5] (1a) rectangle (1b);
\draw <3-> [gray, fill=gray, opacity=0.6] (2a) rectangle (1b);
\draw <4-> [cyclamen, thick] (3a) rectangle (1b);
\draw <5-> [ultra thick] (f1) -- (f2);
2020-06-10 16:23:33 +02:00
\end{tikzpicture}
2020-06-07 14:32:03 +02:00
## Sample FWHM
$$
2020-06-07 14:32:03 +02:00
\text{FWHM} = x_+ - x_- \with L(x_{\pm}) = \frac{L_{\text{max}}}{2}
$$
2020-06-10 16:23:33 +02:00
\setbeamercovered{transparent}
. . .
2020-06-11 00:21:44 +02:00
**Kernel Density Estimation**
2020-06-07 14:32:03 +02:00
2020-06-10 18:48:17 +02:00
:::: {.columns}
::: {.column width=50% .c}
- empirical PDF construction:
2020-06-07 14:32:03 +02:00
2020-06-10 18:48:17 +02:00
$$
f_\varepsilon(x) = \frac{1}{N\varepsilon} \sum_{i = 1}^N
G \left( \frac{x-x_i}{\varepsilon} \right)
$$
2020-06-07 00:02:20 +02:00
- The parameter $\varepsilon$ controls the
sharpness of the empirical PDF
2020-06-10 18:48:17 +02:00
:::
::: {.column width=50%}
\setbeamercovered{}
\begin{center}
\begin{tikzpicture}
% placeholder
\draw [transparent] (-2.7,-0.2) rectangle (3,3.3);
% bandwidth 1
\node <4,5> [left] at (2.9,3) {$\epsilon = 1$};
2020-06-10 18:48:17 +02:00
% points
\draw <3-> [yellow!50!black, fill=yellow] (-1.2,-0.2) rectangle (-1,0);
\draw <3-> [yellow!50!black, fill=yellow] (-0.1,-0.2) rectangle (0.1,0);
\draw <3-> [yellow!50!black, fill=yellow] (0.7,-0.2) rectangle (0.9,0);
\draw <3-> [yellow!50!black, fill=yellow] (1.3,-0.2) rectangle (1.5,0);
% lines 1
\draw <4,5> [cyclamen, dashed] (-1.1,0.1) -- (-1.1,1);
\draw <4,5> [cyclamen, dashed] (0,0.1) -- (0,1);
\draw <4,5> [cyclamen, dashed] (1.4,0.1) -- (1.4,1);
\draw <4,5> [cyclamen, dashed] (0.8,0.1) -- (0.8,1);
% Gaussians 1
\draw <4,5> [domain=-2.6:0.4, smooth, variable=\x, cyclamen, very thick]
plot ({\x}, {exp(-(\x + 1.1)*(\x + 1.1)) + 0.1});
\draw <4,5> [domain=-1.5:1.5, smooth, variable=\x, cyclamen, very thick]
plot ({\x}, {exp(-\x*\x + 0.1});
\draw <4,5> [domain=-0.7:2.3, smooth, variable=\x, cyclamen, very thick]
plot ({\x}, {exp(-(\x - 0.8)*(\x - 0.8)) + 0.1});
\draw <4,5> [domain=-0.1:2.9, smooth, variable=\x, cyclamen, very thick]
plot ({\x}, {exp(-(\x - 1.4)*(\x - 1.4)) + 0.1});
% sum 1
\draw <5> [domain=-2.6:2.9, smooth, variable=\x, yellow, very thick]
plot ({\x}, {exp(-(\x + 1.1)*(\x + 1.1)) +
exp(-\x*\x) +
exp(-(\x - 1.4)*(\x - 1.4)) +
exp(-(\x - 0.8)*(\x - 0.8)) + 0.1});
% bandwidth 2
\node <6> [left] at (2.9,3) {$\epsilon = 0.5$};
% lines 2
\draw <6> [cyclamen, dashed] (-1.1,0.1) -- (-1.1,2);
\draw <6> [cyclamen, dashed] (0,0.1) -- (0,2);
\draw <6> [cyclamen, dashed] (1.4,0.1) -- (1.4,2);
\draw <6> [cyclamen, dashed] (0.8,0.1) -- (0.8,2);
% Gaussians 2
\draw <6> [domain=-2.6:0.4, smooth, variable=\x, cyclamen, very thick]
plot ({\x}, {exp(-(\x + 1.1)*(\x + 1.1)/0.25)/0.5 + 0.1});
\draw <6> [domain=-1.5:1.5, smooth, variable=\x, cyclamen, very thick]
plot ({\x}, {exp(-\x*\x/0.25)/0.5 + 0.1});
\draw <6> [domain=-0.7:2.3, smooth, variable=\x, cyclamen, very thick]
plot ({\x}, {exp(-(\x - 0.8)*(\x - 0.8)/0.25)/0.5 + 0.1});
\draw <6> [domain=-0.1:2.9, smooth, variable=\x, cyclamen, very thick]
plot ({\x}, {exp(-(\x - 1.4)*(\x - 1.4)/0.25)/0.5 + 0.1});
2020-06-10 18:48:17 +02:00
% sum
\draw <6> [domain=-2.6:2.9, smooth, variable=\x, yellow, very thick]
plot ({\x}, {exp(-(\x + 1.1)*(\x + 1.1)/0.25)/0.5 +
exp(-\x*\x/0.25)/0.5 +
exp(-(\x - 1.4)*(\x - 1.4)/0.25)/0.5 +
exp(-(\x - 0.8)*(\x - 0.8)/0.25)/0.5 + 0.1});
2020-06-10 18:48:17 +02:00
\end{tikzpicture}
\end{center}
\setbeamercovered{transparent}
:::
::::
2020-06-07 14:32:03 +02:00
2020-06-07 00:02:20 +02:00
2020-06-07 14:32:03 +02:00
## Sample FWHM
2020-06-12 16:18:16 +02:00
**Silverman's rule of thumb** [@silver86]:
2020-06-07 14:32:03 +02:00
$$
2020-06-08 18:02:21 +02:00
\varepsilon = 0.88 \, S_N
2020-06-07 14:32:03 +02:00
\left( \frac{d + 2}{4}N \right)^{-1/(d + 4)}
$$
2020-06-12 16:18:16 +02:00
where:
2020-06-07 14:32:03 +02:00
- $S_N$ is the sample standard deviation
- $d$ is number of dimensions ($d = 1$)
2020-06-07 14:32:03 +02:00
. . .
2020-06-07 00:02:20 +02:00
2020-06-12 16:18:16 +02:00
Minimization (Brent) for $\quad f_{\varepsilon_{\text{max}}}$
Root finding (Brent-Dekker) for $\quad f_{\varepsilon}(x_{\pm}) =
2020-06-10 16:23:33 +02:00
\frac{f_{\varepsilon_{\text{max}}}}{2}$
## Sample FWHM
![](images/kde.pdf)
2020-06-12 17:00:27 +02:00
## Bootstrap
Estimating confidence interval:
. . .
\Begin{block}{Algorithm}
::: incremental
1. Sample $N$ points from PDF
2. Sample with replacement $M$ times
2020-06-13 18:42:32 +02:00
3. Compute the estimator for each new sample
2020-06-12 17:00:27 +02:00
4. Compute mean and standard deviation
:::
\End{block}