analistica/slides/sections/4.md

238 lines
7.8 KiB
Markdown

# Sample statistics
## Sample statistics
How to estimate sample median, mode and FWHM?
. . .
- \only<3>\strike{Binning data $\hence$ depends wildly on bin-width}
. . .
- Alternative solutions
- Robust estimators
- Kernel density estimation
## Sample median
\Begin{block}{Algorithm}
::: incremental
1. Sample points
2. Sort sample in ascending order
3.
Take middle element if odd
Take average of two middle elements if even
:::
\End{block}
\setbeamercovered{}
\begin{center}
\begin{tikzpicture}[remember picture, >=Stealth]
% placeholder
\draw [ultra thick, transparent] (-0.35,0.7) -- (-0.35,-0.7);
% line
\draw <1-> [line width=3, ->, cyclamen] (-5,0) -- (5,0);
\node <1-> [right] at (5,0) {$x$};
% points
\draw <1-> [yellow!50!black, fill=yellow] (-4.6,-0.1) rectangle (-4.8,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (-4,-0.1) rectangle (-4.2,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (-3.3,-0.1) rectangle (-3.5,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (-2.3,-0.1) rectangle (-2.5,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (-0.6,-0.1) rectangle (-0.8,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (-0.1,-0.1) rectangle (0.1,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (1.1,-0.1) rectangle (1.3,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (2,-0.1) rectangle (2.2,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (2.7,-0.1) rectangle (2.9,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (4,-0.1) rectangle (4.2,0.1);
% nodes
\node <2-> [below] at (-4.7,-0.1) {1};
\node <2-> [below] at (-4.1,-0.1) {2};
\node <2-> [below] at (-3.4,-0.1) {3};
\node <2-> [below] at (-2.4,-0.1) {4};
\node <2-> [below] at (-0.7,-0.1) {5};
\node <2-> [below] at ( 0 ,-0.1) {6};
\node <2-> [below] at ( 1.2,-0.1) {7};
\node <2-> [below] at ( 2.1,-0.1) {8};
\node <2-> [below] at ( 2.8,-0.1) {9};
\node <2-> [below] at ( 4.1,-0.1) {10};
\draw <3-> [ultra thick] (-0.35,0.7) -- (-0.35,-0.7);
\end{tikzpicture}
\end{center}
\setbeamercovered{transparent}
## Sample mode
**Half Sample Mode** [@robertson74]
\Begin{block}{Algorithm}
::: incremental
1. Sample points
2. Find the smallest interval containing half points
3. Repeat on the new interval (iterative)
4. If less than four points, take average of the closest two
:::
\End{block}
\centering
\setbeamercovered{}
\begin{tikzpicture}[remember picture, >=Stealth]
% line
\draw <1-> [line width=3, ->, cyclamen] (-5,0) -- (5,0);
\node <1-> [right] at (5,0) {$x$};
% points
\draw <1-> [yellow!50!black, fill=yellow] (-4.6,-0.1) rectangle (-4.8,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (-4,-0.1) rectangle (-4.2,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (-3.3,-0.1) rectangle (-3.5,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (-2.3,-0.1) rectangle (-2.5,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (-0.6,-0.1) rectangle (-0.8,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (-0.1,-0.1) rectangle (0.1,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (1.1,-0.1) rectangle (1.3,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (2,-0.1) rectangle (2.2,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (2.7,-0.1) rectangle (2.9,0.1);
\draw <1-> [yellow!50!black, fill=yellow] (4,-0.1) rectangle (4.2,0.1);
% nodes
\node <1-> at (-1,-0.3) (1a) {};
\node <1-> at (3.1,0.3) (1b) {};
\node <1-> at (0.9,-0.3) (2a) {};
\node <1-> at (1.8,-0.3) (3a) {};
\node <1-> at (2.45,-0.7) (f1) {};
\node <1-> at (2.45,0.7) (f2) {};
% algorithm
\draw <2-> [gray, fill=gray, opacity=0.5] (1a) rectangle (1b);
\draw <3-> [gray, fill=gray, opacity=0.6] (2a) rectangle (1b);
\draw <4-> [cyclamen, thick] (3a) rectangle (1b);
\draw <5-> [ultra thick] (f1) -- (f2);
\end{tikzpicture}
## Sample FWHM
$$
\text{FWHM} = x_+ - x_- \with L(x_{\pm}) = \frac{L_{\text{max}}}{2}
$$
\setbeamercovered{transparent}
. . .
**Kernel Density Estimation**
:::: {.columns}
::: {.column width=50% .c}
- empirical PDF construction:
$$
f_\varepsilon(x) = \frac{1}{N\varepsilon} \sum_{i = 1}^N
G \left( \frac{x-x_i}{\varepsilon} \right)
$$
- The parameter $\varepsilon$ controls the
sharpness of the empirical PDF
:::
::: {.column width=50%}
\setbeamercovered{}
\begin{center}
\begin{tikzpicture}
% placeholder
\draw [transparent] (-2.7,-0.2) rectangle (3,3.3);
% bandwidth 1
\node <4,5> [left] at (2.9,3) {$\epsilon = 1$};
% points
\draw <3-> [yellow!50!black, fill=yellow] (-1.2,-0.2) rectangle (-1,0);
\draw <3-> [yellow!50!black, fill=yellow] (-0.1,-0.2) rectangle (0.1,0);
\draw <3-> [yellow!50!black, fill=yellow] (0.7,-0.2) rectangle (0.9,0);
\draw <3-> [yellow!50!black, fill=yellow] (1.3,-0.2) rectangle (1.5,0);
% lines 1
\draw <4,5> [cyclamen, dashed] (-1.1,0.1) -- (-1.1,1);
\draw <4,5> [cyclamen, dashed] (0,0.1) -- (0,1);
\draw <4,5> [cyclamen, dashed] (1.4,0.1) -- (1.4,1);
\draw <4,5> [cyclamen, dashed] (0.8,0.1) -- (0.8,1);
% Gaussians 1
\draw <4,5> [domain=-2.6:0.4, smooth, variable=\x, cyclamen, very thick]
plot ({\x}, {exp(-(\x + 1.1)*(\x + 1.1)) + 0.1});
\draw <4,5> [domain=-1.5:1.5, smooth, variable=\x, cyclamen, very thick]
plot ({\x}, {exp(-\x*\x + 0.1});
\draw <4,5> [domain=-0.7:2.3, smooth, variable=\x, cyclamen, very thick]
plot ({\x}, {exp(-(\x - 0.8)*(\x - 0.8)) + 0.1});
\draw <4,5> [domain=-0.1:2.9, smooth, variable=\x, cyclamen, very thick]
plot ({\x}, {exp(-(\x - 1.4)*(\x - 1.4)) + 0.1});
% sum 1
\draw <5> [domain=-2.6:2.9, smooth, variable=\x, yellow, very thick]
plot ({\x}, {exp(-(\x + 1.1)*(\x + 1.1)) +
exp(-\x*\x) +
exp(-(\x - 1.4)*(\x - 1.4)) +
exp(-(\x - 0.8)*(\x - 0.8)) + 0.1});
% bandwidth 2
\node <6> [left] at (2.9,3) {$\epsilon = 0.5$};
% lines 2
\draw <6> [cyclamen, dashed] (-1.1,0.1) -- (-1.1,2);
\draw <6> [cyclamen, dashed] (0,0.1) -- (0,2);
\draw <6> [cyclamen, dashed] (1.4,0.1) -- (1.4,2);
\draw <6> [cyclamen, dashed] (0.8,0.1) -- (0.8,2);
% Gaussians 2
\draw <6> [domain=-2.6:0.4, smooth, variable=\x, cyclamen, very thick]
plot ({\x}, {exp(-(\x + 1.1)*(\x + 1.1)/0.25)/0.5 + 0.1});
\draw <6> [domain=-1.5:1.5, smooth, variable=\x, cyclamen, very thick]
plot ({\x}, {exp(-\x*\x/0.25)/0.5 + 0.1});
\draw <6> [domain=-0.7:2.3, smooth, variable=\x, cyclamen, very thick]
plot ({\x}, {exp(-(\x - 0.8)*(\x - 0.8)/0.25)/0.5 + 0.1});
\draw <6> [domain=-0.1:2.9, smooth, variable=\x, cyclamen, very thick]
plot ({\x}, {exp(-(\x - 1.4)*(\x - 1.4)/0.25)/0.5 + 0.1});
% sum
\draw <6> [domain=-2.6:2.9, smooth, variable=\x, yellow, very thick]
plot ({\x}, {exp(-(\x + 1.1)*(\x + 1.1)/0.25)/0.5 +
exp(-\x*\x/0.25)/0.5 +
exp(-(\x - 1.4)*(\x - 1.4)/0.25)/0.5 +
exp(-(\x - 0.8)*(\x - 0.8)/0.25)/0.5 + 0.1});
\end{tikzpicture}
\end{center}
\setbeamercovered{transparent}
:::
::::
## Sample FWHM
**Silverman's rule of thumb** [@silver86]:
$$
\varepsilon = 0.88 \, S_N
\left( \frac{d + 2}{4}N \right)^{-1/(d + 4)}
$$
where:
- $S_N$ is the sample standard deviation
- $d$ is number of dimensions ($d = 1$)
. . .
Minimization (Brent) for $\quad f_{\varepsilon_{\text{max}}}$
Root finding (Brent-Dekker) for $\quad f_{\varepsilon}(x_{\pm}) =
\frac{f_{\varepsilon_{\text{max}}}}{2}$
## Sample FWHM
![](images/kde.pdf)
## Bootstrap
Estimating confidence interval:
. . .
\Begin{block}{Algorithm}
::: incremental
1. Sample $N$ points from PDF
2. Sample with replacement $M$ times
3. Apply the test to each new sample
4. Compute mean and standard deviation
:::
\End{block}