analistica/slides/sections/5.md

77 lines
2.2 KiB
Markdown
Raw Normal View History

2020-06-11 00:21:44 +02:00
# Kolmogorov-Smirnov test
2020-06-07 14:32:03 +02:00
2020-06-10 16:23:33 +02:00
## KS
2020-06-07 14:32:03 +02:00
2020-06-13 18:42:32 +02:00
Quantify distance between expected and observed CDFs.
2020-06-13 17:47:37 +02:00
KS statistic:
2020-06-07 14:32:03 +02:00
2020-06-11 00:21:44 +02:00
:::: {.columns}
::: {.column width=50% .c}
$$
D_N = \text{sup}_x |F_N(x) - F(x)|
$$
- $F(x)$ is the expected CDF
2020-06-13 17:47:37 +02:00
2020-06-11 00:21:44 +02:00
- $F_N(x)$ is the empirical CDF
- sort points in ascending order
- number of points preceding the point normalized by $N$
2020-06-11 18:30:30 +02:00
. . .
2020-06-11 00:21:44 +02:00
:::
::: {.column width=50%}
\setbeamercovered{}
\begin{center}
\begin{tikzpicture}[>=Stealth]
2020-06-11 00:21:44 +02:00
% empiric
2020-06-12 18:53:10 +02:00
\draw [cyclamen, thick, fill=cyclamen!20!white]
(-2.5,0) -- (-2.5,0.5) -- (-1.5,0.5) -- (-1.5,1) -- (-0.9,1) --
(-0.9,1.5) -- (-0.1,1.5) -- (-0.1,2) -- (1,2) -- (1,2.5) --
(1.2,2.5) -- (1.2,3) -- (1.3,3) -- (1.3,3.5) -- (1.6,3.5) --
(1.6,4) -- (2.3,4) -- (2.3,4.5) -- (2.5,4.5) -- (2.5,0) --
cycle;
2020-06-11 00:21:44 +02:00
% points
2020-06-12 00:40:04 +02:00
\draw [yellow!50!black, fill=yellow] (-2.6,-0.1) rectangle (-2.4,0.1); %-2.5
\draw [yellow!50!black, fill=yellow] (-1.6,-0.1) rectangle (-1.4,0.1); %-1.5
\draw [yellow!50!black, fill=yellow] (-1,-0.1) rectangle (-0.8,0.1); %-0.9
2020-06-12 18:53:10 +02:00
\draw [yellow!50!black, fill=yellow] (-0.2,-0.1) rectangle (0,0.1); %-0.1
\draw [yellow!50!black, fill=yellow] (0.9,-0.1) rectangle (1.1,0.1); % 1
2020-06-12 00:40:04 +02:00
\draw [yellow!50!black, fill=yellow] (1.1,-0.1) rectangle (1.3,0.1); % 1.2
2020-06-12 18:53:10 +02:00
\draw [yellow!50!black, fill=yellow] (1.2,-0.1) rectangle (1.4,0.1); % 1.3
2020-06-12 00:40:04 +02:00
\draw [yellow!50!black, fill=yellow] (1.5,-0.1) rectangle (1.7,0.1); % 1.6
\draw [yellow!50!black, fill=yellow] (2.2,-0.1) rectangle (2.4,0.1); % 2.3
2020-06-11 00:21:44 +02:00
% expected
\pause
2020-06-12 00:40:04 +02:00
\draw[domain=-2.5:2.5, yscale=5, smooth, variable=\x, yellow, very thick]
2020-06-11 00:21:44 +02:00
plot ({\x}, {((atan(\x)*pi/180) + pi/2)/pi});
\pause
2020-06-12 18:53:10 +02:00
\draw [very thick, cyclamen, <->] (1,2.5) -- (1,3.75);
2020-06-11 00:21:44 +02:00
\end{tikzpicture}
\end{center}
\setbeamercovered{transparent}
:::
::::
2020-06-07 14:32:03 +02:00
2020-06-10 16:23:33 +02:00
## KS
2020-06-07 14:32:03 +02:00
2020-06-13 18:42:32 +02:00
$\bold{H_0}$: points sampled from reference distribution
2020-06-07 14:32:03 +02:00
2020-06-13 17:47:37 +02:00
::: incremental
2020-06-07 14:32:03 +02:00
2020-06-13 17:47:37 +02:00
- $\sqrt{N}D_N \xrightarrow{N \rightarrow + \infty} K$, independent of $F$
2020-06-07 19:59:07 +02:00
2020-06-13 17:47:37 +02:00
- Kolmogorov variable $K$ with CDF:
$$
P(K \leqslant K_0) = \frac{\sqrt{2 \pi}}{K_0}
\sum_{j = 1}^{+ \infty} e^{-(2j - 1)^2 \pi^2 / 8 K_0^2}
$$
2020-06-07 19:59:07 +02:00
2020-06-13 17:47:37 +02:00
- $p$-value given by: $p = 1 - P(K \leq K_0)$
2020-06-07 19:59:07 +02:00
2020-06-13 17:47:37 +02:00
:::