PROBABILITY AND STATISTICS

Download 12 Dec 2011 ... This cookbook integrates a variety of topics in probability the- ory and statistics. It is based on literature [1, 6, 3] an...

0 downloads 826 Views 1MB Size
Probability and Statistics Cookbook

c Matthias Vallentin, 2011 Copyright [email protected] 12th December, 2011

12 Parametric Inference 12.1 Method of Moments . . . . . . . . . 12.2 Maximum Likelihood . . . . . . . . . versity of California in Berkeley but also influenced by other 12.2.1 Delta Method . . . . . . . . . sources [4, 5]. If you find errors or have suggestions for further 12.3 Multiparameter Models . . . . . . . topics, I would appreciate if you send me an email. The most re12.3.1 Multiparameter delta method cent version of this document is available at http://matthias. 12.4 Parametric Bootstrap . . . . . . . .

15 Exponential Family

11 20 Stochastic Processes 20.1 Markov Chains . . . . . . . . . . 11 20.2 Poisson Processes . . . . . . . . . 12 12 21 Time Series 12 21.1 Stationary Time Series . . . . . . 13 21.2 Estimation of Correlation . . . . 13 21.3 Non-Stationary Time Series . . . 21.3.1 Detrending . . . . . . . . 13 21.4 ARIMA models . . . . . . . . . . 21.4.1 Causality and Invertibility 14 21.5 Spectral Analysis . . . . . . . . . 14 14 22 Math 22.1 Gamma Function . . . . . . . . . 15 22.2 Beta Function . . . . . . . . . . . 15 22.3 Series . . . . . . . . . . . . . . . 15 22.4 Combinatorics . . . . . . . . . . 16

16 Sampling Methods 16.1 The Bootstrap . . . . . . . . . . . . . 16.1.1 Bootstrap Confidence Intervals 16.2 Rejection Sampling . . . . . . . . . . . 16.3 Importance Sampling . . . . . . . . . .

16 16 16 17 17

This cookbook integrates a variety of topics in probability theory and statistics. It is based on literature [1, 6, 3] and in-class material from courses of the statistics department at the Uni-

vallentin.net/probability-and-statistics-cookbook/. To reproduce, please contact me.

2 Probability Theory 3 Random Variables 3.1 Transformations . . . . . . . . . . . . .

6 7

4 Expectation

7

5 Variance

7

6 Inequalities

8

1 Distribution Overview 1.1 Discrete Distributions . . . . . . . . . . 1.2 Continuous Distributions . . . . . . . .

7 Distribution Relationships 8 Probability Functions

and

Moment

9 Multivariate Distributions 9.1 Standard Bivariate Normal . . . . . . . 9.2 Bivariate Normal . . . . . . . . . . . . . 9.3 Multivariate Normal . . . . . . . . . . . 10 Convergence 10.1 Law of Large Numbers (LLN) . . . . . . 10.2 Central Limit Theorem (CLT) . . . . . 11 Statistical Inference 11.1 Point Estimation . . . . . . . . . . 11.2 Normal-Based Confidence Interval 11.3 Empirical distribution . . . . . . . 11.4 Statistical Functionals . . . . . . .

. . . .

. . . .

. . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . .

17 Decision Theory 17.1 Risk . . . . . . . . . . . . . . . . . . . . 17.2 Admissibility . . . . . . . . . . . . . . . 17.3 Bayes Rule . . . . . . . . . . . . . . . . 9 17.4 Minimax Rules . . . . . . . . . . . . . . 9 9 18 Linear Regression 9 18.1 Simple Linear Regression . . . . . . . . 9 18.2 Prediction . . . . . . . . . . . . . . . . . 18.3 Multiple Regression . . . . . . . . . . . 9 18.4 Model Selection . . . . . . . . . . . . . . 10 10 19 Non-parametric Function Estimation 19.1 Density Estimation . . . . . . . . . . . . 10 19.1.1 Histograms . . . . . . . . . . . . 10 19.1.2 Kernel Density Estimator (KDE) 11 19.2 Non-parametric Regression . . . . . . . 11 11 19.3 Smoothing Using Orthogonal Functions 8

Generating

. . . . . .

13 Hypothesis Testing

14 Bayesian Inference 14.1 Credible Intervals . . . . 14.2 Function of parameters . 3 3 14.3 Priors . . . . . . . . . . 4 14.3.1 Conjugate Priors 14.4 Bayesian Testing . . . . 6

Contents

. . . . . .

17 17 17 18 18 18 18 19 19 19 20 20 20 21 21 21

22 . . . . 22 . . . . 22

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

23 23 24 24 24 24 25 25

. . . .

. . . .

. . . .

. . . .

26 26 26 27 27

Distribution Overview

1.1

Discrete Distributions Notation1

FX (x)

Unif {a, . . . , b}

Uniform

  0

xb (1 − p)1−x

bxc−a+1  b−a

 1 Bernoulli

Bern (p)

Binomial

Multinomial

Hypergeometric

≈Φ

Hyp (N, m, n)

Negative Binomial

NBin (r, p)

Geometric

Geo (p)

Poisson

Po (λ)

!

e−λ

MX (s)

I(a < x < b) b−a+1

a+b 2

(b − a + 1)2 − 1 12

eas − e−(b+1)s s(b − a)

px (1 − p)1−x

p

p(1 − p)

1 − p + pes

np

np(1 − p)

(1 − p + pes )n

x

x ∈ N+

x X λi i! i=0

Uniform (discrete)

1−p p

r



1−p p2

1 p

1−p p2

λ

λ

x ∈ N+

λx e−λ x!

p 1 − (1 − p)es

eλ(e

s

−1)

Poisson ●

p = 0.2 p = 0.5 p = 0.8







λ=1 λ=4 λ = 10







● ●

0.2

0.6 ●

PMF



0.4



PMF

0.15 ●

n

PMF

1



0.10

● ●



0.05 x

1 We

● ●



0

● ● ●

10

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

20 x

30

40











0.0





0.0

● ● ● ●







b







a

0.1

0.2





0.00

PMF

0.20

0.3

0.25

n = 40, p = 0.3 n = 30, p = 0.6 n = 25, p = 0.9

r

p 1 − (1 − p)es

Geometric ●

pi e

nm(N − n)(N − m) N 2 (N − 1)

nm N r

Binomial

!n si

i=0

n−x  N x

p(1 − p)x−1

k X

npi (1 − pi )

npi

! x+r−1 r p (1 − p)x r−1

Ip (r, x + 1) 1 − (1 − p)x

V [X]

k X n! x xi = n px1 1 · · · pkk x1 ! . . . xk ! i=1   m m−x

Mult (n, p) x − np p np(1 − p)

E [X]

! n x p (1 − p)n−x x

I1−p (n − x, x + 1)

Bin (n, p)

fX (x)

0.8

1

0

2

4

6

8

10



0

x

5











10











15











20

x

use the notation γ(s, x) and Γ(x) to refer to the Gamma functions (see §22.1), and use B(x, y) and Ix to refer to the Beta functions (see §22.2). 3

1.2

Continuous Distributions Notation

FX (x)

Uniform

Unif (a, b)

Normal

N µ, σ 2



  0

x
b Z x Φ(x) = φ(t) dt x−a  b−a

−∞

Log-Normal

ln N µ, σ 2

Multivariate Normal

MVN (µ, Σ)



Student’s t

Student(ν)

Chi-square

χ2k

  1 1 ln x − µ + erf √ 2 2 2σ 2

fX (x)

E [X]

V [X]

MX (s)

I(a < x < b) b−a   (x − µ)2 1 φ(x) = √ exp − 2σ 2 σ 2π   (ln x − µ)2 1 √ exp − 2σ 2 x 2πσ 2

a+b 2

(b − a)2 12

µ

σ2

esb − esa s(b − a)   σ 2 s2 exp µs + 2

T

1

(2π)−k/2 |Σ|−1/2 e− 2 (x−µ) ν ν  Ix , 2 2   1 k x γ , Γ(k/2) 2 2

Σ−1 (x−µ)

  −(ν+1)/2 Γ ν+1 x2 2  1+ √ ν ν νπΓ 2 1 xk/2−1 e−x/2 2k/2 Γ(k/2) r

eµ+σ

2

/2

2

(eσ − 1)e2µ+σ

2

µ

Σ

0

0

k

2k

d2 d2 − 2

2d22 (d1 + d2 − 2) d1 (d2 − 2)2 (d2 − 4)

  1 exp µT s + sT Σs 2

(1 − 2s)−k/2 s < 1/2

d

 F

F(d1 , d2 )

Exponential

Exp (β)

Gamma Inverse Gamma

Dirichlet

Beta Weibull Pareto

I

d1 x d1 x+d2

Gamma (α, β) InvGamma (α, β)

d1 d1 , 2 2



Pareto(xm , α)

β

β2

γ(α, x/β) Γ(α)  Γ α, βx Γ (α)

1 xα−1 e−x/β Γ (α) β α

αβ

αβ 2

β α −α−1 −β/x x e Γ (α) P  k k Γ i=1 αi Y α −1 xi i Qk i=1 Γ (αi ) i=1

β α>1 α−1

β2 α>2 (α − 1)2 (α − 2)2

Γ (α + β) α−1 x (1 − x)β−1 Γ (α) Γ (β)

α α+β   1 λΓ 1 + k

αβ (α + β)2 (α + β + 1)   2 λ2 Γ 1 + − µ2 k

αxm α>1 α−1

xα m α>2 (α − 1)2 (α − 2)

k

1 − e−(x/λ) 1−

d1 , 2 2

1 −x/β e β

Ix (α, β)

Weibull(λ, k)

xB

 d1

1 − e−x/β

Dir (α)

Beta (α, β)

(d1 x)d1 d2 2

(d1 x+d2 )d1 +d2

 x α m

x

x ≥ xm

k  x k−1 −(x/λ)k e λ λ xα m α α+1 x

x ≥ xm

αi Pk

i=1 αi

1 (s < 1/β) 1 − βs α  1 (s < 1/β) 1 − βs  p 2(−βs)α/2 −4βs Kα Γ(α)

E [Xi ] (1 − E [Xi ]) Pk i=1 αi + 1 1+

∞ X k=1 ∞ X n=0

k−1 Y r=0

α+r α+β+r

!

sk k!

sn λn  n Γ 1+ n! k

α(−xm s)α Γ(−α, −xm s) s < 0

4

Normal

Log−normal

Student's t

1.0

µ = 0, σ2 = 3 µ = 2, σ2 = 2 µ = 0, σ2 = 1 µ = 0.5, σ2 = 1 µ = 0.25, σ2 = 1 µ = 0.125, σ2 = 1

ν=1 ν=2 ν=5 ν=∞

0.2

PDF

PDF 0.4

φ(x)

PDF

0.6

0.6

0.3

0.8

0.8

µ = 0, σ2 = 0.2 µ = 0, σ2 = 1 µ = 0, σ2 = 5 µ = −2, σ2 = 0.5

0.4

Uniform (continuous)



0.1 −4

−2

0

x

2

4

0.0

0.0

0.2

0.2 ●

b

0.0



a

0.0

0.5

1.0

1.5

x

2

χ

F

2.5

3.0

−4

−2

0

Exponential

4

Gamma

2.0

β=2 β=1 β = 0.4

α = 1, β = 2 α = 2, β = 2 α = 3, β = 2 α = 5, β = 1 α = 9, β = 0.5

0.4

3.0

d1 = 1, d2 = 1 d1 = 2, d2 = 1 d1 = 5, d2 = 2 d1 = 100, d2 = 1 d1 = 100, d2 = 100

2

x

0.3 PDF 0.2

1.0

PDF

PDF

1.5

4

6

8

0.1 0

1

2

x

3

4

5

0.0 0

1

2

x

Inverse Gamma

Beta

5

0

5

10

15

20

x

Pareto xm = 1, α = 1 xm = 1, α = 2 xm = 1, α = 4

λ = 1, k = 0.5 λ = 1, k = 1 λ = 1, k = 1.5 λ = 1, k = 5

2.0

2.5

4

4

Weibull

α = 0.5, β = 0.5 α = 5, β = 1 α = 1, β = 3 α = 2, β = 2 α = 2, β = 5

3.0

α = 1, β = 1 α = 2, β = 1 α = 3, β = 1 α = 3, β = 0.5

3 x

0

1

2

3 x

4

5

0.0

0.2

0.4

0.6 x

0.8

1.0

2 0

0.0

0

0.0

0.5

0.5

1

1

1.0

1.0

PDF

PDF

1.5

PDF 2

PDF

1.5

3

2.0

3

2

2.5

0

0.0

0.0

0.0

0.5

0.1

0.5

1.0

0.2

PDF

0.3

2.0

1.5

2.5

0.4

0.5

k=1 k=2 k=3 k=4 k=5

2.0

x

0.5



0.4

1 b−a

0.0

0.5

1.0

1.5 x

2.0

2.5

0

1

2

3

4

5

x

5

2

Probability Theory

Law of Total Probability

Definitions • • • •

P [B] =

Sample space Ω Outcome (point or element) ω ∈ Ω Event A ⊆ Ω σ-algebra A

P [B|Ai ] P [Ai ]

n G

Ai

i=1

Bayes’ Theorem P [B | Ai ] P [Ai ] P [Ai | B] = Pn j=1 P [B | Aj ] P [Aj ] Inclusion-Exclusion Principle n n [ X Ai = (−1)r−1

• Probability Distribution P

i=1

1. P [A] ≥ 0 ∀A 2. P [Ω] = 1 "∞ # ∞ G X 3. P Ai = P [Ai ]

3

r=1

X

Ω=

n G

Ai

i=1

r \ Aij

i≤i1 <···
Random Variables

Random Variable (RV)

i=1

X:Ω→R

• Probability space (Ω, A, P) Probability Mass Function (PMF)

Properties • • • • • • • •

Ω=

i=1

1. ∅ ∈ A S∞ 2. A1 , A2 , . . . , ∈ A =⇒ i=1 Ai ∈ A 3. A ∈ A =⇒ ¬A ∈ A

i=1

n X

P [∅] = 0 B = Ω ∩ B = (A ∪ ¬A) ∩ B = (A ∩ B) ∪ (¬A ∩ B) P [¬A] = 1 − P [A] P [B] = P [A ∩ B] + P [¬A ∩ B] P [Ω] = 1 P [∅] = 0 S T T S ¬( n An ) = n ¬An ¬( n An ) = n ¬An DeMorgan S T P [ n An ] = 1 − P [ n ¬An ] P [A ∪ B] = P [A] + P [B] − P [A ∩ B]

=⇒ P [A ∪ B] ≤ P [A] + P [B] • P [A ∪ B] = P [A ∩ ¬B] + P [¬A ∩ B] + P [A ∩ B] • P [A ∩ ¬B] = P [A] − P [A ∩ B]

fX (x) = P [X = x] = P [{ω ∈ Ω : X(ω) = x}] Probability Density Function (PDF) Z P [a ≤ X ≤ b] =

b

f (x) dx a

Cumulative Distribution Function (CDF) FX : R → [0, 1]

FX (x) = P [X ≤ x]

1. Nondecreasing: x1 < x2 =⇒ F (x1 ) ≤ F (x2 ) 2. Normalized: limx→−∞ = 0 and limx→∞ = 1 3. Right-Continuous: limy↓x F (y) = F (x)

Continuity of Probabilities • A1 ⊂ A2 ⊂ . . . =⇒ limn→∞ P [An ] = P [A] • A1 ⊃ A2 ⊃ . . . =⇒ limn→∞ P [An ] = P [A]

S∞ whereA = i=1 Ai T∞ whereA = i=1 Ai

Z

a≤b

a

Independence ⊥ ⊥

fY |X (y | x) =

A⊥ ⊥ B ⇐⇒ P [A ∩ B] = P [A] P [B]

f (x, y) fX (x)

Independence

Conditional Probability P [A | B] =

b

fY |X (y | x)dy

P [a ≤ Y ≤ b | X = x] =

P [A ∩ B] P [B]

P [B] > 0

1. P [X ≤ x, Y ≤ y] = P [X ≤ x] P [Y ≤ y] 2. fX,Y (x, y) = fX (x)fY (y) 6

3.1

Z

Transformations

• E [XY ] =

xyfX,Y (x, y) dFX (x) dFY (y) X,Y

Transformation function

• E [ϕ(Y )] 6= ϕ(E [X]) (cf. Jensen inequality) • P [X ≥ Y ] = 0 =⇒ E [X] ≥ E [Y ] ∧ P [X = Y ] = 1 =⇒ E [X] = E [Y ] ∞ X • E [X] = P [X ≥ x]

Z = ϕ(X) Discrete X

  fZ (z) = P [ϕ(X) = z] = P [{x : ϕ(x) = z}] = P X ∈ ϕ−1 (z) =

f (x)

x∈ϕ−1 (z)

x=1

Sample mean n

X ¯n = 1 X Xi n i=1

Continuous Z FZ (z) = P [ϕ(X) ≤ z] =

with Az = {x : ϕ(x) ≤ z}

f (x) dx Az

Special case if ϕ strictly monotone dx d 1 fZ (z) = fX (ϕ−1 (z)) ϕ−1 (z) = fX (x) = fX (x) dz dz |J|

Conditional expectation Z • E [Y | X = x] = yf (y | x) dy • E [X] = E [E [X | Y ]] • E[ϕ(X, Y ) | X = x] =

ϕ(x, y)fY |X (y | x) dx Z −∞ ∞

The Rule of the Lazy Statistician • E [ϕ(Y, Z) | X = x] =

Z E [Z] =

• E [Y + Z | X] = E [Y | X] + E [Z | X] • E [ϕ(X)Y | X] = ϕ(X)E [Y | X] • E[Y | X] = c =⇒ Cov [X, Y ] = 0

Z dFX (x) = P [X ∈ A]

IA (x) dFX (x) =

ϕ(y, z)f(Y,Z)|X (y, z | x) dy dz −∞

ϕ(x) dFX (x)

Z E [IA (x)] =



Z

A

Convolution Z • Z := X + Y



fX,Y (x, z − x) dx

fZ (z) = −∞

Z

X,Y ≥0

Z

z

fX,Y (x, z − x) dx

=

0



• Z := |X − Y | • Z :=

4

X Y

fZ (z) = 2 fX,Y (x, z + x) dx 0 Z ∞ Z ∞ ⊥ ⊥ fZ (z) = |x|fX,Y (x, xz) dx = xfx (x)fX (x)fY (xz) dx −∞

−∞

Expectation

5

Variance

Definition and properties     2 2 • V [X] = σX = E (X − E [X])2 = E X 2 − E [X] " n # n X X X • V Xi = V [Xi ] + 2 Cov [Xi , Yj ] i=1

• V

" n X

# Xi =

i=1

Definition and properties

Z • E [X] = µX =

x dFX (x) =

i6=j

V [Xi ]

iff Xi ⊥ ⊥ Xj

i=1

Standard deviation X  xfX (x)     x Z      xfX (x)

• P [X = c] = 1 =⇒ E [c] = c • E [cX] = c E [X] • E [X + Y ] = E [X] + E [Y ]

i=1 n X

sd[X] =

X discrete

p

V [X] = σX

Covariance X continuous

• • • • •

Cov [X, Y ] = E [(X − E [X])(Y − E [Y ])] = E [XY ] − E [X] E [Y ] Cov [X, a] = 0 Cov [X, X] = V [X] Cov [X, Y ] = Cov [Y, X] Cov [aX, bY ] = abCov [X, Y ]

7

• Cov [X + a, Y + b] = Cov [X, Y ]   n m n X m X X X   • Cov Xi , Yj = Cov [Xi , Yj ] i=1

j=1

Correlation

• limn→∞ Bin (n, p) = Po (np) (n large, p small) • limn→∞ Bin (n, p) = N (np, np(1 − p)) (n large, p far from 0 and 1) Negative Binomial

i=1 j=1

• • • •

Cov [X, Y ] ρ [X, Y ] = p V [X] V [Y ]

Independence

X ∼ NBin (1, p) = Geo (p) Pr X ∼ NBin (r, p) = i=1 Geo (p) P P Xi ∼ NBin (ri , p) =⇒ Xi ∼ NBin ( ri , p) X ∼ NBin (r, p) . Y ∼ Bin (s + r, p) =⇒ P [X ≤ s] = P [Y ≥ r]

Poisson

X⊥ ⊥ Y =⇒ ρ [X, Y ] = 0 ⇐⇒ Cov [X, Y ] = 0 ⇐⇒ E [XY ] = E [X] E [Y ]

• Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒

n X

Xi ∼ Po

i=1

Sample variance

n X

! λi

i=1

  X n X n λi  Xj ∼ Bin  Xj , Pn • Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒ Xi j=1 j=1 λj j=1

n

1 X ¯ n )2 S2 = (Xi − X n − 1 i=1 Conditional variance     2 • V [Y | X] = E (Y − E [Y | X])2 | X = E Y 2 | X − E [Y | X] • V [Y ] = E [V [Y | X]] + V [E [Y | X]]

Exponential • Xi ∼ Exp (β) ∧ Xi ⊥ ⊥ Xj =⇒

n X

Xi ∼ Gamma (n, β)

i=1

• Memoryless property: P [X > x + y | X > y] = P [X > x]

6

Inequalities

Normal • X ∼ N µ, σ 2

Cauchy-Schwarz

2

• • •

    E [XY ] ≤ E X 2 E Y 2

Markov P [ϕ(X) ≥ t] ≤

E [ϕ(X)] t

Chebyshev P [|X − E [X]| ≥ t] ≤ Chernoff

 P [X ≥ (1 + δ)µ] ≤



δ > −1

E [ϕ(X)] ≥ ϕ(E [X]) ϕ convex

Distribution Relationships

Binomial • Xi ∼ Bern (p) =⇒

n X

X−µ σ



2

Gamma



Jensen

7



∼ N (0, 1)   X ∼ N µ, σ ∧ Z = aX + b =⇒ Z ∼ N aµ + b, a2 σ 2    X ∼ N µ1 , σ12 ∧ Y ∼ N µ2 , σ22 =⇒ X + Y ∼ N µ1 + µ2 , σ12 + σ22   P P P 2 Xi ∼ N µi , σi2 =⇒ X ∼N i µi , i σi i i   − Φ a−µ P [a < X ≤ b] = Φ b−µ σ σ =⇒

• Φ(−x) = 1 − Φ(x) φ0 (x) = −xφ(x) φ00 (x) = (x2 − 1)φ(x) −1 • Upper quantile of N (0, 1): zα = Φ (1 − α)

V [X] t2

eδ (1 + δ)1+δ



Xi ∼ Bin (n, p)

i=1

• X ∼ Bin (n, p) , Y ∼ Bin (m, p) =⇒ X + Y ∼ Bin (n + m, p)

• X ∼ Gamma (α, β) ⇐⇒ X/β ∼ Gamma (α, 1) Pα • Gamma (α, β) ∼ i=1 Exp (β) P P • Xi ∼ Gamma (αi , β) ∧ Xi ⊥ ⊥ Xj =⇒ Xi ∼ Gamma ( i αi , β) i Z ∞ Γ(α) • = xα−1 e−λx dx λα 0 Beta 1 Γ(α + β) α−1 • xα−1 (1 − x)β−1 = x (1 − x)β−1 B(α, β) Γ(α)Γ(β)   B(α + k, β)   α+k−1 • E Xk = = E X k−1 B(α, β) α+β+k−1 • Beta (1, 1) ∼ Unif (0, 1) 8

8

Probability and Moment Generating Functions   • GX (t) = E tX

σX (Y − E [Y ]) σY p V [X | Y ] = σX 1 − ρ2

E [X | Y ] = E [X] + ρ

|t| < 1 

t

Xt

• MX (t) = GX (e ) = E e



=E

"∞ # X (Xt)i i!

i=0

  ∞ X E Xi = · ti i! i=0

• P [X = 0] = GX (0) • P [X = 1] = G0X (0) • P [X = i] =

Conditional mean and variance

9.3

(i) GX (0)

Multivariate Normal (Precision matrix Σ−1 )   V [X1 ] · · · Cov [X1 , Xk ]   .. .. .. Σ=  . . .

Covariance matrix Σ

i! • E [X] = G0X (1− )   (k) • E X k = MX (0)   X! (k) • E = GX (1− ) (X − k)!

Cov [Xk , X1 ] · · · If X ∼ N (µ, Σ),

2

• V [X] = G00X (1− ) + G0X (1− ) − (G0X (1− )) d

−n/2

• GX (t) = GY (t) =⇒ X = Y

9 9.1

V [Xk ]

fX (x) = (2π)

|Σ|



1 exp − (x − µ)T Σ−1 (x − µ) 2



Properties

Multivariate Distributions

• • • •

Standard Bivariate Normal

Let X, Y ∼ N (0, 1) ∧ X ⊥ ⊥ Z where Y = ρX +

−1/2

p

1 − ρ2 Z

Z ∼ N (0, 1) ∧ X = µ + Σ1/2 Z =⇒ X ∼ N (µ, Σ) X ∼ N (µ, Σ) =⇒ Σ−1/2 (X − µ) ∼ N (0, 1)  X ∼ N (µ, Σ) =⇒ AX ∼ N Aµ, AΣAT  X ∼ N (µ, Σ) ∧ kak = k =⇒ aT X ∼ N aT µ, aT Σa

Joint density f (x, y) =

  2 1 x + y 2 − 2ρxy p exp − 2(1 − ρ2 ) 2π 1 − ρ2

10

Convergence

Let {X1 , X2 , . . .} be a sequence of rv’s and let X be another rv. Let Fn denote the cdf of Xn and let F denote the cdf of X.

Conditionals (Y | X = x) ∼ N ρx, 1 − ρ2



and

(X | Y = y) ∼ N ρy, 1 − ρ2

 Types of convergence

Independence

D

1. In distribution (weakly, in law): Xn → X

X⊥ ⊥ Y ⇐⇒ ρ = 0

lim Fn (t) = F (t)

9.2

n→∞

Bivariate Normal

P

2. In probability: Xn → X

  Let X ∼ N µx , σx2 and Y ∼ N µy , σy2 . f (x, y) = " z=

x − µx σx

2πσx σy

2

 +

1 p



z exp − 2 2(1 − ρ2 ) 1−ρ

y − µy σy

∀t where F continuous

2

 − 2ρ

x − µx σx



(∀ε > 0) lim P [|Xn − X| > ε] = 0



y − µy σy

n→∞ as

#

3. Almost surely (strongly): Xn → X h i h i P lim Xn = X = P ω ∈ Ω : lim Xn (ω) = X(ω) = 1 n→∞

n→∞

9

qm

4. In quadratic mean (L2 ): Xn → X

CLT notations Zn ≈ N (0, 1)   σ2 ¯ Xn ≈ N µ, n   σ2 ¯ Xn − µ ≈ N 0, n  √ 2 ¯ n − µ) ≈ N 0, σ n(X √ ¯ n(Xn − µ) ≈ N (0, 1) n

  lim E (Xn − X)2 = 0

n→∞

Relationships qm

P

D

• Xn → X =⇒ Xn → X =⇒ Xn → X as P • Xn → X =⇒ Xn → X D P • Xn → X ∧ (∃c ∈ R) P [X = c] = 1 =⇒ Xn → X • • • •

Xn Xn Xn Xn

P

→X qm →X P →X P →X

∧ Yn ∧ Yn ∧ Yn =⇒

P

P

→ Y =⇒ Xn + Yn → X + Y qm qm → Y =⇒ Xn + Yn → X + Y P P → Y =⇒ Xn Yn → XY P ϕ(Xn ) → ϕ(X)

D

Continuity correction

D

 x + 12 − µ √ σ/ n     x − 12 − µ ¯ √ P Xn ≥ x ≈ 1 − Φ σ/ n

• Xn → X =⇒ ϕ(Xn ) → ϕ(X) qm • Xn → b ⇐⇒ limn→∞ E [Xn ] = b ∧ limn→∞ V [Xn ] = 0 qm ¯n → • X1 , . . . , Xn iid ∧ E [X] = µ ∧ V [X] < ∞ ⇐⇒ X µ

  ¯n ≤ x ≈ Φ P X

Slutzky’s Theorem D

P

D

• Xn → X and Yn → c =⇒ Xn + Yn → X + c D P D • Xn → X and Yn → c =⇒ Xn Yn → cX D D D • In general: Xn → X and Yn → Y =⇒ 6 Xn + Yn → X + Y

10.1

Delta method  Yn ≈ N

11

Law of Large Numbers (LLN)



σ2 µ, n



 =⇒ ϕ(Yn ) ≈ N

σ2 ϕ(µ), (ϕ (µ)) n 0

2



Statistical Inference iid

Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ, and V [X1 ] < ∞.

Let X1 , · · · , Xn ∼ F if not otherwise noted.

Weak (WLLN)

11.1 n→∞

• Point estimator θbn of θ is a rv: θbn = g(X1 , . . . , Xn ) h i • bias(θbn ) = E θbn − θ

as ¯n → X µ

n→∞

P • Consistency: θbn → θ • Sampling distribution: F (θbn ) r h i b • Standard error: se(θn ) = V θbn h i h i • Mean squared error: mse = E (θbn − θ)2 = bias(θbn )2 + V θbn

Strong (WLLN)

10.2

Central Limit Theorem (CLT)

Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ, and V [X1 ] = σ 2 . √ ¯ ¯n − µ n(Xn − µ) D X →Z Zn := q   = σ ¯n V X lim P [Zn ≤ z] = Φ(z)

n→∞

Point Estimation

¯n → µ X P

where Z ∼ N (0, 1)

z∈R

• limn→∞ bias(θbn ) = 0 ∧ limn→∞ se(θbn ) = 0 =⇒ θbn is consistent θbn − θ D • Asymptotic normality: → N (0, 1) se • Slutzky’s Theorem often lets us replace se(θbn ) by some (weakly) consistent estimator σ bn . 10

11.2

Normal-Based Confidence Interval 



b 2 . Let zα/2 Suppose θbn ≈ N θ, se   and P −zα/2 < Z < zα/2 = 1 − α where Z ∼ N (0, 1). Then

  = Φ−1 (1 − (α/2)), i.e., P Z > zα/2 = α/2

b Cn = θbn ± zα/2 se

11.4 • • • •

Statistical Functionals Statistical functional: T (F ) Plug-in estimator of θ = (F ): θbn = T (Fbn ) R Linear functional: T (F ) = ϕ(x) dFX (x) Plug-in estimator for linear functional: n

Z

11.3

1X ϕ(Xi ) ϕ(x) dFbn (x) = n i=1   b 2 =⇒ T (Fbn ) ± zα/2 se b • Often: T (Fbn ) ≈ N T (F ), se T (Fbn ) =

Empirical distribution

Empirical Distribution Function (ECDF) Pn Fbn (x) =

i=1

I(Xi ≤ x) n

( 1 I(Xi ≤ x) = 0

Xi ≤ x Xi > x

Properties (for any fixed x) h i • E Fbn = F (x) h i F (x)(1 − F (x)) • V Fbn = n F (x)(1 − F (x)) D • mse = →0 n P • Fbn → F (x) Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X1 , . . . , Xn ∼ F )   2 P sup F (x) − Fbn (x) > ε = 2e−2nε

• pth quantile: F −1 (p) = inf{x : F (x) ≥ p} ¯n • µ b=X n 1 X ¯ n )2 • σ b2 = (Xi − X n − 1 i=1 Pn 1 b)3 i=1 (Xi − µ n • κ b= σ b3 j Pn ¯ n )(Yi − Y¯n ) (Xi − X qP • ρb = qP i=1 n n 2 ¯ ¯ (X − X ) i n i=1 i=1 (Yi − Yn )

12

Parametric Inference

 Let F = f (x; θ) : θ ∈ Θ be a parametric model with parameter space Θ ⊂ Rk and parameter θ = (θ1 , . . . , θk ).

12.1

Method of Moments

j th moment   αj (θ) = E X j =

x

Nonparametric 1 − α confidence band for F L(x) = max{Fbn − n , 0} U (x) = min{Fbn + n , 1} s   1 2 log = 2n α

P [L(x) ≤ F (x) ≤ U (x) ∀x] ≥ 1 − α

Z

xj dFX (x)

j th sample moment n

α bj =

1X j X n i=1 i

Method of moments estimator (MoM) α1 (θ) = α b1 α2 (θ) = α b2 .. .. .=. αk (θ) = α bk

11

Properties of the MoM estimator • θbn exists with probability tending to 1 P • Consistency: θbn → θ • Asymptotic normality: √

D

n(θb − θ) → N (0, Σ)

  where Σ = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T , ∂ −1 g = (g1 , . . . , gk ) and gj = ∂θ αj (θ)

12.2

Maximum Likelihood

Likelihood: Ln : Θ → [0, ∞) Ln (θ) =

n Y

f (Xi ; θ)

• Equivariance: θbn is the mle =⇒ ϕ(θbn ) ist the mle of ϕ(θ) • Asymptotic normality: p 1. se ≈ 1/In (θ) (θbn − θ) D → N (0, 1) se q b ≈ 1/In (θbn ) 2. se (θbn − θ) D → N (0, 1) b se • Asymptotic optimality (or efficiency), i.e., smallest variance for large samples. If θen is any other estimator, the asymptotic relative efficiency is h i V θbn are(θen , θbn ) = h i ≤ 1 V θen

i=1

• Approximately the Bayes estimator Log-likelihood `n (θ) = log Ln (θ) =

n X

log f (Xi ; θ)

i=1

12.2.1

Delta Method

b where ϕ is differentiable and ϕ0 (θ) 6= 0: If τ = ϕ(θ)

Maximum likelihood estimator (mle)

(b τn − τ ) D → N (0, 1) b τ) se(b

Ln (θbn ) = sup Ln (θ) θ

b is the mle of τ and where τb = ϕ(θ)

Score function s(X; θ) =

∂ log f (X; θ) ∂θ

b b b = ϕ0 (θ) b θn ) se se(

Fisher information I(θ) = Vθ [s(X; θ)] In (θ) = nI(θ) Fisher information (exponential family)   ∂ I(θ) = Eθ − s(X; θ) ∂θ Observed Fisher information Inobs (θ) = − Properties of the mle P • Consistency: θbn → θ

12.3

Multiparameter Models

Let θ = (θ1 , . . . , θk ) and θb = (θb1 , . . . , θbk ) be the mle. Hjj =

∂ 2 `n ∂θ2

Hjk =

Fisher information matrix 

n ∂2 X log f (Xi ; θ) ∂θ2 i=1

∂ 2 `n ∂θj ∂θk

··· .. . Eθ [Hk1 ] · · ·

Eθ [H11 ]  .. In (θ) = −  .

 Eθ [H1k ]  ..  . Eθ [Hkk ]

Under appropriate regularity conditions (θb − θ) ≈ N (0, Jn )

12

with Jn (θ) = In−1 . Further, if θbj is the j th component of θ, then

13

Hypothesis Testing H0 : θ ∈ Θ0

(θbj − θj ) D → N (0, 1) bj se h i b 2j = Jn (j, j) and Cov θbj , θbk = Jn (j, k) where se

12.3.1

Multiparameter delta method

Let τ = ϕ(θ1 , . . . , θk ) and let the gradient of ϕ be  ∂ϕ  ∂θ1   .   ∇ϕ =   ..   ∂ϕ  ∂θk 

Definitions • • • • • • • • • • •

Null hypothesis H0 Alternative hypothesis H1 Simple hypothesis θ = θ0 Composite hypothesis θ > θ0 or θ < θ0 Two-sided test: H0 : θ = θ0 versus H1 : θ 6= θ0 One-sided test: H0 : θ ≤ θ0 versus H1 : θ > θ0 Critical value c Test statistic T Rejection region R = {x : T (x) > c} Power function β(θ) = P [X ∈ R] Power of a test: 1 − P [Type II error] = 1 − β = inf β(θ)

(b τ − τ) D → N (0, 1) b τ) se(b

θ∈Θ0

H0 true H1 true

b ∇ϕ

T

Type II Error (β)

1−Fθ (T (X))

p-value < 0.01 0.01 − 0.05 0.05 − 0.1 > 0.1

  b Jbn ∇ϕ

b and ∇ϕ b = ∇ϕ b. and Jbn = Jn (θ) θ=θ

12.4

Retain H0 √

Reject H0 Type √ I Error (α) (power)

 • p-value = supθ∈Θ0 Pθ [T (X) ≥ T (x)] = inf α : T (x) ∈ Rα  • p-value = supθ∈Θ0 Pθ [T (X ? ) ≥ T (X)] = inf α : T (X) ∈ Rα | {z }

where b τ) = se(b

θ∈Θ1

• Test size: α = P [Type I error] = sup β(θ)

p-value

b Then, Suppose ∇ϕ θ=θb 6= 0 and τb = ϕ(θ).

r 

H1 : θ ∈ Θ1

versus

Parametric Bootstrap

Sample from f (x; θbn ) instead of from Fbn , where θbn could be the mle or method of moments estimator.

since T (X ? )∼Fθ

evidence very strong evidence against H0 strong evidence against H0 weak evidence against H0 little or no evidence against H0

Wald test • Two-sided test θb − θ0 • Reject H0 when |W | > zα/2 where W = b se   • P |W | > zα/2 → α • p-value = Pθ0 [|W | > |w|] ≈ P [|Z| > |w|] = 2Φ(−|w|) Likelihood ratio test (LRT) • T (X) =

supθ∈Θ Ln (θ) Ln (θbn ) = supθ∈Θ0 Ln (θ) Ln (θbn,0 )

13

D

• λ(X) = 2 log T (X) → χ2r−q where

k X

iid

Zi2 ∼ χ2k and Z1 , . . . , Zk ∼ N (0, 1)

 i=1  • p-value = Pθ0 [λ(X) > λ(x)] ≈ P χ2r−q > λ(x) Multinomial LRT   Xk X1 ,..., • mle: pbn = n n Xj k  Y Ln (b pn ) pbj • T (X) = = Ln (p0 ) p0j j=1   k X pbj D Xj log • λ(X) = 2 → χ2k−1 p 0j j=1

• xn = (x1 , . . . , xn ) • Prior density f (θ) • Likelihood f (xn | θ): joint density of the data n Y In particular, X n iid =⇒ f (xn | θ) = f (xi | θ) = Ln (θ) i=1

• Posterior density f (θ | xn ) R • Normalizing constant cn = f (xn ) = f (x | θ)f (θ) dθ • Kernel: part of a density that depends Ron θ R θLn (θ)f (θ) • Posterior mean θ¯n = θf (θ | xn ) dθ = R Ln (θ)f (θ) dθ

14.1

• The approximate size α LRT rejects H0 when λ(X) ≥ χ2k−1,α

Credible Intervals

Posterior interval

Pearson Chi-square Test

n

k X (Xj − E [Xj ])2 where E [Xj ] = np0j under H0 • T = E [Xj ] j=1 D

f (θ | xn ) dθ = 1 − α

P [θ ∈ (a, b) | x ] = a

Equal-tail credible interval

χ2k−1

• T →   • p-value = P χ2k−1 > T (x)

Z

a

f (θ | xn ) dθ =

Z

−∞

D

2 • Faster → Xk−1 than LRT, hence preferable for small n



f (θ | xn ) dθ = α/2

b

Highest posterior density (HPD) region Rn

Independence testing • I rows, J columns, X multinomial sample of size n = I ∗ J X • mles unconstrained: pbij = nij X

• mles under H0 : pb0ij = pbi· pb·j = Xni· n·j   PI PJ nX • LRT: λ = 2 i=1 j=1 Xij log Xi· Xij·j PI PJ (X −E[X ])2 • PearsonChiSq: T = i=1 j=1 ijE[Xij ]ij D

• LRT and Pearson → χ2k ν, where ν = (I − 1)(J − 1)

14

b

Z

1. P [θ ∈ Rn ] = 1 − α 2. Rn = {θ : f (θ | xn ) > k} for some k Rn is unimodal =⇒ Rn is an interval

14.2

Function of parameters

Let τ = ϕ(θ) and A = {θ : ϕ(θ) ≤ τ }. Posterior CDF for τ

Bayesian Inference

H(r | xn ) = P [ϕ(θ) ≤ τ | xn ] =

Z

f (θ | xn ) dθ

A

Bayes’ Theorem

Posterior density

f (x | θ)f (θ) f (x | θ)f (θ) f (θ | x) = =R ∝ Ln (θ)f (θ) n f (x ) f (x | θ)f (θ) dθ

h(τ | xn ) = H 0 (τ | xn ) Bayesian delta method

Definitions n

• X = (X1 , . . . , Xn )

  b b se b ϕ0 (θ) τ | X n ≈ N ϕ(θ), 14

14.3

Priors

Continuous likelihood (subscript c denotes constant) Likelihood

Conjugate prior

Unif (0, θ)

Pareto(xm , k)

Exp (λ)

Gamma (α, β)

Choice • Subjective bayesianism. • Objective bayesianism. • Robust bayesianism.

i=1

 2

 2

N µ, σc

N µ0 , σ0

Types N µc , σ 2

• Flat: f (θ) ∝ constant R∞ • Proper: −∞ f (θ) dθ = 1 R∞ • Improper: −∞ f (θ) dθ = ∞ • Jeffrey’s prior (transformation-invariant): f (θ) ∝

p

I(θ)

f (θ) ∝

N µ, σ 2

p



Scaled Inverse Chisquare(ν, σ02 )



Normalscaled Inverse Gamma(λ, ν, α, β)

det(I(θ)) MVN(µ, Σc )

MVN(µ0 , Σ0 )

• Conjugate: f (θ) and f (θ | xn ) belong to the same parametric family MVN(µc , Σ) 14.3.1

Conjugate Priors Discrete likelihood

Likelihood Bern (p) Bin (p)

Conjugate prior Beta (α, β) Beta (α, β)

Posterior hyperparameters α+ α+

n X i=1 n X

xi , β + n − xi , β +

i=1

NBin (p) Po (λ)

Beta (α, β) Gamma (α, β)

α + rn, β + α+

n X

n X

n X

i=1

xi

Dir (α)

α+

n X

xi , β + n x(i)

i=1

Geo (p)

Beta (α, β)

n X

InverseWishart(κ, Ψ)

Pareto(xmc , k)

Gamma (α, β)

Pareto(xm , kc )

Pareto(x0 , k0 )

Gamma (αc , β)

Gamma (α0 , β0 )

Pn    n µ0 1 i=1 xi + + 2 , / σ2 σ2 σ02 σc  0 c−1 n 1 + 2 σ02 σc Pn νσ02 + i=1 (xi − µ)2 ν + n, ν+n 

n νλ + n¯ x , ν + n, α + , ν+n 2 n 1X γ(¯ x − λ)2 2 β+ (xi − x ¯) + 2 i=1 2(n + γ) −1 Σ−1 0 + nΣc

α + n, β +

n X i=1

−1

 −1 Σ−1 x ¯ , 0 µ0 + nΣ

 −1 −1 Σ−1 0 + nΣc n X n + κ, Ψ + (xi − µc )(xi − µc )T i=1 n X

xi xm c i=1 x0 , k0 − kn where k0 > kn n X α0 + nαc , β0 + xi α + n, β +

log

i=1

xi

i=1

Ni −

n X

14.4 xi

Bayesian Testing

If H0 : θ ∈ Θ0 :

i=1

Z Prior probability P [H0 ] =

i=1

i=1

Multinomial(p)

Posterior hyperparameters  max x(n) , xm , k + n n X α + n, β + xi

Posterior probability P [H0 | xn ] =

f (θ) dθ ZΘ0

f (θ | xn ) dθ

Θ0

Let H0 , . . . , HK−1 be K hypotheses. Suppose θ ∼ f (θ | Hk ), xi

f (xn | Hk )P [Hk ] P [Hk | xn ] = PK , n k=1 f (x | Hk )P [Hk ]

15

Marginal likelihood f (xn | Hi ) =

Z

1. Estimate VF [Tn ] with VFbn [Tn ]. 2. Approximate VFbn [Tn ] using simulation:

f (xn | θ, Hi )f (θ | Hi ) dθ

∗ ∗ (a) Repeat the following B times to get Tn,1 , . . . , Tn,B , an iid sample from b the sampling distribution implied by Fn i. Sample uniformly X ∗ , . . . , Xn∗ ∼ Fbn .

Θ

Posterior odds (of Hi relative to Hj ) P [Hi | xn ] P [Hj | xn ]

=

f (xn | Hi ) f (xn | Hj ) | {z }

Bayes Factor BFij

×

P [Hi ] P [Hj ] | {z }

1

ii. Compute Tn∗ = g(X1∗ , . . . , Xn∗ ). (b) Then

prior odds

Bayes factor log10 BF10

p∗ =

15

0 − 0.5 0.5 − 1 1−2 >2 p 1−p BF10 1+

p 1−p BF10

BF10

evidence

1 − 1.5 1.5 − 10 10 − 100 > 100

Weak Moderate Strong Decisive

vboot

B B X 1 X ∗ ∗ bb = 1 T − =V T n,b Fn B B r=1 n,r

!2

b=1

16.1.1

where p = P [H1 ] and p∗ = P [H1 | xn ]

Bootstrap Confidence Intervals

Normal-based interval b boot Tn ± zα/2 se

Exponential Family

Pivotal interval

Scalar parameter

1. 2. 3. 4.

fX (x | θ) = h(x) exp {η(θ)T (x) − A(θ)} = h(x)g(θ) exp {η(θ)T (x)} Vector parameter ( fX (x | θ) = h(x) exp

s X

Location parameter θ = T (F ) Pivot Rn = θbn − θ Let H(r) = P [Rn ≤ r] be the cdf of Rn ∗ ∗ Let Rn,b = θbn,b − θbn . Approximate H using bootstrap:

)

B 1 X ∗ b H(r) = I(Rn,b ≤ r) B

ηi (θ)Ti (x) − A(θ)

i=1

b=1

= h(x) exp {η(θ) · T (x) − A(θ)} = h(x)g(θ) exp {η(θ) · T (x)} Natural form fX (x | η) = h(x) exp {η · T(x) − A(η)} = h(x)g(η) exp {η · T(x)}  = h(x)g(η) exp η T T(x)

∗ ∗ , . . . , θbn,B ) 5. θβ∗ = β sample quantile of (θbn,1 ∗ ∗ 6. rβ∗ = β sample quantile of (Rn,1 , . . . , Rn,B ), i.e., rβ∗ = θβ∗ − θbn   7. Approximate 1 − α confidence interval Cn = a ˆ, ˆb where

a ˆ= ˆb =

16 16.1

Sampling Methods The Bootstrap

Let Tn = g(X1 , . . . , Xn ) be a statistic.

  b −1 1 − α = θbn − H 2 b −1 α = θbn − H 2

∗ θbn − r1−α/2 =

∗ 2θbn − θ1−α/2

∗ θbn − rα/2 =

∗ 2θbn − θα/2

Percentile interval   ∗ ∗ Cn = θα/2 , θ1−α/2 16

16.2

Rejection Sampling

Setup • We can easily sample from g(θ) • We want to sample from h(θ), but it is difficult k(θ) k(θ) dθ • Envelope condition: we can find M > 0 such that k(θ) ≤ M g(θ) ∀θ • We know h(θ) up to a proportional constant: h(θ) = R

Algorithm 1. Draw θcand ∼ g(θ) 2. Generate u ∼ Unif (0, 1) k(θcand ) 3. Accept θcand if u ≤ M g(θcand ) 4. Repeat until B values of θcand have been accepted Example • • • •

Loss functions • Squared error loss: L(θ, a) = (θ − a)2 ( K1 (θ − a) a − θ < 0 • Linear loss: L(θ, a) = K2 (a − θ) a − θ ≥ 0 • Absolute error loss: L(θ, a) = |θ − a| (linear loss with K1 = K2 ) • Lp loss: L(θ, a) = |θ − a|p ( 0 a=θ • Zero-one loss: L(θ, a) = 1 a 6= θ

17.1

We can easily sample from the prior g(θ) = f (θ) Target is the posterior h(θ) ∝ k(θ) = f (xn | θ)f (θ) Envelope condition: f (xn | θ) ≤ f (xn | θbn ) = Ln (θbn ) ≡ M Algorithm 1. Draw θcand ∼ f (θ) 2. Generate u ∼ Unif (0, 1) Ln (θcand ) 3. Accept θcand if u ≤ Ln (θbn )

16.3

• Decision rule: synonymous for an estimator θb • Action a ∈ A: possible value of the decision rule. In the estimation b context, the action is just an estimate of θ, θ(x). • Loss function L: consequences of taking action a when true state is θ or b L : Θ × A → [−k, ∞). discrepancy between θ and θ,

Posterior risk Z

h i b b L(θ, θ(x))f (θ | x) dθ = Eθ|X L(θ, θ(x))

Z

h i b b L(θ, θ(x))f (x | θ) dx = EX|θ L(θ, θ(X))

r(θb | x) = (Frequentist) risk b = R(θ, θ) Bayes risk ZZ

Importance Sampling

b = r(f, θ)

Sample from an importance function g rather than target density h. Algorithm to obtain an approximation to E [q(θ) | xn ]:

Decision Theory

Definitions • Unknown quantity affecting our decision: θ ∈ Θ

h i b b L(θ, θ(x))f (x, θ) dx dθ = Eθ,X L(θ, θ(X))

h h ii h i b = Eθ EX|θ L(θ, θ(X) b b r(f, θ) = Eθ R(θ, θ) h h ii h i b = EX Eθ|X L(θ, θ(X) b r(f, θ) = EX r(θb | X)

iid

1. Sample from the prior θ1 , . . . , θn ∼ f (θ) Ln (θi ) 2. wi = PB ∀i = 1, . . . , B i=1 Ln (θi ) PB 3. E [q(θ) | xn ] ≈ i=1 q(θi )wi

17

Risk

17.2

Admissibility

• θb0 dominates θb if

b ∀θ : R(θ, θb0 ) ≤ R(θ, θ) b ∃θ : R(θ, θb0 ) < R(θ, θ)

• θb is inadmissible if there is at least one other estimator θb0 that dominates it. Otherwise it is called admissible.

17

17.3

Bayes Rule

Residual sums of squares (rss)

Bayes rule (or Bayes estimator) rss(βb0 , βb1 ) =

b = inf e r(f, θ) e • r(f, θ) θ R b b b = r(θb | x)f (x) dx • θ(x) = inf r(θ | x) ∀x =⇒ r(f, θ)

n X

ˆ2i

i=1

Least square estimates Theorems

βbT = (βb0 , βb1 )T : min rss b0 ,β b1 β

• Squared error loss: posterior mean • Absolute error loss: posterior median • Zero-one loss: posterior mode

17.4

¯n βb0 = Y¯n − βb1 X Pn Pn ¯ n )(Yi − Y¯n ) ¯ (Xi − X i=1 Xi Yi − nXY Pn βb1 = i=1 = P n ¯ 2 2 2 i=1 (Xi − Xn ) i=1 Xi − nX h i β  0 E βb | X n = β1   P h i σ 2 n−1 ni=1 Xi2 −X n n b V β |X = −X n 1 nsX r Pn 2 σ b i=1 Xi √ b βb0 ) = se( n sX n σ b √ b βb1 ) = se( sX n

Minimax Rules

Maximum risk b = sup R(θ, θ) b ¯ θ) R(

¯ R(a) = sup R(θ, a)

θ

θ

Minimax rule e e = inf sup R(θ, θ) b = inf R( ¯ θ) sup R(θ, θ) θ

θe

θe

θ

b =c θb = Bayes rule ∧ ∃c : R(θ, θ) Least favorable prior θbf = Bayes rule ∧ R(θ, θbf ) ≤ r(f, θbf ) ∀θ

18

Linear Regression

Pn b2 = where s2X = n−1 i=1 (Xi − X n )2 and σ Further properties:

Pn

ˆ2i i=1 

(unbiased estimate).

P P • Consistency: βb0 → β0 and βb1 → β1 • Asymptotic normality:

Definitions • Response variable Y • Covariate X (aka predictor variable or feature)

18.1

1 n−2

βb0 − β0 D → N (0, 1) b βb0 ) se(

Simple Linear Regression

βb1 − β1 D → N (0, 1) b βb1 ) se(

• Approximate 1 − α confidence intervals for β0 and β1 :

Model Yi = β0 + β1 Xi + i

and

E [i | Xi ] = 0, V [i | Xi ] = σ 2

b βb0 ) βb0 ± zα/2 se(

b βb1 ) and βb1 ± zα/2 se(

Fitted line • Wald test for H0 : β1 = 0 vs. H1 : β1 6= 0: reject H0 if |W | > zα/2 where b βb1 ). W = βb1 /se(

rb(x) = βb0 + βb1 x Predicted (fitted) values Ybi = rb(Xi ) Residuals



ˆi = Yi − Ybi = Yi − βb0 + βb1 Xi

R2 

Pn b Pn 2 2 ˆ rss i=1 (Yi − Y ) R = Pn = 1 − Pn i=1 i 2 = 1 − 2 tss i=1 (Yi − Y ) i=1 (Yi − Y ) 2

18

If the (k × k) matrix X T X is invertible,

Likelihood L= L1 =

n Y i=1 n Y

f (Xi , Yi ) =

n Y

fX (Xi ) ×

i=1

n Y

βb = (X T X)−1 X T Y h i V βb | X n = σ 2 (X T X)−1

fY |X (Yi | Xi ) = L1 × L2

i=1

βb ≈ N β, σ 2 (X T X)−1

fX (Xi )



Estimate regression function

i=1 n Y

(

2 1 X Yi − (β0 − β1 Xi ) fY |X (Yi | Xi ) ∝ σ −n exp − 2 L2 = 2σ i i=1

) rb(x) =

k X

βbj xj

j=1

Under the assumption of Normality, the least squares estimator is also the mle

Unbiased estimate for σ

2

n 1 X 2 ˆ σ b = n − k i=1 i

n

2

1X 2 ˆ σ b2 = n i=1 i

ˆ = X βb − Y

mle

18.2

¯ µ b=X

Prediction

Observe X = x∗ of the covariate and want to predict their outcome Y∗ . Yb∗ = βb0 + βb1 x∗ i h i h i h i V Yb∗ = V βb0 + x2∗ V βb1 + 2x∗ Cov βb0 , βb1 h

Prediction interval ξbn2 = σ b2

  Pn 2 i=1 (Xi − X∗ ) P ¯ 2j + 1 n i (Xi − X)

n−k 2 σ n

1 − α Confidence interval b βbj ) βbj ± zα/2 se(

18.4

Model Selection

Consider predicting a new observation Y ∗ for covariates X ∗ and let S ⊂ J denote a subset of the covariates in the model, where |S| = k and |J| = n. Issues • Underfitting: too few covariates yields high bias • Overfitting: too many covariates yields high variance

Yb∗ ± zα/2 ξbn

18.3

σ b2 =

Procedure 1. Assign a score to each model 2. Search through all models to find the one with the highest score

Multiple Regression Y = Xβ + 

Hypothesis testing H0 : βj = 0 vs. H1 : βj 6= 0 ∀j ∈ J

where 

X11  .. X= . Xn1

··· .. . ···

 X1k ..  .  Xnk



 β1   β =  ... 

  1  ..  =.

βk

n

Mean squared prediction error (mspe) h i mspe = E (Yb (S) − Y ∗ )2 Prediction risk

Likelihood  1 2 −n/2 L(µ, Σ) = (2πσ ) exp − 2 rss 2σ 

R(S) =

n X

mspei =

i=1

n X

h i E (Ybi (S) − Yi∗ )2

i=1

Training error rss = (y − Xβ)T (y − Xβ) = kY − Xβk2 =

N X (Yi − xTi β)2 i=1

btr (S) = R

n X (Ybi (S) − Yi )2 i=1

19

R2 btr (S) rss(S) R R2 (S) = 1 − =1− =1− tss tss

Pn b 2 i=1 (Yi (S) − Y ) P n 2 i=1 (Yi − Y )

Frequentist risk Z h i Z R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx

The training error is a downward-biased estimate of the prediction risk. h i btr (S) < R(S) E R h i btr (S)) = E R btr (S) − R(S) = −2 bias(R

n X

h i b(x) = E fbn (x) − f (x) h i v(x) = V fbn (x)

h i Cov Ybi , Yi

i=1

Adjusted R

2

R2 (S) = 1 −

n − 1 rss n − k tss

19.1.1

Mallow’s Cp statistic

Histograms

Definitions

b btr (S) + 2kb R(S) =R σ 2 = lack of fit + complexity penalty • • • •

Akaike Information Criterion (AIC) AIC(S) = `n (βbS , σ bS2 ) − k Bayesian Information Criterion (BIC)

Number of bins m 1 Binwidth h = m Bin Bj has νj observations R Define pbj = νj /n and pj = Bj f (u) du

Histogram estimator

k BIC(S) = `n (βbS , σ bS2 ) − log n 2 Validation and training bV (S) = R

m X

fbn (x) =

(Ybi∗ (S) − Yi∗ )2

m = |{validation data}|, often

i=1

Leave-one-out cross-validation bCV (S) = R

n X

2

(Yi − Yb(i) ) =

i=1

n X i=1

Yi − Ybi (S) 1 − Uii (S)

!2

U (S) = XS (XST XS )−1 XS (“hat matrix”)

19 19.1

Non-parametric Function Estimation

n n or 4 2

m X pbj j=1

h

I(x ∈ Bj )

h i p j E fbn (x) = h h i p (1 − p ) j j V fbn (x) = nh2 Z 1 h2 2 b (f 0 (u)) du + R(fn , f ) ≈ 12 nh !1/3 1 6 h∗ = 1/3 R 2 du n (f 0 (u))  2/3 Z 1/3 3 C 2 C= (f 0 (u)) du R∗ (fbn , f ) ≈ 2/3 4 n

Density Estimation

R Estimate f (x), where f (x) = P [X ∈ A] = A f (x) dx. Integrated square error (ise) Z  Z 2 L(f, fbn ) = f (x) − fbn (x) dx = J(h) + f 2 (x) dx

Cross-validation estimate of E [J(h)] Z JbCV (h) =

n

m

2Xb 2 n+1 X 2 f(−i) (Xi ) = pb fbn2 (x) dx − − n i=1 (n − 1)h (n − 1)h j=1 j 20

19.1.2

Kernel Density Estimator (KDE)

k-nearest Neighbor Estimator X 1 Yi where Nk (x) = {k values of x1 , . . . , xn closest to x} rb(x) = k

Kernel K • • • •

i:xi ∈Nk (x)

K(x) ≥ 0 R K(x) dx = 1 R xK(x) dx = 0 R 2 2 >0 x K(x) dx ≡ σK

Nadaraya-Watson Kernel Estimator rb(x) =

n X

wi (x)Yi

i=1

KDE

x−xi  h Pn x−xj K j=1 h

K

wi (x) =

  n 1X1 x − Xi fbn (x) = K n i=1 h h Z Z 1 1 4 00 2 b R(f, fn ) ≈ (hσK ) (f (x)) dx + K 2 (x) dx 4 nh Z Z −2/5 −1/5 −1/5 c c2 c3 2 2 h∗ = 1 c = σ , c = K (x) dx, c = (f 00 (x))2 dx 1 2 3 K n1/5 Z 4/5 Z 1/5 c4 5 2 2/5 ∗ 2 00 2 b R (f, fn ) = 4/5 K (x) dx (f ) dx c4 = (σK ) 4 n | {z }

R(b rn , r) ≈

h4 4 Z

+



∈ [0, 1]

4 Z  2 f 0 (x) x2 K 2 (x) dx r00 (x) + 2r0 (x) dx f (x) R σ 2 K 2 (x) dx dx nhf (x) Z

c1 n1/5 c2 R∗ (b rn , r) ≈ 4/5 n h∗ ≈

C(K)

Cross-validation estimate of E [J(h)]

Epanechnikov Kernel ( K(x) =



3 √ 4 5(1−x2 /5)

|x| <

0

otherwise

JbCV (h) =

5

n X

(Yi − rb(−i) (xi ))2 =

i=1

n X i=1

(Yi − rb(xi ))2 1−

Pn

j=1

Cross-validation estimate of E [J(h)] Z JbCV (h) =

K ∗ (x) = K (2) (x) − 2K(x)

19.2

19.3

  n n n 2Xb 1 X X ∗ Xi − Xj 2 2 b fn (x) dx − f(−i) (Xi ) ≈ + K(0) K n i=1 hn2 i=1 j=1 h nh K (2) (x) =

Approximation r(x) =

∞ X j=1

βj φj (x) ≈

J X

βj φj (x)

i=1

Multivariate regression

Non-parametric Regression

Estimate f (x) where f (x) = E [Y | X = x]. Consider pairs of points (x1 , Y1 ), . . . , (xn , Yn ) related by

K(0)  x−x  j K h

Smoothing Using Orthogonal Functions

Z K(x − y)K(y) dy

!2

where

ηi = i

Y = Φβ + η  φ0 (x1 )  .. and Φ =  .

··· .. . φ0 (xn ) · · ·

 φJ (x1 ) ..  .  φJ (xn )

Least squares estimator Yi = r(xi ) + i E [i ] = 0 V [i ] = σ 2

βb = (ΦT Φ)−1 ΦT Y 1 ≈ ΦT Y (for equally spaced observations only) n

21

Cross-validation estimate of E [J(h)]  2 n J X X bCV (J) = Yi − R φj (xi )βbj,(−i)  i=1

20

j=1

20.2

Poisson Processes

Poisson process • {Xt : t ∈ [0, ∞)} = number of events up to and including time t • X0 = 0 • Independent increments:

Stochastic Processes

Stochastic Process ( {0, ±1, . . . } = Z discrete T = [0, ∞) continuous

{Xt : t ∈ T } • Notations Xt , X(t) • State space X • Index set T

20.1

∀t0 < · · · < tn : Xt1 − Xt0 ⊥ ⊥ ··· ⊥ ⊥ Xtn − Xtn−1 • Intensity function λ(t) – P [Xt+h − Xt = 1] = λ(t)h + o(h) – P [Xt+h − Xt = 2] = o(h) • Xs+t − Xs ∼ Po (m(s + t) − m(s)) where m(t) =

Markov Chains

Markov chain

Rt 0

λ(s) ds

Homogeneous Poisson process

P [Xn = x | X0 , . . . , Xn−1 ] = P [Xn = x | Xn−1 ]

∀n ∈ T, x ∈ X λ(t) ≡ λ =⇒ Xt ∼ Po (λt)

Transition probabilities pij ≡ P [Xn+1 = j | Xn = i] pij (n) ≡ P [Xm+n = j | Xm = i]

λ>0

Waiting times n-step Wt := time at which Xt occurs

Transition matrix P (n-step: Pn ) • (i, j) element is pij • pij > 0 P • i pij = 1

  1 Wt ∼ Gamma t, λ

Chapman-Kolmogorov

Interarrival times

pij (m + n) =

X

pij (m)pkj (n)

St = Wt+1 − Wt

k

Pm+n = Pm Pn Pn = P × · · · × P = Pn

St ∼ Exp

  1 λ

Marginal probability µn = (µn (1), . . . , µn (N ))

where

µi (i) = P [Xn = i]

St

µ0 , initial distribution µn = µ0 Pn

Wt−1

Wt

t 22

21

Time Series

21.1

Mean function

Z

Strictly stationary



µxt = E [xt ] =

Stationary Time Series

xft (x) dx

P [xt1 ≤ c1 , . . . , xtk ≤ ck ] = P [xt1 +h ≤ c1 , . . . , xtk +h ≤ ck ]

−∞

Autocovariance function γx (s, t) = E [(xs − µs )(xt − µt )] = E [xs xt ] − µs µt   γx (t, t) = E (xt − µt )2 = V [xt ] Autocorrelation function (ACF)

∀k ∈ N, tk , ck , h ∈ Z Weakly stationary   ∀t ∈ Z • E x2t < ∞  2 • E xt = m ∀t ∈ Z • γx (s, t) = γx (s + r, t + r)

γ(s, t) Cov [xs , xt ] =p ρ(s, t) = p V [xs ] V [xt ] γ(s, s)γ(t, t) Cross-covariance function (CCV)

∀r, s, t ∈ Z

Autocovariance function

γxy (s, t) = E [(xs − µxs )(yt − µyt )]

• • • • •

Cross-correlation function (CCF) ρxy (s, t) = p

γxy (s, t) γx (s, s)γy (t, t)

γ(h) = E [(xt+h − µ)(xt − µ)]   γ(0) = E (xt − µ)2 γ(0) ≥ 0 γ(0) ≥ |γ(h)| γ(h) = γ(−h)

∀h ∈ Z

Backshift operator B k (xt ) = xt−k

Autocorrelation function (ACF)

Difference operator

γ(t + h, t) γ(h) Cov [xt+h , xt ] ρx (h) = p =p = γ(0) V [xt+h ] V [xt ] γ(t + h, t + h)γ(t, t)

∇d = (1 − B)d White noise 2 • wt ∼ wn(0, σw )

• • • •

Jointly stationary time series

iid

2 0, σw



Gaussian: wt ∼ N E [wt ] = 0 t ∈ T V [wt ] = σ 2 t ∈ T γw (s, t) = 0 s 6= t ∧ s, t ∈ T

γxy (h) = E [(xt+h − µx )(yt − µy )] ρxy (h) = p

Random walk Linear process

• Drift δ Pt • xt = δt + j=1 wj • E [xt ] = δt

xt = µ +

k X j=−k

aj xt−j

∞ X

ψj wt−j

where

j=−∞

Symmetric moving average mt =

γxy (h) γx (0)γy (h)

where aj = a−j ≥ 0 and

k X j=−k

aj = 1

γ(h) =

∞ X

|ψj | < ∞

j=−∞

2 σw

∞ X

ψj+h ψj

j=−∞

23

21.2

Estimation of Correlation

21.3.1

Detrending

Least squares

Sample mean n

1X x ¯= xt n t=1 Sample variance  n  |h| 1 X 1− γx (h) V [¯ x] = n n h=−n

1. Choose trend model, e.g., µt = β0 + β1 t + β2 t2 2. Minimize rss to obtain trend estimate µ bt = βb0 + βb1 t + βb2 t2 3. Residuals , noise wt Moving average • The low-pass filter vt is a symmetric moving average mt with aj =

Sample autocovariance function n−h 1 X γ b(h) = (xt+h − x ¯)(xt − x ¯) n t=1

vt =

1 2k+1 :

k X 1 xt−1 2k + 1 i=−k

Pk 1 • If 2k+1 i=−k wt−j ≈ 0, a linear trend function µt = β0 + β1 t passes without distortion

Sample autocorrelation function ρb(h) =

γ b(h) γ b(0)

Differencing • µt = β0 + β1 t =⇒ ∇xt = β1

Sample cross-variance function γ bxy (h) =

n−h 1 X (xt+h − x ¯)(yt − y) n t=1

21.4

ARIMA models

Autoregressive polynomial φ(z) = 1 − φ1 z − · · · − φp zp

Sample cross-correlation function γ bxy (h) ρbxy (h) = p γ bx (0)b γy (0)

z ∈ C ∧ φp 6= 0

Autoregressive operator φ(B) = 1 − φ1 B − · · · − φp B p

Properties 1 • σρbx (h) = √ if xt is white noise n 1 • σρbxy (h) = √ if xt or yt is white noise n

21.3

Non-Stationary Time Series

Autoregressive model order p, AR (p) xt = φ1 xt−1 + · · · + φp xt−p + wt ⇐⇒ φ(B)xt = wt AR (1) • xt = φk (xt−k ) +

k−1 X

φj (wt−j )

k→∞,|φ|<1

j=0

Classical decomposition model

=

∞ X

φj (wt−j )

j=0

|

{z

}

linear process

xt = µt + st + wt • µt = trend • st = seasonal component • wt = random noise term

• E [xt ] =

P∞

j=0

φj (E [wt−j ]) = 0

• γ(h) = Cov [xt+h , xt ] = • ρ(h) =

γ(h) γ(0)

2 h σw φ 1−φ2

= φh

• ρ(h) = φρ(h − 1) h = 1, 2, . . .

24

Moving average polynomial

Seasonal ARIMA

θ(z) = 1 + θ1 z + · · · + θq zq

z ∈ C ∧ θq 6= 0

Moving average operator θ(B) = 1 + θ1 B + · · · + θp B p

21.4.1

MA (q) (moving average model order q) xt = wt + θ1 wt−1 + · · · + θq wt−q ⇐⇒ xt = θ(B)wt E [xt ] =

q X

• Denoted by ARIMA (p, d, q) × (P, D, Q)s d s • ΦP (B s )φ(B)∇D s ∇ xt = δ + ΘQ (B )θ(B)wt Causality and Invertibility

ARMA (p, q) is causal (future-independent) ⇐⇒ ∃{ψj } : xt =

θj E [wt−j ] = 0

γ(h) = Cov [xt+h , xt ] =

2 σw 0

Pq−h j=0

θj θj+h

j=0

ψj < ∞ such that

wt−j = ψ(B)wt

j=0

j=0

(

∞ X

P∞

0≤h≤q h>q

ARMA (p, q) is invertible ⇐⇒ ∃{πj } :

MA (1) xt = wt + θwt−1  2 2  (1 + θ )σw h = 0 2 γ(h) = θσw h=1   0 h>1 ( θ h=1 2 ρ(h) = (1+θ ) 0 h>1

π(B)xt =

P∞

j=0

∞ X

πj < ∞ such that

Xt−j = wt

j=0

Properties • ARMA (p, q) causal ⇐⇒ roots of φ(z) lie outside the unit circle ψ(z) =

∞ X

ψj z j =

j=0

θ(z) φ(z)

|z| ≤ 1

ARMA (p, q) xt = φ1 xt−1 + · · · + φp xt−p + wt + θ1 wt−1 + · · · + θq wt−q

• ARMA (p, q) invertible ⇐⇒ roots of θ(z) lie outside the unit circle

φ(B)xt = θ(B)wt

π(z) =

Partial autocorrelation function (PACF)

∞ X

πj z j =

j=0

• xh−1 , regression of xi on {xh−1 , xh−2 , . . . , x1 } i • φhh = corr(xh − xh−1 , x0 − xh−1 ) h≥2 0 h • E.g., φ11 = corr(x1 , x0 ) = ρ(1)

φ(z) θ(z)

|z| ≤ 1

Behavior of the ACF and PACF for causal and invertible ARMA models

ACF PACF

ARIMA (p, d, q) ∇d xt = (1 − B)d xt is ARMA (p, q)

AR (p) tails off cuts off after lag p

MA (q) cuts off after lag q tails off q

ARMA (p, q) tails off tails off

φ(B)(1 − B)d xt = θ(B)wt Exponentially Weighted Moving Average (EWMA) xt = xt−1 + wt − λwt−1 xt =

∞ X

(1 − λ)λj−1 xt−j + wt

when |λ| < 1

21.5

Spectral Analysis

Periodic process xt = A cos(2πωt + φ) = U1 cos(2πωt) + U2 sin(2πωt)

j=1

x ˜n+1 = (1 − λ)xn + λ˜ xn

• Frequency index ω (cycles per unit time), period 1/ω

25

• Amplitude A • Phase φ • U1 = A cos φ and U2 = A sin φ often normally distributed rv’s

Discrete Fourier Transform (DFT) d(ωj ) = n−1/2

xt =

q X

Fourier/Fundamental frequencies (Uk1 cos(2πωk t) + Uk2 sin(2πωk t))

ωj = j/n

k=1

• Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rv’s with variances σk2 Pq • γ(h) = k=1 σk2 cos(2πωk h)   Pq • γ(0) = E x2t = k=1 σk2

Inverse DFT xt = n−1/2

j=0

Scaled Periodogram

σ 2 −2πiω0 h σ 2 2πiω0 h e + e = 2 2 Z 1/2 = e2πiωh dF (ω)

4 I(j/n) n !2 n 2X = xt cos(2πtj/n + n t=1

P (j/n) =

−1/2

Spectral distribution function ω < −ω0 −ω ≤ ω < ω0 ω ≥ ω0

22 22.1

γ(h)e−2πiωh



h=−∞ h=−∞

!2

Gamma Function ∞

ts−1 e−t dt 0 Z ∞ • Upper incomplete: Γ(s, x) = ts−1 e−t dt x Z x • Lower incomplete: γ(s, x) = ts−1 e−t dt

• Ordinary: Γ(s) =

Spectral density ∞ X

n

2X xt sin(2πtj/n n t=1

Math Z

• F (−∞) = F (−1/2) = 0 • F (∞) = F (1/2) = γ(0)

f (ω) =

d(ωj )e2πiωj t

I(j/n) = |d(j/n)|2

γ(h) = σ 2 cos(2πω0 h)

  0 F (ω) = σ 2 /2   2 σ

n−1 X

Periodogram

Spectral representation of a periodic process

P∞

xt e−2πiωj t

i=1

Periodic mixture

• Needs

n X

|γ(h)| < ∞ =⇒ γ(h) =

1 1 ≤ω≤ 2 2

R 1/2 −1/2

e2πiωh f (ω) dω

• f (ω) ≥ 0 • f (ω) = f (−ω) • f (ω) = f (1 − ω) R 1/2 • γ(0) = V [xt ] = −1/2 f (ω) dω

0

h = 0, ±1, . . .

• Γ(α + 1) = αΓ(α) α>1 • Γ(n) = (n − 1)! n∈N √ • Γ(1/2) = π

22.2

Beta Function

Z 1 Γ(x)Γ(y) tx−1 (1 − t)y−1 dt = • Ordinary: B(x, y) = B(y, x) = Γ(x + y) 0 Z x • Incomplete: B(x; a, b) = ta−1 (1 − t)b−1 dt

2 • White noise: fw (ω) = σw • ARMA (p, q) , φ(B)xt = θ(B)wt :

|θ(e−2πiω )|2 fx (ω) = |φ(e−2πiω )|2 Pp Pq where φ(z) = 1 − k=1 φk z k and θ(z) = 1 + k=1 θk z k 2 σw

0

• Regularized incomplete: a+b−1 B(x; a, b) a,b∈N X (a + b − 1)! Ix (a, b) = = xj (1 − x)a+b−1−j B(a, b) j!(a + b − 1 − j)! j=a

26

Stirling numbers, 2nd kind       n n−1 n−1 =k + k k k−1

• I0 (a, b) = 0 I1 (a, b) = 1 • Ix (a, b) = 1 − I1−x (b, a)

22.3

Series

Finite •

n(n + 1) k= 2



(2k − 1) = n2



k=1 n X



k=1 n X

k=1 n X



k2 =

ck =

k=0

cn+1 − 1 c−1

n   X n k=0 n  X

k

n

=2

k=0 ∞ X

Balls and Urns |B| = n, |U | = m B : D, U : ¬D

• Binomial Theorem: n   X n n−k k a b = (a + b)n k

B : ¬D, U : D

k=0

c 6= 1

k > n : Pn,k = 0

f :B→U f arbitrary n

m 

B : D, U : ¬D

 n+n−1 n m   X n k=1

1 , 1−p

∞ X

p |p| < 1 1−p k=1 !   ∞ X d 1 1 k = p = dp 1 − p (1 − p)2 pk =

kpk−1 =

B : ¬D, U : ¬D |p| < 1

ordered

(n − i) =

i=0

unordered

Pn,k

f injective ( mn m ≥ n 0 else   m n ( 1 m≥n 0 else ( 1 m≥n 0 else

f surjective   n m! m 

 n−1 m−1   n m Pn,m

f bijective ( n! m = n 0 else ( 1 m=n 0 else ( 1 m=n 0 else ( 1 m=n 0 else

References

[3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications With R Examples. Springer, 2006.

w/o replacement nk =

k

D = distinguishable, ¬D = indistinguishable.

[2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships. The American Statistician, 62(1):45–53, 2008.

Sampling

k−1 Y

n ≥ 1 : Pn,0 = 0, P0,0 = 1

[1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory. Brooks Cole, 1972.

Combinatorics

k out of n

m X k=1

k=0

22.4

Pn,i

k=0

d dp k=0 k=0  ∞  X r+k−1 k • x = (1 − x)−r r ∈ N+ k k=0 ∞   X α k • p = (1 + p)α |p| < 1 , α ∈ C k •

n X

• Vandermonde’s Identity:    r   X m n m+n = k r−k r

k=0

Infinite ∞ X • pk =

Pn+k,k =

i=1

   r+k r+n+1 = k n k=0     n X k n+1 • = m m+1

n(n + 1)(2n + 1) 6 k=1 2  n X n(n + 1) • k3 = 2 •

Partitions

Binomial

n X

1≤k≤n

  ( 1 n=0 n = 0 0 else

n! (n − k)!

  n nk n! = = k k! k!(n − k)!

w/ replacement

[4] A. Steger. Diskrete Strukturen – Band 1: Kombinatorik, Graphentheorie, Algebra. Springer, 2001.

nk

[5] A. Steger. Diskrete Strukturen – Band 2: Wahrscheinlichkeitstheorie und Statistik. Springer, 2002.

    n−1+r n−1+r = r n−1

[6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer, 2003. 27

28

Univariate distribution relationships, courtesy Leemis and McQueston [2].