[R语言] Statistics and R - Week 3-

作者: 半为花间酒 | 来源:发表于2020-04-19 13:40 被阅读0次

[R语言] Statistics and R - Week 3-
[R语言] Statistics and R - Week 1
[R语言] Statistics and R - Week 2-
[R语言] Statistics and R - Week 2-
[R语言] Statistics and R - Week 2-
R语言-0基础学习4-实战1-常见操作
克拉美罗界(Cramér–Rao Bound ,CRB)
An example of Generalized Bayes'
Some words that U need to know t
【转】Fitting Non-Linear Growth Cur

Week 3: Inference

Part 1: Confidence Intervals

参考书籍：《Data Analysis for the Life Sciences》

参考视频：

Data Analysis for the Life Sciences Series - rafalib

Professional Certificate in Data Analysis for Life Sciences (Harvard University) - edX

Confidence Intervals

A confidence interval includes information about your estimated effect size and the uncertainty associated with this estimate.

Confidence Interval For Population Mean

dat <- rio::import("mice_pheno.csv")

# population average
chowPopulation <- dat[dat$Sex=="F" & dat$Diet=="chow",3]
mu_chow <- mean(chowPopulation)

# sample average
N <- 30
chow <- sample(chowPopulation,N)
mean(chow)

# CLT
se <- sd(chow) / sqrt(N)

Defining The Interval

Keep in mind that saying 95% of random intervals will fall on the true value is not the same as saying there is a 95% chance that the true value falls in our interval.

(mean(chow) - mean(chowPopulation)) / se 
# 符合正态正态分布 N(0~1)

图源：https://oneclass.com/blog/university-of-maryland/56656-5-core-concepts-in-stat-440-at-the-university-of-maryland.en.html

$-2 \leq \sqrt{N} (\bar{X}-\mu_X)/s_X \leq 2$

pnorm(2) - pnorm(-2)
# [1] 0.9544997

# 2是一个近似值，本质应该用如下值
qnorm(1- 0.05/2)
# [1] 1.959964

Now do some basic algebra to clear out everything and leave $\mu_X$ alone in the middle and you get that the following event:

$\bar{X}-2 s_X/\sqrt{N} \leq \mu_X \leq \bar{X}+2s_X/\sqrt{N}$
has a probability of 95%.

Again, the definition of the confidence interval is that 95% of random intervals will contain the true, fixed value $\mu_X$ .

参考：如何理解95%置信区间

计算置信区间

Q <- qnorm(1- 0.05/2)
interval <- c(mean(chow)-Q*se, mean(chow)+Q*se )
interval
# [1] 23.11161 25.62239
between(mu_chow,interval[1],interval[2])
# [1] TRUE

which happens to cover $\mu_X$ or mean(chowPopulation). However, we can take another sample and we might not be as lucky.

In fact, the theory tells us that we will cover $\mu_X$ 95% of the time.

绘制置信区间模拟图

B <- 250
N <- 30
Q <- qnorm(1- 0.05/2)
plot(mean(chowPopulation)+c(-7,7),c(1,1),type="n",
     xlab="weight",ylab="interval",ylim=c(1,B))
# 画总体参数均值线
abline(v=mean(chowPopulation))
for (i in 1:B) {
  chow <- sample(chowPopulation,N)
  se <- sd(chow)/sqrt(N)
  interval <- c(mean(chow)-Q*se, mean(chow)+Q*se)
  covered <- 
    mean(chowPopulation) <= interval[2] & mean(chowPopulation) >= interval[1]
  # 根据覆盖总体均值与否分配颜色
  color <- ifelse(covered,1,2)
  lines(interval, c(i,i),col=color)
}

You can run this repeatedly to see what happens. You will see that in about 5% of the cases, we fail to cover $\mu_X$ .

Small Sample Size And The CLT

library(rafalib)
mypar(1,2)
B <- 250

# 大样本
N <- 30
Q <- qnorm(1- 0.05/2)
plot(mean(chowPopulation)+c(-7,7),c(1,1),type="n",
     xlab="weight",ylab="interval",ylim=c(1,B))
abline(v=mean(chowPopulation))
for (i in 1:B) {
  chow <- sample(chowPopulation,N)
  se <- sd(chow)/sqrt(N)
  interval <- c(mean(chow)-Q*se, mean(chow)+Q*se)
  covered <- 
    mean(chowPopulation) <= interval[2] & mean(chowPopulation) >= interval[1]
  color <- ifelse(covered,1,2)
  lines(interval, c(i,i),col=color)
}

# 小样本
N <- 5
Q <- qnorm(1- 0.05/2)
plot(mean(chowPopulation)+c(-7,7),c(1,1),type="n",
     xlab="weight",ylab="interval",ylim=c(1,B))
abline(v=mean(chowPopulation))
for (i in 1:B) {
  chow <- sample(chowPopulation,N)
  se <- sd(chow)/sqrt(N)
  interval <- c(mean(chow)-Q*se, mean(chow)+Q*se)
  covered <- 
    mean(chowPopulation) <= interval[2] & mean(chowPopulation) >= interval[1]
  color <- ifelse(covered,1,2)
  lines(interval, c(i,i),col=color)
}

可以看到不覆盖总体均数的区间明显变多，因为小样本情况下CLT延续正态分布的判断会增大误差，即Q值计算错误。此时需要用t-分布

mypar(1,2)
B <- 250

# 按正态分布
N <- 5
Q <- qnorm(1- 0.05/2)
plot(mean(chowPopulation)+c(-7,7),c(1,1),type="n",
     xlab="weight",ylab="interval",ylim=c(1,B))
abline(v=mean(chowPopulation))
for (i in 1:B) {
  chow <- sample(chowPopulation,N)
  se <- sd(chow)/sqrt(N)
  interval <- c(mean(chow)-Q*se, mean(chow)+Q*se)
  covered <- 
    mean(chowPopulation) <= interval[2] & mean(chowPopulation) >= interval[1]
  color <- ifelse(covered,1,2)
  lines(interval, c(i,i),col=color)
}

# 按照t分布
N <- 5
Q <- qt(1- 0.05/2,df=N-1)
plot(mean(chowPopulation)+c(-7,7),c(1,1),type="n",
     xlab="weight",ylab="interval",ylim=c(1,B))
abline(v=mean(chowPopulation))
for (i in 1:B) {
  chow <- sample(chowPopulation,N)
  se <- sd(chow)/sqrt(N)
  interval <- c(mean(chow)-Q*se, mean(chow)+Q*se)
  covered <- 
    mean(chowPopulation) <= interval[2] & mean(chowPopulation) >= interval[1]
  color <- ifelse(covered,1,2)
  lines(interval, c(i,i),col=color)
}

Now the intervals are made bigger, which makes the intervals larger and hence cover $\mu_X$ more frequently

N <- 5
qt(1 - 0.05/2, df=N-1)
# [1] 2.776445
qnorm(1 - 0.05/2)
# [1] 1.959964

This is because the t-distribution has fatter tails

参考：https://staff.aist.go.jp/t.ihara/t.html

Connection Between Confidence Intervals and p-values

课本里阐述的很清楚，这里直接放原文

If we are talking about a t-test p-value, we are asking if differences as extreme as the one we observe, $\bar{Y} - \bar{X}$ , are likely when the difference between the population averages is actually equal to zero. So we can form a confidence interval with the observed difference. Instead of writing $\bar{Y} - \bar{X}$ repeatedly, let's define this difference as a new variable $d \equiv \bar{Y} - \bar{X}$ .

Suppose you use CLT and report $d \pm 2 s_d/\sqrt{N}$ , with $s_d = \sqrt{s_X^2+s_Y^2}$ , as a 95% confidence interval for the difference and this interval does not include 0 (a false positive).

Because the interval does not include 0, this implies that either $D - 2 s_d/\sqrt{N} > 0$ or $d + 2 s_d/\sqrt{N} < 0$ . This suggests that either $\sqrt{N}d/s_d > 2$ or $\sqrt{N}d/s_d < 2$ . This then implies that the t-statistic is more extreme than 2, which in turn suggests that the p-value must be smaller than 0.05 (approximately, for a more exact calculation use qnorm(.05/2) instead of 2). The same calculation can be made if we use the t-distribution instead of CLT (with qt(.05/2, df=2*N-2)).

置信区间和p值很核心的一句话

In summary, if a 95% or 99% confidence interval does not include 0, then the p-value must be smaller than 0.05 or 0.01 respectively.

[R语言] Statistics and R - Week 3-
Week 3: Inference Part 1: Confidence Intervals 参考书籍：《Data...
[R语言] Statistics and R - Week 1
Week 1: R 参考书籍：《Data Analysis for the Life Sciences》参考视频：...
[R语言] Statistics and R - Week 2-
Week 2: Random Variables and Central Limit Theorem Part 3...
[R语言] Statistics and R - Week 2-
Week 2: Random Variables and Central Limit Theorem Part 2...
[R语言] Statistics and R - Week 2-
Week 2: Random Variables and Central Limit Theorem Part 1...
R语言-0基础学习4-实战1-常见操作
R语言学习系列R语言-0基础学习1-数据结构R语言-0基础学习2-构建子集R语言-0基础学习3-循环排序信息处理函...
克拉美罗界(Cramér–Rao Bound ,CRB)
In estimation theory and statistics, the Cramér–Rao bound...
An example of Generalized Bayes'
BOOK: Applied Bayesian Statistics With R and OpenBUGS Exa...
Some words that U need to know t
statistics analysis/analytics/analyze/analyst Algorithm R...
【转】Fitting Non-Linear Growth Cur
from Fitting Non-Linear Growth Curves in R · Statistics @...