美文网首页Data Analysis for the Life Science
[R语言] Statistics and R - Week 3-

[R语言] Statistics and R - Week 3-

作者: 半为花间酒 | 来源:发表于2020-04-19 13:40 被阅读0次

Week 3: Inference

Part 1: Confidence Intervals


参考书籍:《Data Analysis for the Life Sciences》

参考视频:

  1. Data Analysis for the Life Sciences Series - rafalib
  2. Professional Certificate in Data Analysis for Life Sciences (Harvard University) - edX

Confidence Intervals

A confidence interval includes information about your estimated effect size and the uncertainty associated with this estimate.

Confidence Interval For Population Mean

dat <- rio::import("mice_pheno.csv")

# population average
chowPopulation <- dat[dat$Sex=="F" & dat$Diet=="chow",3]
mu_chow <- mean(chowPopulation)

# sample average
N <- 30
chow <- sample(chowPopulation,N)
mean(chow)

# CLT
se <- sd(chow) / sqrt(N)

Defining The Interval

Keep in mind that saying 95% of random intervals will fall on the true value is not the same as saying there is a 95% chance that the true value falls in our interval.

(mean(chow) - mean(chowPopulation)) / se 
# 符合正态正态分布 N(0~1)

图源:https://oneclass.com/blog/university-of-maryland/56656-5-core-concepts-in-stat-440-at-the-university-of-maryland.en.html

-2 \leq \sqrt{N} (\bar{X}-\mu_X)/s_X \leq 2

pnorm(2) - pnorm(-2)
# [1] 0.9544997

# 2是一个近似值,本质应该用如下值
qnorm(1- 0.05/2)
# [1] 1.959964

Now do some basic algebra to clear out everything and leave \mu_X alone in the middle and you get that the following event:

\bar{X}-2 s_X/\sqrt{N} \leq \mu_X \leq \bar{X}+2s_X/\sqrt{N}
has a probability of 95%.

Again, the definition of the confidence interval is that 95% of random intervals will contain the true, fixed value \mu_X.

参考:如何理解95%置信区间

  • 计算置信区间
Q <- qnorm(1- 0.05/2)
interval <- c(mean(chow)-Q*se, mean(chow)+Q*se )
interval
# [1] 23.11161 25.62239
between(mu_chow,interval[1],interval[2])
# [1] TRUE

which happens to cover \mu_X or mean(chowPopulation). However, we can take another sample and we might not be as lucky.

In fact, the theory tells us that we will cover \mu_X 95% of the time.

  • 绘制置信区间模拟图
B <- 250
N <- 30
Q <- qnorm(1- 0.05/2)
plot(mean(chowPopulation)+c(-7,7),c(1,1),type="n",
     xlab="weight",ylab="interval",ylim=c(1,B))
# 画总体参数均值线
abline(v=mean(chowPopulation))
for (i in 1:B) {
  chow <- sample(chowPopulation,N)
  se <- sd(chow)/sqrt(N)
  interval <- c(mean(chow)-Q*se, mean(chow)+Q*se)
  covered <- 
    mean(chowPopulation) <= interval[2] & mean(chowPopulation) >= interval[1]
  # 根据覆盖总体均值与否分配颜色
  color <- ifelse(covered,1,2)
  lines(interval, c(i,i),col=color)
}

You can run this repeatedly to see what happens. You will see that in about 5% of the cases, we fail to cover \mu_X.

Small Sample Size And The CLT

library(rafalib)
mypar(1,2)
B <- 250

# 大样本
N <- 30
Q <- qnorm(1- 0.05/2)
plot(mean(chowPopulation)+c(-7,7),c(1,1),type="n",
     xlab="weight",ylab="interval",ylim=c(1,B))
abline(v=mean(chowPopulation))
for (i in 1:B) {
  chow <- sample(chowPopulation,N)
  se <- sd(chow)/sqrt(N)
  interval <- c(mean(chow)-Q*se, mean(chow)+Q*se)
  covered <- 
    mean(chowPopulation) <= interval[2] & mean(chowPopulation) >= interval[1]
  color <- ifelse(covered,1,2)
  lines(interval, c(i,i),col=color)
}

# 小样本
N <- 5
Q <- qnorm(1- 0.05/2)
plot(mean(chowPopulation)+c(-7,7),c(1,1),type="n",
     xlab="weight",ylab="interval",ylim=c(1,B))
abline(v=mean(chowPopulation))
for (i in 1:B) {
  chow <- sample(chowPopulation,N)
  se <- sd(chow)/sqrt(N)
  interval <- c(mean(chow)-Q*se, mean(chow)+Q*se)
  covered <- 
    mean(chowPopulation) <= interval[2] & mean(chowPopulation) >= interval[1]
  color <- ifelse(covered,1,2)
  lines(interval, c(i,i),col=color)
}

可以看到不覆盖总体均数的区间明显变多,因为小样本情况下CLT延续正态分布的判断会增大误差,即Q值计算错误。此时需要用t-分布

mypar(1,2)
B <- 250

# 按正态分布
N <- 5
Q <- qnorm(1- 0.05/2)
plot(mean(chowPopulation)+c(-7,7),c(1,1),type="n",
     xlab="weight",ylab="interval",ylim=c(1,B))
abline(v=mean(chowPopulation))
for (i in 1:B) {
  chow <- sample(chowPopulation,N)
  se <- sd(chow)/sqrt(N)
  interval <- c(mean(chow)-Q*se, mean(chow)+Q*se)
  covered <- 
    mean(chowPopulation) <= interval[2] & mean(chowPopulation) >= interval[1]
  color <- ifelse(covered,1,2)
  lines(interval, c(i,i),col=color)
}

# 按照t分布
N <- 5
Q <- qt(1- 0.05/2,df=N-1)
plot(mean(chowPopulation)+c(-7,7),c(1,1),type="n",
     xlab="weight",ylab="interval",ylim=c(1,B))
abline(v=mean(chowPopulation))
for (i in 1:B) {
  chow <- sample(chowPopulation,N)
  se <- sd(chow)/sqrt(N)
  interval <- c(mean(chow)-Q*se, mean(chow)+Q*se)
  covered <- 
    mean(chowPopulation) <= interval[2] & mean(chowPopulation) >= interval[1]
  color <- ifelse(covered,1,2)
  lines(interval, c(i,i),col=color)
}

Now the intervals are made bigger, which makes the intervals larger and hence cover \mu_X more frequently

N <- 5
qt(1 - 0.05/2, df=N-1)
# [1] 2.776445
qnorm(1 - 0.05/2)
# [1] 1.959964

This is because the t-distribution has fatter tails

参考:https://staff.aist.go.jp/t.ihara/t.html

Connection Between Confidence Intervals and p-values

课本里阐述的很清楚,这里直接放原文

If we are talking about a t-test p-value, we are asking if differences as extreme as the one we observe, \bar{Y} - \bar{X}, are likely when the difference between the population averages is actually equal to zero. So we can form a confidence interval with the observed difference. Instead of writing \bar{Y} - \bar{X} repeatedly, let's define this difference as a new variable d \equiv \bar{Y} - \bar{X} .

Suppose you use CLT and report d \pm 2 s_d/\sqrt{N}, with s_d = \sqrt{s_X^2+s_Y^2}, as a 95% confidence interval for the difference and this interval does not include 0 (a false positive).

Because the interval does not include 0, this implies that either D - 2 s_d/\sqrt{N} > 0 or d + 2 s_d/\sqrt{N} < 0. This suggests that either \sqrt{N}d/s_d > 2 or \sqrt{N}d/s_d < 2. This then implies that the t-statistic is more extreme than 2, which in turn suggests that the p-value must be smaller than 0.05 (approximately, for a more exact calculation use qnorm(.05/2) instead of 2). The same calculation can be made if we use the t-distribution instead of CLT (with qt(.05/2, df=2*N-2)).

置信区间和p值很核心的一句话

In summary, if a 95% or 99% confidence interval does not include 0, then the p-value must be smaller than 0.05 or 0.01 respectively.

相关文章

网友评论

    本文标题:[R语言] Statistics and R - Week 3-

    本文链接:https://www.haomeiwen.com/subject/qffybhtx.html