- [R语言] Statistics and R - Week 3-
- [R语言] Statistics and R - Week 1
- [R语言] Statistics and R - Week 2-
- [R语言] Statistics and R - Week 2-
- [R语言] Statistics and R - Week 2-
- R语言-0基础学习4-实战1-常见操作
- 克拉美罗界(Cramér–Rao Bound ,CRB)
- An example of Generalized Bayes'
- Some words that U need to know t
- 【转】Fitting Non-Linear Growth Cur
Week 3: Inference
Part 1: Confidence Intervals
参考书籍:《Data Analysis for the Life Sciences》
参考视频:
Confidence Intervals
A confidence interval includes information about your estimated effect size and the uncertainty associated with this estimate.
Confidence Interval For Population Mean
dat <- rio::import("mice_pheno.csv")
# population average
chowPopulation <- dat[dat$Sex=="F" & dat$Diet=="chow",3]
mu_chow <- mean(chowPopulation)
# sample average
N <- 30
chow <- sample(chowPopulation,N)
mean(chow)
# CLT
se <- sd(chow) / sqrt(N)
Defining The Interval
Keep in mind that saying 95% of random intervals will fall on the true value is not the same as saying there is a 95% chance that the true value falls in our interval.
(mean(chow) - mean(chowPopulation)) / se
# 符合正态正态分布 N(0~1)
pnorm(2) - pnorm(-2)
# [1] 0.9544997
# 2是一个近似值,本质应该用如下值
qnorm(1- 0.05/2)
# [1] 1.959964
Now do some basic algebra to clear out everything and leave alone in the middle and you get that the following event:
has a probability of 95%.
Again, the definition of the confidence interval is that 95% of random intervals will contain the true, fixed value
.
参考:如何理解95%置信区间
- 计算置信区间
Q <- qnorm(1- 0.05/2)
interval <- c(mean(chow)-Q*se, mean(chow)+Q*se )
interval
# [1] 23.11161 25.62239
between(mu_chow,interval[1],interval[2])
# [1] TRUE
which happens to cover or
mean(chowPopulation). However, we can take another sample and we might not be as lucky.
In fact, the theory tells us that we will cover 95% of the time.
- 绘制置信区间模拟图
B <- 250
N <- 30
Q <- qnorm(1- 0.05/2)
plot(mean(chowPopulation)+c(-7,7),c(1,1),type="n",
xlab="weight",ylab="interval",ylim=c(1,B))
# 画总体参数均值线
abline(v=mean(chowPopulation))
for (i in 1:B) {
chow <- sample(chowPopulation,N)
se <- sd(chow)/sqrt(N)
interval <- c(mean(chow)-Q*se, mean(chow)+Q*se)
covered <-
mean(chowPopulation) <= interval[2] & mean(chowPopulation) >= interval[1]
# 根据覆盖总体均值与否分配颜色
color <- ifelse(covered,1,2)
lines(interval, c(i,i),col=color)
}
You can run this repeatedly to see what happens. You will see that in about 5% of the cases, we fail to cover .
Small Sample Size And The CLT
library(rafalib)
mypar(1,2)
B <- 250
# 大样本
N <- 30
Q <- qnorm(1- 0.05/2)
plot(mean(chowPopulation)+c(-7,7),c(1,1),type="n",
xlab="weight",ylab="interval",ylim=c(1,B))
abline(v=mean(chowPopulation))
for (i in 1:B) {
chow <- sample(chowPopulation,N)
se <- sd(chow)/sqrt(N)
interval <- c(mean(chow)-Q*se, mean(chow)+Q*se)
covered <-
mean(chowPopulation) <= interval[2] & mean(chowPopulation) >= interval[1]
color <- ifelse(covered,1,2)
lines(interval, c(i,i),col=color)
}
# 小样本
N <- 5
Q <- qnorm(1- 0.05/2)
plot(mean(chowPopulation)+c(-7,7),c(1,1),type="n",
xlab="weight",ylab="interval",ylim=c(1,B))
abline(v=mean(chowPopulation))
for (i in 1:B) {
chow <- sample(chowPopulation,N)
se <- sd(chow)/sqrt(N)
interval <- c(mean(chow)-Q*se, mean(chow)+Q*se)
covered <-
mean(chowPopulation) <= interval[2] & mean(chowPopulation) >= interval[1]
color <- ifelse(covered,1,2)
lines(interval, c(i,i),col=color)
}
可以看到不覆盖总体均数的区间明显变多,因为小样本情况下CLT延续正态分布的判断会增大误差,即Q值计算错误。此时需要用t-分布
mypar(1,2)
B <- 250
# 按正态分布
N <- 5
Q <- qnorm(1- 0.05/2)
plot(mean(chowPopulation)+c(-7,7),c(1,1),type="n",
xlab="weight",ylab="interval",ylim=c(1,B))
abline(v=mean(chowPopulation))
for (i in 1:B) {
chow <- sample(chowPopulation,N)
se <- sd(chow)/sqrt(N)
interval <- c(mean(chow)-Q*se, mean(chow)+Q*se)
covered <-
mean(chowPopulation) <= interval[2] & mean(chowPopulation) >= interval[1]
color <- ifelse(covered,1,2)
lines(interval, c(i,i),col=color)
}
# 按照t分布
N <- 5
Q <- qt(1- 0.05/2,df=N-1)
plot(mean(chowPopulation)+c(-7,7),c(1,1),type="n",
xlab="weight",ylab="interval",ylim=c(1,B))
abline(v=mean(chowPopulation))
for (i in 1:B) {
chow <- sample(chowPopulation,N)
se <- sd(chow)/sqrt(N)
interval <- c(mean(chow)-Q*se, mean(chow)+Q*se)
covered <-
mean(chowPopulation) <= interval[2] & mean(chowPopulation) >= interval[1]
color <- ifelse(covered,1,2)
lines(interval, c(i,i),col=color)
}
Now the intervals are made bigger, which makes the intervals larger and hence cover
more frequently
N <- 5
qt(1 - 0.05/2, df=N-1)
# [1] 2.776445
qnorm(1 - 0.05/2)
# [1] 1.959964
This is because the t-distribution has fatter tails
参考:https://staff.aist.go.jp/t.ihara/t.html
Connection Between Confidence Intervals and p-values
课本里阐述的很清楚,这里直接放原文
If we are talking about a t-test p-value, we are asking if differences as extreme as the one we observe, , are likely when the difference between the population averages is actually equal to zero. So we can form a confidence interval with the observed difference. Instead of writing
repeatedly, let's define this difference as a new variable
.
Suppose you use CLT and report , with
, as a 95% confidence interval for the difference and this interval does not include 0 (a false positive).
Because the interval does not include 0, this implies that either or
. This suggests that either
or
. This then implies that the t-statistic is more extreme than 2, which in turn suggests that the p-value must be smaller than 0.05 (approximately, for a more exact calculation use
qnorm(.05/2) instead of 2). The same calculation can be made if we use the t-distribution instead of CLT (with qt(.05/2, df=2*N-2)).
置信区间和p值很核心的一句话
In summary, if a 95% or 99% confidence interval does not include 0, then the p-value must be smaller than 0.05 or 0.01 respectively.














网友评论