前言
cisTopic,用于分析和注释scATAC-seq的结果,一般来说,scATAC测量的是单个细胞中染色质开放的程度
对于scATAC的分析有三种类型,第一种是直接得到peaks的单细胞矩阵,这里的peak即为染色质开放的区域(行为peak的位置信息;列为cells);第二种是将基因组按照window/bin等长划分,比如按照500bp来划分(1-500bp,501-1000bp,1001-1500bp...),基于scATAC测序的结果,在所划分的区域内计算开放的活性分数(行为window/bin的位置信息;列为cells);第三种就是需要基于数据库信息,在开放的区域内,寻找一些TF或是其他因子结合的motif(基于数据库中的注释),并计算这些motif相应的开放活性(行为motif的位置信息;列为cells),这一组方式的基本假设是TF或是其他因子更容易结合在染色质开放的区域
cisRopic数学原理
如上图所示,cisRopic首先先获得scATAC的矩阵(二进制矩阵),那么0表示该cell中没有这个peak,1表示该cell中有这个peak
然后针对每一个cell,将其中的peak分为 topic 1 和 topic 2(这里的topic 1 和 topic 2 可以理解为两个不同的生物学功能,即这些peak划分为具有两种不同生物学功能的大类)
注:这里的两个topic是一个抽象的概念,具体的生物学意义还是要基于注释来赋予
下一步,对每一个cell中的peak进行吉布斯采样(吉布斯采样的理解与M-H采样类似,传送门)
估计出每个topic的抽样比例的期望(也可以理解为在某个cell中,peak属于某个topic的比例)
具体原理如下:
- zi is the current assignment to be made,
- z–i is the rest of assignments in the dataset,
- t is the given topic,
- r is the given region,
- P(zi = t | z–i, r) is the probability of assigning the given region r to a regulatory topic t given the rest of the assignments in the dataset,
is the number of times the given region r is assigned to topic t across the dataset without consideration of the current assignment to be done,
- β is the Dirichlet hyperparameter of the prior distribution for the categorical distribution over regions in a topic
(here, we use symmetric Dirichlet priors for all topics, using 0.1 as the value for β),
is the total number of assignments to topic t through the dataset,
- R is the total number of regions in the dataset,
expresses the probability of region r under topic t,
is the total number of region assignments to topic t within the given cell c (without consideration of the region to be assigned),
- α is the Dirichlet hyperparameter of the prior distribution for the categorical distribution over topics in a cell θ(c) (here, we use symmetric Dirichlet priors for all cells, using 50/T as the value for α),
is the total number of assignments within the given cell c,
- T is the total number of topics in the model (the total number of topics has to be provided; see ‘Model selection’), and
is the probability of topic t under cell c.
其中 的意义是在给定任意peak位置信息(Region)的条件下,该peak(Region)属于 topic t 的概率
而 代表在 cell c 中出现 topic t 的概率;
代表在 topic t中出现 region r 的概率
根据贝叶斯理论,我们可以得知如下的正比关系:
那么要估计,则可以依靠这种正比关系,对
和
这两个分布进行吉布斯采样,然后估算它们乘积的期望。由于这种正比关系的存在,右式乘积的期望就等于左式的概率的期望,可参考:传送门
代码分析
# install
devtools::install_github("aertslab/cisTopic")
library(cisTopic)
# 读入数据
data(counts_mel)
cisTopicObject <- createcisTopicObject(counts_mel, project.name='scH3K27Ac_melanoma')
data(cellData_mel)
cisTopicObject <- addCellMetadata(cisTopicObject, cell.data = cellData_mel)
# 计算
## 这里的topic后面的数字代表分topic的个数
cisTopicObject <- runWarpLDAModels(cisTopicObject, topic=c(2:15, 20, 25, 35, 40, 45, 50),
seed=123, nCores=17, addModels=FALSE)
对于topic = 2 的模型,诸如MM001_1这一类为细胞信息;行名“[1,],[2,]”代表两个topic;里面的数值代表每个cell中,peak属于某个topic的比例
内部函数:
object = cisTopicObject
topic=c(2:15, 20, 25, 35, 40, 45, 50)
nCores=1
seed=123
iterations = 500
alpha = 50
beta=0.1
alphaByTopic=TRUE
tmp=NULL
addModels=FALSE
returnType='allModels'
object@calc.params[['runWarpLDAModels']] <- NULL
object@models <- list()
object@calc.params[['runWarpLDAModels']] <- c(as.list(environment(), all = TRUE)[names(formals("runWarpLDAModels"))[-1]])
# Initialize
input <- Matrix::t(object@binary.count.matrix)
#假设topic = 2
t = 2
library(text2vec)
mod = .runWarpLDA_perTopic(input, n_topics=t, doc_topic_prior=alpha/t, topic_word_prior = beta, n_iter=iterations, tmp=tmp)
names(models) <- laply(1:length(models), function(x) sapply(models[x], function(y) nrow(y$topic_sums))
models <- .addModels(c(object@models, models))
object@models <- models
## 其中函数.addModels
.addModels <- function(
modelList
){
names(modelList) <- laply(1:length(modelList), function(x) sapply(modelList[x], function(y) nrow(y$topic_sums)))
modelList <- modelList[as.character(sort(as.numeric(names(modelList))))]
modelList <- modelList[unique(names(modelList))]
return(modelList)
}
这里解释两个参数:
- α is the Dirichlet hyperparameter of the prior distribution for the categorical distribution over topics in a cell θ(c) ;
- β is the Dirichlet hyperparameter of the prior distribution for the categorical distribution over regions in a topic ϕ(t)r (here, we use symmetric Dirichlet priors for all topics, using 0.1 as the value for β),
我们可以看到,基于吉布斯采样作者调用的是R包text2vec,而对于.runWarpLDA_perTopic函数,其内部调用的是R包text2vec中的函数
.runWarpLDA_perTopic <- function(input, n_topics, doc_topic_prior, topic_word_prior, n_iter, convergence_tol = 0.0001, n_check_convergence = 25, progressbar = FALSE, normalize='none', tmp=tmp){
lda_model <- LDA$new(n_topics = n_topics, doc_topic_prior = doc_topic_prior, topic_word_prior = topic_word_prior)
doc_topic_distr <- lda_model$fit_transform(x = input,
n_iter = n_iter,
convergence_tol = convergence_tol,
n_check_convergence = n_check_convergence,
progressbar = progressbar, normalize=normalize)
model <- list()
model$topics <- lda_model$components
model$topic_sums <- as.matrix(rowSums(lda_model$components))
model$document_sums <- t(doc_topic_distr)
model$document_expects <- t(doc_topic_distr)
model$log.likelihoods <- t(attributes(doc_topic_distr)$likelihood)
model$perplexity <- perplexity(input, lda_model$topic_word_distribution, doc_topic_distr)
if(!is.null(tmp)){
saveRDS(model, file=paste0(tmp, n_topics, '_topic.Rds'))
}
return(model)
}












网友评论