美文网首页
LDA模型在scATAC-seq中的应用

LDA模型在scATAC-seq中的应用

作者: 小潤澤 | 来源:发表于2022-01-11 16:11 被阅读0次

前言

cisTopic,用于分析和注释scATAC-seq的结果,一般来说,scATAC测量的是单个细胞中染色质开放的程度
对于scATAC的分析有三种类型,第一种是直接得到peaks的单细胞矩阵,这里的peak即为染色质开放的区域(行为peak的位置信息;列为cells);第二种是将基因组按照window/bin等长划分,比如按照500bp来划分(1-500bp,501-1000bp,1001-1500bp...),基于scATAC测序的结果,在所划分的区域内计算开放的活性分数(行为window/bin的位置信息;列为cells);第三种就是需要基于数据库信息,在开放的区域内,寻找一些TF或是其他因子结合的motif(基于数据库中的注释),并计算这些motif相应的开放活性(行为motif的位置信息;列为cells),这一组方式的基本假设是TF或是其他因子更容易结合在染色质开放的区域

cisRopic数学原理

如上图所示,cisRopic首先先获得scATAC的矩阵(二进制矩阵),那么0表示该cell中没有这个peak,1表示该cell中有这个peak
然后针对每一个cell,将其中的peak分为 topic 1 和 topic 2(这里的topic 1 和 topic 2 可以理解为两个不同的生物学功能,即这些peak划分为具有两种不同生物学功能的大类)

注:这里的两个topic是一个抽象的概念,具体的生物学意义还是要基于注释来赋予

下一步,对每一个cell中的peak进行吉布斯采样(吉布斯采样的理解与M-H采样类似,传送门
估计出每个topic的抽样比例的期望(也可以理解为在某个cell中,peak属于某个topic的比例)

具体原理如下:


  • zi is the current assignment to be made,
  • z–i is the rest of assignments in the dataset,
  • t is the given topic,
  • r is the given region,
  • P(zi=t | z–i, r) is the probability of assigning the given region r to a regulatory topic t given the rest of the assignments in the dataset,
  • n_{ - i,t}^{(r)} is the number of times the given region r is assigned to topic t across the dataset without consideration of the current assignment to be done,
  • β is the Dirichlet hyperparameter of the prior distribution for the categorical distribution over regions in a topic \phi _r^{(t)} (here, we use symmetric Dirichlet priors for all topics, using 0.1 as the value for β),
  • n_{ - i,t} is the total number of assignments to topic t through the dataset,
  • R is the total number of regions in the dataset,
  • \left( {n_{ - i,t}^{\left( r \right)} + \beta } \right)/\left( {n_{ - i,t} + R\beta } \right) expresses the probability of region r under topic t,
  • n_{ - i,t}^{(c)} is the total number of region assignments to topic t within the given cell c (without consideration of the region to be assigned),
  • α is the Dirichlet hyperparameter of the prior distribution for the categorical distribution over topics in a cell θ(c) (here, we use symmetric Dirichlet priors for all cells, using 50/T as the value for α),
  • n_{ - i}^{\left( c \right)} is the total number of assignments within the given cell c,
  • T is the total number of topics in the model (the total number of topics has to be provided; see ‘Model selection’), and
  • \left( {n_{ - i,t}^{(c)} + \alpha } \right)/\left( {n_{ - i}^{(c)} + T\alpha } \right) is the probability of topic t under cell c.

其中 P\left( {z_i = t{\mathrm{|}}z_{ - i},r} \right) 的意义是在给定任意peak位置信息(Region)的条件下,该peak(Region)属于 topic t 的概率

\left( {n_{ - i,t}^{(c)} + \alpha } \right)/\left( {n_{ - i}^{(c)} + T\alpha } \right) 代表在 cell c 中出现 topic t 的概率;\left( {n_{ - i,t}^{\left( r \right)} + \beta } \right)/\left( {n_{ - i,t} + R\beta } \right) 代表在 topic t中出现 region r 的概率
根据贝叶斯理论,我们可以得知如下的正比关系:

那么要估计P\left( {z_i = t{\mathrm{|}}z_{ - i},r} \right),则可以依靠这种正比关系,对 \left( {n_{ - i,t}^{(c)} + \alpha } \right)/\left( {n_{ - i}^{(c)} + T\alpha } \right)\left( {n_{ - i,t}^{\left( r \right)} + \beta } \right)/\left( {n_{ - i,t} + R\beta } \right) 这两个分布进行吉布斯采样,然后估算它们乘积的期望。由于这种正比关系的存在,右式乘积的期望就等于左式的概率的期望,可参考:传送门

代码分析

# install
devtools::install_github("aertslab/cisTopic")
library(cisTopic)

# 读入数据
data(counts_mel) 
cisTopicObject <- createcisTopicObject(counts_mel, project.name='scH3K27Ac_melanoma')

data(cellData_mel)
cisTopicObject <- addCellMetadata(cisTopicObject, cell.data = cellData_mel)

# 计算
## 这里的topic后面的数字代表分topic的个数
cisTopicObject <- runWarpLDAModels(cisTopicObject, topic=c(2:15, 20, 25, 35, 40, 45, 50), 
                                   seed=123, nCores=17, addModels=FALSE)

对于topic = 2 的模型,诸如MM001_1这一类为细胞信息;行名“[1,],[2,]”代表两个topic;里面的数值代表每个cell中,peak属于某个topic的比例

内部函数:

object = cisTopicObject
topic=c(2:15, 20, 25, 35, 40, 45, 50)
nCores=1
seed=123
iterations = 500
alpha = 50
beta=0.1
alphaByTopic=TRUE
tmp=NULL
addModels=FALSE
returnType='allModels'


object@calc.params[['runWarpLDAModels']] <- NULL
object@models <- list()
object@calc.params[['runWarpLDAModels']] <- c(as.list(environment(), all = TRUE)[names(formals("runWarpLDAModels"))[-1]])

# Initialize
input <- Matrix::t(object@binary.count.matrix)

#假设topic = 2
t = 2
library(text2vec)
mod = .runWarpLDA_perTopic(input, n_topics=t, doc_topic_prior=alpha/t, topic_word_prior = beta, n_iter=iterations, tmp=tmp)

names(models) <- laply(1:length(models), function(x) sapply(models[x], function(y) nrow(y$topic_sums))
models <- .addModels(c(object@models, models))                         
object@models <- models

## 其中函数.addModels
.addModels <- function(
  modelList
){
  names(modelList) <- laply(1:length(modelList), function(x) sapply(modelList[x], function(y) nrow(y$topic_sums)))
  modelList <- modelList[as.character(sort(as.numeric(names(modelList))))]
  modelList <- modelList[unique(names(modelList))]
  return(modelList)
}

这里解释两个参数:

  1. α is the Dirichlet hyperparameter of the prior distribution for the categorical distribution over topics in a cell θ(c)
  2. β is the Dirichlet hyperparameter of the prior distribution for the categorical distribution over regions in a topic ϕ(t)r (here, we use symmetric Dirichlet priors for all topics, using 0.1 as the value for β),

我们可以看到,基于吉布斯采样作者调用的是R包text2vec,而对于.runWarpLDA_perTopic函数,其内部调用的是R包text2vec中的函数

.runWarpLDA_perTopic <- function(input, n_topics, doc_topic_prior, topic_word_prior, n_iter, convergence_tol = 0.0001, n_check_convergence = 25, progressbar = FALSE, normalize='none', tmp=tmp){
  lda_model <- LDA$new(n_topics = n_topics, doc_topic_prior = doc_topic_prior, topic_word_prior = topic_word_prior)
  doc_topic_distr <- lda_model$fit_transform(x = input, 
                                             n_iter = n_iter,
                                             convergence_tol = convergence_tol, 
                                             n_check_convergence = n_check_convergence,
                                             progressbar = progressbar, normalize=normalize)
  
  model <- list()
  model$topics <- lda_model$components
  model$topic_sums <- as.matrix(rowSums(lda_model$components))
  model$document_sums <- t(doc_topic_distr)
  model$document_expects <- t(doc_topic_distr)
  model$log.likelihoods <- t(attributes(doc_topic_distr)$likelihood)
  model$perplexity <- perplexity(input, lda_model$topic_word_distribution, doc_topic_distr)
  if(!is.null(tmp)){
    saveRDS(model, file=paste0(tmp, n_topics, '_topic.Rds'))
  }
  return(model)
}

相关文章

网友评论

      本文标题:LDA模型在scATAC-seq中的应用

      本文链接:https://www.haomeiwen.com/subject/azhvcrtx.html