LDA模型在scATAC-seq中的应用

作者: 小潤澤 | 来源:发表于2022-01-11 16:11 被阅读0次

LDA模型在scATAC-seq中的应用
10 主题模型 - 代码案例一 - LDA主题模型初识
浅谈人工智能产品设计——LDA主题模型
LDA模型伪代码块2
12 主题模型 - 代码案例三 - scikit-learn中的
14 主题模型 - 代码参考文档 - 1.txt\2.txt\3
13 主题模型 - 代码案例四 - scikit-learn中的
词典构造方法之LDA主题模型
主题模型LDA理解与应用
第七篇|Spark平台下基于LDA的k-means算法实现

前言

cisTopic，用于分析和注释scATAC-seq的结果，一般来说，scATAC测量的是单个细胞中染色质开放的程度
对于scATAC的分析有三种类型，第一种是直接得到peaks的单细胞矩阵，这里的peak即为染色质开放的区域（行为peak的位置信息；列为cells）；第二种是将基因组按照window/bin等长划分，比如按照500bp来划分（1-500bp，501-1000bp，1001-1500bp...），基于scATAC测序的结果，在所划分的区域内计算开放的活性分数（行为window/bin的位置信息；列为cells）；第三种就是需要基于数据库信息，在开放的区域内，寻找一些TF或是其他因子结合的motif（基于数据库中的注释），并计算这些motif相应的开放活性（行为motif的位置信息；列为cells），这一组方式的基本假设是TF或是其他因子更容易结合在染色质开放的区域

cisRopic数学原理

如上图所示，cisRopic首先先获得scATAC的矩阵（二进制矩阵），那么0表示该cell中没有这个peak，1表示该cell中有这个peak
然后针对每一个cell，将其中的peak分为 topic 1 和 topic 2（这里的topic 1 和 topic 2 可以理解为两个不同的生物学功能，即这些peak划分为具有两种不同生物学功能的大类）

注：这里的两个topic是一个抽象的概念，具体的生物学意义还是要基于注释来赋予

下一步，对每一个cell中的peak进行吉布斯采样（吉布斯采样的理解与M-H采样类似，传送门）
估计出每个topic的抽样比例的期望（也可以理解为在某个cell中，peak属于某个topic的比例）

具体原理如下：

z_i is the current assignment to be made,

z_–i is the rest of assignments in the dataset,

t is the given topic,

r is the given region,

P(z_i= t | z_–i, r) is the probability of assigning the given region r to a regulatory topic t given the rest of the assignments in the dataset,

$n_{ - i,t}^{(r)}$ is the number of times the given region r is assigned to topic t across the dataset without consideration of the current assignment to be done,

β is the Dirichlet hyperparameter of the prior distribution for the categorical distribution over regions in a topic $\phi _r^{(t)}$ (here, we use symmetric Dirichlet priors for all topics, using 0.1 as the value for β),

$n_{ - i,t}$ is the total number of assignments to topic t through the dataset,

R is the total number of regions in the dataset,

$\left( {n_{ - i,t}^{\left( r \right)} + \beta } \right)/\left( {n_{ - i,t} + R\beta } \right)$ expresses the probability of region r under topic t,

$n_{ - i,t}^{(c)}$ is the total number of region assignments to topic t within the given cell c (without consideration of the region to be assigned),

α is the Dirichlet hyperparameter of the prior distribution for the categorical distribution over topics in a cell θ^(c) (here, we use symmetric Dirichlet priors for all cells, using 50/T as the value for α),

$n_{ - i}^{\left( c \right)}$ is the total number of assignments within the given cell c,

T is the total number of topics in the model (the total number of topics has to be provided; see ‘Model selection’), and

$\left( {n_{ - i,t}^{(c)} + \alpha } \right)/\left( {n_{ - i}^{(c)} + T\alpha } \right)$ is the probability of topic t under cell c.

其中 $P\left( {z_i = t{\mathrm{|}}z_{ - i},r} \right)$ 的意义是在给定任意peak位置信息（Region）的条件下，该peak（Region）属于 topic t 的概率

而 $\left( {n_{ - i,t}^{(c)} + \alpha } \right)/\left( {n_{ - i}^{(c)} + T\alpha } \right)$ 代表在 cell c 中出现 topic t 的概率； $\left( {n_{ - i,t}^{\left( r \right)} + \beta } \right)/\left( {n_{ - i,t} + R\beta } \right)$ 代表在 topic t中出现 region r 的概率
根据贝叶斯理论，我们可以得知如下的正比关系：

那么要估计 $P\left( {z_i = t{\mathrm{|}}z_{ - i},r} \right)$ ，则可以依靠这种正比关系，对 $\left( {n_{ - i,t}^{(c)} + \alpha } \right)/\left( {n_{ - i}^{(c)} + T\alpha } \right)$ 和 $\left( {n_{ - i,t}^{\left( r \right)} + \beta } \right)/\left( {n_{ - i,t} + R\beta } \right)$ 这两个分布进行吉布斯采样，然后估算它们乘积的期望。由于这种正比关系的存在，右式乘积的期望就等于左式的概率的期望，可参考：传送门

代码分析

# install
devtools::install_github("aertslab/cisTopic")
library(cisTopic)

# 读入数据
data(counts_mel) 
cisTopicObject <- createcisTopicObject(counts_mel, project.name='scH3K27Ac_melanoma')

data(cellData_mel)
cisTopicObject <- addCellMetadata(cisTopicObject, cell.data = cellData_mel)

# 计算
## 这里的topic后面的数字代表分topic的个数
cisTopicObject <- runWarpLDAModels(cisTopicObject, topic=c(2:15, 20, 25, 35, 40, 45, 50), 
                                   seed=123, nCores=17, addModels=FALSE)

对于topic = 2 的模型，诸如MM001_1这一类为细胞信息；行名“[1,]，[2,]”代表两个topic；里面的数值代表每个cell中，peak属于某个topic的比例

内部函数：

object = cisTopicObject
topic=c(2:15, 20, 25, 35, 40, 45, 50)
nCores=1
seed=123
iterations = 500
alpha = 50
beta=0.1
alphaByTopic=TRUE
tmp=NULL
addModels=FALSE
returnType='allModels'


object@calc.params[['runWarpLDAModels']] <- NULL
object@models <- list()
object@calc.params[['runWarpLDAModels']] <- c(as.list(environment(), all = TRUE)[names(formals("runWarpLDAModels"))[-1]])

# Initialize
input <- Matrix::t(object@binary.count.matrix)

#假设topic = 2
t = 2
library(text2vec)
mod = .runWarpLDA_perTopic(input, n_topics=t, doc_topic_prior=alpha/t, topic_word_prior = beta, n_iter=iterations, tmp=tmp)

names(models) <- laply(1:length(models), function(x) sapply(models[x], function(y) nrow(y$topic_sums))
models <- .addModels(c(object@models, models))                         
object@models <- models

## 其中函数.addModels
.addModels <- function(
  modelList
){
  names(modelList) <- laply(1:length(modelList), function(x) sapply(modelList[x], function(y) nrow(y$topic_sums)))
  modelList <- modelList[as.character(sort(as.numeric(names(modelList))))]
  modelList <- modelList[unique(names(modelList))]
  return(modelList)
}

这里解释两个参数：

α is the Dirichlet hyperparameter of the prior distribution for the categorical distribution over topics in a cell θ^(c) ；

β is the Dirichlet hyperparameter of the prior distribution for the categorical distribution over regions in a topic ϕ^(t)r (here, we use symmetric Dirichlet priors for all topics, using 0.1 as the value for β),

我们可以看到，基于吉布斯采样作者调用的是R包text2vec，而对于.runWarpLDA_perTopic函数，其内部调用的是R包text2vec中的函数

.runWarpLDA_perTopic <- function(input, n_topics, doc_topic_prior, topic_word_prior, n_iter, convergence_tol = 0.0001, n_check_convergence = 25, progressbar = FALSE, normalize='none', tmp=tmp){
  lda_model <- LDA$new(n_topics = n_topics, doc_topic_prior = doc_topic_prior, topic_word_prior = topic_word_prior)
  doc_topic_distr <- lda_model$fit_transform(x = input, 
                                             n_iter = n_iter,
                                             convergence_tol = convergence_tol, 
                                             n_check_convergence = n_check_convergence,
                                             progressbar = progressbar, normalize=normalize)
  
  model <- list()
  model$topics <- lda_model$components
  model$topic_sums <- as.matrix(rowSums(lda_model$components))
  model$document_sums <- t(doc_topic_distr)
  model$document_expects <- t(doc_topic_distr)
  model$log.likelihoods <- t(attributes(doc_topic_distr)$likelihood)
  model$perplexity <- perplexity(input, lda_model$topic_word_distribution, doc_topic_distr)
  if(!is.null(tmp)){
    saveRDS(model, file=paste0(tmp, n_topics, '_topic.Rds'))
  }
  return(model)
}

LDA模型在scATAC-seq中的应用
前言 cisTopic，用于分析和注释scATAC-seq的结果，一般来说，scATAC测量的是单个细胞中染色质开...
10 主题模型 - 代码案例一 - LDA主题模型初识
08 主题模型 - LDA09 主题模型 - LDA参数学习-Gibbs采样安装 lda 库使用第三方的lda库...
浅谈人工智能产品设计——LDA主题模型
一、LDA模型简介 LDA是Latent Dirichlet Allocation（潜在狄利克雷分配模型）的缩写，...
LDA模型伪代码块2
LDA模型应用：一眼看穿希拉里的邮件我们拿到希拉里泄露的邮件，跑一把LDA，看看她平时都在聊什么。首先，导入我...
12 主题模型 - 代码案例三 - scikit-learn中的
11 主题模型 - 代码案例二 - scikit-learn中的LDA模型13 主题模型 - 代码案例四 - sc...
14 主题模型 - 代码参考文档 - 1.txt\2.txt\3
11 主题模型 - 代码案例二 - scikit-learn中的LDA模型12 主题模型 - 代码案例三 - sc...
13 主题模型 - 代码案例四 - scikit-learn中的
11 主题模型 - 代码案例二 - scikit-learn中的LDA模型12 主题模型 - 代码案例三 - sc...
词典构造方法之LDA主题模型
词典构造方法之LDA主题模型主题模型LDA原理理解 LDA是一种非监督学习技术，可以用来识别大规模文档集（doc...
主题模型LDA理解与应用
本文主要用于理解主题模型LDA（Latent Dirichlet Allocation）其背后的数学原理及其推导过...
第七篇|Spark平台下基于LDA的k-means算法实现
本文主要在Spark平台下实现一个机器学习应用，该应用主要涉及LDA主题模型以及K-means聚类。通过本文你可以...