GO、KEGG富集分析（二）纯无参自定义物种

作者: 大号在这里 | 来源:发表于2020-08-13 12:59 被阅读0次

GO、KEGG富集分析（二）纯无参自定义物种
2021-06-29：Go注释的相关操作记录
miRNA功能富集分析
富集分析与可视化：R 实现
【R语言】dplyr对数据分组取各组前几行
加载R包org.Hs.eg.db出错，避坑指南！
几年过去了，现在这种简单而又烂大街的生信SCI还有期刊接收
几年过去了，现在这种简单而又烂大街的生信SCI还有期刊接收
【R】气泡图和柱形图展示挑选的KEGG通路
2018-01-02 clusterprofiler

示例数据
链接：https://pan.baidu.com/s/1fyy8XWgZcFJRUmHDsnprUQ
提取码：4k7k

一、差异表达基因的GO富集分析

可首先使用Blast2GO对基因进行GO注释（方法参照前面发的文章），然后将注释好的基因-GO对应关系表，以及差异表达基因名称列表输入至enricher()执行就可以了。

#加载 clusterProfiler
library(clusterProfiler)

#### GO
#预先获得的 GO 注释文件，基因和 GO 的对应关系
go_anno <- read.delim('go_anno.txt', header = FALSE, stringsAsFactors = FALSE)
names(go_anno) <- c('gene_id', 'ID')

#添加 GO 注释详情
#所有 GO 的注释描述均可在 GO 官网自定义获取：http://geneontology.org/
go_class <- read.delim('go_class.txt', header = FALSE, stringsAsFactors = FALSE)
names(go_class) <- c('ID', 'Description', 'Ontology')
go_anno <- merge(go_anno, go_class, by = 'ID', all.x = TRUE)

#目标基因列表
gene_select <- read.delim('gene_select.txt', stringsAsFactors = FALSE)$gene_id

#GO 富集分析
#默认以所有注释到 GO 的基因为背景集，也可通过 universe 参数输入背景集
#默认以 p<0.05 为标准，Benjamini 方法校正 p 值，q 值阈值 0.2
#默认输出 top500 富集结果
go_rich <- enricher(gene = gene_select,
    TERM2GENE = go_anno[c('ID', 'gene_id')], 
    TERM2NAME = go_anno[c('ID', 'Description')], 
    pvalueCutoff = 0.05, 
    pAdjustMethod = 'BH', 
    qvalueCutoff = 0.2, 
    maxGSSize = 500)

#输出默认结果，即根据上述 p 值等阈值筛选后的
write.table(go_rich, 'go_tmp.txt', sep = '\t', row.names = FALSE, quote = FALSE)
tmp <- read.delim('go_tmp.txt')
tmp <- merge(tmp, go_class[c('ID', 'Ontology')], by = 'ID')
tmp <- tmp[c(10, 1:9)]
tmp <- tmp[order(tmp$pvalue), ]
write.table(tmp, 'go_rich.significant.txt', sep = '\t', row.names = FALSE, quote = FALSE)

#同样地，如果想输出所有富集结果（不考虑 p 值阈值等），将 p、q 等值设置为 1 即可
#或者直接在 enrichResult 类对象中直接提取需要的结果
#names(attributes(go_rich))  #查看对象“go_rich”包含元素
#result <- go_rich@result    #其中的 result，即为所有基因的富集结果

#result <- merge(result, go_class[c('ID', 'Ontology')], by = 'ID')
#result <- result[c(10, 1:9)]
#result <- result[order(result$pvalue), ]
#write.table(result, 'go_rich.all.txt', sep = '\t', row.names = FALSE, quote = FALSE)

#一些常见作图方法
#上文提到的 clusterProfiler 包自带的柱形图、气泡图、网络图等默认方法略
#这些参考上文 KEGG
pdf("go_all.pdf",width=12,height=8)
dotplot(go_rich)
dev.off()

#此外，对于 GO 的包含关系，即 GO 的有向无环图，在这种纯自定义富集分析的情况下，仍然可以使用 topGO 展示
#此时需要在作图前手动修改 enrichResult 类对象中的元素，主要为 result 统计表和 ontology 的 GO 分类，便于 topGO 识别
#但需注意的是一次只能展示一类（BP、CC 或 MF）GO，如下以展示富集的“biological process”（即 BP）为例

#BiocManager::install('topGO')
#BiocManager::install('Rgraphviz')

pdf("go_topGO_BP.pdf",width=12,height=8)
library(topGO)

go_rich_BP <- go_rich
go_rich_BP@result <- go_rich_BP@result[as.vector(subset(tmp, Ontology == 'biological process')$ID), ]
go_rich_BP@ontology <- 'BP'
plotGOgraph(go_rich_BP)
dev.off()

输出结果：参考教程（一）中图表

Ontology    ID  Description GeneRatio   BgRatio pvalue  p.adjust    qvalue  geneID  Count
molecular function  GO:0009924  octadecanal decarbonylase activity  9/245   19/4743 1.32808568665708e-07    1.79955610542034e-05    1.63564237198819e-05    gene57/gene63/gene111/gene470/gene743/gene1330/gene1759/gene1781/gene2567   9
molecular function  GO:1990465  aldehyde oxygenase (deformylating) activity 9/245   19/4743 1.32808568665708e-07    1.79955610542034e-05    1.63564237198819e-05    gene57/gene63/gene111/gene470/gene743/gene1330/gene1759/gene1781/gene2567   9
biological process  GO:0008610  lipid biosynthetic process  8/245   16/4743 4.06447422367092e-07    3.67157504871607e-05    3.33714725732981e-05    gene57/gene63/gene111/gene743/gene1330/gene1759/gene1781/gene2567   8
biological process  GO:1901002  positive regulation of response to salt stress  8/245   19/4743 2.08340398033236e-06    0.000141150619667518    0.000128293824052046    gene2478/gene1039/gene4446/gene3884/gene3927/gene4266/gene2235/gene2516 8
biological process  GO:0009737  response to abscisic acid   9/245   42/4743 0.000224689503380151    0.0104477067328498  0.00949607059812333 gene331/gene1039/gene4446/gene743/gene1759/gene1781/gene1934/gene2008/gene2567  9
biological process  GO:0009651  response to salt stress 12/245  71/4743 0.000231314540210696    0.0104477067328498  0.00949607059812333 gene1745/gene57/gene63/gene111/gene743/gene2075/gene757/gene1083/gene2008/gene2235/gene2934/gene2516    12
biological process  GO:1902584  positive regulation of response to water deprivation    5/245   14/4743 0.000481168949421839    0.0170332577939266  0.0154817724753498  gene2478/gene3884/gene3927/gene4266/gene2235    5
molecular function  GO:0050660  flavin adenine dinucleotide binding 6/245   21/4743 0.000502826798344696    0.0170332577939266  0.0154817724753498  gene3226/gene650/gene2305/gene3164/gene4707/gene2735    6
molecular function  GO:0016491  oxidoreductase activity 10/245  61/4743 0.000984384203413422    0.0274772114324556  0.0249744318321795  gene3226/gene63/gene111/gene1781/gene650/gene2305/gene3164/gene4707/gene2963/gene3953   10
biological process  GO:0009688  abscisic acid biosynthetic process  4/245   10/4743 0.00113959706484006 0.0274772114324556  0.0249744318321795  gene3169/gene1273/gene4157/gene229  4
biological process  GO:0009739  response to gibberellin 4/245   10/4743 0.00113959706484006 0.0274772114324556  0.0249744318321795  gene1039/gene4446/gene2075/gene2563 4
cellular component  GO:0009507  chloroplast 34/245  386/4743    0.00121670308926003 0.0274772114324556  0.0249744318321795  gene763/gene2127/gene2508/gene499/gene498/gene2477/gene1978/gene173/gene918/gene3169/gene1804/gene2904/gene2735/gene2739/gene2541/gene3927/gene156/gene1830/gene2549/gene2900/gene1424/gene2290/gene444/gene314/gene2973/gene1166/gene2008/gene229/gene1653/gene3953/gene717/gene1116/gene2622/gene3361 34
biological process  GO:0009611  response to wounding    6/245   25/4743 0.00137847729769354 0.0287359498211499  0.0261185172194565  gene1401/gene3230/gene4734/gene820/gene2008/gene2516    6
biological process  GO:0009832  plant-type cell wall biogenesis 4/245   12/4743 0.00247430845317722 0.0478955422007877  0.0435329457175541  gene2513/gene1676/gene3472/gene4253 4

二、差异表达基因的KEGG富集分析

对于无参考数据库的物种而言，因为没有现成的基因功能注释信息（这是最让人难受的），首先需要使用基因序列与kobas公开的的kegg蛋白序列进行blast比对，然后进行注释。
对于差异表达基因的获得，常见方法如limma、edgeR、DESeq2等。自定义阈值筛选后，即可获得一串差异表达基因的名称列表。将得到的基因-pathway对应关系表输入enricher()作为该物种的背景集，并提供用于作富集的基因名称，即可鉴定功能富集。

#Bioconductor 安装 clusterProfiler
#BiocManager::install('clusterProfiler')
#加载 clusterProfiler
library(clusterProfiler)

#### KEGG
#“kegg_anno.txt”为预先获得的 KEGG 注释文件，包含基因和 KEGG pathway 的对应关系
#我用蛋白序列做的 kegg 注释，后续又对应的蛋白-基因关系，不用担心同一途径中基因名称出现重复的问题，enricher() 函数可自动去重
kegg_anno <- read.delim('kegg_anno.txt', colClasses = 'character')
head(kegg_anno)

#目标基因列表
gene_select <- read.delim('gene_select.txt', stringsAsFactors = FALSE)$gene_id
head(gene_select)

#KEGG 富集分析
#默认以所有注释到 KEGG 的基因为背景集，也可通过 universe 参数指定其中的一个子集作为背景集
#默认以 p<0.05 为标准，Benjamini 方法校正 p 值，q 值阈值 0.2
#默认输出 top500 富集结果
kegg_rich <- enricher(gene = gene_select,
    TERM2GENE = kegg_anno[c('pathway_id', 'gene_id')], 
    TERM2NAME = kegg_anno[c('pathway_id', 'pathway_description')], 
    pvalueCutoff = 0.05, 
    pAdjustMethod = 'BH', 
    qvalueCutoff = 0.2, 
    maxGSSize = 500)

#输出默认结果，即根据上述 p 值等阈值筛选后的
write.table(kegg_rich, 'kegg_rich.significant.txt', sep = '\t', row.names = FALSE, quote = FALSE)

#clusterProfiler 包里的一些默认作图方法，例如
pdf("kegg.pdf",width=10,height=10)
barplot(kegg_rich)  #富集柱形图
dotplot(kegg_rich)  #富集气泡图
cnetplot(kegg_rich) #网络图展示富集功能和基因的包含关系
emapplot(kegg_rich) #网络图展示各富集功能之间共有基因关系
heatplot(kegg_rich) #热图展示富集功能和基因的包含关系
dev.off()
#如果想输出所有富集结果（不考虑 p 值阈值等）
#可以在上述 enricher() 函数中，将 p、q 等阈值设置为 1
#或者在 enrichResult 类对象中直接提取需要的结果
#names(attributes(kegg_rich))  #查看对象“kegg_rich”包含元素
#result <- kegg_rich@result    #其中的 result，即为所有基因的富集结果
#write.table(result, 'kegg_rich.all.txt', sep = '\t', row.names = FALSE, quote = FALSE)

其它更多效果图，可自行参阅clusterProfiler文档。

另外，以上结果均为根据p值等阈值筛选后的，其它特殊情况，有时我们可能想查看所有的结果（不管它是否显著）。

#如果想输出所有富集结果（不考虑 p 值阈值等）
#可以在上述 enricher() 函数中，将 p、q 等阈值设置为 1
 
#或者在 enrichResult 类对象中直接提取需要的结果
names(attributes(kegg_rich))  #查看对象“kegg_rich”包含元素
result <- kegg_rich@result    #其中的 result，即为所有基因的富集结果
write.table(result, 'kegg_rich.all.txt', sep = '\t', row.names = FALSE, quote = FALSE)

三、基因集富集分析（GSEA）

clusterProfiler还提供了自定义的基因集富集分析（GSEA）的功能，方便了无参时的GSEA。

#### GSEA
#加载 clusterProfiler
library(clusterProfiler)
library(enrichplot)
library(ReactomePA)

#基因表达的 log2FC
gene_express <- read.delim('gene_express.txt', stringsAsFactors = FALSE)
genelist <- gene_express$log2FoldChange
names(genelist) <- gene_express$gene_id
genelist <- genelist[order(genelist, decreasing = TRUE)]

#基因集富集分析，以 GO 富集为例
#因为示例数据集的 log2FC 是我瞎凑的，所以 p<0.05 绝对不会有结果的，为了演示数据就使用的 p<1 阈值，正常情况下视情况选择合适 p 值
#默认输出 top500 富集结果
gsea <- GSEA(genelist, 
    TERM2GENE = go_anno[c('ID', 'gene_id')], 
    TERM2NAME = go_anno[c('ID', 'Description')], 
    pvalueCutoff = 1, 
    pAdjustMethod = 'BH',
    maxGSSize = 500)

write.table(gsea, 'gsea.txt', sep = '\t', row.names = FALSE, quote = FALSE)
#查看 enrichResult 类对象中的元素
#names(attributes(gsea))
#head(gsea@result)
#同上文提到的，clusterProfiler 包里的一些默认作图方法，例如
pdf("gsea.pdf",width=10,height=10)
dotplot(gsea)  #富集气泡图
emapplot(gsea) #网络图展示各富集功能之间共有基因关系
#gseaplot(gsea, 'GO:0004386')    #基因集富集得分图
gseaplot2(gsea, 'GO:0004386')    #基因集富集得分图
dev.off()

结果：

ID  Description setSize enrichmentScore NES pvalue  p.adjust    qvalues rank    leading_edge    core_enrichment
GO:0004386  helicase activity   61  0.565523816609853   1.69505350745673    0.0014367816091954  0.470521541950113   0.459482038429407   595 tags=30%, list=16%, signal=25%  gene240/gene254/gene304/gene408/gene466/gene530/gene572/gene641/gene673/gene701/gene778/gene510/gene106/gene2090/gene257/gene561/gene1072/gene1249
GO:0006979  response to oxidative stress    16  0.681540602697324   1.57292409726776    0.00841750841750842 0.470521541950113   0.459482038429407   337 tags=38%, list=9%, signal=34%   gene454/gene553/gene592/gene678/gene681/gene758
GO:0045454  cell redox homeostasis  72  0.508482852259299   1.5589346336912 0.00843881856540084 0.470521541950113   0.459482038429407   659 tags=31%, list=18%, signal=26%  gene194/gene232/gene262/gene328/gene329/gene429/gene444/gene454/gene455/gene472/gene588/gene704/gene175/gene156/gene1808/gene150/gene1506/gene1322/gene1436/gene511/gene1534/gene1879
GO:0016671  oxidoreductase activity, acting on a sulfur group of donors, disulfide as acceptor  11  0.75643380614145    1.62208988553364    0.00847457627118644 0.470521541950113   0.459482038429407   601 tags=55%, list=16%, signal=46%  gene329/gene454/gene455/gene156/gene150/gene1436

如下展示了基因集富集得分图：

GO、KEGG富集分析（二）纯无参自定义物种
示例数据链接：https://pan.baidu.com/s/1fyy8XWgZcFJRUmHDsnprUQ提取码...
2021-06-29：Go注释的相关操作记录
主要参考：(4条消息) 无参转录组GO、KEGG富集分析——diamond+idmapping+GOstats_y...
miRNA功能富集分析
相信大家都做过基因的功能富集分析，小编前面也花了不少篇幅给大家介绍GO和KEGG富集分析 ☞GO和KEGG富集分析...
富集分析与可视化：R 实现
一. GO/KEGG （一）富集分析（二）GOChord、弦图什么的脱离clusterprofiler包，需要...
【R语言】dplyr对数据分组取各组前几行
前面我们介绍过GO富集分析和结果可视化 1.GO和KEGG富集分析视频讲解[https://link.zhihu....
加载R包org.Hs.eg.db出错，避坑指南！
前面我们讲到的GO和KEGG富集分析都会用到org.Hs.eg.db这个包 ☞GO和KEGG富集结果如何显示基因s...
几年过去了，现在这种简单而又烂大街的生信SCI还有期刊接收
几年前，这种下载GEO数据～差异分析～功能分析（GO富集分析、KEGG富集分析、PPI分析）～筛选hub基因～hu...
几年过去了，现在这种简单而又烂大街的生信SCI还有期刊接收
几年前，这种下载GEO数据～差异分析～功能分析（GO富集分析、KEGG富集分析、PPI分析）～筛选hub基因～hu...
【R】气泡图和柱形图展示挑选的KEGG通路
KEGG富集分析相信大家都并不陌生。前面小编给大家系统的介绍了 ☞ R进行GO和KEGG富集分析及结果可视化[ht...
2018-01-02 clusterprofiler
GO 和 KEGG富集分析包，简单好用！代码例子： gene1.txt