从NCBI获取数据直接构建Or.eg.db

作者: 深山夕照深秋雨OvO | 来源:发表于2025-09-12 00:55 被阅读0次

获取数据的来源
从NCBI上获取SRR数据
NCBI数据、获取使用
转录组分析流程
RNA-seq入门实战（一）：上游数据下载、格式转化和质控清洗
免疫组库分析实战/mixcr+vdjtools+R实现
biostar 学习笔记（4-1）--- 认识数据和数据的获取
在 NCBI 中获取数据
从NCBI中GEO profiles中获取差异调控芯片数据。
从NCBI下载数据

众所周知ClusterProfiler是目前最马赛克的富集软件
但是除了一些模式物种，大部分物种都没有相应的背景库，非常痛苦

怎么办呢？AnnotationForge这个R包中提供了makeOrgPackageFromNCBI函数
可以直接从NCBI获取数据然后构建背景库，目前NCBI上有注释的物种还是不少的，可以满足一部分需求。（当然，如果连NCBI上也没有数据，咱们自己de novo组装、注释了一个基因组怎么办？这个详见下期）

# 选择一个目录存放包与tarball(说白了就是中间文件)
#请你记住这个路径, 记住这个路径
BUILD_DIR=/path/to/orgdb_tgu
mkdir -p "$BUILD_DIR"
cd "$BUILD_DIR"

# 用 R 构建源码目录
Rscript -e 'if(!requireNamespace("AnnotationForge", quietly=TRUE)) BiocManager::install("AnnotationForge");
AnnotationForge::makeOrgPackageFromNCBI(
  version   = "0.1",
  maintainer= "kuangzhuoran <you@example.com>",
  author    = "kuangzhuoran <you@example.com>",
  outputDir = "'"$BUILD_DIR"'",
  tax_id    = "59729",             # Taeniopygia guttata
  genus     = "Taeniopygia",
  species   = "guttata"
)'

# 打包成 tar.gz
R CMD build org.Tguttata.eg.db

#完成后，目录里会出现一个类似 org.Tguttata.eg.db_0.1.tar.gz 的文件

# 建议准备一个共享库目录，便于所有作业节点复用
SHARE_LIB=/beegfs/home/kuangzhuoran/Rlib
mkdir -p "$SHARE_LIB"

# 安装包（从tar.gz安装）
R CMD INSTALL org.Tguttata.eg.db_0.1.tar.gz --library="$SHARE_LIB"

2.5 如果顺利的话你会看到这样一个文件夹：

顺利的输出

那么只需要在R里面输入:

.libPaths(c("/path/to/Rlib", .libPaths()))
library(org.Tguttata.eg.db)

就能顺利使用这个东西了
用法和模式物种的一模一样

不出意外是要出意外了：

processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
Error in (function (cond) : error in evaluating the argument 'table' in selecting a method for function '%in%': Server denied you to change to the given directory

问了下GPT，可能和服务器的网络相关，我所用的服务器在网络上有很多我不知道的限制，故而报错。但是没关系，因为刚刚的代码运行到这里，其实该下载的文件都已经下载好了：

#这几个文件下载完毕
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz

cd /path/to/orgdb_tgu #这就是前面让你记住的路径
R CMD build org.Tguttata.eg.db
R CMD INSTALL org.Tguttata.eg.db_0.1.tar.gz --library=/path/to/Rlib

现在你终于可以在/path/to/Rlib中，找到上文section 2.5 中提到的那张图片了。

?makeOrgPackageFromNCBI
最后看下readme

## Make an organism package from annotations available from NCBI.

### Description

The `makeOrgPackageFromNCBI` function allows the user to make an organism package from NCBI annotations available from the NCBI.

### Usage

```

  makeOrgPackageFromNCBI(
    version=,
    maintainer,
    author,
    outputDir=getwd(),
    tax_id,
    genus=NULL,
    species=NULL,
    NCBIFilesDir=getwd(),
    databaseOnly=FALSE,
    useDeprecatedStyle=FALSE,
    rebuildCache=TRUE,
    verbose=TRUE,
    ensemblVersion=NULL)

```

### Arguments

| `version` | 

Package version in 'x.y.z' format.

 |
| `maintainer` | 

Package maintainer followed by email

 |
| `author` | 

Creator of package.

 |
| `outputDir` | 

Path where the package source should be assembled.

 |
| `tax_id` | 

The Taxonomy ID that represents the organism.

 |
| `genus` | 

Single string indicating the genus.

 |
| `species` | 

Single string indicating the species.

 |
| `NCBIFilesDir` | 

When a path is given, the files used to create the DB are saved locally.

 |
| `databaseOnly` | 

When TRUE, a DB is created without the package infrastructure. Used for OrgDb packages hosted on AnnotationHub.

 |
| `useDeprecatedStyle` | 

Legacy support for older package style with bimaps.

 |
| `rebuildCache` | 

When TRUE, the files used to create the DB are refreshed (i.e., re-downloaded) if the timestamp is greater than 24 hours old. When FALSE, the temporary NCBI.sqlite DB and final package are re-generated from local files in `outputDir`. Used internally and for testing.

 |
| `verbose` | 

When TRUE, status messages are printed.

 |
| `ensemblVersion` | 

Ensmbl version to use. When NULL, uses the current version.

 |

### Details

`makeOrgPackageFromNCBI` downloads multiple files and assembles a 33 GB database in `NCBIFilesDir`. The first time the function is run it may take well over an hour; subsequent calls reuse files from the cache and are much faster. The default behavior of `makeOrgPackageFromNCBI` attempts to refresh the cached files each day (suppress with `rebuildCache = FALSE`).

The files that are downloaded from NCBI may take longer to download than the default timeout permits. We encourage users to set a `options(timeout=xxx)` to encourage the files to finish downloading. Adjust the timelimit according to download speed and capacity.

Depending on the organism, the database file could reach up to 49 G. You will need ~62G free for downloading files and creating the largest database as of February 2022.

Some orgDbs are already provided through `AnnotationHub`. See package `AnnotationHub::AnnotationHub`

### Value

Nothing returned to the R session. Just creates an organism annotation package.

### Author(s)

M. Carlson

### Examples

[Run examples](http://127.0.0.1:41447/help/library/AnnotationForge/Example/makeOrgPackageFromNCBI)

```

## Not run: 
## Makes an organism package for Zebra Finch from NCBI:

makeOrgPackageFromNCBI(version = "0.1",
                       author = "Some One <so@someplace.org>",
                       maintainer = "Some One <so@someplace.org>",
                       outputDir = ".",
                       tax_id = "59729",
                       genus = "Taeniopygia",
                       species = "guttata")

## End(Not run)
```