美文网首页
vConTACT2病毒分类注释

vConTACT2病毒分类注释

作者: 胡童远 | 来源:发表于2021-10-12 15:58 被阅读0次

文章:Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks
中文:通过基因分享网络给META中的病毒基因组做分类注释
杂志:Nature Biotechnology
时间:2019

bitbucket: https://bitbucket.org/MAVERICLab/vcontact2/wiki/Home

安装

conda install -n vcontact2 python=3
conda activate vcontact2
conda install -y -c bioconda vcontact2
conda install -y -c bioconda mcl blast diamond
# -y, --yes             Do not ask for confirmation.

获取依赖cluster_one

# 下载聚类软件,移动到conda/bin路径 (可使用win下载代替)
wget -c http://www.paccanarolab.org/static_content/clusterone/cluster_one-1.0.jar

java -jar cluster_one-1.0.jar -h

查看数据库:

安装vcontact2也顺便下载了数据库

ll -alh /route/miniconda3/envs/vcontact2/lib/python3.8/site-packages/vcontact2/data

查看蛋白序列数:

zcat ViralRefSeq-prokaryotes-v94.faa.gz | grep '^>' | wc -l
268145
zcat ViralRefSeq-prokaryotes-v88.faa.gz | grep '^>' | wc -l
230992
zcat ViralRefSeq-prokaryotes-v85.faa.gz | grep '^>' | wc -l
231165
zcat ViralRefSeq-prokaryotes-v201.faa.gz | grep '^>' | wc -l
363514

蛋白信息

zcat ViralRefSeq-prokaryotes-v94.faa.gz | grep '^>' | head
>NP_037662.1 terminase small subunit [Escherichia virus HK022]
>NP_037663.1 terminase large subunit [Escherichia virus HK022]
>NP_037664.1 head portal protein [Escherichia virus HK022]
>NP_037665.1 head maturation protease [Escherichia virus HK022]
>NP_037666.1 major capsid subunit precursor [Escherichia virus HK022]
>NP_037667.1 gp6 [Escherichia virus HK022]

注释信息

less -S ViralRefSeq-prokaryotes-v94.Merged-reference.csv | head
Organism/Name,origin,order,family,subfamily,genus
Acholeplasma virus L2,RefSeq-94,,Plasmaviridae,,Plasmavirus
Acholeplasma virus MV-L51,RefSeq-94,,Inoviridae,,Plectrovirus

protein_id,contig_id,keywords

less -S ViralRefSeq-prokaryotes-v94.protein2contig.csv | head
protein_id,contig_id,keywords
NP_955551.1,Acholeplasma virus L2,envelope protein
NP_040808.1,Acholeplasma virus L2,envelope protein
NP_040809.1,Acholeplasma virus L2,hypothetical protein L2_02
NP_040810.1,Acholeplasma virus L2,hypothetical protein L2_03
NP_040811.1,Acholeplasma virus L2,hypothetical protein L2_04
NP_040812.1,Acholeplasma virus L2,hypothetical protein L2_05

vcontact2参数

vcontact2 --help

--raw-proteins FASTA-formatted proteins file
--proteins-fp A file linking the protein name and the genome names (csv or tsv)
--rel-mode {BLASTP,Diamond,MMSeqs2} 蛋白比对方法,计算蛋白序列相似性
--pcs-mode {ClusterONE,MCL} 蛋白聚类方法
--vcs-mode {ClusterONE,MCL} 病毒聚类方法
--c1-bin "cluster_one-1.0.jar"的路径
--db 参考库
{None,
ProkaryoticViralRefSeq85-ICTV [default],
ProkaryoticViralRefSeq85-Merged,
ProkaryoticViralRefSeq88-Merged,
ProkaryoticViralRefSeq94-Merged,
ProkaryoticViralRefSeq97-Merged,
ProkaryoticViralRefSeq201-Merged,
ArchaeaViralRefSeq85-Merged,
ArchaeaViralRefSeq94-Merged,
ArchaeaViralRefSeq97-Merged,
ArchaeaViralRefSeq201-Merged}

推荐参数:

vcontact --raw-proteins [proteins file] \
--rel-mode ‘Diamond’ \
--proteins-fp [gene-to-genome mapping file] \
--db 'ProkaryoticViralRefSeq94-Merged' \
--pcs-mode MCL \
--vcs-mode ClusterONE \
--c1-bin [path to ClusterONE] \
--output-dir [target output directory]

输入数据格式:

1 prodigal获取蛋白序列

提取中质量ID和序列

# medium more quality sequence id
cat quality_summary.tsv | awk -F"\t" '{if($8=="Medium-quality") print $1}' > medium_more.contigs
# medium more quality sequence
for i in `cat medium_more.contigs`;
do
    cat combined.fna | grep -A 1 $i >> medium_more.fna
    echo -e "$i done..."
done

蛋白预测和翻译

prodigal \
-a ./prodigal/out.faa \
-d ./prodigal/out.fna \
-f gff \
-g 11 \
-o ./prodigal/out.gff \
-p single \
-s ./prodigal/out.stat \
-i ./checkv/output_sop/medium_more.fna

2 准备gene2genome文件

conda activate vcontact2
vcontact2_gene2genome \
--proteins out.faa \
--output out_map.csv \
--source-type Prodigal-FAA

必须使用csv后缀,否则后续分析报错

参数:
--source-type
{VIRSorter,Prodigal-coords,Prodigal-FAA, MetaGeneMark,NCBICodingSequence,NCBIFasta}

过程:

vcontact2_gene2genome:174: DeprecationWarning: 'U' mode is deprecated
  with open(results.proteins, 'rU') as proteins_fh:

结果:

3 vcontact2获取PC和VC

vcontact2 \
--rel-mode 'Diamond' \
--pcs-mode MCL \
--vcs-mode ClusterONE \
--c1-bin /hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/ \
--db 'ProkaryoticViralRefSeq94-Merged' \
--verbose --threads 4 \
--raw-proteins ./prodigal/out.faa \
--proteins-fp ./prodigal/out_map.csv \
--output-dir ./vcontact2/

过程:

============================This is vConTACT2 0.9.19

----------------------------------Pre-Analysis

INFO:vcontact2: Found Diamond
INFO:vcontact2: Found MCL
INFO:vcontact2: Identified 4 CPUs
INFO:vcontact2: Using reference database: ProkaryoticViralRefSeq94-Merged
INFO:vcontact2: Using existing directory ./vcontact2/

------------------------------Reference databases

INFO:vcontact2: Merging ProkaryoticViralRefSeq94-Merged to user sequences...
INFO:vcontact2: Creating Diamond database and running Diamond...
INFO:vcontact2.protein_clusters: Creating Diamond database...
INFO:vcontact2.protein_clusters: Running Diamond...

-------------------------------Protein clustering

INFO:vcontact2: Loading proteins...
INFO:vcontact2: Merging ProkaryoticViralRefSeq94-Merged to user gene-to-genome mapping...
DEBUG:vcontact2: Read 268201 proteins from ./prodigal/out_map.csv.
DEBUG:vcontact2.protein_clusters: Generating abc file...
DEBUG:vcontact2.protein_clusters: Running MCL...
INFO:vcontact2: Building the cluster and profiles 
INFO:vcontact2: Saving intermediate files...

----------------------------------Loading data

DEBUG:vcontact2: Read 2617 entries from ./vcontact2/vConTACT2_contigs.csv
INFO:vcontact2: Read 232886 entries (dropped 2328 singletons)

--------------------------------Adding Taxonomy

------------------------Calculating Similarity Networks

DEBUG:vcontact2.pcprofiles: Hypergeometric contig-similarity network:
DEBUG:vcontact2.pcprofiles: 21269 PCs present in strictly more than 3 contigs
DEBUG:vcontact2.pcprofiles: Hypergeometric PCs-similarity network
DEBUG:vcontact2: Network Contigs

------------------------Contig Clustering & Affiliation

DEBUG:vcontact2.contig_clusters: 3 taxonomic levels detected: genus, order, fami
INFO:vcontact2.contig_clusters: Exporting for ClusterONE
DEBUG:vcontact2.contig_clusters: Saving network in file ./vcontact2/c1.ntw (9513ines).
INFO:vcontact2.contig_clusters: Clustering the PC Similarity-Network using ClustNE
INFO:vcontact2.contig_clusters: Running clusterONE
DEBUG:vcontact2.contig_clusters: ClusterONE results are being saved to ./vcontacc1.clusters.
INFO:vcontact2.contig_clusters: 346 clusters loaded (singletons and non-connecteodes are dropped).
INFO:vcontact2.contig_clusters: Computing membership matrix...
ERROR:vcontact2: Error in viral clusters
ERROR:vcontact2: type object 'object' has no attribute 'dtype'

Traceback (most recent call last):
  File "/hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/miniconda3/envs/vcontact2/bin/vcontact2", line 637, in main
    vc = vcontact2.cluster_refinements.ViralClusters(gc.contigs, profiles_fp, optimize=options.optimize)
  File "/hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/miniconda3/envs/vcontact2/lib/python3.8/site-packages/vcontact2/cluster_refinements.py", line 37, in __in
    self.metrics = pd.DataFrame(columns=summary_headers)
  File "/hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/miniconda3/envs/vcontact2/lib/python3.8/site-packages/pandas/core/frame.py", line 411, in __init__
    mgr = init_dict(data, index, columns, dtype=dtype)
  File "/hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/miniconda3/envs/vcontact2/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 242, i
    val = construct_1d_arraylike_from_scalar(np.nan, len(index), nan_dtype)
  File "/hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/miniconda3/envs/vcontact2/lib/python3.8/site-packages/pandas/core/dtypes/cast.py", line 1221, in construc
    dtype = dtype.dtype
AttributeError: type object 'object' has no attribute 'dtype'

debug round 1

更新pandas,自动downgrade vcontact2,结果依然bug

conda install pandas=1.2.3

Downloading and Extracting Packages
certifi-2021.10.8    | 145 KB    | ####################################### | 100%
pandas-1.2.3         | 12.1 MB   | ####################################### | 100%
vcontact2-0.9.15     | 98.0 MB   | ####################################### | 100%
openssl-3.0.0        | 2.9 MB    | ####################################### | 100%
ca-certificates-2021 | 139 KB    | ####################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

# 运行
# vConTACT2 0.9.13
ERROR:vcontact2: Error in contig clustering
ERROR:vcontact2: 'DataFrame' object has no attribute 'ix'
AttributeError: 'DataFrame' object has no attribute 'ix'

debug round 2

此次vcontact2版本依然是0.9.15,没有实现更新,仅更新了numpy等依赖。

conda update vcontact2

Downloading and Extracting Packages
numpy-1.21.2         | 6.2 MB    | ####################################### | 100%

conda list

vcontact2                 0.9.15                     py_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda
pandas                    1.2.3            py38h51da96c_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
numpy                     1.21.2           py38he2449b9_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge

ERROR:vcontact2: Error in contig clustering
ERROR:vcontact2: 'DataFrame' object has no attribute 'ix'
AttributeError: 'DataFrame' object has no attribute 'ix'

debug round 3

conda list
pandas                    0.25.3
numpy                     1.21.2
conda install numpy=1.19.5

------------------------Contig Clustering & Affiliation-------------------------
DEBUG:vcontact2.contig_clusters: 3 taxonomic levels detected: genus, order, family
INFO:vcontact2.contig_clusters: Exporting for ClusterONE
DEBUG:vcontact2.contig_clusters: Network file already exist.
INFO:vcontact2.contig_clusters: Clustering the PC Similarity-Network using ClusterONE
DEBUG:vcontact2.contig_clusters: ClusterONE file ./vcontact2/c1.clusters already exist.
INFO:vcontact2.contig_clusters: 346 clusters loaded (singletons and non-connected nodes are dropped).
INFO:vcontact2.contig_clusters: Computing membership matrix...
DEBUG:vcontact2.cluster_refinements: 3 taxonomic levels detected: genus, order, family
INFO:vcontact2.cluster_refinements: Optimizing on distance: 9
INFO:vcontact2.evaluations: Performance evaluations at the genus level...
INFO:vcontact2.cluster_refinements: Identified a single best composite score 2.761417223726828 for distance 9
INFO:vcontact2.cluster_refinements: Merging optimal distance determined from performance evaluations.
DEBUG:vcontact2.evaluations: 3 taxonomic levels detected: order, family, genus
INFO:vcontact2.evaluations: Performance evaluations at the order level...
INFO:vcontact2.evaluations: Performance evaluations at the family level...
INFO:vcontact2.evaluations: Performance evaluations at the genus level...
INFO:vcontact2.cluster_refinements:              PPV  Sensitivity  Accuracy
order   1.000000     0.351764  0.593097
family  0.994381     0.630965  0.792098
genus   0.869642     0.972256  0.919519

--------------------------------Protein modules---------------------------------
DEBUG:vcontact2.modules: Filtered 0 edges according to the sig. threshold 1.0.
INFO:vcontact2.modules: Exporting the PC-network for MCL
DEBUG:vcontact2.modules: Saving network in file ./vcontact2/modules.ntwk (2292198 lines)
INFO:vcontact2.modules: Clustering the PC similarity-network
DEBUG:vcontact2.modules: MCL(5.0) results are saved in ./vcontact2/modules_mcl_5.0.clusters.
INFO:vcontact2.modules: Loading the clustering results
DEBUG:vcontact2.modules: Saving 622 modules containing 18958  protein clusters in ./vcontact2/modules_mcl_5.0_modules.pandas.
---------------------------Link modules and clusters----------------------------
INFO:vcontact2.modules: 2844 contigs-modules owning association, 50018 filtered (a contig must have 50% of the PCs to own a module).
INFO:vcontact2.modules: Linking 622 modules with 346 contigs clusters...
INFO:vcontact2.modules: Network done 346 clusters, 622 modules and 297 edges.


----------------------------Exporting results files-----------------------------
INFO:vcontact2: Identifying genomes that are not clustered (i.e. singletons, outliers and overlaps
There were 540 genomes (including refs) that were singleton, outlier or overlaps.
INFO:vcontact2: Building final summary table
INFO:vcontact2.exports.summaries: Reading edges for 2617 contigs
INFO:vcontact2.exports.summaries: Building PC array
INFO:vcontact2.exports.summaries: Calculating comparisons for back-calculations
...
INFO:vcontact2.exports.summaries: Writing viral cluster overview file...
INFO:vcontact2.exports.summaries: Examining each viral cluster and breaking it down into individual genomes...
INFO:vcontact2.exports.summaries: Writing the genome-by-genome overview file...

yesssssssssssssssss

4 结果

更多:
Supplementing and Colouring vConTACT2 Clusters
Applying vContact to Viral Sequences and Visualizing the Output
https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/
(2017). vConTACT: an iVirus tool to classify double-stranded DNA viruses that infect Archaea and Bacteria. PeerJ
Prediction of human-virus protein-protein interactions through a sequence embedding-based machine learning method. Computational and Structural Biotechnology Journal. 2020

https://github.com/pandas-dev/pandas/issues/39520

if you are stuck on pandas==0.24.2 (don't ask); downgrading to numpy==1.19.5 works
THANK YOU
also works with pandas==0.25.3

相关文章

  • vConTACT2病毒分类注释

    文章:Taxonomic assignment of uncultivated prokaryotic virus...

  • vConTACT2 | Error in identifyin

    最近,跑vConTACT2[1],对比各种宏病毒数据集。几天过去了,分析已经差不多接近尾声。然而,出现了报错,如下...

  • MySQL基础

    MySQL [TOC] 注释 单行注释: -- 多行注释 :/* */ SQL分类 DDL :Data Defin...

  • C#第一课

    一、注释 注释分类:单行注释、多行注释,文档注释 1、单行注释——双斜杠,注释的内容只有一行 ...

  • Java学习笔记1

    注释 注释概述:用于解释说明程序的文字Java中注释分类格式单行注释 格式://注释文字多行注释 格式:/* ...

  • NO.4 Java的基础语法

    1、注释: 用于解释说明程序的文字 , 分类格式:单行注释(//注释文字可以嵌套),多行注释(/* 注释文...

  • python的基础知识

    注释 1.注释的分类 注释分为单行注释和多行注释 单行注释: '# 注释信息' 多行注释: 使用三个单引号或者使用...

  • Java核心_常量、变量

    1.Java的注释 注释的作用:用于介绍、解释说明程序;调试错误注释的分类:// 单行注释/* 多行注释 //* ...

  • java注释

    A: 什么是注释 – 用于解释说明程序的文字 B: Java中注释分类 单行注释: – 格式: //注释文字 多行...

  • 《每天一点Java知识》Java基础知识——注释

    注释原则 形式统一 内容准确简介(10个字) 注释语法 单行注释// 块注释/**/ 注释分类 类注释类的维护人类...

网友评论

      本文标题:vConTACT2病毒分类注释

      本文链接:https://www.haomeiwen.com/subject/rceznltx.html