美文网首页
kraken2bracken build db

kraken2bracken build db

作者: 胡童远 | 来源:发表于2023-08-22 09:50 被阅读0次

kraken2 github:
https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown

环境

import argparse
import gzip
import itertools
import os
import pathlib
import textwrap
import sys
from Bio import Phylo
source /hwfsxx1/ST_HN/P18Z10200N0423/huty/software/miniconda3_2/etc/profile.d/conda.sh
conda activate py38
conda install -c bioconda biopython

GTDB注释

source /hwfssz1/**/hutongyuan/softwares/miniconda3/etc/profile.d/conda.sh
conda activate gtdbtk
# 配置数据库
echo "export GTDBTK_DATA_PATH=/hwfssz1/**/hutongyuan/databases/release207_v2/" > /hwfssz1/**/hutongyuan/softwares/miniconda3/envs/gtdbtk/etc/conda/activate.d/gtdbtk.sh
# 注释
gtdbtk classify_wf \
--cpus 32 \
-x fa \
--genome_dir bins_drep_comp70/dereplicated_genomes/ \
--out_dir anno_drep_comp70/

处理注释文件
很多潜在新物种没有注释信息,用MAG号作为补充
存在同种注释,种水平可全部加上MAG号区分

cp /hwfssz5/**/anno_drep_comp70/classify/gtdbtk.bac120.summary.tsv ./
cp /hwfssz5/**/anno_drep_comp70/classify/gtdbtk.ar53.summary.tsv ./
# 提取合并细菌,古菌
touch anno.bac.ar.tsv
cat gtdbtk.bac120.summary.tsv | sed '1d' | awk -F"\t" '{OFS="\t"}{print $1, $2}' >> anno.bac.ar.tsv
cat gtdbtk.ar53.summary.tsv | sed '1d' | awk -F"\t" '{OFS="\t"}{print $1, $2}' >> anno.bac.ar.tsv
# 补充未注释的部分
#!/usr/bin/env python3
import re,sys,os
with open("anno_edit.txt", 'w') as o:
    with open("anno.bac.ar.tsv", 'r') as infile:
        for line in infile:
            line = line.strip()
            line = re.split(r'[\t|;]', line)
            #if line[7] == "s__":
            line[7] = "s__" + line[0]
            if line[6] == "g__":
                line[6] = "g__" + line[0]
            if line[5] == "f__":
                line[5] = "f__" + line[0]
            if line[4] == "o__":
                line[4] = "o__" + line[0]
            if line[3] == "c__":
                line[3] = "c__" + line[0]
            o.write("{}\t{};{};{};{};{};{};{}\n".format(line[0],line[1],line[2],line[3],line[4],line[5],line[6],line[7]))

gtdb注释文件要求:
1 中间是制表符
2 末尾不留空行

kraken2建库准备

source /hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/miniconda3/etc/profile.d/conda.sh
conda activate python3.7

bins="/hwfssz5/ST_HEALTH/P18Z10200N0423/hutongyuan/analysis/fish/02_bin/bins_drep_comp70/dereplicated_genomes"
python ./tax_gtdb_final.py \
--gtdb anno_edit.txt \
--assemblies $bins/ \
--nodes nodes.dmp \
--names names.dmp \
--kraken_dir kraken_library
bash ./sc_buildk2db.sh

过程

Loading GTDB taxonomy:
    336 taxa found

Assigning tax IDs:
    domains:       2
    phyla:        13
    classes:      19
    orders:       38
    families:     56
    genera:       88
    species:     120

Writing taxonomy nodes to nodes.dmp
Writing taxonomy names to names.dmp

Searching for an assembly for each accession:
    120 / 120 assemblies found

Making Kraken assembly directory:
    120 / 120 assemblies

kraken_library里就是drep的MAG

建库

conda activate kraken2
for file in kraken_library/*.fa
do
    kraken2-build --add-to-library $file --db kraken_database
done

mkdir kraken_database/taxonomy
mv names.dmp kraken_database/taxonomy
mv nodes.dmp kraken_database/taxonomy

## kraken2-build --build --db kraken_database
vi build_kk2.sh
source /hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/miniconda3/etc/profile.d/conda.sh
conda activate kraken2
kraken2-build --build --db kraken_database --threads 4
qsub -cwd -l vf=10G,p=4 -q st.q -P P18Z10200N0423 -binding linear:4 build_kk2.sh

过程

Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map complete. [0.762s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 1491144408 bytes
Capacity estimation complete. [3m54.774s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 11 bits reserved for taxid.
Completed processing of 282544 sequences, 1399196925 bp
Writing data to disk...  complete.
Database files completed. [9m25.653s]
Database construction complete. [Total: 13m35.437s]

generate bracken distribution

# Let's continue to generate bracken distribution for profile correction
cd kraken_database
kraken2 --db=./ --threads=4 <( find -L library \( -name "*.fna" -o -name "*.fasta" -o -name "*.fa" \) -exec cat {} + ) > database.kraken

过程

Loading database information... done.
41646 sequences (388.85 Mbp) processed in 37.970s (65.8 Kseq/m, 614.45 Mbp/m).
  41646 sequences classified (100.00%)
  0 sequences unclassified (0.00%)

readlen=50、100、200、300(这一步是后面bracken可能会用到的)

conda activate bracken
kmer2read_distr --seqid2taxid seqid2taxid.map --taxonomy ./taxonomy --kraken database.kraken --output database50mers.kraken -l 50 -t 1
kmer2read_distr --seqid2taxid seqid2taxid.map --taxonomy ./taxonomy --kraken database.kraken --output database100mers.kraken -l 100 -t 1
kmer2read_distr --seqid2taxid seqid2taxid.map --taxonomy ./taxonomy --kraken database.kraken --output database200mers.kraken -l 200 -t 1
kmer2read_distr --seqid2taxid seqid2taxid.map --taxonomy ./taxonomy --kraken database.kraken --output database300mers.kraken -l 300 -t 1

generate_kmer_distribution.py -i database50mers.kraken -o database50mers.kmer_distrib
generate_kmer_distribution.py -i database100mers.kraken -o database100mers.kmer_distrib
generate_kmer_distribution.py -i database200mers.kraken -o database200mers.kmer_distrib
generate_kmer_distribution.py -i database300mers.kraken -o database300mers.kmer_distrib

# rm -r ../kraken_library/

过程

(bracken) [hutongyuan@cngb-xcompute-0-18 kraken_database]$ kmer2read_distr --seqid2taxid seqid2taxid.map --taxonomy ./taxonomy --kraken database.kraken --output database50mers.kraken -l 50 -t 1
        >>STEP 0: PARSING COMMAND LINE ARGUMENTS
                Taxonomy nodes file: ./taxonomy/nodes.dmp
                Seqid file:          seqid2taxid.map
                Num Threads:         1
                Kmer Length:         31
                Read Length:         50
        >>STEP 1: READING SEQID2TAXID MAP
                282544 total sequences read
        >>STEP 2: READING NODES.DMP FILE
                1056 total nodes read
        >>STEP 3: CONVERTING KMER MAPPINGS INTO READ CLASSIFICATIONS:
                50mers, with a database built using 31mers
                282546 sequences converted (finished: )oncoct_E9_bin.21_k119_17748|kraken:taxid|595)97)))
        Time Elaped: 1 minutes, 23 seconds, 0.00000 microseconds
        =============================
(bracken) [hutongyuan@cngb-xcompute-0-18 kraken_database]$ kmer2read_distr --seqid2taxid seqid2taxid.map --taxonomy ./taxonomy --kraken database.kraken --output database100mers.kraken -l 100 -t 1
        >>STEP 0: PARSING COMMAND LINE ARGUMENTS
                Taxonomy nodes file: ./taxonomy/nodes.dmp
                Seqid file:          seqid2taxid.map
                Num Threads:         1
                Kmer Length:         31
                Read Length:         100
        >>STEP 1: READING SEQID2TAXID MAP
                282544 total sequences read
        >>STEP 2: READING NODES.DMP FILE
                1056 total nodes read
        >>STEP 3: CONVERTING KMER MAPPINGS INTO READ CLASSIFICATIONS:
                100mers, with a database built using 31mers
                282546 sequences converted (finished: )oncoct_E9_bin.21_k119_17748|kraken:taxid|595)97)))
        Time Elaped: 2 minutes, 29 seconds, 0.00000 microseconds
        =============================
(bracken) [hutongyuan@cngb-xcompute-0-18 kraken_database]$ kmer2read_distr --seqid2taxid seqid2taxid.map --taxonomy ./taxonomy --kraken database.kraken --output database200mers.kraken -l 200 -t 1
        >>STEP 0: PARSING COMMAND LINE ARGUMENTS
                Taxonomy nodes file: ./taxonomy/nodes.dmp
                Seqid file:          seqid2taxid.map
                Num Threads:         1
                Kmer Length:         31
                Read Length:         200
        >>STEP 1: READING SEQID2TAXID MAP
                282544 total sequences read
        >>STEP 2: READING NODES.DMP FILE
                1056 total nodes read
        >>STEP 3: CONVERTING KMER MAPPINGS INTO READ CLASSIFICATIONS:
                200mers, with a database built using 31mers
                282546 sequences converted (finished: )oncoct_E9_bin.21_k119_17748|kraken:taxid|595)97)))
        Time Elaped: 4 minutes, 12 seconds, 0.00000 microseconds
        =============================
(bracken) [hutongyuan@cngb-xcompute-0-18 kraken_database]$ kmer2read_distr --seqid2taxid seqid2taxid.map --taxonomy ./taxonomy --kraken database.kraken --output database300mers.kraken -l 300 -t 1
        >>STEP 0: PARSING COMMAND LINE ARGUMENTS
                Taxonomy nodes file: ./taxonomy/nodes.dmp
                Seqid file:          seqid2taxid.map
                Num Threads:         1
                Kmer Length:         31
                Read Length:         300
        >>STEP 1: READING SEQID2TAXID MAP
                282544 total sequences read
        >>STEP 2: READING NODES.DMP FILE
                1056 total nodes read
        >>STEP 3: CONVERTING KMER MAPPINGS INTO READ CLASSIFICATIONS:
                300mers, with a database built using 31mers
                282546 sequences converted (finished: )oncoct_E9_bin.21_k119_17748|kraken:taxid|595)97)))
        Time Elaped: 5 minutes, 39 seconds, 0.00000 microseconds
        =============================
(bracken) [hutongyuan@cngb-login-0-8 kraken_database]$ generate_kmer_distribution.py -i database50mers.kraken -o database50mers.kmer_distrib
PROGRAM START TIME: 12-04-2022 14:10:33
...473 total genomes read from kraken output file
...creating kmer counts file -- lists the number of kmers of each classification per genome
...creating kmer distribution file -- lists genomes and kmer counts contributing to each genome
PROGRAM END TIME: 12-04-2022 14:10:35
(bracken) [hutongyuan@cngb-login-0-8 kraken_database]$ generate_kmer_distribution.py -i database100mers.kraken -o database100mers.kmer_distrib
PROGRAM START TIME: 12-04-2022 14:10:35
...473 total genomes read from kraken output file
...creating kmer counts file -- lists the number of kmers of each classification per genome
...creating kmer distribution file -- lists genomes and kmer counts contributing to each genome
PROGRAM END TIME: 12-04-2022 14:10:37
(bracken) [hutongyuan@cngb-login-0-8 kraken_database]$ generate_kmer_distribution.py -i database200mers.kraken -o database200mers.kmer_distrib
PROGRAM START TIME: 12-04-2022 14:10:37
...473 total genomes read from kraken output file
...creating kmer counts file -- lists the number of kmers of each classification per genome
...creating kmer distribution file -- lists genomes and kmer counts contributing to each genome
PROGRAM END TIME: 12-04-2022 14:10:38
(bracken) [hutongyuan@cngb-login-0-8 kraken_database]$ generate_kmer_distribution.py -i database300mers.kraken -o database300mers.kmer_distrib
PROGRAM START TIME: 12-04-2022 14:10:39
...473 total genomes read from kraken output file
...creating kmer counts file -- lists the number of kmers of each classification per genome
...creating kmer distribution file -- lists genomes and kmer counts contributing to each genome
PROGRAM END TIME: 12-04-2022 14:10:40

相关文章

网友评论

      本文标题:kraken2bracken build db

      本文链接:https://www.haomeiwen.com/subject/seqaurtx.html