最长mRNA、CDS、Protein（Python实现）

作者: 球果假水晶蓝 | 来源:发表于2022-03-25 21:40 被阅读0次

最长mRNA、CDS、Protein（Python实现）
提取最长cds mRNA gene
GFF文件和基因组文件提取mRNA,cds,protein序列
RNA、mRNA、tRNA、rRNA有何区别及功能?
普通转录组将表达矩阵拆分成protein_coding和 lin
RMVar：m6A修饰变异相关甲基化数据库
基因结构基础知识
【基因组注释】GMAP安装使用问题
生物信息中的Python 05 | 从 Genbank 文件中提
外显子、内含子、mRNA,CDS区别

# !usr/bin/env python3
# -*- coding:utf-8 -*-
"""
@FileName: get_longest
@Time: 2022/3/25,19:12
@Motto: go go go 
"""
import argparse
from Bio import SeqIO  #   pip install  biopython


def read_file(file):
    t = {}  # 记录长度和序列名字
    result = {}  #这个字典用于储存最长转录本 、最长cds、最长protein
    for seq_record in SeqIO.parse(file, "fasta"): # 用biopython模块解析文件
        id = seq_record.id.rsplit(".", 1)[0]

        if id not in t:
            result[seq_record.id] = str(seq_record.seq)
            t[id] = [len(seq_record.seq), seq_record.id]
        else:
            if t[id][0] >= len(seq_record.seq):
                continue
            else:
                result.pop(t[id][1])
                result[seq_record.id] = str(seq_record.seq)
                t[id] = [len(seq_record.seq), seq_record.id]
    return result


def write(filename, res):
    with open(filename,'w') as f:
        for i, j in res.items():
            f.write(">" + i + "\n")
            f.write(j + "\n")


def main():
    parser = argparse.ArgumentParser(usage='********', description='得到最长结果')
    parser.add_argument("-i", "--input", help="input filename")
    parser.add_argument("-o", "--output", help="output filename")
    args = parser.parse_args()
    res_dict = read_file(args.input)
    write(args.output, res_dict)


if __name__ == '__main__':
    #res_dict = read_file(r'./cds.fa')
    #write(r'out_cds', res_dict)
    main()

脚本用法.png

现在手里有一个mRNA的文件，分析需要同一基因中最长mRNA。我的想法是先建立两个字典。result保存我要的结果，t辅助用来记录基因名字和序列长度。用.右边分割一次，得到基因ID序列aaaaa.01G000100，如果ID没有在t 中，result 中加入序列名aaaaa.01G000100.t1：序列，t 中加入 ID：[序列名字aaaaa.01G000100.t1,序列长度]。如果ID在t 中，比较长度，保留长的。

image.png