过滤序列比对长度较短的文件

作者: 徐诗芬 | 来源:发表于2022-03-16 16:40 被阅读0次

过滤序列比对长度较短的文件
在linux中构建基因进化树
TBtools基因家族分析详细教程(3)基因家族成员的进化分析1
基础知识复习之比对
用Python和C语言实现DNA双序列简单比对
2020-09-03 Papara 安装
基因组分析 K-mer 第3回估计1000bp的全基因长度
samtools
CHIP-Seq(3):比对及质控
2021-01-17

一个很有意思的小脚本，现在写脚本的思路越来越偏向高效化了。
两个小技巧：1. 这里用python写文件的话运行速度会慢很多，因此用subprocess调用shell脚本会更快。2. 使用了yield，它是一个迭代器，相当于读取一个然后处理一个。不用return，因为return相当于需要全部遍历完文件再进入循环，会慢很多。

#!/usr/bin/python
# -*- coding: utf-8 -*-
# conda activate python3
"""
    作者：徐诗芬
    内容：过滤一系列.mafft.pep.fa文件中序列比对长度低于50的文件，
    先读取判断文件序列长度，再用subprocess进行拷贝文件，因此只适用于Linux系统而不适用于window！
    日期：2022.3.12
"""
import os
import sys
import subprocess

def file_extract(path):
    # 提取基因蛋白序列文件，生成一个文件路径列表
    # file_list = []
    for root, dirs, files in os.walk(path):
        for f in files:
            file = os.path.join(root, f)
            yield file

def fasta_len(inf):
    # 按行读取序列
    # 输入fasta文件，返回名称，序列
    with open(inf) as f:
        global name
        dict = {}
        for line in f:
            line = line.strip()
            if line.startswith('>'):
                name = line
                dict[name] = ''
                i = 0
            else:
                dict[name] += line
                i += len(dict[name])
        return i

def main():
    inpath = sys.argv[1]
    outpath1 = sys.argv[2]   # 文件没有用的，<50个氨基酸的序列
    outpath2 = sys.argv[3]   # 文件有用的，≥50个氨基酸的序列
    subprocess.call(['mkdir', outpath1])
    subprocess.call(['mkdir', outpath2])
    for file in file_extract(inpath):
        if fasta_len(file) < 50:
            subprocess.call(['cp', file, outpath1])
        else:
            subprocess.call(['cp', file, outpath2])

try:
    main()
except IndexError:
    usage()