新手写python脚本提取最长转录本ID以及最长转录本序列（此代码已过时，最新代码见主页）

原始蛋白序列长这个样子：
写脚本的具体思路：
1.创建转录本ID和序列的字典（键值对）
2.提取每个转录本的序列长度，形成三列，第一列是转录本ID，第二列是转录本长度，第三列是gene ID
3.这一步准备对上一步生成的文件按照基因ID和序列长度进行排序
4.创建新的字典（key是gene id，value是排序后的转录本ID），然后print（list_values），第一列就是最长转录本ID
5.linux系统下利用seqtk subseq工具提取最长转录本序列（命令如下：seqtk subseq test_all_pep.fas longest_pep_id.txt > longest_pep.fas）,得到最长转录本序列
注意：在第一步创建字典过程中，要让key变得美观一点，将>后面的header修改为>Ljchlorog3v0000020.1这种样子，即转录本后面的注释全部删掉（从第一个空格往后全部删除），利用正则表达式name=re.sub(" .*$","",nameS)
输入文件为test_all_pep.fas 其中第2步的结果文件名为gene_id_len.txt 第3步的结果文件名为gene_id_len_sorted.txt 第4步的结果文件名为longest_pep_id.txt
输入文件，输出文件，脚本在下方网盘链接

#!/usr/bin/env python
# -*- coding=utf-8 -*-
 
'''
提取序列文件中最长的转录本ID
需要修改######位置的参数 以及 open的目录
'''
 
import sys
import re
 
Fasta=open("D:/python/Code/jupyter-notebook/test_all_pep.fas","r")
Sequence={}
 
## 1.创建ID和序列的字典
for line in Fasta.readlines():
  content=line.strip()
  if content.startswith(">"):
    nameS=content[1:]
    name=re.sub(" .*$","",nameS)       ######修改一下序列的header
    Sequence[name]=''
  else:
    Sequence[name]+=content
Fasta.close()
#print(Sequence.keys())
 
 
## 2.提取每个转录本的序列长度，形成三列
Out=open("D:/python/Code/jupyter-notebook/gene_id_len.txt","w")
for i in Sequence.keys():
    #print(i+"\t"+str(len(Sequence[i]))+"\t"+i.split(".")[0]+"\n")
    Out.write(i+"\t"+str(len(Sequence[i]))+"\t"+i.split(".")[0]+"\n")   #第一列是转录本ID，第二列是转录本长度，第三列是gene ID
Out.close()
 
 
## 3.这一步准备对out文件进行排序，找到每个基因最长的转录本
from operator import itemgetter
gene_id_len=open("D:/python/Code/jupyter-notebook/gene_id_len.txt","r")
table = []
for line in gene_id_len:
    col = line.strip().split("\t")
    col[0] = str(col[0])
    col[1] = int(col[1])
    col[2] = str(col[2])
    table.append(col)                                              #生成嵌套式列表
#print(table)
table_sorted = sorted(table, key=itemgetter(2, 1),reverse=True)                 #先后按列索引2,1排序
#print(table_sorted)
output_file = open("D:/python/Code/jupyter-notebook/gene_id_len_sorted.txt","w")
for row in table_sorted:                                           #遍历读取排序后的嵌套列表
    row = [str(x) for x in row]                                    #列表要转换为字符串才可以写入文本
    output_file.write("\t".join(row) + '\n')
output_file.close()
 
 
## 4.找到最长转录本
input_file=open("D:/python/Code/jupyter-notebook/gene_id_len_sorted.txt","r")
dict2={}
for line in input_file.readlines():
    col = line.strip().split("\t")
    col[0] = str(col[0])
    col[1] = int(col[1])
    col[2] = str(col[2])
    if col[2] not in dict2:
        dict2[col[2]]=col[0]
    else:
        dict2[col[2]]+= '\t'+col[0]
#print(dict2)                     
#print(dict2.values())
list_values=list(dict2.values())
#print(list_values)
result_file = open("D:/python/Code/jupyter-notebook/longest_pep_id.txt","w")
for line in list_values:
    col = line.strip().split("\t")
    col[0] = str(col[0])
    #print(col[0])
    result_file.write(col[0]+"\n")
result_file.close()

链接：https://pan.baidu.com/s/1cCj4DiNVELH3ZAkC7SIgAw
提取码：ybao