Python 将 DNA 翻译成蛋白质
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19521905/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Translation DNA to Protein
提问by Karl Eric Swanson
I am a biology graduate student and I taught myself a very limited amount of python in the past few months to deal with some data I have. I am not asking for homework help, this is for a research project.
我是一名生物学研究生,在过去的几个月里,我自学了非常有限的 Python 来处理我拥有的一些数据。我不是在寻求家庭作业帮助,这是为了一个研究项目。
With this code I intend to take a portion of a string called sequence, between: find the start site of "protein translation," or the first occurrence of ATG(biological term is start codon), then the first occurrence of TAA(stop codon).
使用这段代码,我打算取一部分称为序列的字符串,介于:找到“蛋白质翻译”的起始位点,或第一次出现的ATG(生物学术语是起始密码子),然后是第一次出现的TAA(终止密码子) )。
Then the function translate_dna()
should, for every three letters in the string, swap for the dictionary value. The variable CDS exists properly, but for, or if loop in my function is not working :(. Any suggestions? The input file is formatted as follows:
然后函数应该translate_dna()
,对于字符串中的每三个字母,交换字典值。CDS 变量正确存在,但是 for 或 if 循环在我的函数中不起作用:(。有什么建议吗?输入文件的格式如下:
>gnl|GNOMON|230560476.m Model predicted by Gnomon on Homo sapiens unplaced genomic scaffold, alternate assembly HuRef DEGEN_1103279082069, whole genome shotgun sequence (NW_001841731.1)
CCCCAGTAGCTGGGATTACAGGTTATCCAAGGACATGGAAAAGCCAACACCATGGTAGCATTAATGAAAG
TTTACCAAGAGGAAGATGAAGCCTACCAGGAATTAGTTACCATGGCAACCATGTTTTTCCAGTACTTACT
GCAGCCATTTAGGGCTATGCGAGAAGTTGCAACTTTATGTAAGCTTGAT
>gnl|GNOMON|230560472.m Model predicted by Gnomon on Homo sapiens unplaced genomic scaffold, alternate assembly HuRef DEGEN_1103279082069, whole genome shotgun sequence (NW_001841731.1)
GCCGGCGTTTGACCGCGCTTGGGTGGCCTGGGACCCTGTGGGAGGCTTCCCCGGCGCCGAGAGCCCTGGC
TGACGGCTGATGGGGAGGAGCCGGCGGGCGGAGAAGGCCACGGGCTCCCCAGTACCCTCACCTGCGCGGG
ATCGCTGCGGGAAACCAGGGGGAGCTTCGGCAGGGCCTGCAGAGAGGACAAGCGAAGTTAAGAGCCTAGT
GTACTTGCCGCTGGGAGCTGGGCTAGGCCCCCAACCTTTGCCCTGAAGATGCTGGCAGAGCAGGATGTTG
TAACGGGAAATGTCAGAAATACTGCAAGCAAACTGAAAACAACCCATCCATGTAGGAAAGAATAACACGG
ACTACACACTATGAGGAAACCACAGGGGAGTTTCAGGCCAGTCAGCTTTTGATCTTCAACTTTATAACTT
TCACCTTAGGATATGACGAGCCCACCGGAGTTTCAAAAATGGTATCATTTTGTATCAGGCTTGTTTTTTA
CACTCTTGGTTTCTCACAGAGATAGGTGGTTTCTCCTTAAAATCGAACATTTATATGATGCATTTTACTG
TAGTTACTATCAGAAAAGTTAGTTTTCCCAAATTTAAGTTCACTCTGGGGTACTATAGCGTGAATGTAGT
TCATTCTGTTGAGCTAGTTGTTCATGTTAGTGTAGTTCACATATTTATCTGGAACTCAAAAATGAGGGGT
TGAGAGGGGAAGCTAAAATTCAAAACATGTCCAAATATATAATTTTAATATTTTACTTTATATTTAAAAT
AGAAAAGCAATTGATTCTAGAATTAGACTAATTGCTAGCATTGCTAGGATATATAAAATGAAGCTGAATG
TTTTAACTCTGGAATTTTTCTGAATAGTCTAAGAAATAAGGCTGAAGTGTATCACTTGCCTTAAGTTTAC
TTTTGCGTGTGTGTTTTAATTTTGTTCAGTGGGGCTTTCACTTAAAAAAAAAACCATAATATTATTACCT
GGATAAAAAATACAGCTGAAAGTAGATCACTTTATCTTTAAGCAGAAGGATGGAAATAGAAGAATTTTAA
GAATGTATTGGTTGAAAAACATCTATATTATTTTATTTTTATTTCTCTTCTTGTGGGAGTAAAATAATTT
CCAACCAAATCAGTCCACCTAGATTATACACTGTTCAGTTTGTTTTCTGCCCTGCAGCACAAGCAATAAC
CAGCAGAGACTGGAACCACAGCTGAGGCTCTGTAAATGAGTTGACTGCTAAGGACTTCATGGGGATATTA
ACCTGGGGCATTAAGAGAATCAACATGCTAAAGTACTTGGAGACAGCTCTGTAATGTTTTATGAGGTTTT
TTGTTTTTTTTTTTTGAGACAGAGTCTTGCACTGTCGCCCAGGCTGG
Code:
代码:
from sys import argv
script, filename = argv
def translate_dna(sequence):
codontable = {
'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',
}
proteinsequence = ''
start = sequence.find('ATG')
sequencestart = sequence[int(start):]
stop = sequencestart.find('TAA')
cds = str(sequencestart[:int(stop)+3])
for n in range(0,len(cds),3):
if cds[n:n+3] in codontable == True:
proteinsequence += codontable[cds[n:n+3]]
print proteinsequence
sequence = ''
header = ''
sequence = ''
for line in open(filename):
if line[0] == ">":
if header != '':
print header
translate_dna(sequence)
header = line.strip()
sequence = ''
else:
sequence += line.strip()
print header
translate_dna(sequence)
采纳答案by mdml
Your problem stems from the line
你的问题源于线路
if cds[n:n+3] in codontable == True
This always evaluates to False
, and thus you never append to proteinsequence
. Just remove the == True
portion like so
这始终计算为False
,因此您永远不会附加到proteinsequence
。== True
像这样删除部分
if cds[n:n+3] in codontable
and you will get the protein sequence. Also, make sure to return proteinsequence
in translate_dna()
.
你会得到蛋白质序列。另外,请return proteinsequence
确保在translate_dna()
.
回答by zero323
There is one more problem in your code - when you use stop = sequencestart.find('TAA')
you don't care about opened reading frame. In code below I split sequence into triplets and use itertools.takewhile
to handle that but it can be done using loops as well:
您的代码中还有一个问题 - 当您使用时,stop = sequencestart.find('TAA')
您不关心打开的阅读框。在下面的代码中,我将序列拆分为三元组并用于itertools.takewhile
处理它,但也可以使用循环来完成:
from itertools import takewhile
def translate_dna(sequence, codontable, stop_codons = ('TAA', 'TGA', 'TAG')):
start = sequence.find('ATG')
# Take sequence from the first start codon
trimmed_sequence = sequence[start:]
# Split it into triplets
codons = [trimmed_sequence[i:i+3] for i in range(0, len(trimmed_sequence), 3)]
print(len(codons))
print(trimmed_sequence)
print(codons)
# Take all codons until first stop codon
coding_sequence = takewhile(lambda x: x not in stop_codons and len(x) == 3 , codons)
# Translate and join into string
protein_sequence = ''.join([codontable[codon] for codon in coding_sequence])
# This line assumes there is always stop codon in the sequence
return "{0}_".format(protein_sequence)