如何在python中读取fasta文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20580657/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 20:46:22  来源:igfitidea点击:

how to read a fasta file in python?

pythonfasta

提问by user3098683

I'm trying to read a FASTA file and then find specific motif(string)and print out the sequence and number of times it occurs. A FASTA fileis just series of sequences(strings) that starts with a header line and the signature for header or start of a new sequence is ">". in a new line immediately after the header is the sequence of letters.I'm not done with code but so far I have this and it gives me this error:

我正在尝试读取 FASTA 文件,然后找到特定的主题(字符串)并打印出它出现的顺序和次数。甲FASTA文件仅仅是一系列序列(串),与一个标题行开始和头中的签名或一个新的序列的开始是“>”。在标题之后的新行中是字母序列。我没有完成代码,但到目前为止我有这个,它给了我这个错误:

AttributeError: 'str' object has no attribute 'next'

AttributeError: 'str' 对象没有属性 'next'

I'm not sure what's wrong here.

我不确定这里出了什么问题。

import re

header=""
counts=0
newline=""

f1=open('fpprotein_fasta(2).txt','r')
f2=open('motifs.xls','w')
for line in f1:
    if line.startswith('>'):
        header=line
        #print header
        nextline=line.next()
        for i in nextline:
            motif="ML[A-Z][A-Z][IV]R"
            if re.findall(motif,nextline):
                counts+=1
                #print (header+'\t'+counts+'\t'+motif+'\n')
        fout.write(header+'\t'+counts+'\t'+motif+'\n')

f1.close()
f2.close()

回答by Ray

I am not sure about the pasta stuff, but I am pretty sure you did wrong here:

我不确定意大利面的东西,但我很确定你在这里做错了:

nextline=line.next()

line is simply a str, so you can't call str.next()

line 只是 a str,所以你不能打电话str.next()

Also, regarding files, you are recommended to use:

此外,关于文件,建议您使用:

with open('fpprotein_fasta(2).txt','r') as f1:

This will deal with closing the file automatically.

这将处理自动关闭文件。

You are encouraged to provide a sample fasta file so that I can try to correct the code.

我们鼓励您提供示例 fasta 文件,以便我可以尝试更正代码。

回答by iainmcgin

The error is likely coming from the line:

错误可能来自以下行:

nextline=line.next()

lineis the string you have already read, there is no next()method on it.

line是你已经读过的字符串,上面没有next()方法。

Part of the problem is that you're trying to mix two different ways of reading the file - you are iterating over the lines using for line in f1and <handle>.next().

部分问题是您试图混合两种不同的读取文件方式 - 您正在使用for line in f1和遍历行<handle>.next()

Also, if you are working with FASTA files I recommend using Biopython: it makes working with collections of sequences much easier. In particular, Chapter 14on motifs will be of particular interest to you. This will likely require that you learn more about Python in order to achieve what you want, but if you're going to be doing a lot more bioinformatics than what your example here shows then it's definitely worth the investment of time.

此外,如果您正在处理 FASTA 文件,我建议您使用Biopython:它使处理序列集合变得更加容易。尤其是关于主题的第 14 章,您将特别感兴趣。这可能需要您学习更多关于 Python 的知识才能实现您想要的,但是如果您要做的生物信息学比这里的示例显示的要多得多,那么绝对值得花时间投资。

回答by Arnaud P

This might help getting you in the right direction

这可能有助于让您朝着正确的方向前进

import re

def parse(fasta, outfile):
    motif = "ML[A-Z][A-Z][IV]R"
    header = None
    with open(fasta, 'r') as fin, open(outfile, 'w') as fout:
            for line in fin:
                if line.startswith('>'):
                    if header is not None:
                        fout.write(header + '\t' + str(count) + '\t' + motif + '\n')
                    header = line
                    count = 0
                else:
                    matches = re.findall(motif, line)
                    count += len(matches)
            if header is not None:
                fout.write(header + '\t' + str(count) + '\t' + motif + '\n')
if __name__ == '__main__':
    parse("fpprotein_fasta(2).txt", "motifs.xls")