bash FASTA文件的序列长度
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23992646/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Sequence length of FASTA file
提问by cucurbit
I have the following FASTA file:
我有以下 FASTA 文件:
>header1
CGCTCTCTCCATCTCTCTACCCTCTCCCTCTCTCTCGGATAGCTAGCTCTTCTTCCTCCT
TCCTCCGTTTGGATCAGACGAGAGGGTATGTAGTGGTGCACCACGAGTTGGTGAAGC
>header2
GGT
>header3
TTATGAT
My desired output:
我想要的输出:
>header1
117
>header2
3
>header3
7
# 3 sequences, total length 127.
This is my code:
这是我的代码:
awk '/^>/ {print; next; } { seqlen = length(>header1
60
57
>header2
3
>header3
7
); print seqlen}' file.fa
The output I get with this code is:
我用这段代码得到的输出是:
awk '/^>/ { # header pattern detected
if (seqlen){
# print previous seqlen if exists
print seqlen
}
# pring the tag
print
# initialize sequence
seqlen = 0
# skip further processing
next
}
# accumulate sequence length
{
seqlen += length(awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length(awk '/^>/ { if (seqlen) {
print seqlen
}
print
seqtotal+=seqlen
seqlen=0
seq+=1
next
}
{
seqlen += length(BEGIN {
OFS = "\t"; # tab-delimited output
}
# Use substr instead of regex to match a starting ">"
substr(awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length(bioawk -c fastx '{print ">" $name ORS length($seq)}' file.fasta
)}END{print l}' file.fasta
, 1, 1) == ">" {
if (seqlen) {
# Only print info for this sequence if no target was given
# or its id matches the target.
if (! target || id == target) {
print id, seqlen;
}
}
# Get sequence id:
# 1. Split header on whitespace (fields[1] is now ">id")
split(##代码##, fields);
# 2. Get portion of first field after the starting ">"
id = substr(fields[1], 2);
seqlen = 0;
next;
}
{
seqlen = seqlen + length(##代码##);
}
END {
if (! target || id == target) {
print id, seqlen;
}
}
)
}
END{print seqlen
print seq" sequences, total length " seqtotal+seqlen
}' file.fa
)}END{print seqlen}' file.fa
)
}
# remnant seqlen if exists
END{if(seqlen){print seqlen}}' file.fa
I need a small modification in order to deal with multiple sequence lines.
我需要一个小的修改来处理多个序列行。
I also need a way to have the total sequences and total length. Any suggestion will be welcome... In bash or awk, please. I know that is easy to do it in Perl/BioPerl and actually, I have a script to do it in those ways.
我还需要一种方法来获得总序列和总长度。欢迎提出任何建议...请使用 bash 或 awk。我知道在 Perl/BioPerl 中很容易做到这一点,实际上,我有一个脚本可以通过这些方式做到这一点。
回答by Juan Diego Godoy Robles
An awk
/ gawk
solution can be composed by three stages:
一个awk
/gawk
解决方案可以由三个阶段组成:
Every time
header
is found these actions should be performed:- Print previous seqlen if exists.
- Print tag.
- Initializeseqlen.
- For the
sequence
lines we just need to accumulate totals. - Finally at the
END
stage we print the remnant seqlen.
每次
header
找到这些操作时都应该执行:- 如果存在,则打印之前的 seqlen 。
- 打印标签。
- 初始化seqlen。
- 对于
sequence
行,我们只需要累积 totals。 - 最后在这个
END
阶段,我们打印剩余的 seqlen。
Commented code:
注释代码:
##代码##A oneliner:
一个单线:
##代码##For the totals:
对于总数:
##代码##回答by Nick S
I wanted to share some tweaks to klashxx's answer that might be useful. Its output differs in that it prints the sequence id and its length on one line, It's no longer a one-liner, so the downside is you'll have to save it as a script file.
我想分享一些可能有用的对 klashxx 答案的调整。它的输出不同之处在于它将序列 id 及其长度打印在一行上,它不再是单行,因此缺点是您必须将其另存为脚本文件。
It also parses out the sequence id from the header line, based on whitespace (chrM
in >chrM gi|251831106|ref|NC_012920.1|
). Then, you can select a specific sequence based on the id by setting the variable target
like so: $ awk -f seqlen.awk -v target=chrM seq.fa
.
它还根据空格 ( chrM
in >chrM gi|251831106|ref|NC_012920.1|
)从标题行中解析出序列 id 。然后,您可以通过设置变量选择基于ID的特定序列target
,像这样:$ awk -f seqlen.awk -v target=chrM seq.fa
。
回答by kvantour
A quick way with any awk, would be this:
任何 awk 的快速方法是:
##代码##You might be also interested in BioAwk, it is an adapted version of awk which is tuned to process FASTA files
您可能也对BioAwk感兴趣,它是 awk 的改编版本,用于处理 FASTA 文件
##代码##Note:BioAwkis based on Brian Kernighan's awkwhich is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.
注意:BioAwk基于Brian Kernighan 的 awk,该awk记录在Al Aho、Brian Kernighan 和 Peter Weinberger(Addison-Wesley,1988,ISBN 0-201-07981-X)的“The AWK Programming Language”中 。我不确定这个版本是否与POSIX兼容。