bash FASTA文件的序列长度

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23992646/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 10:35:47  来源:igfitidea点击:

Sequence length of FASTA file

bashawkfasta

提问by cucurbit

I have the following FASTA file:

我有以下 FASTA 文件:

>header1
CGCTCTCTCCATCTCTCTACCCTCTCCCTCTCTCTCGGATAGCTAGCTCTTCTTCCTCCT
TCCTCCGTTTGGATCAGACGAGAGGGTATGTAGTGGTGCACCACGAGTTGGTGAAGC
>header2
GGT
>header3
TTATGAT

My desired output:

我想要的输出:

>header1
117
>header2
3
>header3
7
# 3 sequences, total length 127.

This is my code:

这是我的代码:

awk '/^>/ {print; next; } { seqlen = length(
>header1
60
57
>header2
3
>header3
7
); print seqlen}' file.fa

The output I get with this code is:

我用这段代码得到的输出是:

awk '/^>/ { # header pattern detected
        if (seqlen){
         # print previous seqlen if exists 
         print seqlen
         }

         # pring the tag 
         print

         # initialize sequence
         seqlen = 0

         # skip further processing
         next
      }

# accumulate sequence length
{
seqlen += length(
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length(
awk '/^>/ { if (seqlen) {
              print seqlen
              }
            print

            seqtotal+=seqlen
            seqlen=0
            seq+=1
            next
            }
    {
    seqlen += length(
BEGIN {
  OFS = "\t"; # tab-delimited output
}
# Use substr instead of regex to match a starting ">"
substr(
awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length(
bioawk -c fastx '{print ">" $name ORS length($seq)}' file.fasta
)}END{print l}' file.fasta
, 1, 1) == ">" { if (seqlen) { # Only print info for this sequence if no target was given # or its id matches the target. if (! target || id == target) { print id, seqlen; } } # Get sequence id: # 1. Split header on whitespace (fields[1] is now ">id") split(##代码##, fields); # 2. Get portion of first field after the starting ">" id = substr(fields[1], 2); seqlen = 0; next; } { seqlen = seqlen + length(##代码##); } END { if (! target || id == target) { print id, seqlen; } }
) } END{print seqlen print seq" sequences, total length " seqtotal+seqlen }' file.fa
)}END{print seqlen}' file.fa
) } # remnant seqlen if exists END{if(seqlen){print seqlen}}' file.fa

I need a small modification in order to deal with multiple sequence lines.

我需要一个小的修改来处理多个序列行。

I also need a way to have the total sequences and total length. Any suggestion will be welcome... In bash or awk, please. I know that is easy to do it in Perl/BioPerl and actually, I have a script to do it in those ways.

我还需要一种方法来获得总序列和总长度。欢迎提出任何建议...请使用 bash 或 awk。我知道在 Perl/BioPerl 中很容易做到这一点,实际上,我有一个脚本可以通过这些方式做到这一点。

回答by Juan Diego Godoy Robles

An awk/ gawksolution can be composed by three stages:

一个awk/gawk解决方案可以由三个阶段组成:

  1. Every time headeris found these actions should be performed:

    • Print previous seqlen if exists.
    • Print tag.
    • Initializeseqlen.
  2. For the sequencelines we just need to accumulate totals.
  3. Finally at the ENDstage we print the remnant seqlen.
  1. 每次header找到这些操作时都应该执行:

    • 如果存在,则打印之前的 seqlen 。
    • 打印标签。
    • 初始化seqlen
  2. 对于sequence行,我们只需要累积 totals
  3. 最后在这个END阶段,我们打印剩余的 seqlen

Commented code:

注释代码:

##代码##

A oneliner:

一个单线

##代码##

For the totals:

对于总数:

##代码##

回答by Nick S

I wanted to share some tweaks to klashxx's answer that might be useful. Its output differs in that it prints the sequence id and its length on one line, It's no longer a one-liner, so the downside is you'll have to save it as a script file.

我想分享一些可能有用的对 klashxx 答案的调整。它的输出不同之处在于它将序列 id 及其长度打印在一行上,它不再是单行,因此缺点是您必须将其另存为脚本文件。

It also parses out the sequence id from the header line, based on whitespace (chrMin >chrM gi|251831106|ref|NC_012920.1|). Then, you can select a specific sequence based on the id by setting the variable targetlike so: $ awk -f seqlen.awk -v target=chrM seq.fa.

它还根据空格 ( chrMin >chrM gi|251831106|ref|NC_012920.1|)从标题行中解析出序列 id 。然后,您可以通过设置变量选择基于ID的特定序列target,像这样:$ awk -f seqlen.awk -v target=chrM seq.fa

##代码##

回答by kvantour

A quick way with any awk, would be this:

任何 awk 的快速方法是:

##代码##

You might be also interested in BioAwk, it is an adapted version of awk which is tuned to process FASTA files

您可能也对BioAwk感兴趣,它是 awk 的改编版本,用于处理 FASTA 文件

##代码##

Note:BioAwkis based on Brian Kernighan's awkwhich is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.

注意:BioAwk基于Brian Kernighan 的 awk,该awk记录在Al Aho、Brian Kernighan 和 Peter Weinberger(Addison-Wesley,1988,ISBN 0-201-07981-X)的“The AWK Programming Language”中 。我不确定这个版本是否与POSIX兼容。