bash FASTA文件的序列长度

Question

提问by cucurbit

I have the following FASTA file:

我有以下 FASTA 文件：

>header1
CGCTCTCTCCATCTCTCTACCCTCTCCCTCTCTCTCGGATAGCTAGCTCTTCTTCCTCCT
TCCTCCGTTTGGATCAGACGAGAGGGTATGTAGTGGTGCACCACGAGTTGGTGAAGC
>header2
GGT
>header3
TTATGAT

My desired output:

我想要的输出：

>header1
117
>header2
3
>header3
7
# 3 sequences, total length 127.

This is my code:

这是我的代码：

awk '/^>/ {print; next; } { seqlen = length(>header1
60
57
>header2
3
>header3
7
); print seqlen}' file.fa

The output I get with this code is:

我用这段代码得到的输出是：

awk '/^>/ { # header pattern detected
        if (seqlen){
         # print previous seqlen if exists 
         print seqlen
         }

         # pring the tag 
         print

         # initialize sequence
         seqlen = 0

         # skip further processing
         next
      }

# accumulate sequence length
{
seqlen += length(awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length(awk '/^>/ { if (seqlen) {
              print seqlen
              }
            print

            seqtotal+=seqlen
            seqlen=0
            seq+=1
            next
            }
    {
    seqlen += length(BEGIN {
  OFS = "\t"; # tab-delimited output
}
# Use substr instead of regex to match a starting ">"
substr(awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length(bioawk -c fastx '{print ">" $name ORS length($seq)}' file.fasta
)}END{print l}' file.fasta
, 1, 1) == ">" {
  if (seqlen) {
    # Only print info for this sequence if no target was given
    # or its id matches the target.
    if (! target || id == target) {
      print id, seqlen;
    }
  }
  # Get sequence id:
  # 1. Split header on whitespace (fields[1] is now ">id")
  split(##代码##, fields);
  # 2. Get portion of first field after the starting ">"
  id = substr(fields[1], 2);
  seqlen = 0;
  next;
}
{
  seqlen = seqlen + length(##代码##);
}
END {
  if (! target || id == target) {
    print id, seqlen;
  }
}
)
    }     
    END{print seqlen
        print seq" sequences, total length " seqtotal+seqlen
    }' file.fa
)}END{print seqlen}' file.fa
)
}
# remnant seqlen if exists
END{if(seqlen){print seqlen}}' file.fa

I need a small modification in order to deal with multiple sequence lines.

我需要一个小的修改来处理多个序列行。

I also need a way to have the total sequences and total length. Any suggestion will be welcome... In bash or awk, please. I know that is easy to do it in Perl/BioPerl and actually, I have a script to do it in those ways.

我还需要一种方法来获得总序列和总长度。欢迎提出任何建议...请使用 bash 或 awk。我知道在 Perl/BioPerl 中很容易做到这一点，实际上，我有一个脚本可以通过这些方式做到这一点。

Answer 1

回答by Juan Diego Godoy Robles

An awk/ gawksolution can be composed by three stages:

一个awk/gawk解决方案可以由三个阶段组成：

Every time headeris found these actions should be performed:
- Print previous seqlen if exists.
- Print tag.
- Initializeseqlen.
For the sequencelines we just need to accumulate totals.
Finally at the ENDstage we print the remnant seqlen.

每次header找到这些操作时都应该执行：
- 如果存在，则打印之前的 seqlen 。
- 打印标签。
- 初始化seqlen。
对于sequence行，我们只需要累积 totals。
最后在这个END阶段，我们打印剩余的 seqlen。

Commented code:

注释代码：

##代码##

A oneliner:

一个单线：

##代码##

For the totals:

对于总数：

##代码##

Answer 2

回答by Nick S

I wanted to share some tweaks to klashxx's answer that might be useful. Its output differs in that it prints the sequence id and its length on one line, It's no longer a one-liner, so the downside is you'll have to save it as a script file.

我想分享一些可能有用的对 klashxx 答案的调整。它的输出不同之处在于它将序列 id 及其长度打印在一行上，它不再是单行，因此缺点是您必须将其另存为脚本文件。

It also parses out the sequence id from the header line, based on whitespace (chrMin >chrM gi|251831106|ref|NC_012920.1|). Then, you can select a specific sequence based on the id by setting the variable targetlike so: $ awk -f seqlen.awk -v target=chrM seq.fa.

它还根据空格 ( chrMin >chrM gi|251831106|ref|NC_012920.1|)从标题行中解析出序列 id 。然后，您可以通过设置变量选择基于ID的特定序列target，像这样：$ awk -f seqlen.awk -v target=chrM seq.fa。

##代码##

Answer 3

回答by kvantour

A quick way with any awk, would be this:

任何 awk 的快速方法是：

##代码##

You might be also interested in BioAwk, it is an adapted version of awk which is tuned to process FASTA files

您可能也对BioAwk感兴趣，它是 awk 的改编版本，用于处理 FASTA 文件

##代码##

Note:BioAwkis based on Brian Kernighan's awkwhich is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.

注意：BioAwk基于Brian Kernighan 的 awk，该awk记录在Al Aho、Brian Kernighan 和 Peter Weinberger（Addison-Wesley，1988，ISBN 0-201-07981-X）的“The AWK Programming Language”中。我不确定这个版本是否与POSIX兼容。

bash FASTA文件的序列长度

提问by cucurbit

回答by Juan Diego Godoy Robles

回答by Nick S

回答by kvantour

相关推荐

最近更新

标签

bash FASTA文件的序列长度

提问by cucurbit

回答by Juan Diego Godoy Robles

回答by Nick S

回答by kvantour

相关推荐

bash 嵌套的 awk 命令

将特殊字符从输入传递到 bash 脚本

Bash fork 重试：资源暂时不可用

如何使用 SSH 命令将递归目录和文件列表导出到 Linux Bash shell 中的文本文件？

相关推荐

最近更新

标签