bash awk 和排序输出作为逗号分隔?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28400470/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 12:20:44  来源:igfitidea点击:

Awk & Sort-Output as Comma Delimited?

bashshellawkcommadelimited

提问by pooh80133

I am trying to get this to output as comma delimited. The current version doesn't work at all (I get a blank file as an output), and previous versions (where I keep the awk BEGIN statements but don't have the sort delimiter) will just output as tab delimited, not comma delimited. In the previous versions, without attempting to get the comma delimiters, I do get the expected answer (with the complicated filters, etc), so I'm not asking for help with that portion of it. I realize this is a very ugly way to filter and the numbers are also ugly/very large.

我试图让它以逗号分隔的形式输出。当前版本根本不起作用(我得到一个空白文件作为输出),以前的版本(我保留 awk BEGIN 语句但没有排序分隔符)将只输出为制表符分隔,而不是逗号分隔. 在以前的版本中,没有尝试获取逗号分隔符,我确实得到了预期的答案(使用复杂的过滤器等),所以我不会就该部分寻求帮助。我意识到这是一种非常丑陋的过滤方式,而且数字也很丑/非常大。

The background of the question: Find the regions in the file lamina.bed that overlap with the region chr12:5000000-6000000, and to sort descending by column 4, output as comma delimited. Chromosome is the first column, start position of the region is column 2, end position is column 3, value is column 4. We are supposed to use awk (in Unix bash shell). Thank you in advance for your help!

问题背景:在文件lamina.bed中找到与chr12:5000000-6000000区域重叠的区域,按第4列降序排序,以逗号分隔输出。染色体是第一列,区域的开始位置是第2列,结束位置是第3列,值是第4列。我们应该使用awk(在Unix bash shell中)。预先感谢您的帮助!

awk 'BEGIN{FS="\t"; OFS=","} ( <= 5000000 &&  >= 5000000) || ( >= 5000000 &&  <= 6000000) || ( <= 6000000 &&  >= 6000000) || ( <= 5000000 &&  >= 6000000)' /vol1/opt/data/lamina.bed | awk 'BEGIN{FS=","; OFS=","} ( == "chr12") ' | sort -t$"," -k4rn > ~/MOLB7621/PS_2/results/2015_02_05/PS2_p3_n1.csv
cat ~/MOLB7621/PS_2/results/2015_02_05/PS2_p3_n1.csv

sample lines of input (tab delimited, including the lines on chr12 that should work):

输入的示例行(制表符分隔,包括 chr12 上应该工作的行):

#chrom  start   end value
chr1    11323785    11617177    0.86217008797654
chr1    12645605    13926923    0.934891485809683
chr1    14750216    15119039    0.945945945945946
chr12   3306736 5048326 0.913561847988077
chr12   5294045 5393088 0.923076923076923
chr12   5505370 6006665 0.791318864774624
chr12   7214638 7827375 0.8562874251497
chr12   8139885 10173149    0.884353741496599

采纳答案by John1024

To get comma-separated output, use the following:

要获得逗号分隔的输出,请使用以下命令:

$ awk 'BEGIN{FS="\t"; OFS=","} ( <= 5000000 &&  >= 5000000) || ( >= 5000000 &&  <= 6000000) || ( <= 6000000 &&  >= 6000000) || ( <= 5000000 &&  >= 6000000) {=;print}' file | awk 'BEGIN{FS=","; OFS=","} ( == "chr12") ' | sort -t$"," -k4rn 
chr12,5294045,5393088,0.923076923076923
chr12,3306736,5048326,0.913561847988077
chr12,5505370,6006665,0.791318864774624

The only change above is the addition on the action:

上面唯一的变化是对动作的添加:

{=;print}

awkwill only reformat a line with a new field separator if the one or more of the fields on the line have been changed in some way. $1=$1is sufficient to indicate that field 1 has been changed. Consequently, the new field separators are inserted.

awk如果行上的一个或多个字段以某种方式更改,则只会使用新的字段分隔符重新格式化该行。 $1=$1足以表明字段 1 已更改。因此,插入了新的字段分隔符。

Also, the two calls to awkcan be combined into a single call:

此外,awk可以将两个调用合并为一个调用:

awk 'BEGIN{FS="\t"; OFS=","} ( <= 5000000 &&  >= 5000000) || ( >= 5000000 &&  <= 6000000) || ( <= 6000000 &&  >= 6000000) || ( <= 5000000 &&  >= 6000000) {=; if( == "chr12") print}' file | sort -t$"," -k4rn

Simpler Example

更简单的例子

In the following, the input is tab-separated and the output field separator, OFS, is set to a comma. In this first example, the awkcommand printis used:

在下文中,输入以制表符分隔,输出字段分隔符OFS设置为逗号。在第一个示例中,使用了以下awk命令print

$ echo $'a\tb\tc' | awk -v OFS=, '{print}'
a       b       c

Despite OFS=,, the output retains the tab-separator.

尽管如此OFS=,,输出仍保留制表符分隔符。

Now, we add the simple statement $1=$1and observe the output:

现在,我们添加简单语句$1=$1并观察输出:

$ echo $'a\tb\tc' | awk -v OFS=, '{=;print}'
a,b,c

The output is now comma-separated. Again, that is because awkonly reformats a line with the new OFSif it thinks that a field on the line has been changed in some way. The assignment of $1to itself is sufficient to trigger that reformat.

输出现在以逗号分隔。同样,这是因为awk只有OFS当它认为线路上的字段以某种方式改变时才使用新的重新格式化线路。$1对自身的分配足以触发该重新格式化。

Note that it is not sufficient to make a change that affects the line as a whole. For example, the following does not trigger a reformat:

请注意,仅进行影响整个生产线的更改是不够的。例如,以下不会触发重新格式化:

$ echo $'a\tb\tc' | awk -v OFS=, '{
$ echo $'a\tb\tc' | awk -v OFS=, '{sub(,"NEW");print}'
NEW     b       c
=
$ echo $'a\tb\tc' | awk -v OFS=, '{sub(,"NEW", );print}'
NEW,b,c
;print}' a b c

It is necessary to change one or more fields of the line individually. In the following, suboperates on $0as a whole and, consequently, no reformat is triggered:

需要单独更改该行的一个或多个字段。在下面,作为一个整体进行sub操作$0,因此不会触发重新格式化:

##代码##

In the example below, however, suboperates specifically on field $1and hence triggers a reformat:

但是,sub在下面的示例中,专门对字段进行操作$1,因此会触发重新格式化:

##代码##