如何在 bash 中有效地将具有 270,000 多行的文件中的两列相加

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22669572/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 10:01:45  来源:igfitidea点击:

How to efficiently sum two columns in a file with 270,000+ rows in bash

bashunixsolarisperformance

提问by Emil

I have two columns in a file, and I want to automate summing both values per row

我在一个文件中有两列,我想自动对每行的两个值求和

for example

例如

read write
5    6
read write
10   2
read write
23   44

I want to then sum the "read" and "write" of each row. Eventually after summing, I'm finding the max sum and putting that max value in a file. I feel like I have to use grep -v to rid of the column headers per row, which like stated in the answers, makes the code inefficient since I'm grepping the entire file just to read a line.

然后我想对每行的“读取”和“写入”求和。最终求和后,我找到了最大和并将该最大值放入文件中。我觉得我必须使用 grep -v 来删除每行的列标题,就像答案中所述,这使代码效率低下,因为我正在对整个文件进行 grep 读取一行。

I currently have this in a bash script (within a for loop where $x is the file name) to sum the columns line by line

我目前在 bash 脚本中使用它(在 for 循环中,其中 $x 是文件名)逐行求和

lines=`grep -v READ $x|wc -l | awk '{print }'`
line_num=1
arr_num=0


while [ $line_num -le $lines ]
do

    arr[$arr_num]=`grep -v READ $x |  sed $line_num'q;d' | awk '{print  + }'`
    echo $line_num
    line_num=$[$line_num+1]
    arr_num=$[$arr_num+1]

done

However, the file to be summed has 270,000+ rows. The script has been running for a few hours now, and it is nowhere near finished. Is there a more efficient way to write this so that it does not take so long?

但是,要求和的文件有 270,000 多行。脚本已经运行了几个小时,而且还远未完成。有没有更有效的方法来写这个,这样它就不会花这么长时间?

采纳答案by Juan Diego Godoy Robles

Use awkinstead and take advantage of modulusfunction:

改用awk并利用模数函数:

awk '!(NR%2){print +}' infile

回答by Digital Trauma

awkis probably faster, but the idiomatic bashway to do this is something like:

awk可能更快,但惯用的bash方式是这样的:

while read -a line; do      # read each line one-by-one, into an array
                            # use arithmetic expansion to add col 1 and 2
    echo "$(( ${line[0]} + ${line[1]} ))"
done < <(grep -v READ input.txt)

Note the file input file is only read once (by grep) and the number of externally forked programs is kept to a minimum (just grep, called only once for the whole input file). The rest of the commands are bashbuiltins.

请注意,文件输入文件仅读取一次(通过grep),并且外部分叉程序的数量保持在最低限度(只是grep,仅对整个输入文件调用一次)。其余的命令是bash内置命令。

Using the <( )process substition, in case variables set in the while loop are required out of scope of the while loop. Otherwise a |pipe could be used.

使用<( )process 替换,以防在 while 循环中设置的变量需要超出 while 循环的范围。否则|可以使用管道。

回答by glenn Hymanman

Your question is pretty verbose, yet your goal is not clear. The way I read it, your numbers are on every second line, and you want only to find the maximum sum. Given that:

你的问题很冗长,但你的目标并不明确。我读它的方式,你的数字在每一行,你只想找到最大的总和。鉴于:

awk '
    NR%2 == 1 {next} 
    NR == 2 {max = +; next} 
    + > max {max = +}
    END {print max}
' filename

回答by Hallmarc

You could also use a pipeline with tools that implicitly loop over the input like so:

您还可以使用带有工具的管道,这些工具隐式循环输入,如下所示:

grep -v read INFILE | tr -s ' ' + | bc | sort -rn | head -1 > OUTFILE

This assumes there are spaces between your read and write data values.

这假设您的读取和写入数据值之间存在空格。

回答by Sammitch

Assuming that it's always one 'header' row followed by one 'data' row:

假设它总是一个“标题”行,后跟一个“数据”行:

awk '
  BEGIN{ max = 0 }
  {
    if( NR%2 == 0 ){
      sum =  + ;
      if( sum > max ) { max = sum }
    }
  }
  END{ print max }' input.txt

Or simply trim out all lines that do not conform to what you want:

或者干脆删除所有不符合你想要的行:

grep '^[0-9]\+\s\+[0-9]\+$' input.txt | awk '
  BEGIN{ max = 0 }
  {
    sum =  + ;
    if( sum > max ) { max = sum }
  }
  END{ print max }' input.txt

回答by Jonathan Leffler

Why not run:

为什么不运行:

awk 'NR==1 { print "sum"; next } { print  +  }'

You can afford to run it on the file while the other script it still running. It'll be complete in a few seconds at most (prediction). When you're confident it's right, you can kill the other process.

您可以在文件上运行它,而其他脚本仍在运行。它最多会在几秒钟内完成(预测)。当您确信它是正确的时,您可以终止其他进程。

You can use Perl or Python instead of awkif you prefer.

awk如果您愿意,您可以使用 Perl 或 Python 来代替。

Your code is running grep, sedand awkon each line of the input file; that's damnably expensive. And it isn't even writing the data to a file; it is creating an array in Bash's memory that'll need to be printed to the output file later.

您的代码正在运行grepsed并且awk在输入文件的每一行上;那太贵了。它甚至不会将数据写入文件;它正在 Bash 的内存中创建一个数组,稍后需要将其打印到输出文件中。