如何在 bash 中有效地将具有 270,000 多行的文件中的两列相加

Question

提问by Emil

I have two columns in a file, and I want to automate summing both values per row

我在一个文件中有两列，我想自动对每行的两个值求和

for example

例如

read write
5    6
read write
10   2
read write
23   44

I want to then sum the "read" and "write" of each row. Eventually after summing, I'm finding the max sum and putting that max value in a file. I feel like I have to use grep -v to rid of the column headers per row, which like stated in the answers, makes the code inefficient since I'm grepping the entire file just to read a line.

然后我想对每行的“读取”和“写入”求和。最终求和后，我找到了最大和并将该最大值放入文件中。我觉得我必须使用 grep -v 来删除每行的列标题，就像答案中所述，这使代码效率低下，因为我正在对整个文件进行 grep 读取一行。

I currently have this in a bash script (within a for loop where $x is the file name) to sum the columns line by line

我目前在 bash 脚本中使用它（在 for 循环中，其中 $x 是文件名）逐行求和

lines=`grep -v READ $x|wc -l | awk '{print }'`
line_num=1
arr_num=0


while [ $line_num -le $lines ]
do

    arr[$arr_num]=`grep -v READ $x |  sed $line_num'q;d' | awk '{print  + }'`
    echo $line_num
    line_num=$[$line_num+1]
    arr_num=$[$arr_num+1]

done

However, the file to be summed has 270,000+ rows. The script has been running for a few hours now, and it is nowhere near finished. Is there a more efficient way to write this so that it does not take so long?

但是，要求和的文件有 270,000 多行。脚本已经运行了几个小时，而且还远未完成。有没有更有效的方法来写这个，这样它就不会花这么长时间？

Answer 1

采纳答案by Juan Diego Godoy Robles

Use awkinstead and take advantage of modulusfunction:

改用awk并利用模数函数：

awk '!(NR%2){print +}' infile

Answer 2

回答by Digital Trauma

awkis probably faster, but the idiomatic bashway to do this is something like:

awk可能更快，但惯用的bash方式是这样的：

while read -a line; do      # read each line one-by-one, into an array
                            # use arithmetic expansion to add col 1 and 2
    echo "$(( ${line[0]} + ${line[1]} ))"
done < <(grep -v READ input.txt)

Note the file input file is only read once (by grep) and the number of externally forked programs is kept to a minimum (just grep, called only once for the whole input file). The rest of the commands are bashbuiltins.

请注意，文件输入文件仅读取一次（通过grep），并且外部分叉程序的数量保持在最低限度（只是grep，仅对整个输入文件调用一次）。其余的命令是bash内置命令。

Using the <( )process substition, in case variables set in the while loop are required out of scope of the while loop. Otherwise a |pipe could be used.

使用<( )process 替换，以防在 while 循环中设置的变量需要超出 while 循环的范围。否则|可以使用管道。

Answer 3

回答by glenn Hymanman

Your question is pretty verbose, yet your goal is not clear. The way I read it, your numbers are on every second line, and you want only to find the maximum sum. Given that:

你的问题很冗长，但你的目标并不明确。我读它的方式，你的数字在每一行，你只想找到最大的总和。鉴于：

awk '
    NR%2 == 1 {next} 
    NR == 2 {max = +; next} 
    + > max {max = +}
    END {print max}
' filename

Answer 4

回答by Hallmarc

You could also use a pipeline with tools that implicitly loop over the input like so:

您还可以使用带有工具的管道，这些工具隐式循环输入，如下所示：

grep -v read INFILE | tr -s ' ' + | bc | sort -rn | head -1 > OUTFILE

This assumes there are spaces between your read and write data values.

这假设您的读取和写入数据值之间存在空格。

Answer 5

回答by Sammitch

Assuming that it's always one 'header' row followed by one 'data' row:

假设它总是一个“标题”行，后跟一个“数据”行：

awk '
  BEGIN{ max = 0 }
  {
    if( NR%2 == 0 ){
      sum =  + ;
      if( sum > max ) { max = sum }
    }
  }
  END{ print max }' input.txt

Or simply trim out all lines that do not conform to what you want:

或者干脆删除所有不符合你想要的行：

grep '^[0-9]\+\s\+[0-9]\+$' input.txt | awk '
  BEGIN{ max = 0 }
  {
    sum =  + ;
    if( sum > max ) { max = sum }
  }
  END{ print max }' input.txt

Answer 6

回答by Jonathan Leffler

Why not run:

为什么不运行：

awk 'NR==1 { print "sum"; next } { print  +  }'

You can afford to run it on the file while the other script it still running. It'll be complete in a few seconds at most (prediction). When you're confident it's right, you can kill the other process.

您可以在文件上运行它，而其他脚本仍在运行。它最多会在几秒钟内完成（预测）。当您确信它是正确的时，您可以终止其他进程。

You can use Perl or Python instead of awkif you prefer.

awk如果您愿意，您可以使用 Perl 或 Python 来代替。

Your code is running grep, sedand awkon each line of the input file; that's damnably expensive. And it isn't even writing the data to a file; it is creating an array in Bash's memory that'll need to be printed to the output file later.

您的代码正在运行grep，sed并且awk在输入文件的每一行上；那太贵了。它甚至不会将数据写入文件；它正在 Bash 的内存中创建一个数组，稍后需要将其打印到输出文件中。

如何在 bash 中有效地将具有 270,000 多行的文件中的两列相加

提问by Emil

采纳答案by Juan Diego Godoy Robles

回答by Digital Trauma

回答by glenn Hymanman

回答by Hallmarc

回答by Sammitch

回答by Jonathan Leffler

相关推荐

最近更新

标签

如何在 bash 中有效地将具有 270,000 多行的文件中的两列相加

提问by Emil

采纳答案by Juan Diego Godoy Robles

回答by Digital Trauma

回答by glenn Hymanman

回答by Hallmarc

回答by Sammitch

回答by Jonathan Leffler

相关推荐

如何在 BASH 中将制表符分隔值 (TSV) 文件转换为逗号分隔值 (CSV) 文件？

Bash 命令日期的 -d 选项格式

bash linux 邮件添加内容类型标头不起作用

bash 如何拆分字符串并将每个拆分存储到变量中

相关推荐

最近更新

标签