如何在 bash 中有效地将具有 270,000 多行的文件中的两列相加
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22669572/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to efficiently sum two columns in a file with 270,000+ rows in bash
提问by Emil
I have two columns in a file, and I want to automate summing both values per row
我在一个文件中有两列,我想自动对每行的两个值求和
for example
例如
read write
5 6
read write
10 2
read write
23 44
I want to then sum the "read" and "write" of each row. Eventually after summing, I'm finding the max sum and putting that max value in a file. I feel like I have to use grep -v to rid of the column headers per row, which like stated in the answers, makes the code inefficient since I'm grepping the entire file just to read a line.
然后我想对每行的“读取”和“写入”求和。最终求和后,我找到了最大和并将该最大值放入文件中。我觉得我必须使用 grep -v 来删除每行的列标题,就像答案中所述,这使代码效率低下,因为我正在对整个文件进行 grep 读取一行。
I currently have this in a bash script (within a for loop where $x is the file name) to sum the columns line by line
我目前在 bash 脚本中使用它(在 for 循环中,其中 $x 是文件名)逐行求和
lines=`grep -v READ $x|wc -l | awk '{print }'`
line_num=1
arr_num=0
while [ $line_num -le $lines ]
do
arr[$arr_num]=`grep -v READ $x | sed $line_num'q;d' | awk '{print + }'`
echo $line_num
line_num=$[$line_num+1]
arr_num=$[$arr_num+1]
done
However, the file to be summed has 270,000+ rows. The script has been running for a few hours now, and it is nowhere near finished. Is there a more efficient way to write this so that it does not take so long?
但是,要求和的文件有 270,000 多行。脚本已经运行了几个小时,而且还远未完成。有没有更有效的方法来写这个,这样它就不会花这么长时间?
采纳答案by Juan Diego Godoy Robles
回答by Digital Trauma
awk
is probably faster, but the idiomatic bashway to do this is something like:
awk
可能更快,但惯用的bash方式是这样的:
while read -a line; do # read each line one-by-one, into an array
# use arithmetic expansion to add col 1 and 2
echo "$(( ${line[0]} + ${line[1]} ))"
done < <(grep -v READ input.txt)
Note the file input file is only read once (by grep
) and the number of externally forked programs is kept to a minimum (just grep
, called only once for the whole input file). The rest of the commands are bash
builtins.
请注意,文件输入文件仅读取一次(通过grep
),并且外部分叉程序的数量保持在最低限度(只是grep
,仅对整个输入文件调用一次)。其余的命令是bash
内置命令。
Using the <( )
process substition, in case variables set in the while loop are required out of scope of the while loop. Otherwise a |
pipe could be used.
使用<( )
process 替换,以防在 while 循环中设置的变量需要超出 while 循环的范围。否则|
可以使用管道。
回答by glenn Hymanman
Your question is pretty verbose, yet your goal is not clear. The way I read it, your numbers are on every second line, and you want only to find the maximum sum. Given that:
你的问题很冗长,但你的目标并不明确。我读它的方式,你的数字在每一行,你只想找到最大的总和。鉴于:
awk '
NR%2 == 1 {next}
NR == 2 {max = +; next}
+ > max {max = +}
END {print max}
' filename
回答by Hallmarc
You could also use a pipeline with tools that implicitly loop over the input like so:
您还可以使用带有工具的管道,这些工具隐式循环输入,如下所示:
grep -v read INFILE | tr -s ' ' + | bc | sort -rn | head -1 > OUTFILE
This assumes there are spaces between your read and write data values.
这假设您的读取和写入数据值之间存在空格。
回答by Sammitch
Assuming that it's always one 'header' row followed by one 'data' row:
假设它总是一个“标题”行,后跟一个“数据”行:
awk '
BEGIN{ max = 0 }
{
if( NR%2 == 0 ){
sum = + ;
if( sum > max ) { max = sum }
}
}
END{ print max }' input.txt
Or simply trim out all lines that do not conform to what you want:
或者干脆删除所有不符合你想要的行:
grep '^[0-9]\+\s\+[0-9]\+$' input.txt | awk '
BEGIN{ max = 0 }
{
sum = + ;
if( sum > max ) { max = sum }
}
END{ print max }' input.txt
回答by Jonathan Leffler
Why not run:
为什么不运行:
awk 'NR==1 { print "sum"; next } { print + }'
You can afford to run it on the file while the other script it still running. It'll be complete in a few seconds at most (prediction). When you're confident it's right, you can kill the other process.
您可以在文件上运行它,而其他脚本仍在运行。它最多会在几秒钟内完成(预测)。当您确信它是正确的时,您可以终止其他进程。
You can use Perl or Python instead of awk
if you prefer.
awk
如果您愿意,您可以使用 Perl 或 Python 来代替。
Your code is running grep
, sed
and awk
on each line of the input file; that's damnably expensive. And it isn't even writing the data to a file; it is creating an array in Bash's memory that'll need to be printed to the output file later.
您的代码正在运行grep
,sed
并且awk
在输入文件的每一行上;那太贵了。它甚至不会将数据写入文件;它正在 Bash 的内存中创建一个数组,稍后需要将其打印到输出文件中。