bash Awk、tail、sed 或其他 - 对于大文件,哪一种更快?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27057231/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Awk, tail, sed or others - which one faster for big files?
提问by onur
I have scripts for big log files. I can check all line and do something with tail
and awk
.
我有大日志文件的脚本。我可以检查所有线路,并做一些与tail
和awk
。
Tail:
尾巴:
tail -n +$startline $LOG
Awk:
惊:
awk 'NR>='"$startline"' {print}' $LOG
And checking time, tail working 6 mins 39 seconds, awk working 6 mins 42 seconds. So two commands do same thing / same time.
并检查时间,尾部工作 6 分 39 秒,awk 工作 6 分 42 秒。所以两个命令做同样的事情/同时。
I don't know how to do with sed. Sed can be faster than tail and awk? Or maybe other commands.
我不知道如何处理 sed。sed 可以比tail 和awk 快吗?或者其他命令。
Second question, I use $startline
and every time continue remains from the last line. For example:
第二个问题,我使用$startline
并且每次继续从最后一行开始。例如:
I use script line this:
我使用脚本行:
10:00AM -> ./script -> $startline=1 and do something -> write line number to save file(for ex. 25),
10:05AM -> ./script -> $startline=26(read save file +1) and do something -> write line number save file(55),
10:10AM -> ./script -> $startline=56(read save file +1) and do something ....
But when script is running, checking all lines and when see $startline
, doing something. And it's a little slow because of huge files.
但是当脚本运行时,检查所有行,当看到时$startline
,做一些事情。由于文件很大,它有点慢。
Any suggestions for it be faster?
有什么建议可以更快吗?
Script example:
脚本示例:
lastline=$(tail -1 "line.save")
startline=$(($lastline + 1))
tail -n +$startline $LOG | while read -r
do
....
done
linecount=$(wc -l "$LOG" | awk '{print }')
echo $linecount >> line.save
采纳答案by fedorqui 'SO stop harming'
tail
and head
are tools especially created for this purposes, so the intuitive idea is that their are quite optimized for it. On the other hand, awk
and sed
can perfectly do it because they are like a Swiss Army knife, but this is not supposed to be its best "skill" over the multiple others that they have.
tail
并且head
是专门为此目的创建的工具,因此直观的想法是它们已为此进行了相当优化。另一方面,awk
并且sed
可以完美地做到这一点,因为它们就像一把瑞士军刀,但这不应该是其拥有的众多其他人的最佳“技能”。
In Efficient way to print lines from a massive file using awk, sed, or something else?there is a nice comparison on methods and head
/ tail
is seen as the best approach.
以有效的方式使用 awk、sed 或其他方式从大量文件中打印行?在方法上有一个很好的比较,head
/tail
被视为最好的方法。
Hence, I would go for tail
+ head
.
因此,我会选择tail
+ head
。
Note also that if it is not only the last lines, but a set of them within the text, in awk
(or in sed
) you have the option to exit
after the last line you wanted. This way, you avoid the script to run the file until the last line.
另请注意,如果它不仅是最后一行,而且是文本中的一组它们,则 in awk
(或 in sed
)您可以选择exit
在您想要的最后一行之后。这样,您可以避免脚本运行文件直到最后一行。
So this:
所以这:
awk '{if (NR>=10 && NR<20) print} NR==20 {print; exit}'
is faster than
比
awk 'NR>=10 && NR<=20'
If your input happens to contain more than 20 lines.
如果您的输入恰好包含 20 多行。
Regarding your expression:
关于你的表达:
awk 'NR>='"$startline"' {print}' $LOG
note that it is more straight forward to write:
请注意,编写更直接:
awk -v start="$startline" 'NR>=start' $LOG
there is no need to say print
because it is implicit.
不用说,print
因为它是隐含的。