bash 使用 awk、grep、sed 解析大型日志文件 (~5gb) 的性能问题

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7197500/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 00:40:37  来源:igfitidea点击:

Performance issue with parsing large log files (~5gb) using awk, grep, sed

bashunixawkgreplogging

提问by Albert

I am currently dealing with log files with sizes approx. 5gb. I'm quite new to parsing log files and using UNIX bash, so I'll try to be as precise as possible. While searching through log files, I do the following: provide the request number to look for, then optionally to provide the action as a secondary filter. A typical command looks like this:

我目前正在处理大小约为。5GB。我对解析日志文件和使用 UNIX bash 还很陌生,所以我会尽量做到准确。在搜索日志文件时,我执行以下操作:提供要查找的请求编号,然后可选择提供该操作作为辅助过滤器。一个典型的命令如下所示:

fgrep '2064351200' example.log | fgrep 'action: example'

This is fine dealing with smaller files, but with a log file that is 5gb, it's unbearably slow. I've read online it's great to use sed or awk to improve performance (or possibly even combination of both), but I'm not sure how this is accomplished. For example, using awk, I have a typical command:

这很好处理较小的文件,但是对于 5GB 的日志文件,它的速度慢得令人无法忍受。我在网上阅读过使用 sed 或 awk 来提高性能(或者甚至可能是两者的组合)很棒,但我不确定这是如何实现的。例如,使用awk,我有一个典型的命令:

awk '/2064351200/ {print}' example.log

Basically my ultimate goal is to be able print/return the records (or line number) that contain the strings (could be up to 4-5, and I've read piping is bad) to match in a log file efficiently.

基本上,我的最终目标是能够打印/返回包含字符串的记录(或行号)(最多可达 4-5,我读过管道不好)以有效地匹配日志文件。

On a side note, in bash shell, if I want to use awk and do some processing, how is that achieved? For example:

附带说明一下,在 bash shell 中,如果我想使用 awk 并进行一些处理,那是如何实现的?例如:

BEGIN { print "File\tOwner" }
{ print , "\t", \
}
END { print " - DONE -" }

That is a pretty simple awk script, and I would assume there's a way to put this into a one liner bash command? But I'm not sure how the structure is.

这是一个非常简单的 awk 脚本,我假设有一种方法可以将它放入单行 bash 命令中?但我不确定结构如何。

Thanks in advance for the help. Cheers.

在此先感谢您的帮助。干杯。

回答by Gordon Davisson

You need to perform some tests to find out where your bottlenecks are, and how fast your various tools perform. Try some tests like this:

您需要执行一些测试来找出瓶颈所在,以及各种工具的执行速度。尝试一些这样的测试:

time fgrep '2064351200' example.log >/dev/null
time egrep '2064351200' example.log >/dev/null
time sed -e '/2064351200/!d' example.log >/dev/null
time awk '/2064351200/ {print}' example.log >/dev/null

Traditionally, egrep should be the fastest of the bunch (yes, faster than fgrep), but some modern implementations are adaptive and automatically switch to the most appropriate searching algorithm. If you have bmgrep (which uses the Boyer-Moore search algorithm), try that. Generally, sed and awk will be slower because they're designed as more general-purpose text manipulation tools rather than being tuned for the specific job of searching. But it really depends on the implementation, and the correct way to find out is to run tests. Run them each several times so you don't get messed up by things like caching and competing processes.

传统上,egrep 应该是最快的(是的,比 fgrep 快),但一些现代实现是自适应的,会自动切换到最合适的搜索算法。如果您有 bmgrep(使用 Boyer-Moore 搜索算法),请尝试。通常,sed 和 awk 会更慢,因为它们被设计为更通用的文本操作工具,而不是针对特定的搜索工作进行调整。但这确实取决于实现,而正确的找出方法是运行测试。每次运行它们几次,这样你就不会被缓存和竞争进程之类的东西搞砸了。

As @Ron pointed out, your search process may be disk I/O bound. If you will be searching the same log file a number of times, it may be faster to compress the log file first; this makes it faster to read off disk, but then require more CPU time to process because it has to be decompressed first. Try something like this:

正如@Ron 指出的那样,您的搜索过程可能受磁盘 I/O 限制。如果您将多次搜索同一个日志文件,首先压缩日志文件可能会更快;这使得读取磁盘的速度更快,但需要更多的 CPU 时间来处理,因为它必须先解压缩。尝试这样的事情:

compress -c example2.log >example2.log.Z
time zgrep '2064351200' example2.log.Z >/dev/null
gzip -c example2.log >example2.log.gz
time zgrep '2064351200' example2.log.gz >/dev/null
bzip2 -k example.log
time bzgrep '2064351200' example.log.bz2 >/dev/null

I just ran a quick test with a fairly compressible text file, and found that bzip2 compressed best, but then took far more CPU time to decompress, so the zgip option wound up being fastest overall. Your computer will have different disk and CPU performance than mine, so your results may be different. If you have any other compressors lying around, try them as well, and/or try different levels of gzip compression, etc.

我只是用一个相当可压缩的文本文件进行了一个快速测试,发现 bzip2 压缩得最好,但随后需要更多的 CPU 时间来解压缩,所以 zgip 选项总体上是最快的。您的计算机的磁盘和 CPU 性能将与我的不同,因此您的结果可能会有所不同。如果您有任何其他压缩器,也请尝试它们,和/或尝试不同级别的 gzip 压缩等。

Speaking of preprocessing: if you're searching the same log over and over, is there a way to preselect out just the log lines that you mightbe interested in? If so, grep them out into a smaller (maybe compressed) file, then search that instead of the whole thing. As with compression, you spend some extra time up front, but then each individual search runs faster.

说到预处理:如果您一遍又一遍地搜索相同的日志,有没有办法只预选您可能感兴趣的日志行?如果是这样,请将它们提取到一个较小的(可能是压缩的)文件中,然后搜索该文件而不是整个文件。与压缩一样,您预先花费了一些额外的时间,但随后每个单独的搜索运行得更快。

A note about piping: other things being equal, piping a huge file through multiple commands will be slower than having a single command do all the work. But all things are not equal here, and if using multiple commands in a pipe (which is what zgrep and bzgrep do) buys you better overall performance, go for it. Also, consider whether you're actually passing all of the data through the entire pipe. In the example you gave, fgrep '2064351200' example.log | fgrep 'action: example', the first fgrep will discard most of the file; the pipe and second command only have to process the small fraction of the log that contains '2064351200', so the slowdown will likely be negligible.

关于管道的注意事项:在其他条件相同的情况下,通过多个命令管道传输一个巨大的文件比让单个命令完成所有工作要慢。但是这里并非所有事情都相同,如果在管道中使用多个命令(这是 zgrep 和 bzgrep 所做的)可以为您带来更好的整体性能,那就去做吧。另外,请考虑您是否真的通过整个管道传递所有数据。在您给出的示例中fgrep '2064351200' example.log | fgrep 'action: example',第一个 fgrep 将丢弃大部分文件;管道和第二个命令只需处理包含“2064351200”的日志的一小部分,因此减速可能可以忽略不计。

tl;dr TEST ALL THE THINGS!

tl;dr 测试所有的东西!

EDIT: if the log file is "live" (i.e. new entries are being added), but the bulk of it is static, you may be able to use a partial preprocess approach: compress (& maybe prescan) the log, then when scanning use the compressed (&/prescanned) version plus a tail of the part of the log added since you did the prescan. Something like this:

编辑:如果日志文件是“实时”的(即正在添加新条目),但其中大部分是静态的,您可以使用部分预处理方法:压缩(并且可能预扫描)日志,然后在扫描时使用压缩(&/预扫描)版本加上自您进行预扫描后添加的日志部分的尾部。像这样的东西:

# Precompress:
gzip -v -c example.log >example.log.gz
compressedsize=$(gzip -l example.log.gz | awk '{if(NR==2) print }')

# Search the compressed file + recent additions:
{ gzip -cdfq example.log.gz; tail -c +$compressedsize example.log; } | egrep '2064351200'

If you're going to be doing several related searches (e.g. a particular request, then specific actions with that request), you can save prescanned versions:

如果您要进行多个相关搜索(例如特定请求,然后对该请求执行特定操作),您可以保存预扫描版本:

# Prescan for a particular request (repeat for each request you'll be working with):
gzip -cdfq example.log.gz | egrep '2064351200' > prescan-2064351200.log

# Search the prescanned file + recent additions:
{ cat prescan-2064351200.log; tail -c +$compressedsize example.log | egrep '2064351200'; } | egrep 'action: example'

回答by glenn Hymanman

If you don't know the sequence of your strings, then:

如果您不知道字符串的顺序,则:

awk '/str1/ && /str2/ && /str3/ && /str4/' filename

If you know that they will appear one following another in the line:

如果您知道它们将一个接一个地出现在该行中:

grep 'str1.*str2.*str3.*str4' filename

(note for awk, {print}is the default action block, so it can be omitted if the condition is given)

(awk 的注意事项,{print}是默认的动作块,所以如果给定条件可以省略)

Dealing with files that large is going to be slow no matter how you slice it.

无论您如何切片,处理这么大的文件都会很慢。

回答by hemflit

As to multi-line programs on the command line,

至于命令行上的多行程序,

$ awk 'BEGIN { print "File\tOwner" }
> { print , "\t", \
> }
> END { print " - DONE -" }' infile > outfile

Note the single quotes.

注意单引号。

回答by tripleee

If you process the same file multiple times, it might be faster to read it into a database, and perhaps even create an index.

如果多次处理同一个文件,将其读入数据库甚至可能创建索引可能会更快。