apache 快速处理apache日志
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2114958/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Processing apache logs quickly
提问by konr
I'm currently running an awk script to process a large (8.1GB) access-log file, and it's taking forever to finish. In 20 minutes, it wrote 14MB of the (1000 +- 500)MB I expect it to write, and I wonder if I can process it much faster somehow.
我目前正在运行一个 awk 脚本来处理一个大(8.1GB)的访问日志文件,它需要很长时间才能完成。在 20 分钟内,它写入了我希望它写入的 (1000 +- 500)MB 中的 14MB,我想知道我是否可以以某种方式更快地处理它。
Here is the awk script:
这是 awk 脚本:
#!/bin/bash
awk '{t=" "; gsub("[\[\]\/]"," ",t); sub(":"," ",t);printf("%s,",);system("date -d \""t"\" +%s");}'
EDIT:
编辑:
For non-awkers, the script reads each line, gets the date information, modifies it to a format the utility daterecognizes and calls it to represent the date as the number of seconds since 1970, finally returning it as a line of a .csv file, along with the IP.
对于非 awkers,脚本读取每一行,获取日期信息,将其修改为实用程序date识别的格式并调用它以将日期表示为自 1970 年以来的秒数,最后将其作为 .csv 文件的一行返回,连同IP。
Example input:189.5.56.113 - - [22/Jan/2010:05:54:55 +0100] "GET (...)"
示例输入:189.5.56.113 - - [22/Jan/2010:05:54:55 +0100] "GET (...)"
Returned output:189.5.56.113,124237889
返回输出:189.5.56.113,124237889
回答by ghostdog74
@OP, your script is slow mainly due to the excessive call of system date command for every line in the file, and its a big file as well (in the GB). If you have gawk, use its internal mktime() command to do the date to epoch seconds conversion
@OP,您的脚本很慢,主要是由于对文件中的每一行过度调用 system date 命令,而且它也是一个大文件(在 GB 中)。如果您有 gawk,请使用其内部 mktime() 命令进行日期到纪元秒的转换
awk 'BEGIN{
m=split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",d,"|")
for(o=1;o<=m;o++){
date[d[o]]=sprintf("%02d",o)
}
}
{
gsub(/\[/,"",); gsub(":","/",); gsub(/\]/,"",)
n=split(, DATE,"/")
day=DATE[1]
mth=DATE[2]
year=DATE[3]
hr=DATE[4]
min=DATE[5]
sec=DATE[6]
MKTIME= mktime(year" "date[mth]" "day" "hr" "min" "sec)
print ,MKTIME
}' file
output
输出
$ more file
189.5.56.113 - - [22/Jan/2010:05:54:55 +0100] "GET (...)"
$ ./shell.sh
189.5.56.113 1264110895
回答by Dietrich Epp
If you really really need it to be faster, you can do what I did. I rewrote an Apache log file analyzer using Ragel. Ragel allows you to mix regular expressions with C code. The regular expressions get transformed into very efficient C code and then compiled. Unfortunately, this requires that you are very comfortablewriting code in C. I no longer have this analyzer. It processed 1 GB of Apache access logs in 1 or 2 seconds.
如果你真的需要它更快,你可以做我所做的。我使用 Ragel 重写了一个 Apache 日志文件分析器。Ragel 允许您将正则表达式与 C 代码混合使用。正则表达式被转换成非常高效的 C 代码,然后被编译。不幸的是,这要求您非常熟悉用 C 编写代码。我不再拥有这个分析器。它在 1 或 2 秒内处理了 1 GB 的 Apache 访问日志。
You may have limited success removing unnecessary printfs from your awk statement and replacing them with something simpler.
从 awk 语句中删除不必要的 printfs 并用更简单的东西替换它们的成功可能有限。
回答by Paused until further notice.
If you are using gawk, you can massage your date and time into a format that mktime(a gawkfunction) understands. It will give you the same timestamp you're using now and save you the overhead of repeated system()calls.
如果您正在使用gawk,您可以将日期和时间转换为mktime(gawk函数)可以理解的格式。它将为您提供与您现在使用的相同的时间戳,并为您节省重复system()调用的开销。
回答by Max Shawabkeh
This little Python script handles a ~400MB worth of copies of your example line in about 3 minutes on my machine producing ~200MB of output (keep in mind your sample line was quite short, so that's a handicap):
这个小的 Python 脚本在我的机器上在大约 3 分钟内处理了大约 400MB 的示例行副本,产生了大约 200MB 的输出(请记住,您的示例行很短,所以这是一个障碍):
import time
src = open('x.log', 'r')
dest = open('x.csv', 'w')
for line in src:
ip = line[:line.index(' ')]
date = line[line.index('[') + 1:line.index(']') - 6]
t = time.mktime(time.strptime(date, '%d/%b/%Y:%X'))
dest.write(ip)
dest.write(',')
dest.write(str(int(t)))
dest.write('\n')
src.close()
dest.close()
A minor problem is that it doesn't handle timezones (strptime() problem), but you could either hardcode that or add a little extra to take care of it.
一个小问题是它不处理时区(strptime() 问题),但是您可以对其进行硬编码或添加一些额外的东西来处理它。
But to be honest, something as simple as that should be just as easy to rewrite in C.
但老实说,这么简单的事情应该同样容易用 C 重写。
回答by Ryan Liu
gawk '{
dt=substr(,2,11);
gsub(/\//," ",dt);
"date -d \""dt"\" +%s"|getline ts;
print , ts
}' yourfile

