apache 快速处理apache日志

Question

提问by konr

I'm currently running an awk script to process a large (8.1GB) access-log file, and it's taking forever to finish. In 20 minutes, it wrote 14MB of the (1000 +- 500)MB I expect it to write, and I wonder if I can process it much faster somehow.

我目前正在运行一个 awk 脚本来处理一个大（8.1GB）的访问日志文件，它需要很长时间才能完成。在 20 分钟内，它写入了我希望它写入的 (1000 +- 500)MB 中的 14MB，我想知道我是否可以以某种方式更快地处理它。

Here is the awk script:

这是 awk 脚本：

#!/bin/bash

awk '{t=" "; gsub("[\[\]\/]"," ",t); sub(":"," ",t);printf("%s,",);system("date -d \""t"\" +%s");}'

EDIT:

编辑：

For non-awkers, the script reads each line, gets the date information, modifies it to a format the utility daterecognizes and calls it to represent the date as the number of seconds since 1970, finally returning it as a line of a .csv file, along with the IP.

对于非 awkers，脚本读取每一行，获取日期信息，将其修改为实用程序date识别的格式并调用它以将日期表示为自 1970 年以来的秒数，最后将其作为 .csv 文件的一行返回，连同IP。

Example input:189.5.56.113 - - [22/Jan/2010:05:54:55 +0100] "GET (...)"

示例输入：189.5.56.113 - - [22/Jan/2010:05:54:55 +0100] "GET (...)"

Returned output:189.5.56.113,124237889

返回输出：189.5.56.113,124237889

Answer 1

回答by ghostdog74

@OP, your script is slow mainly due to the excessive call of system date command for every line in the file, and its a big file as well (in the GB). If you have gawk, use its internal mktime() command to do the date to epoch seconds conversion

@OP，您的脚本很慢，主要是由于对文件中的每一行过度调用 system date 命令，而且它也是一个大文件（在 GB 中）。如果您有 gawk，请使用其内部 mktime() 命令进行日期到纪元秒的转换

awk 'BEGIN{
   m=split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",d,"|")
   for(o=1;o<=m;o++){
      date[d[o]]=sprintf("%02d",o)
    }
}
{
    gsub(/\[/,"",); gsub(":","/",); gsub(/\]/,"",)
    n=split(, DATE,"/")
    day=DATE[1]
    mth=DATE[2]
    year=DATE[3]
    hr=DATE[4]
    min=DATE[5]
    sec=DATE[6]
    MKTIME= mktime(year" "date[mth]" "day" "hr" "min" "sec)
    print ,MKTIME

}' file

output

输出

$ more file
189.5.56.113 - - [22/Jan/2010:05:54:55 +0100] "GET (...)"
$ ./shell.sh    
189.5.56.113 1264110895

Answer 2

回答by Dietrich Epp

If you really really need it to be faster, you can do what I did. I rewrote an Apache log file analyzer using Ragel. Ragel allows you to mix regular expressions with C code. The regular expressions get transformed into very efficient C code and then compiled. Unfortunately, this requires that you are very comfortablewriting code in C. I no longer have this analyzer. It processed 1 GB of Apache access logs in 1 or 2 seconds.

如果你真的需要它更快，你可以做我所做的。我使用 Ragel 重写了一个 Apache 日志文件分析器。Ragel 允许您将正则表达式与 C 代码混合使用。正则表达式被转换成非常高效的 C 代码，然后被编译。不幸的是，这要求您非常熟悉用 C 编写代码。我不再拥有这个分析器。它在 1 或 2 秒内处理了 1 GB 的 Apache 访问日志。

You may have limited success removing unnecessary printfs from your awk statement and replacing them with something simpler.

从 awk 语句中删除不必要的 printfs 并用更简单的东西替换它们的成功可能有限。

Answer 3

回答by Paused until further notice.

If you are using gawk, you can massage your date and time into a format that mktime(a gawkfunction) understands. It will give you the same timestamp you're using now and save you the overhead of repeated system()calls.

如果您正在使用gawk，您可以将日期和时间转换为mktime（gawk函数）可以理解的格式。它将为您提供与您现在使用的相同的时间戳，并为您节省重复system()调用的开销。

Answer 4

回答by Max Shawabkeh

This little Python script handles a ~400MB worth of copies of your example line in about 3 minutes on my machine producing ~200MB of output (keep in mind your sample line was quite short, so that's a handicap):

这个小的 Python 脚本在我的机器上在大约 3 分钟内处理了大约 400MB 的示例行副本，产生了大约 200MB 的输出（请记住，您的示例行很短，所以这是一个障碍）：

import time

src = open('x.log', 'r')
dest = open('x.csv', 'w')

for line in src:
    ip = line[:line.index(' ')]
    date = line[line.index('[') + 1:line.index(']') - 6]
    t = time.mktime(time.strptime(date, '%d/%b/%Y:%X'))
    dest.write(ip)
    dest.write(',')
    dest.write(str(int(t)))
    dest.write('\n')

src.close()
dest.close()

A minor problem is that it doesn't handle timezones (strptime() problem), but you could either hardcode that or add a little extra to take care of it.

一个小问题是它不处理时区（strptime() 问题），但是您可以对其进行硬编码或添加一些额外的东西来处理它。

But to be honest, something as simple as that should be just as easy to rewrite in C.

但老实说，这么简单的事情应该同样容易用 C 重写。

Answer 5

回答by Ryan Liu

gawk '{
    dt=substr(,2,11); 
    gsub(/\//," ",dt); 
    "date -d \""dt"\" +%s"|getline ts; 
    print , ts
}' yourfile

apache 快速处理apache日志

提问by konr

回答by ghostdog74

回答by Dietrich Epp

回答by Paused until further notice.

回答by Max Shawabkeh

回答by Ryan Liu

相关推荐

最近更新

标签

apache 快速处理apache日志

提问by konr

回答by ghostdog74

回答by Dietrich Epp

回答by Paused until further notice.

回答by Max Shawabkeh

回答by Ryan Liu

相关推荐

PHP 内存不足 - 使 Apache 崩溃？

如何使用 Apache mod_rewrite 隐藏 .html 扩展名

在 apache 中设置基本的 Web 代理

apache 带有 mod_rewrite 的不区分大小写的 URL

相关推荐

最近更新

标签