bash 使用命令行工具按日期拆分 access.log 文件

Question

提问by mr.b

I have a Apache access.log file, which is around 35GB in size. Grepping through it is not an option any more, without waiting a great deal.

我有一个 Apache access.log 文件，大小约为 35GB。通过它不再是一种选择，无需等待很多。

I wanted to split it in many small files, by using date as splitting criteria.

我想通过使用日期作为拆分标准将其拆分为许多小文件。

Date is in format [15/Oct/2011:12:02:02 +0000]. Any idea how could I do it using only bash scripting, standard text manipulation programs (grep, awk, sed, and likes), piping and redirection?

日期格式为[15/Oct/2011:12:02:02 +0000]。知道如何仅使用 bash 脚本、标准文本操作程序（grep、awk、sed 和 likes）、管道和重定向来完成此操作吗？

Input file name is access.log. I'd like output files to have format such as access.apache.15_Oct_2011.log(that would do the trick, although not nice when sorting.)

输入文件名为access.log. 我希望输出文件的格式为access.apache.15_Oct_2011.log（这可以解决问题，虽然排序时不太好。）

Answer 1

回答by Theodore R. Smith

One way using awk:

一种使用方式awk：

awk 'BEGIN {
    split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ", months, " ")
    for (a = 1; a <= 12; a++)
        m[months[a]] = sprintf("%02d", a)
}
{
    split(,array,"[:/]")
    year = array[3]
    month = m[array[2]]

    print > FILENAME"-"year"_"month".txt"
}' incendiary.ws-2009

This will output files like:

这将输出如下文件：

incendiary.ws-2010-2010_04.txt
incendiary.ws-2010-2010_05.txt
incendiary.ws-2010-2010_06.txt
incendiary.ws-2010-2010_07.txt

Against a 150 MB log file, the answer by chepner took 70 seconds on an 3.4 GHz 8 Core Xeon E31270, while this method took 5 seconds.

对于 150 MB 的日志文件，chepner 的回答在 3.4 GHz 8 核 Xeon E31270 上用了 70 秒，而这种方法用了 5 秒。

Original inspiration: "How to split existing apache logfile by month?"

原创灵感：《如何按月拆分现有的apache日志文件？》

Answer 2

回答by chepner

Pure bash, making one pass through the access log:

纯 bash，通过访问日志：

while read; do
    [[ $REPLY =~ \[(..)/(...)/(....): ]]

    d=${BASH_REMATCH[1]}
    m=${BASH_REMATCH[2]}
    y=${BASH_REMATCH[3]}

    #printf -v fname "access.apache.%s_%s_%s.log" ${BASH_REMATCH[@]:1:3}
    printf -v fname "access.apache.%s_%s_%s.log" $y $m $d

    echo "$REPLY" >> $fname
done < access.log

Answer 3

回答by Thor

Here is an awkversion that outputs lexically sortable log files.

这是一个awk输出可按词法排序的日志文件的版本。

Some efficiency enhancements: all done in one pass, only generate fnamewhen it is not the same as before, close fnamewhen switching to a new file (otherwise you might run out of file descriptors).

一些效率提升：一次完成，只fname在与以前不同时生成，fname切换到新文件时关闭（否则可能会用完文件描述符）。

awk -F"[]/:[]" '
BEGIN {
  m2n["Jan"] =  1;  m2n["Feb"] =  2; m2n["Mar"] =  3; m2n["Apr"] =  4;
  m2n["May"] =  5;  m2n["Jun"] =  6; m2n["Jul"] =  7; m2n["Aug"] =  8;
  m2n["Sep"] =  9;  m2n["Oct"] = 10; m2n["Nov"] = 11; m2n["Dec"] = 12;
}
{
  if( != pyear ||  != pmonth ||  != pday) {
    pyear  = 
    pmonth = 
    pday   = 

    if(fname != "")
      close(fname)

    fname  = sprintf("access_%04d_%02d_%02d.log", , m2n[], )
  }
  print > fname
}' access-log

Answer 4

回答by mr.b

Perl came to the rescue:

Perl 来救援：

cat access.log | perl -n -e'm@\[(\d{1,2})/(\w{3})/(\d{4}):@; open(LOG, ">>access.apache.__.log"); print LOG $_;'

Well, it's not exactly "standard" manipulation program, but it's made for text manipulation nevertheless.

嗯，它不是完全“标准”的操作程序，但它仍然是为文本操作而设计的。

I've also changed order of arguments in file name, so that files are named like access.apache.yyyy_mon_dd.log for easier sorting.

我还更改了文件名中参数的顺序，以便将文件命名为 access.apache.yyyy_mon_dd.log 以便于排序。

Answer 5

回答by jwadsack

I combined Theodore's and Thor's solutions to use Thor's efficiency improvement and daily files, but retain the original support for IPv6 addresses in combined format file.

我结合了 Theodore 和 Thor 的解决方案来使用 Thor 的效率提升和日常文件，但在组合格式文件中保留了对 IPv6 地址的原始支持。

awk '
BEGIN {
  m2n["Jan"] =  1;  m2n["Feb"] =  2; m2n["Mar"] =  3; m2n["Apr"] =  4;
  m2n["May"] =  5;  m2n["Jun"] =  6; m2n["Jul"] =  7; m2n["Aug"] =  8;
  m2n["Sep"] =  9;  m2n["Oct"] = 10; m2n["Nov"] = 11; m2n["Dec"] = 12;
}
{
  split(, a, "[]/:[]")
  if(a[4] != pyear || a[3] != pmonth || a[2] != pday) {
    pyear  = a[4]
    pmonth = a[3]
    pday   = a[2]

    if(fname != "")
      close(fname)

    fname  = sprintf("access_%04d-%02d-%02d.log", a[4], m2n[a[3]], a[2])
  }
  print >> fname
}'

Answer 6

回答by ncultra

Kind of ugly, that's bash for you:

有点丑，这对你来说很糟糕：

    for year in 2010 2011 2012; do
       for month in jan feb mar apr may jun jul aug sep oct nov dec; do
           for day in 1 2 3 4 5 6 7 8 9 10 ... 31 ; do
               cat access.log | grep -i $day/$month/$year > $day-$month-$year.log
            done
        done
     done

Answer 7

回答by simon

I made a slight improvement to Theodore's answer so I could see progress when processing a verylarge log file.

我对 Theodore 的回答做了一点改进，以便在处理非常大的日志文件时可以看到进展。

#!/usr/bin/awk -f

BEGIN {
    split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ", months, " ")
    for (a = 1; a <= 12; a++)
        m[months[a]] = a
}
{
    split(, array, "[:/]")
    year = array[3]
    month = sprintf("%02d", m[array[2]])

    current = year "-" month
    if (last != current)
        print current
    last = current

    print >> FILENAME "-" year "-" month ".txt"
}

Also I found that I needed to use gawk(brew install gawkif you don't have it) for this to work on Mac OS X.

我还发现我需要使用gawk（brew install gawk如果你没有它）才能在 Mac OS X 上工作。

bash 使用命令行工具按日期拆分 access.log 文件

提问by mr.b

回答by Theodore R. Smith

回答by chepner

回答by Thor

回答by mr.b

回答by jwadsack

回答by ncultra

回答by simon

相关推荐

最近更新

标签

bash 使用命令行工具按日期拆分 access.log 文件

提问by mr.b

回答by Theodore R. Smith

回答by chepner

回答by Thor

回答by mr.b

回答by jwadsack

回答by ncultra

回答by simon

相关推荐

bash FTP Shell 脚本 mkdir 问题

Bash：读取文件时丢失特殊字符

bash 在bash中使用正则表达式在字符串中进行多个匹配

如何在 os x 中重新加载所有 bash 启动文件？

相关推荐

最近更新

标签