bash 使用命令行工具按日期拆分 access.log 文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/11687054/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Split access.log file by dates using command line tools
提问by mr.b
I have a Apache access.log file, which is around 35GB in size. Grepping through it is not an option any more, without waiting a great deal.
我有一个 Apache access.log 文件,大小约为 35GB。通过它不再是一种选择,无需等待很多。
I wanted to split it in many small files, by using date as splitting criteria.
我想通过使用日期作为拆分标准将其拆分为许多小文件。
Date is in format [15/Oct/2011:12:02:02 +0000]. Any idea how could I do it using only bash scripting, standard text manipulation programs (grep, awk, sed, and likes), piping and redirection?
日期格式为[15/Oct/2011:12:02:02 +0000]。知道如何仅使用 bash 脚本、标准文本操作程序(grep、awk、sed 和 likes)、管道和重定向来完成此操作吗?
Input file name is access.log. I'd like output files to have format such as access.apache.15_Oct_2011.log(that would do the trick, although not nice when sorting.)
输入文件名为access.log. 我希望输出文件的格式为access.apache.15_Oct_2011.log(这可以解决问题,虽然排序时不太好。)
回答by Theodore R. Smith
One way using awk:
一种使用方式awk:
awk 'BEGIN {
    split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ", months, " ")
    for (a = 1; a <= 12; a++)
        m[months[a]] = sprintf("%02d", a)
}
{
    split(,array,"[:/]")
    year = array[3]
    month = m[array[2]]
    print > FILENAME"-"year"_"month".txt"
}' incendiary.ws-2009
This will output files like:
这将输出如下文件:
incendiary.ws-2010-2010_04.txt
incendiary.ws-2010-2010_05.txt
incendiary.ws-2010-2010_06.txt
incendiary.ws-2010-2010_07.txt
Against a 150 MB log file, the answer by chepner took 70 seconds on an 3.4 GHz 8 Core Xeon E31270, while this method took 5 seconds.
对于 150 MB 的日志文件,chepner 的回答在 3.4 GHz 8 核 Xeon E31270 上用了 70 秒,而这种方法用了 5 秒。
Original inspiration: "How to split existing apache logfile by month?"
原创灵感:《如何按月拆分现有的apache日志文件?》
回答by chepner
Pure bash, making one pass through the access log:
纯 bash,通过访问日志:
while read; do
    [[ $REPLY =~ \[(..)/(...)/(....): ]]
    d=${BASH_REMATCH[1]}
    m=${BASH_REMATCH[2]}
    y=${BASH_REMATCH[3]}
    #printf -v fname "access.apache.%s_%s_%s.log" ${BASH_REMATCH[@]:1:3}
    printf -v fname "access.apache.%s_%s_%s.log" $y $m $d
    echo "$REPLY" >> $fname
done < access.log
回答by Thor
Here is an awkversion that outputs lexically sortable log files.
这是一个awk输出可按词法排序的日志文件的版本。
Some efficiency enhancements: all done in one pass, only generate fnamewhen it is not the same as before, close fnamewhen switching to a new file (otherwise you might run out of file descriptors).
一些效率提升:一次完成,只fname在与以前不同时生成,fname切换到新文件时关闭(否则可能会用完文件描述符)。
awk -F"[]/:[]" '
BEGIN {
  m2n["Jan"] =  1;  m2n["Feb"] =  2; m2n["Mar"] =  3; m2n["Apr"] =  4;
  m2n["May"] =  5;  m2n["Jun"] =  6; m2n["Jul"] =  7; m2n["Aug"] =  8;
  m2n["Sep"] =  9;  m2n["Oct"] = 10; m2n["Nov"] = 11; m2n["Dec"] = 12;
}
{
  if( != pyear ||  != pmonth ||  != pday) {
    pyear  = 
    pmonth = 
    pday   = 
    if(fname != "")
      close(fname)
    fname  = sprintf("access_%04d_%02d_%02d.log", , m2n[], )
  }
  print > fname
}' access-log
回答by mr.b
Perl came to the rescue:
Perl 来救援:
cat access.log | perl -n -e'm@\[(\d{1,2})/(\w{3})/(\d{4}):@; open(LOG, ">>access.apache.__.log"); print LOG $_;'
Well, it's not exactly "standard" manipulation program, but it's made for text manipulation nevertheless.
嗯,它不是完全“标准”的操作程序,但它仍然是为文本操作而设计的。
I've also changed order of arguments in file name, so that files are named like access.apache.yyyy_mon_dd.log for easier sorting.
我还更改了文件名中参数的顺序,以便将文件命名为 access.apache.yyyy_mon_dd.log 以便于排序。
回答by jwadsack
I combined Theodore's and Thor's solutions to use Thor's efficiency improvement and daily files, but retain the original support for IPv6 addresses in combined format file.
我结合了 Theodore 和 Thor 的解决方案来使用 Thor 的效率提升和日常文件,但在组合格式文件中保留了对 IPv6 地址的原始支持。
awk '
BEGIN {
  m2n["Jan"] =  1;  m2n["Feb"] =  2; m2n["Mar"] =  3; m2n["Apr"] =  4;
  m2n["May"] =  5;  m2n["Jun"] =  6; m2n["Jul"] =  7; m2n["Aug"] =  8;
  m2n["Sep"] =  9;  m2n["Oct"] = 10; m2n["Nov"] = 11; m2n["Dec"] = 12;
}
{
  split(, a, "[]/:[]")
  if(a[4] != pyear || a[3] != pmonth || a[2] != pday) {
    pyear  = a[4]
    pmonth = a[3]
    pday   = a[2]
    if(fname != "")
      close(fname)
    fname  = sprintf("access_%04d-%02d-%02d.log", a[4], m2n[a[3]], a[2])
  }
  print >> fname
}'
回答by ncultra
Kind of ugly, that's bash for you:
有点丑,这对你来说很糟糕:
    for year in 2010 2011 2012; do
       for month in jan feb mar apr may jun jul aug sep oct nov dec; do
           for day in 1 2 3 4 5 6 7 8 9 10 ... 31 ; do
               cat access.log | grep -i $day/$month/$year > $day-$month-$year.log
            done
        done
     done
回答by simon
I made a slight improvement to Theodore's answer so I could see progress when processing a verylarge log file.
我对 Theodore 的回答做了一点改进,以便在处理非常大的日志文件时可以看到进展。
#!/usr/bin/awk -f
BEGIN {
    split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ", months, " ")
    for (a = 1; a <= 12; a++)
        m[months[a]] = a
}
{
    split(, array, "[:/]")
    year = array[3]
    month = sprintf("%02d", m[array[2]])
    current = year "-" month
    if (last != current)
        print current
    last = current
    print >> FILENAME "-" year "-" month ".txt"
}
Also I found that I needed to use gawk(brew install gawkif you don't have it) for this to work on Mac OS X.
我还发现我需要使用gawk(brew install gawk如果你没有它)才能在 Mac OS X 上工作。

