如何使用 bash (grep/sed/etc) 在 2 个时间戳之间抓取日志文件的一部分?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/827930/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-17 20:51:39  来源:igfitidea点击:

How can I use bash (grep/sed/etc) to grab a section of a logfile between 2 timestamps?

bashparsinglogfilestimestamp

提问by Brent

I have a set of mail logs: mail.log mail.log.0 mail.log.1.gz mail.log.2.gz

我有一组邮件日志:mail.log mail.log.0 mail.log.1.gz mail.log.2.gz

each of these files contain chronologically sorted lines that begin with timestamps like:

这些文件中的每一个都包含按时间顺序排序的行,这些行以时间戳开头,例如:

May 3 13:21:12 ...

5月3日13:21:12...

How can I easily grab every log entry after a certain date/time and before another date/time using bash(and related command line tools) without comparing every single line? Keep in mind that my before and after dates may not exactly match any entries in the logfiles.

在不比较每一行的情况下,如何使用 bash(和相关的命令行工具)在某个日期/时间之后和另一个日期/时间之前轻松获取每个日志条目?请记住,我之前和之后的日期可能与日志文件中的任何条目都不完全匹配。

It seems to me that I need to determine the offset of the first line greater than the starting timestamp, and the offset of the last line less than the ending timestamp, and cut that section out somehow.

在我看来,我需要确定第一行的偏移量大于开始时间戳,最后一行的偏移量小于结束时间戳,并以某种方式切掉该部分。

采纳答案by Brent

Here one basic idea of how to do it:

这是如何做到这一点的一个基本想法:

  1. Examine the datestamp on the fileto see if it is irrelevent
  2. If it couldbe relevent, unzip if necessary and examine the first and last linesof the file to see if it contains the start or finish time.
  3. If it does, use a recursive functionto determine if it contains the start time in the first or second half of the file. Using a recursive function I think you could find any date in a million line logfile with around 20 comparisons.
  4. echo the logfile(s) in order from the offset of the first entry to the offset of the last entry (no more comparisons)
  1. 检查文件上的日期戳,看它是否无关紧要
  2. 如果可能是相关的,请在必要时解压缩并检查文件的第一行和最后一行以查看它是否包含开始或结束时间。
  3. 如果是,则使用递归函数确定它是否包含文件前半部分或后半部分的开始时间。使用递归函数,我认为您可以在一百万行日志文件中找到任何日期,并进行大约 20 次比较。
  4. 按照从第一个条目的偏移量到最后一个条目的偏移量的顺序回显日志文件(不再进行比较)

What I don't know is: how to best read the nth line of a file (how efficient is it to use tail n+**n|head 1**?)

我不知道的是:如何最好地读取文件的第 n 行(使用tail n+**n|head 1** 的效率如何?)

Any help?

有什么帮助吗?

回答by Dylan

Convert your min/max dates into "seconds since epoch",

将您的最小/最大日期转换为“自纪元以来的秒数”,

MIN=`date --date="" +%s`
MAX=`date --date="" +%s`

Convert the first nwords in each log line to the same,

n每个日志行中的第一个单词转换为相同的,

L_DATE=`echo $LINE | awk '{print   ... $n}'`
L_DATE=`date --date="$L_DATE" +%s`

Compare and throw away lines until you reach MIN,

比较并扔掉线,直到你到达MIN

if (( $MIN > $L_DATE )) ; then continue ; fi

Compare and print lines until you reach MAX,

比较并打印线条,直到达到MAX,

if (( $L_DATE <= $MAX )) ; then echo $LINE ; fi

Exit when you exceed MAX.

超过时退出MAX

if (( $L_DATE > $MAX )) ; then exit 0 ; fi

The whole script minmaxlog.shlooks like this,

整个脚本minmaxlog.sh看起来像这样,

#!/usr/bin/env bash

MIN=`date --date="" +%s`
MAX=`date --date="" +%s`

while true ; do
    read LINE
    if [ "$LINE" = "" ] ; then break ; fi

    L_DATE=`echo $LINE | awk '{print  " "  " "  " " }'`
    L_DATE=`date --date="$L_DATE" +%s`

    if (( $MIN > $L_DATE  )) ; then continue ; fi
    if (( $L_DATE <= $MAX )) ; then echo $LINE ; fi
    if (( $L_DATE >  $MAX )) ; then break ; fi
done

I ran it on this file minmaxlog.input,

我在这个文件minmaxlog.input上运行它,

May 5 12:23:45 2009 first line
May 6 12:23:45 2009 second line
May 7 12:23:45 2009 third line
May 9 12:23:45 2009 fourth line
June 1 12:23:45 2009 fifth line
June 3 12:23:45 2009 sixth line

like this,

像这样,

./minmaxlog.sh "May 6" "May 8" < minmaxlog.input

回答by paxdiablo

You have to look at every single line in the range you want (to tell if it's in the range you want) so I'm guessing you mean not every line in the file. At a bare minimum, you will have to look at every line in the file up to and including the first one outside your range (I'm assuming the lines are in date/time order).

您必须查看您想要的范围内的每一行(以确定它是否在您想要的范围内),所以我猜你的意思不是文件中的每一行。至少,您必须查看文件中的每一行,直到并包括您范围之外的第一行(我假设这些行按日期/时间顺序排列)。

This is a fairly simple pattern:

这是一个相当简单的模式:

state = preprint
for every line in file:
    if line.date >= startdate:
        state = print
    if line.date > enddate:
        exit for loop
    if state == print:
        print line

You can write this in awk, Perl, Python, even COBOL if you must but the logic is always the same.

如果必须,您可以使用 awk、Perl、Python 甚至 COBOL 编写此代码,但逻辑始终相同。

Locating the line numbers first (with say grep) and then just blindly printing out that line range won't help since grep also has to look at all the lines (allof them, not just up to the first outside the range, and most likely twice, one for the first line and one for the last).

首先定位行号(比如 grep)然后只是盲目地打印出该行范围将无济于事,因为 grep 还必须查看所有行(所有这些行,而不仅仅是范围之外的第一个,而且大多数可能两次,一次用于第一行,一次用于最后一行)。

If this is something you're going to do quite often, you may want to consider shifting the effort from 'every time you do it' to 'once, when the file is stabilized'. An example would be to load up the log file lines into a database, indexed by the date/time.

如果这是您经常要做的事情,您可能需要考虑将工作量从“每次执行”转变为“在文件稳定后一次”。一个例子是将日志文件行加载到数据库中,按日期/时间索引。

That takes a while to get set up but will result in your queries becoming a lot faster. I'm not necessarily advocating a database - you could probably achieve the same effect by splitting the log files into hourly logs thus:

这需要一段时间来设置,但会导致您的查询变得更快。我不一定提倡数据库 - 您可以通过将日志文件拆分为每小时日志来实现相同的效果,因此:

2009/
  01/
    01/
      0000.log
      0100.log
      : :
      2300.log
    02/
    : :

Then for a given time, you know exactly where to start and stop looking. The range 2009/01/01-15:22through 2009/01/05-09:07would result in:

然后在给定的时间内,您确切地知道从哪里开始和停止寻找。范围2009/01/01-15:22通过2009/01/05-09:07将导致:

  • some (the last bit) of the file 2009/01/01/1500.txt.
  • all of the files 2009/01/01/1[6-9]*.txt.
  • all of the files 2009/01/01/2*.txt.
  • all of the files 2009/01/0[2-4]/*.txt.
  • all of the files 2009/01/05/0[0-8]*.txt.
  • some (the first bit) of the file 2009/01/05/0900.txt.
  • 文件的一些(最后一位)2009/01/01/1500.txt
  • 所有的文件2009/01/01/1[6-9]*.txt
  • 所有的文件2009/01/01/2*.txt
  • 所有的文件2009/01/0[2-4]/*.txt
  • 所有的文件2009/01/05/0[0-8]*.txt
  • 文件的一些(第一位)2009/01/05/0900.txt

Of course, I'd write a script to return those lines rather than trying to do it manually each time.

当然,我会编写一个脚本来返回这些行,而不是每次都尝试手动执行。

回答by uzsolt

Maybe you can try this:

也许你可以试试这个:

sed -n "/BEGIN_DATE/,/END_DATE/p" logfile

回答by simchuck

I know this thread is old, but I just stumbled upon it after recently finding a one line solution for my needs:

我知道这个线程很旧,但在最近找到满足我需求的单行解决方案后,我偶然发现了它:

awk -v ts_start="2018-11-01" -v ts_end="2018-11-15" -F, '>=ts_start && <ts_end' myfile

In this case, my file has records with comma separated values and the timestamp in the first field. You can use any valid timestamp format for the start and end timestamps, and replace these will shell variables if desired.

在这种情况下,我的文件的第一个字段中有逗号分隔值和时间戳的记录。您可以为开始和结束时间戳使用任何有效的时间戳格式,并根据需要替换这些 shell 变量。

If you want to write to a new file, just use normal output redirection (> newfile) appended to the end of above.

如果要写入新文件,只需使用> newfile附加到上述末尾的正常输出重定向 ( )。

回答by Joseph Pecoraro

It may be possible in a Bash environment but you should really take advantage of tools that have more built-in support for working with Strings and Dates. For instance Ruby seems to have the built in ability to parse your Date format. It can then convert it to an easily comparable Unix Timestamp (a positive integer representing the seconds since the epoch).

这在 Bash 环境中是可能的,但您应该真正利用具有更多内置支持来处理字符串和日期的工具。例如,Ruby 似乎具有解析日期格式的内置功能。然后它可以将其转换为易于比较的 Unix 时间戳(一个正整数,表示自纪元以来的秒数)。

irb> require 'time'
# => true

irb> Time.parse("May 3 13:21:12").to_i
# => 1241371272  

You can then easily write a Ruby script:

然后,您可以轻松编写 Ruby 脚本:

  • Provide a start and end date. Convert those to this Unix Timestamp Number.
  • Scan the log files line by line, converting the Date into its Unix Timestamp and check if that is in the range of the start and end dates.
  • 提供开始和结束日期。将它们转换为这个 Unix 时间戳编号。
  • 逐行扫描日志文件,将日期转换为其 Unix 时间戳并检查它是否在开始和结束日期的范围内。

Note: Converting to a Unix Timestamp integer first is nice because comparing integers is very easy and efficient to do.

注意:首先转换为 Unix Timestamp 整数很好,因为比较整数非常容易且高效。

You mentioned "without comparing every single line." Its going to be hard to "guess" at where in the log file the entries starts being too old, or too new without checking all the values in between. However, if there is indeed a monotonically increasing trend, then you know immediately when to stop parsing lines, because as soon as the next entry is too new (or old, depending on the layout of the data) you know you can stop searching. Still, there is the problem of finding the first line in your desired range.

您提到“没有比较每一行”。如果不检查两者之间的所有值,将很难“猜测”日志文件中条目开始太旧或太新的位置。但是,如果确实存在单调增加的趋势,那么您会立即知道何时停止解析行,因为一旦下一个条目太新(或旧,取决于数据的布局),您就知道可以停止搜索。尽管如此,在您想要的范围内找到第一行仍然存在问题。



I just noticed your edit. Here is what I would say:

我刚注意到你的编辑。这是我要说的:

If you are reallyworried about efficiently finding that start and end entry, then you could do a binary search for each. Or, if that seems like overkill or too difficult with bash tools you could have a heuristic of reading only 5% of the lines (1 in every 20), to quickly get a close to exact answer and then refining that if desired. These are just some suggestions for performance improvements.

如果您真的很担心有效地找到开始和结束条目,那么您可以对每个条目进行二分搜索。或者,如果使用 bash 工具这看起来有点矫枉过正或太难了,您可以尝试只读取 5% 的行(每 20 行中的 1 行),以快速获得接近准确的答案,然后根据需要对其进行改进。这些只是性能改进的一些建议。