bash 如何zgrep没有尾部的gz文件的最后一行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22533060/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-10 00:44:56  来源:igfitidea点击:

How to zgrep the last line of a gz file without tail

bashshelllogginggrep

提问by Rodrigo Gurgel

Here is my problem, I have a set of big gzlog files, the very first info in the line is a datetime text, e.g.: 2014-03-20 05:32:00.

这是我的问题,我有一组大gz日志文件,行中的第一个信息是日期时间文本,例如:2014-03-20 05:32:00。

I need to check what set of log files holds a specific data. For the init I simply do a:

我需要检查哪组日志文件包含特定数据。对于 init 我只是做一个:

           '-query-data-'
zgrep -m 1 '^20140320-04' 20140320-0{3,4}*gz

BUT HOW to do the same with the last line without process the whole file as would be done with zcat (too heavy):

但是如何在不处理整个文件的情况下对最后一行执行相同的操作,就像使用 zcat 所做的那样(太重):

zcat foo.gz | tail -1

Additional info, those logs are created with the data time of it's initial record, so if I want to query logs at 14:00:00 I have to search, also, in files created BEFORE 14:00:00, as a file would be created at 13:50:00 and closed at 14:10:00.

附加信息,这些日志是用它的初始记录的数据时间创建的,所以如果我想在 14:00:00 查询日志,我也必须在 14:00:00 之前创建的文件中进行搜索,就像一个文件一样在 13:50:00 创建并在 14:10:00 关闭。

回答by Adam Katz

The easiest solution would be to alter your log rotation to create smaller files.

最简单的解决方案是更改日志轮换以创建更小的文件。

The second easiest solution would be to use a compression tool that supports random access.

第二种最简单的解决方案是使用支持随机访问的压缩工具。

Projects like dictzip, BGZF, and csioeach add sync flush pointsat various intervals within gzip-compressed data that allow you to seek to in a program aware of that extra information. While it exists in the standard, the vanilla gzipdoes not add such markers either by default or by option.

项目如dictzipBGZFCSIO每个附加同步刷新点在,让您以寻求在意识到额外信息的程序gzip压缩数据中不同的时间间隔。虽然它存在于标准中,但 vanillagzip不会默认或通过选项添加此类标记。

Files compressed by these random-access-friendly utilities are slightly larger (by perhaps 2-20%) due to the markers themselves, but fully support decompression with gzipor another utility that is unaware of these markers.

由于标记本身,这些对随机访问友好的实用程序压缩的文件稍大(可能大 2-20%),但完全支持使用gzip或其他不知道这些标记的实用程序进行解压。

You can learn more at this question about random access in various compression formats.

您可以在此问题中了解有关各种压缩格式的随机访问的更多信息。

There's also a "Blasted Bioinformatics" blog by Peter Cock with several posts on this topic, including:

还有一个由 Peter Cock 撰写的“Blasted Bioinformatics”博客,其中有几篇关于这个主题的文章,包括:



Experiments with xz

实验 xz

xz(an LZMAcompression format) actually has random access support on a per-block level, but you will only get a single block with the defaults.

xzLZMA压缩格式)实际上在每个块级别上具有随机访问支持,但是您将只能获得具有默认值的单个块。

File creation

文件创建

xzcan concatenate multiple archives together, in which case each archive would have its own block. The GNU splitcan do this easily:

xz可以将多个档案连接在一起,在这种情况下,每个档案都有自己的块。在GNUsplit可以轻松地做到这一点:

split -b 50M --filter 'xz -c' big.log > big.log.sp.xz

This tells splitto break big.loginto 50MB chunks (beforecompression) and run each one through xz -c, which outputs the compressed chunk to standard output. We then collect that standard output into a single file named big.log.sp.xz.

这告诉split打破big.log到50MB块(压缩),然后运行每一个通过xz -c,并输出压缩块到标准输出。然后我们将该标准输出收集到一个名为big.log.sp.xz.

To do this without GNU, you'd need a loop:

要在没有 GNU 的情况下执行此操作,您需要一个循环:

split -b 50M big.log big.log-part
for p in big.log-part*; do xz -c $p; done > big.log.sp.xz
rm big.log-part*

Parsing

解析

You can get the list of block offsets with xz --verbose --list FILE.xz. If you want the last block, you need its compressed size (column 5) plus 36 bytes for overhead (found by comparing the size to hd big.log.sp0.xz |grep 7zXZ). Fetch that block using tail -cand pipe that through xz. Since the above question wants the last line of the file, I then pipe that through tail -n1:

您可以使用 获取块偏移列表xz --verbose --list FILE.xz。如果你想要最后一个块,你需要它的压缩大小(第 5 列)加上 36 个字节的开销(通过将大小与 进行比较找到hd big.log.sp0.xz |grep 7zXZ)。使用 获取该块tail -c并将其通过管道传输xz。由于上述问题需要文件的最后一行,因此我将其通过管道传输tail -n1

SIZE=$(xz --verbose --list big.log.sp.xz |awk 'END { print  + 36 }')
tail -c $SIZE big.log.sp.xz |unxz -c |tail -n1

Side note

边注

Version 5.1.1 introduced support for the --block-sizeflag:

5.1.1 版引入了对--block-size标志的支持:

xz --block-size=50M big.log

However, I have not been able to extract a specific block since it doesn't include full headers between blocks. I suspect this is nontrivial to do from the command line.

但是,我无法提取特定块,因为它不包含块之间的完整标题。我怀疑从命令行执行此操作很重要。

Experiments with gzip

实验 gzip

gzipalso supports concatenation. I (briefly) tried mimicking this process for gzipwithout any luck. gzip --verbose --listdoesn't give enough information and it appears the headers are too variable to find.

gzip还支持串联。我(简要地)尝试模仿这个过程,gzip但没有任何运气。 gzip --verbose --list没有提供足够的信息,而且标题似乎变化太大而无法找到。

This would require adding sync flush points, and since their size varies on the size of the last buffer in the previous compression, that's too hard to do on the command line (use dictzip or another of the previously discussed tools).

这将需要添加同步刷新点,并且由于它们的大小因上次压缩中最后一个缓冲区的大小而异,因此在命令行上很难做到(使用 dictzip 或其他之前讨论的工具)。

I did apt-get install dictzipand played with dictzip, but just a little. It doesn't work without arguments, creating a (massive!) .dzarchive that neither dictunzipnor gunzipcould understand.

我做了apt-get install dictzip并玩了 dictzip,但只是一点点。它没有参数就无法工作,创建一个.dzdictunzip不能gunzip也不能理解的(大量!)档案。

Experiments with bzip2

实验 bzip2

bzip2has headers we can find. This is still a bit messy, but it works.

bzip2有我们可以找到的标题。这仍然有点混乱,但它有效。

Creation

创建

This is just like the xzprocedure above:

这就像xz上面的过程:

split -b 50M --filter 'bzip2 -c' big.log > big.log.sp.bz2

I should note that this is considerablyslower than xz(48 min for bzip2 vs 17 min for xz vs 1 min for xz -0) as well as considerably larger (97M for bzip2 vs 25M for xz -0vs 15M for xz), at least for my test log file.

我应该注意,这是显着地慢于xz(48分钟为的bzip2比17分钟为XZ VS 1分钟为xz -0)以及相当大的(97M用于bzip2的VS 25M为xz -0VS 15M对于XZ),至少对于我的测试的日志文件。

Parsing

解析

This is a little harder because we don't have the nice index. We have to guess at where to go, and we have to err on the side of scanning too much, but with a massive file, we'd still save I/O.

这有点难,因为我们没有好的索引。我们必须猜测去哪里,我们不得不在扫描太多方面犯错,但是对于一个庞大的文件,我们仍然会节省 I/O。

My guess for this test was 50000000 (out of the original 52428800, a pessimistic guess that isn't pessimistic enough for e.g. an H.264 movie.)

我对这个测试的猜测是 50000000(在最初的 52428800 之外,一个悲观的猜测,对于 H.264 电影来说不够悲观。)

GUESS=50000000
LAST=$(tail -c$GUESS big.log.sp.bz2 \
         |grep -abo 'BZh91AY&SY' |awk -F: 'END { print '$GUESS'- }')
tail -c $LAST big.log.sp.bz2 |bunzip2 -c |tail -n1

This takes just the last 50 million bytes, finds the binary offset of the last BZIP2 header, subtracts that from the guess size, and pulls that many bytes off of the end of the file. Just that part is decompressed and thrown into tail.

这仅需要最后 5000 万字节,找到最后一个 BZIP2 标头的二进制偏移量,从猜测大小中减去它,然后从文件末尾拉出那么多字节。只是那部分被解压并放入tail.

Because this has to query the compressed file twice and has an extra scan (the grepcall seeking the header, which examines the whole guessed space), this is a suboptimal solution. See also the below section on how slow bzip2really is.

因为这必须查询压缩文件两次并且有一个额外的扫描(grep寻找标题的调用,它检查整个猜测的空间),这是一个次优的解决方案。另请参阅以下部分,了解bzip2实际有多慢。

 

 

Perspective

看法

Given how fast xzis, it's easily the best bet; using its fastest option (xz -0) is quite fast to compress or decompress and creates a smaller file than gzipor bzip2on the log file I was testing with. Other tests (as well as various sources online) suggest that xz -0is preferable to bzip2in all scenarios.

鉴于速度有多快xz,这很容易成为最好的选择;使用其最快的选项(xz -0)是相当快的压缩或解压缩,并创建一个较小的文件比gzipbzip2对我与测试日志文件。其他测试(以及各种在线来源)表明,这在所有情况下xz -0都更可取bzip2

            ————— No Random Access ——————     ——————— Random Access ———————
FORMAT       SIZE    RATIO   WRITE   READ      SIZE    RATIO   WRITE   SEEK
—————————   —————————————————————————————     —————————————————————————————
(original)  7211M   1.0000       -   0:06     7211M   1.0000       -   0:00
bzip2         96M   0.0133   48:31   3:15       97M   0.0134   47:39   0:00
gzip          79M   0.0109    0:59   0:22                                  
dictzip                                        605M   0.0839    1:36  (fail)
xz -0         25M   0.0034    1:14   0:12       25M   0.0035    1:08   0:00
xz            14M   0.0019   16:32   0:11       14M   0.0020   16:44   0:00

Timing tests were not comprehensive, I did not average anything and disk caching was in use. Still, they look correct; there is a very small amount of overhead from splitplus launching 145 compression instances rather than just one (this may even be a net gainif it allows an otherwise non-multithreaded utility to consume multiple threads).

计时测试并不全面,我没有平均任何东西,并且正在使用磁盘缓存。不过,它们看起来是正确的。有开销从一个非常小的量split加上推出145个压缩实例,而不是只有一个(这甚至可能是一个净增益,如果它允许其他非多线程程序消耗多线程)。

回答by circulosmeos

Well, you canaccess randomly a gzipped file if you previously creates an indexfor each file ...

好吧,如果您之前为每个文件创建了索引,则可以随机访问一个 gzip文件...

I've developed a command line tool which creates indexes for gzip files which allow for very quick random access inside them: https://github.com/circulosmeos/gztool

我开发了一个命令行工具,它为 gzip 文件创建索引,允许在其中非常快速地随机访问https: //github.com/circulosmeos/gztool

The tool has two options that may be of interest for you:

该工具有两个您可能感兴趣的选项:

  • -Soption supervise a still-growing file and creates an index for it as it is growing - this can be useful for gzipped rsyslog files as reduces to zero in the practice the time of index creation.
  • -ttails a gzip file: this way you can do: $ gztool -t foo.gz | tail -1Please, note that if the index doesn't exists, this will consume the same time as a complete decompression: but as the index is reusable, next searches will be greatly reduced in time!
  • -S选项监督仍在增长的文件并在它增长时为其创建索引 - 这对于 gzipped rsyslog 文件非常有用,因为在实践中索引创建时间减少到零。
  • -t拖尾一个 gzip 文件:这样你可以这样做:$ gztool -t foo.gz | tail -1请注意,如果索引不存在,这将消耗与完全解压相同的时间:但由于索引是可重用的,下次搜索将大大减少时间!

This tool is based on zran.c demonstration code from original zlib, so there's no out-of-the-rules magic!

该工具基于原始 zlib 的 zran.c 演示代码,因此没有超出规则的魔法!