bash grep - 如何输出进度条或状态

Question

提问by Bob

Sometimes I'm grep-ing thousands of files and it'd be nice to see some kind of progress (bar or status).

有时我正在处理grep数千个文件，很高兴看到某种进度（条形或状态）。

I know this is not trivial because grepoutputs the search results to STDOUTand my default workflow is that I output the results to a file and would like the progress bar/status to be output to STDOUTor STDERR.

我知道这不是微不足道的，因为grep将搜索结果输出到STDOUT并且我的默认工作流程是我将结果输出到文件并希望将进度条/状态输出到STDOUT或STDERR。

Would this require modifying source code of grep?

这是否需要修改的源代码grep？

Ideal command is:

理想的命令是：

grep -e "STRING" --results="FILE.txt"

and the progress:

和进展：

[curr file being searched], number x/total number of files

written to STDOUTor STDERR

写入STDOUT或STDERR

Answer 1

回答by rici

This wouldn't necessarily require modifying grep, although you could probably get a more accurate progress bar with such a modification.

这不一定需要修改grep，尽管您可能会通过这样的修改获得更准确的进度条。

If you are grepping "thousands of files" with a single invocation of grep, it is most likely that you are using the -roption to recursively a directory structure. In that case, it is not even clear that grepknows how many files it will examine, because I believe it starts examining files before it explores the entire directory structure. Exploring the directory structure first would probably increase the total scan time (and, indeed, there is always a cost to producing progress reports, which is why few traditional Unix utilities do this.)

如果您通过一次 grep 调用来搜索“数千个文件”，则很可能您正在使用该-r选项来递归地创建目录结构。在那种情况下，甚至不清楚grep它会检查多少文件，因为我相信它在探索整个目录结构之前就开始检查文件。首先探索目录结构可能会增加总扫描时间（实际上，生成进度报告总是要付出代价的，这就是为什么很少有传统的 Unix 实用程序会这样做。）

In any case, a simple but slightly inaccurate progress bar could be obtained by constructing the complete list of files to be scanned and then feeding them to grepin batches of some size, maybe 100, or maybe based on the total size of the batch. Small batches would allow for more accurate progress reports but they would also increase overhead since they would require additional grep process start-up, and the process start-up time can be more than grepping a small file. The progress report would be updated for each batch of files, so you would want to choose a batch size that gave you regular updates without increasing overhead too much. Basing the batch size on the total size of the files (using, for example, statto get the filesize) would make the progress report more exact but add an additional cost to process startup.

在任何情况下，通过构建要扫描的文件的完整列表，然后将它们grep分批提供给某个大小的批次，可能是 100，或者可能基于批次的总大小，可以获得一个简单但稍微不准确的进度条。小批量将允许更准确的进度报告，但它们也会增加开销，因为它们需要额外的 grep 进程启动，并且进程启动时间可能不仅仅是 grepping 一个小文件。进度报告将针对每批文件进行更新，因此您可能希望选择一个批次大小，以便在不增加太多开销的情况下进行定期更新。基于文件总大小的批处理大小（例如，使用stat获取文件大小）将使进度报告更准确，但会增加进程启动的额外成本。

One advantage of this strategy is that you could also run two or more greps in parallel, which might speed the process up a bit.

此策略的一个优点是您还可以并行运行两个或多个 grep，这可能会稍微加快进程。

In broad terms, a simple script (which just divides the files by count, not by size, and which doesn't attempt to parallelize).

从广义上讲，一个简单的脚本（它只是按计数而不是按大小划分文件，并且不尝试并行化）。

# Requires bash 4 and Gnu grep
shopt -s globstar
files=(**)
total=${#files[@]}
for ((i=0; i<total; i+=100)); do
  echo $i/$total >>/dev/stderr
  grep -d skip -e "$pattern" "${files[@]:i:100}" >>results.txt
done

For simplicity, I use a globstar (**) to safely put all the files in an array. If your version of bash is too old, then you can do it by looping over the output of find, but that's not very efficient if you have lots of files. Unfortunately, there is no way that I know of to write a globstar expression which only matches files. (**/only matches directories.) Fortunately, GNU grep provides the -d skipoption which silently skips directories. That means that the file count will be slightly inaccurate, since directories will be counted, but it probably doesn't make much difference.

为简单起见，我使用 globstar ( **) 将所有文件安全地放在一个数组中。如果您的 bash 版本太旧，那么您可以通过循环输出find. 不幸的是，我所知道的无法编写仅匹配文件的 globstar 表达式。（**/仅匹配目录。）幸运的是，GNU grep 提供了-d skip静默跳过目录的选项。这意味着文件计数会稍微不准确，因为将计算目录，但它可能没有太大区别。

You probably will want to make the progress report cleaner by using some console codes. The above is just to get you started.

您可能希望通过使用一些控制台代码使进度报告更清晰。以上只是让你开始。

The simplest way to divide that into different processes would be to just divide the list into X different segments and run X different for loops, each with a different starting point. However, they probably won't all finish at the same time so that is sub-optimal. A better solution is GNU parallel. You might do something like this:

将其划分为不同进程的最简单方法是将列表划分为 X 个不同的段并运行 X 个不同的 for 循环，每个循环都有不同的起点。但是，它们可能不会同时完成，因此这是次优的。更好的解决方案是 GNU 并行。你可能会做这样的事情：

find . -type f -print0 |
parallel --progress -L 100 -m -j 4 grep -e "$pattern" > results.txt

(Here -L 100specifies that up to 100 files should be given to each grep instance, and -j 4specifies four parallel processes. I just pulled those numbers out of the air; you'll probably want to adjust them.)

（此处-L 100指定应为每个 grep 实例提供最多 100 个文件，并-j 4指定四个并行进程。我只是从空中提取了这些数字；您可能想要调整它们。）

Answer 2

回答by RTLinuxSW

Try the parallel program

试试并行程序

find * -name \*.[ch] | parallel -j5 --bar  '(grep grep-string {})' > output-file

Though I found this to be slower than a simple

虽然我发现这比简单的要慢

find * -name \*.[ch] | xargs grep grep-string > output-file

Answer 3

回答by mountrix

This command show the progress (speed and offset), but not the total amount. This could be manually estimated however.

此命令显示进度（速度和偏移量），但不显示总量。然而，这可以手动估计。

dd if=/input/file bs=1c skip=<offset> | pv | grep -aob "<string>"

Answer 4

回答by cb0

I'm pretty sure you would need to alter the grepsource code. And those changes would be huge.

我很确定您需要更改grep源代码。这些变化将是巨大的。

Currently grepdoes not know how many lines a file as until it's finished parsing the whole file. For your requirement it would need to parse the file 2 times or a least determine the full line count any other way.

目前grep不知道一个文件有多少行，直到它完成对整个文件的解析。对于您的要求，它需要解析文件 2 次或至少以任何其他方式确定完整的行数。

The first time it would determine the line count for the progress bar. The second time it would actually do the work an search for your pattern.

第一次确定进度条的行数。第二次它实际上会搜索您的模式。

This would not only increase the runtime but violate one of the main UNIX philosophies.

这不仅会增加运行时间，而且会违反主要的 UNIX 哲学之一。

Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new "features". (source)

让每个程序做好一件事。要完成一项新工作，请重新构建而不是通过添加新“功能”来使旧程序复杂化。（来源）

There might be other tools out there for your need, but afaik grep won't fit here.

可能还有其他工具可以满足您的需求，但 afaik grep 不适合这里。

Answer 5

回答by Samuel Kirschner

I normaly use something like this:

我通常使用这样的东西：

grep | tee "FILE.txt" | cat -n | sed 's/^/match: /;s/$/     /' | tr '\n' '\r' 1>&2

It is not perfect, as it does only display the matches, and if they to long or differ to much in length there are errors, but it should provide you with the general idea.

它并不完美，因为它只显示匹配项，如果它们很长或长度不同，就会出现错误，但它应该为您提供总体思路。

Or a simple dots:

或者一个简单的点：

grep | tee "FILE.txt" | sed 's/.*//' | tr '\n' '.' 1>&2

bash grep - 如何输出进度条或状态

提问by Bob

回答by rici

回答by RTLinuxSW

回答by mountrix

回答by cb0

回答by Samuel Kirschner

相关推荐

最近更新

标签

bash grep - 如何输出进度条或状态

提问by Bob

回答by rici

回答by RTLinuxSW

回答by mountrix

回答by cb0

回答by Samuel Kirschner

相关推荐

bash 如何制作正则表达式以匹配文件路径？

bash Grep apache 服务器 500 错误到一个单独的文件

bash 如何复制用grep找到的文件

什么意思`！-d` 在这个 Bash 命令中？

相关推荐

最近更新

标签