bash Grep 到多个输出文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/8363718/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Grep to multiple output files
提问by mefju
I have one huge file (over 6GB) and about 1000 patterns. I want extract lines matching each of the pattern to separate file. For example my patterns are:
我有一个大文件(超过 6GB)和大约 1000 个模式。我想提取与每个模式匹配的行来分隔文件。例如我的模式是:
1
2
my file:
我的文件:
a|1
b|2
c|3
d|123
As a output I would like to have 2 files:
作为输出,我想要 2 个文件:
1:
1:
a|1
d|123
2:
2:
b|2
d|123
I can do it by greping file multiple times, but it is inefficient for 1000 patterns and huge file. I also tried something like this:
我可以通过多次greping文件来做到这一点,但是对于1000个模式和大文件来说效率很低。我也尝试过这样的事情:
grep -f pattern_file huge_file
but it will make only 1 output file. I can't sort my huge file - it takes to much time. Maybe AWK will make it?
但它只会制作 1 个输出文件。我无法对我的大文件进行排序 - 这需要很长时间。也许 AWK 会成功?
回答by Dimitre Radoulov
awk -F\| 'NR == FNR {
patt[awk -F\| 'NR == FNR {
patt[$ cal -h
September 2013
Su Mo Tu We Th Fr Sa
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30
]; next
}
{
for (p in patt) {
if ( ~ p) print >> p
close(p)
}
}' patterns huge_file
]; next
}
{
for (p in patt)
if ( ~ p) print > p
}' patterns huge_file
With some awk implementations you may hit the max number of open files limit. Let me know if that's the case so I can post an alternative solution.
对于某些 awk 实现,您可能会达到最大打开文件数限制。如果是这种情况,请告诉我,以便我可以发布替代解决方案。
P.S.: This version will keep only one file open at a time:
PS:此版本一次只会打开一个文件:
$ cal -h \
| tee >( egrep '1' > f1.txt ) \
| tee >( egrep '2' > f2.txt ) \
| tee >( egrep 'Sept' > f3.txt )
回答by michael
You can accomplish this (if I understand the problem) using bash "process substitution", e.g., consider the following sample data:
您可以使用 bash“进程替换”来完成此操作(如果我理解问题),例如,请考虑以下示例数据:
$ more f?.txt
::::::::::::::
f1.txt
::::::::::::::
September 2013
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
::::::::::::::
f2.txt
::::::::::::::
September 2013
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30
::::::::::::::
f3.txt
::::::::::::::
September 2013
Then selective lines can be grepd to different output files in a single command as:
然后可以grep在单个命令中将选择行放到不同的输出文件中,如下所示:
sed 's,.*,/&/w &_file,' pattern_file > sed_file
In this case, each grepis processing the entire data stream (which may or may not be what you want: this may not save a lot of time vs. just running concurrent grepprocesses):
在这种情况下,每个都grep在处理整个数据流(这可能是也可能不是您想要的:与仅运行并发grep进程相比,这可能不会节省大量时间):
sed -nf sed_file huge_file
回答by potong
This might work for you (although sedmight not be the quickest tool!):
这可能对你sed有用(虽然可能不是最快的工具!):
Then run this file against the source:
然后针对源运行此文件:
##代码##I did a cursory test and the GNU sed version 4.1.5I was using, easily opened 1000 files OK, however your unix system may well have smaller limits.
我做了一个粗略的测试,GNU sed version 4.1.5我正在使用,可以轻松打开 1000 个文件,但是你的 unix 系统可能有更小的限制。
回答by Llamageddon
Grep cannot output matches of different patterns to different files. Tee is able to redirect it's input into multiple destinations, but i don't think this is what you want.
Grep 无法将不同模式的匹配项输出到不同的文件。Tee 能够将其输入重定向到多个目的地,但我认为这不是您想要的。
Either use multiple grep commands or write a program to do it in Python or whatever else language you fancy.
要么使用多个 grep 命令,要么用 Python 或您喜欢的任何其他语言编写程序来完成它。
回答by Steve Summit
I had this need, so I added the capability to my own copy of grep.c that I happened to have lying around. But it just occurred to me: if the primary goal is to avoid multiple passes over a huge input, you could run egrep once on the huge input to search for any of your patterns (which, I know, is not what you want), and redirect its output to an intermediate file, then make multiple passes over that intermediate file, once per individual pattern, redirecting to a different final output file each time.
我有这个需求,所以我将这个功能添加到我自己的 grep.c 副本中,我碰巧在身边。但我突然想到:如果主要目标是避免对大量输入进行多次传递,您可以在大量输入上运行 egrep 一次以搜索您的任何模式(我知道这不是您想要的),并将其输出重定向到一个中间文件,然后对该中间文件进行多次传递,每个模式一次,每次重定向到不同的最终输出文件。

