bash 大 (27GB) 文件的更快 grep 功能
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14602963/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Faster grep function for big (27GB) files
提问by fabioln79
I have to grep from a file (5MB) containing specific strings the same strings (and other information) from a big file (27GB). To speed up the analysis I split the 27GB file into 1GB files and then applied the following script (with the help of some people here). However it is not very efficient (to produce a 180KB file it takes 30 hours!).
我必须从包含特定字符串的文件 (5MB) 中 grep 与来自大文件 (27GB) 的相同字符串(和其他信息)。为了加快分析速度,我将 27GB 文件拆分为 1GB 文件,然后应用以下脚本(在某些人的帮助下)。但是它的效率不是很高(生成一个 180KB 的文件需要 30 个小时!)。
Here's the script. Is there a more appropriate tool than grep? Or a more efficient way to use grep?
这是脚本。有没有比 grep 更合适的工具?或者更有效的使用 grep 的方法?
#!/bin/bash
NR_CPUS=4
count=0
for z in `echo {a..z}` ;
do
for x in `echo {a..z}` ;
do
for y in `echo {a..z}` ;
do
for ids in $(cat input.sam|awk '{print }');
do
grep $ids sample_"$z""$x""$y"|awk '{print " "" "}' >> output.txt &
let count+=1
[[ $((count%NR_CPUS)) -eq 0 ]] && wait
done
done #&
回答by dogbane
A few things you can try:
您可以尝试以下几点:
1) You are reading input.sammultiple times. It only needs to be read once before your first loop starts. Save the ids to a temporary file which will be read by grep.
1)您正在阅读input.sam多次。它只需要在您的第一个循环开始之前读取一次。将 id 保存到一个临时文件,该文件将由grep.
2) Prefix your grep command with LC_ALL=Cto use the C locale instead of UTF-8. This will speed up grep.
2) 为您的 grep 命令添加前缀LC_ALL=C以使用 C 语言环境而不是 UTF-8。这将加快grep。
3) Use fgrepbecause you're searching for a fixed string, not a regular expression.
3) 使用,fgrep因为您正在搜索固定字符串,而不是正则表达式。
4) Use -fto make grepread patterns from a file, rather than using a loop.
4)使用-f,使grep从文件中读取的模式,而不是使用一个循环。
5) Don't write to the output file from multiple processes as you may end up with lines interleaving and a corrupt file.
5) 不要从多个进程写入输出文件,因为您最终可能会出现行交错和损坏的文件。
After making those changes, this is what your script would become:
进行这些更改后,您的脚本将变为:
awk '{print }' input.sam > idsFile.txt
for z in {a..z}
do
for x in {a..z}
do
for y in {a..z}
do
LC_ALL=C fgrep -f idsFile.txt sample_"$z""$x""$y" | awk '{print ,,}'
done >> output.txt
Also, check out GNU Parallelwhich is designed to help you run jobs in parallel.
此外,请查看旨在帮助您并行运行作业的GNU Parallel。
回答by Brian Agnew
My initial thoughts are that you're repeatedly spawning grep. Spawning processes is very expensive (relatively) and I think you'd be better off with some sort of scripted solution (e.g. Perl) that doesn't require the continual process creation
我最初的想法是你反复产卵grep。生成过程非常昂贵(相对而言),我认为您最好使用某种不需要连续过程创建的脚本解决方案(例如 Perl)
e.g. for each inner loop you're kicking off catand awk(you won't need catsince awkcan read files, and in fact doesn't this cat/awkcombination return the same thing each time?) and then grep. Then you wait for 4 grepsto finish and you go around again.
例如,对于您开始的每个内部循环cat和awk(您不需要,cat因为awk可以读取文件,实际上这个cat/awk组合不是每次都返回相同的东西?)然后grep. 然后你等待 4greps完成,然后你再转一次。
If you have to use grep, you can use
如果你必须使用grep,你可以使用
grep -f filename
to specify the set of patterns to match in the filename, rather than a single pattern on the command line. I suspect form the above you can pre-generate such a list.
指定要在文件名中匹配的一组模式,而不是命令行上的单个模式。我怀疑从上面可以预先生成这样的列表。
回答by peteches
ok I have a test file containing 4 character strings ie aaaa aaab aaac etc
好的,我有一个包含 4 个字符串的测试文件,即 aaaa aaab aaac 等
ls -lh test.txt
-rw-r--r-- 1 root pete 1.9G Jan 30 11:55 test.txt
time grep -e aaa -e bbb test.txt
<output>
real 0m19.250s
user 0m8.578s
sys 0m1.254s
time grep --mmap -e aaa -e bbb test.txt
<output>
real 0m18.087s
user 0m8.709s
sys 0m1.198s
So using the mmap option shows a clear improvement on a 2 GB file with two search patterns, if you take @BrianAgnew's advice and use a single invocation of grep try the --mmap option.
因此,使用 mmap 选项显示了对具有两种搜索模式的 2 GB 文件的明显改进,如果您接受@BrianAgnew 的建议并使用 grep 的单个调用,请尝试 --mmap 选项。
Though it should be noted that mmap can be a bit quirky if the source files changes during the search. from man grep
但应该注意的是,如果源文件在搜索过程中发生变化,mmap 可能会有点古怪。来自man grep
--mmap
If possible, use the mmap(2) system call to read input, instead of the default read(2) system call. In some situations, --mmap yields better performance. However, --mmap can cause undefined behavior (including core dumps) if an input file shrinks while grep is operating, or if an I/O error occurs.
--mmap
如果可能,请使用 mmap(2) 系统调用来读取输入,而不是默认的 read(2) 系统调用。在某些情况下,--mmap 会产生更好的性能。但是,如果在 grep 运行时输入文件缩小,或者发生 I/O 错误,则 --mmap 可能导致未定义的行为(包括核心转储)。
回答by Ole Tange
Using GNU Parallel it would look like this:
使用 GNU Parallel 它看起来像这样:
awk '{print }' input.sam > idsFile.txt
doit() {
LC_ALL=C fgrep -f idsFile.txt sample_"" | awk '{print ,,}'
}
export -f doit
parallel doit {1}{2}{3} ::: {a..z} ::: {a..z} ::: {a..z} > output.txt
If the order of the lines is not important this will be a bit faster:
如果行的顺序不重要,这会快一点:
parallel --line-buffer doit {1}{2}{3} ::: {a..z} ::: {a..z} ::: {a..z} > output.txt

