使用 find、grep、sed 在 bash 脚本中出现分段错误
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16856308/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Segmentation fault in bash script using find, grep, sed
提问by T.Kaukoranta
I have a script that searches through a very large number of files, and uses sed to substitute a multiple line pattern. The script is iterative, and it works fine on some iterations but sometimes it causes a segmentation fault.
我有一个脚本可以搜索大量文件,并使用 sed 替换多行模式。该脚本是迭代的,它在某些迭代中运行良好,但有时会导致分段错误。
This is what the script is doing:
这是脚本正在执行的操作:
- Search for files that DON'T contain the string X
- Out of these files, search the ones that CONTAIN the string Y
- Iterate the returned file list with a for-loop
- If the file contents match pattern A, replace pattern A with A_TAG
- The same for patterns B,C,D (a file can contain only one of A,B,C,D)
- 搜索不包含字符串 X 的文件
- 在这些文件中,搜索包含字符串 Y 的文件
- 使用 for 循环迭代返回的文件列表
- 如果文件内容与模式 A 匹配,则将模式 A 替换为 A_TAG
- 模式 B、C、D 相同(文件只能包含 A、B、C、D 之一)
Patterns A,B,C,D are multiline, and they are replaced with two lines. X and Y are single line.
图案 A、B、C、D 是多行的,它们被替换为两行。X 和 Y 是单线。
Here's the script. I apologise for the long lines, but I decided not to edit them since they're regex. I did however shorten the regex by replacing strings with "pattern" - the replaced contents are NOT the same in every regex, but they don't have any special characters so I don't think the actual contents are relevant to this question. Besides, the regex has been shown to work so you probably don't need to fully understand it..
这是脚本。我为长行道歉,但我决定不编辑它们,因为它们是正则表达式。然而,我确实通过用“模式”替换字符串来缩短正则表达式——替换的内容在每个正则表达式中都不相同,但它们没有任何特殊字符,所以我认为实际内容与这个问题无关。此外,正则表达式已被证明是有效的,因此您可能不需要完全理解它。
#!/bin/sh
STRING_A="Pattern(\n|.)*Pattern\.\""
A_TAG="$STRING:A$"
STRING_B="(Pattern(\n|.)*)?(Pattern(\n|.)*)?Pattern(\n|.)*Pattern(\n|.)*Pattern\.((\n|.)*will be met\: http\:\/\/www.foo\.org\/example\/temp\.html\.\n)?"
B_TAG="$STRING:B$"
STRING_C="(Pattern(\n|.)*)?Pattern(\n|.)*http\:\/\/www\.foo\.org\/bar\/old-foobar\/file\-2\.1\.html\.((\n|.)*Pattern.*Pattern)?"
C_TAG="$STRING:C$"
STRING_D="(Pattern(\n|.)*)?(Pattern(\n|.)*http\:\/\/www\.foo\.org\/bar\/old-foobar\/file\-2\.1\.html.*|Pattern(\n|.)*Pattern)((\n|.)*http\:\/\/www\.some-site\.org/\.)?"
D_TAG="$STRING:D$"
## params: #1 file, #2 PATTERN, #3 TAG
multil_sed()
{
echo "In multil_sed"
# -n = silent, -r = extended regex, -i = inline changes
sed -nr '
# Sed has a hold buffer that we can use to "keep text in memory".
# Here we copy the line to the buffer if it is the first line of the file,
# or append it if it is not
1h
1!H
# We must first save all lines until the nth line to the hold buffer,
# then we can search for our pattern
60 {
# Then we must use the pattern buffer. Pattern buffer holds text that
# is up for modification. With g we can hopy the hold buffer into the pattern space
g
# Now we can just use the substitution command as we normally would. Use @ as a delimiter
s@([ \t:#*;/".\-]*)'""'@'""'\
$QT_END_LICENSE$@Ig
# Finally print what we did
p
}
' > .foo;
echo "Done"
}
for p in $(find . -type f -not -iwholename '*.git*' -exec grep -iL '.*STRING_X.*' {} \; | xargs grep -il -E '.*STRING_Y.*')
do
echo
echo "####################"
echo "Working on file" $p
#Find A
if pcregrep -qiM "$STRING_A" "$p";
then
echo "A"
multil_sed "$p" "$STRING_A" "$A_TAG"
#Find B
elif pcregrep -qiM "$STRING_B" "$p";
then
echo "B"
multil_sed "$p" "$STRING_B" "$B_TAG"
#Find C
elif pcregrep -qiM "$STRING_C" "$p";
then
echo "C"
multil_sed "$p" "$STRING_C" "$C_TAG"
#Find D
elif pcregrep -qiM "$STRING_D" "$p";
then
echo "D"
multil_sed "$p" "$STRING_D" "$D_TAG"
else
echo "No match found"
fi
echo "####################"
done
I should probably note that C is essentially a longer version of D, that has some extra contents before the common part.
我可能应该注意到 C 本质上是 D 的更长版本,它在公共部分之前有一些额外的内容。
What happens is that for some iterations this works ok..
发生的情况是,对于某些迭代,这可以正常工作..
####################
Working on file ./src/listing.txt
A
In multil_sed
Done
####################
and sometimes it doesn't.
有时不会。
####################
Working on file ./src/web/page.html
/home/tekaukor/code/project/tag_adder.sh: line 54: 16904 Segmentation fault (core dumped) pcregrep -qiM "$STRING_A" "$p"
No match found
####################
It's not dependent on which pattern is being searched.
它不依赖于正在搜索的模式。
####################
Working on file ./src/test/formatter_test.cpp
/home/tekaukor/code/project/tag_adder.sh: line 54: 18051 Segmentation fault (core dumped) pcregrep -qiM "$STRING_B" "$p"
/home/tekaukor/code/project/tag_adder.sh: line 54: 18053 Segmentation fault (core dumped) pcregrep -qiM "$STRING_C" "$p"
/home/tekaukor/code/project/tag_adder.sh: line 54: 18055 Segmentation fault (core dumped) pcregrep -qiM "$STRING_D" "$p"
No match found
####################
Line 54 points to the line "for p in $(find . -type f -not -iwholename '.git' -exec grep...".
第 54 行指向“for p in $(find .-type f -not -iwholename ' .git' -exec grep...”这一行。
My guess is that sed is causing a buffer overflow, but I haven't found a way to ascertain or fix this.
我的猜测是 sed 导致缓冲区溢出,但我还没有找到确定或解决此问题的方法。
采纳答案by T.Kaukoranta
UPDATE #2: So apparently sed doesn't support non greedy matching, which makes part of my answer invalid. There are ways around this, but I will not include them here as it's far removed from the original question. The answer to this question is using the --disable-stack-for-recursion flag as described below.
更新 #2:显然 sed 不支持非贪婪匹配,这使我的部分答案无效。有很多方法可以解决这个问题,但我不会在这里包括它们,因为它与原始问题相去甚远。这个问题的答案是使用 --disable-stack-for-recursion 标志,如下所述。
The answer by msw helped me in the right direction.
MSW 的回答帮助我找到了正确的方向。
First I changed the regex to be lazy instead of greedy. By default regex is greedy, which (as msw stated) means that a multiline expression with "PATTERN(.|\n)*TEXT" will search through the whole file. By adding "?" after quantifiers (* -> *?) I made the regez lazy, which means that the "(.|\n)*?" in "PATTERN(.|\n)*?TEXT" will stop expanding at the first TEXT.
首先,我将正则表达式改为懒惰而不是贪婪。默认情况下,正则表达式是贪婪的,这(如 msw 所述)意味着带有 "PATTERN(.|\n)*TEXT" 的多行表达式将搜索整个文件。通过添加 ”?” 在量词 (* -> *?) 之后,我让正则表达式变得懒惰,这意味着“(.|\n)*?” "PATTERN(.|\n)*?TEXT" 将在第一个文本处停止扩展。
I also made the optional parts lazy (? -> ??), though I'm not sure if this was necessary.
我还使可选部分变得懒惰 (? -> ??),但我不确定这是否有必要。
However this was not enough. I also had to configure pcregrep to use heap instead of stack memory. I downloaded pcre and configured using the flag --disable-stack-for-recursion. Note that using heap is much slower, so you shouldn't do this if you don't have to.
然而这还不够。我还必须将 pcregrep 配置为使用堆而不是堆栈内存。我下载了 pcre 并使用标志 --disable-stack-for-recursion 进行了配置。请注意,使用堆要慢得多,因此如果您没有必要,您不应该这样做。
I'm including a step-by-step in case anyone wonders here with the same problem. Note that I'm still a linux newb and there's a high chance that I made something unnecessary and/or stupid. The instructions are based on http://www.mail-archive.com/[email protected]/msg00817.htmland http://www.linuxfromscratch.org/blfs/view/svn/general/pcre.html
如果有人在这里遇到同样的问题,我会提供一个分步说明。请注意,我仍然是 linux 新手,很有可能我做了一些不必要和/或愚蠢的事情。说明基于http://www.mail-archive.com/[email protected]/msg00817.html和http://www.linuxfromscratch.org/blfs/view/svn/general/pcre.html
- Download pcre from http://downloads.sourceforge.net/pcre/pcre-8.33.tar.bz2
- tar jxf pre-8.33.tar.bz2
- cd pcre-8.33
- ./configure --prefix=/usr --docdir=/usr/share/doc/pcre-8.33 --enable-utf --enable-unicode-properties --enable-pcregrep-libz2 --disable-static --disable-stack-for-recursion
- make
- sudo make install
- 从http://downloads.sourceforge.net/pcre/pcre-8.33.tar.bz2下载 pcre
- tar jxf 8.33.tar.bz2 之前的版本
- cd pcre-8.33
- ./configure --prefix=/usr --docdir=/usr/share/doc/pcre-8.33 --enable-utf --enable-unicode-properties --enable-pcregrep-libz2 --disable-static --disable -stack-for-recursion
- 制作
- 须藤制作安装
There are some additional steps in the provided guide, but I didn't have to do them.
提供的指南中还有一些额外的步骤,但我不必执行这些步骤。
UPDATE: Making the optional elements lazy (? -> ??) is a mistake, as then they will not be included in the matched pattern if possible.
更新:使可选元素惰性 (? -> ??) 是一个错误,因为如果可能,它们将不会包含在匹配的模式中。
回答by msw
Bash isn't great about locating the source of a fault in a compound statement so
Bash 在定位复合语句中的错误源方面并不是很好,所以
Line 54 points to the line
for p in $(find . -type f ....
线 54 指向线
for p in $(find . -type f ...。
is misleading as the error could be anywhere in that for statement block. The error message
具有误导性,因为错误可能出现在 for 语句块中的任何位置。错误信息
Segmentation fault (core dumped) pcregrep -qiM "$STRING_D" "$p"
分段错误(核心转储) pcregrep -qiM "$STRING_D" "$p"
is much more accurate. And likely the cause of the fault is the -Mflag combined with unbounded patterns like (.|\n)*As the pcregrep man pagenotes:
准确得多。并且可能导致故障的原因是-M标志与无界模式相结合,(.|\n)*如pcregrep 手册页所述:
-M, --multilineAllow patterns to match more than one line. When this option is given, patterns may usefully contain literal newline characters and internal occurrences of ^ and $ characters. The output for any one match may consist of more than one line. When this option is set, the PCRE library is called in "multiline" mode. There is a limit to the number of lines that can be matched, imposed by the way that pcregrep buffers the input file as it scans it. However, pcregrep ensures that at least 8K characters or the rest of the document (whichever is the shorter) are available for forward matching, and similarly the previous 8K characters (or all the previous characters, if fewer than 8K) are guaranteed to be available for lookbehind assertions.
-M, --multiline允许模式匹配多于一行。当给出这个选项时,模式可能有用地包含文字换行符和 ^ 和 $ 字符的内部出现。任何一场比赛的输出都可能包含多于一行。设置此选项后,将在“多行”模式下调用 PCRE 库。可以匹配的行数是有限制的,这是由 pcregrep 在扫描输入文件时缓冲输入文件的方式所强加的。但是,pcregrep 确保至少有 8K 个字符或文档的其余部分(以较短者为准)可用于前向匹配,并且同样保证前 8K 个字符(或所有前一个字符,如果少于 8K)可用对于后视断言。
with emphasis mine. The single pattern fragment .*or (.|\n)*can literally match an entire file, so yes, it will fill up its lookahead buffer not just to the next literal (e.g. http) but until it finds the last such literal, because by default regular expressions seek the longestconforming match.
重点是我的。单个模式片段.*或(.|\n)*可以字面上匹配整个文件,所以是的,它不仅会填充它的前瞻缓冲区到下一个文字(例如http),而且直到找到最后一个这样的文字,因为默认情况下正则表达式会寻找最长的符合匹配.

