bash 如何使用 grep/egrep 在文件中找到重复的单词?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33396629/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I find repeated words in a file using grep/egrep?
提问by Mouse
I need to find repeated words in a file using egrep (or grep -e) in unix (bash)
我需要在 unix (bash) 中使用 egrep (或 grep -e) 在文件中找到重复的单词
I tried:
我试过:
egrep "(\<[a-zA-Z]+\>) " file.txt
and
和
egrep "(\b[a-zA-Z]+\b) " file.txt
but for some reason these consider things to be repeats that aren't!
for example, it thinks the string "word words" meets the criteria despite the word boundary condition \>
or \b
.
但出于某种原因,这些人认为事情是重复的,而不是!例如,尽管存在单词边界条件\>
or ,它仍认为字符串“word words”符合条件\b
。
采纳答案by rici
\1
matches whatever string was matched by the first capture. That is not the same as matching the same pattern as was matched by the first capture. So the fact that the first capture matched on a word boundary is no longer relevant, even though the \b
is inside the capture parentheses.
\1
匹配第一次捕获匹配的任何字符串。这与匹配与第一次捕获匹配的相同模式不同。因此,第一个捕获在单词边界上匹配的事实不再相关,即使\b
是在捕获括号内。
If you want the second instance to also be on a word boundary, you need to say so:
如果您希望第二个实例也位于单词边界上,则需要这样说:
egrep "(\b[a-zA-Z]+) \b" file.txt
That is no different from:
这与:
egrep "\b([a-zA-Z]+) \b" file.txt
The space in the pattern forces a word boundary, so I removed the redundant \b
s. If you wanted to be more explicit, you could put them in:
模式中的空格强制一个词边界,所以我删除了多余的\b
s。如果你想更明确,你可以把它们放在:
egrep "\<([a-zA-Z]+)\> \<\>" file.txt
回答by fedorqui 'SO stop harming'
This is the expected behaviour. See what man grep
says:
这是预期的行为。看看怎么man grep
说:
The Backslash Character and Special Expressions
The symbols \< and > respectively match the empty string at the beginning and end of a word. The symbol \b matches the empty string at the edge of a word, and \B matches the empty string provided it's not at the edge of a word. The symbol \w is a synonym for [[:alnum:]] and \W is a synonym for [^[:alnum:]].
反斜杠字符和特殊表达式
符号\< 和> 分别匹配单词开头和结尾的空字符串。符号 \b 匹配单词边缘的空字符串,而 \B 匹配不在单词边缘的空字符串。符号\w 是[ [:alnum:]] 的同义词,\W 是[^[:alnum:]]的同义词。
and then in another place we see what "word" is:
然后在另一个地方我们看到“词”是什么:
Matching Control
Word-constituent characters are letters, digits, and the underscore.
匹配控制
单词组成字符是字母、数字和下划线。
So this is what will produce:
所以这将产生:
$ cat a
hello bye
hello and and bye
words words
this are words words
"words words"
$ egrep "(\b[a-zA-Z]+\b) " a
hello and and bye
words words
this are words words
"words words"
$ egrep "(\<[a-zA-Z]+\>) " a
hello and and bye
words words
this are words words
"words words"
回答by Martin Thoma
I use
我用
pcregrep -M '(\b[a-zA-Z]+)\s+\b' *
to check my documents for such errors. This also works if there is a line break between the duplicated words.
检查我的文档是否存在此类错误。如果重复的单词之间有换行符,这也适用。
Explanation:
解释:
-M, --multiline
run in multiline mode (important if a line break is between the duplicated words.[a-zA-Z]+
: Match words\b
: Word boundary, see tutorial(\b[a-zA-Z]+)
group it\s+
match at least one (but as many more as necessary) whitespace characters. This includes newline.\1
: Match whatever was in the first group
-M, --multiline
在多行模式下运行(如果换行符在重复的单词之间很重要。[a-zA-Z]+
: 匹配单词\b
: 词界,见教程(\b[a-zA-Z]+)
分组\s+
匹配至少一个(但根据需要更多)空白字符。这包括换行符。\1
:匹配第一组中的任何内容
回答by Mouse
egrep "(\<[a-zA-Z]+>) \<\>" file.txt
fixes the problem.
解决问题。
basically, you have to tell \1 that it needs to stay in word boundaries too
基本上,你必须告诉 \1 它也需要保持在单词边界内