bash 如何使用 grep/egrep 在文件中找到重复的单词？

Question

提问by Mouse

I need to find repeated words in a file using egrep (or grep -e) in unix (bash)

我需要在 unix (bash) 中使用 egrep (或 grep -e) 在文件中找到重复的单词

I tried:

我试过：

egrep "(\<[a-zA-Z]+\>) " file.txt

and

和

egrep "(\b[a-zA-Z]+\b) " file.txt

but for some reason these consider things to be repeats that aren't! for example, it thinks the string "word words" meets the criteria despite the word boundary condition \>or \b.

但出于某种原因，这些人认为事情是重复的，而不是！例如，尽管存在单词边界条件\>or ，它仍认为字符串“word words”符合条件\b。

Answer 1

采纳答案by rici

\1matches whatever string was matched by the first capture. That is not the same as matching the same pattern as was matched by the first capture. So the fact that the first capture matched on a word boundary is no longer relevant, even though the \bis inside the capture parentheses.

\1匹配第一次捕获匹配的任何字符串。这与匹配与第一次捕获匹配的相同模式不同。因此，第一个捕获在单词边界上匹配的事实不再相关，即使\b是在捕获括号内。

If you want the second instance to also be on a word boundary, you need to say so:

如果您希望第二个实例也位于单词边界上，则需要这样说：

egrep "(\b[a-zA-Z]+) \b" file.txt

That is no different from:

这与：

egrep "\b([a-zA-Z]+) \b" file.txt

The space in the pattern forces a word boundary, so I removed the redundant \bs. If you wanted to be more explicit, you could put them in:

模式中的空格强制一个词边界，所以我删除了多余的\bs。如果你想更明确，你可以把它们放在：

egrep "\<([a-zA-Z]+)\> \<\>" file.txt

Answer 2

回答by fedorqui 'SO stop harming'

This is the expected behaviour. See what man grepsays:

这是预期的行为。看看怎么man grep说：

The Backslash Character and Special Expressions
The symbols \< and > respectively match the empty string at the beginning and end of a word. The symbol \b matches the empty string at the edge of a word, and \B matches the empty string provided it's not at the edge of a word. The symbol \w is a synonym for [[:alnum:]] and \W is a synonym for [^[:alnum:]].

反斜杠字符和特殊表达式
符号\< 和> 分别匹配单词开头和结尾的空字符串。符号 \b 匹配单词边缘的空字符串，而 \B 匹配不在单词边缘的空字符串。符号\w 是[ [:alnum:]] 的同义词，\W 是[^[:alnum:]]的同义词。

and then in another place we see what "word" is:

然后在另一个地方我们看到“词”是什么：

Matching Control
Word-constituent characters are letters, digits, and the underscore.

匹配控制
单词组成字符是字母、数字和下划线。

So this is what will produce:

所以这将产生：

$ cat a
hello bye
hello and and bye
words words
this are words words
"words words"
$ egrep "(\b[a-zA-Z]+\b) " a
hello and and bye
words words
this are words words
"words words"
$ egrep "(\<[a-zA-Z]+\>) " a
hello and and bye
words words
this are words words
"words words"

Answer 3

回答by Martin Thoma

I use

我用

pcregrep -M '(\b[a-zA-Z]+)\s+\b' *

to check my documents for such errors. This also works if there is a line break between the duplicated words.

检查我的文档是否存在此类错误。如果重复的单词之间有换行符，这也适用。

Explanation:

解释：

-M, --multilinerun in multiline mode (important if a line break is between the duplicated words.
[a-zA-Z]+: Match words
\b: Word boundary, see tutorial
(\b[a-zA-Z]+)group it
\s+match at least one (but as many more as necessary) whitespace characters. This includes newline.
\1: Match whatever was in the first group

-M, --multiline在多行模式下运行（如果换行符在重复的单词之间很重要。
[a-zA-Z]+: 匹配单词
\b: 词界，见教程
(\b[a-zA-Z]+)分组
\s+匹配至少一个（但根据需要更多）空白字符。这包括换行符。
\1：匹配第一组中的任何内容

Answer 4

回答by Mouse

egrep "(\<[a-zA-Z]+>) \<\>" file.txt

fixes the problem.

解决问题。

basically, you have to tell \1 that it needs to stay in word boundaries too

基本上，你必须告诉 \1 它也需要保持在单词边界内

bash 如何使用 grep/egrep 在文件中找到重复的单词？

提问by Mouse

采纳答案by rici

回答by fedorqui 'SO stop harming'

回答by Martin Thoma

回答by Mouse

相关推荐

最近更新

标签

bash 如何使用 grep/egrep 在文件中找到重复的单词？

提问by Mouse

采纳答案by rici

回答by fedorqui 'SO stop harming'

回答by Martin Thoma

回答by Mouse

相关推荐

bash 在没有 aws-cli 的情况下通过 shell 脚本上传到 S3，可能吗？

bash 使用 Cygwin 运行 Ruby gems

bash 测量运行 Ubuntu 的 Raspberry Pi B+ 的输入电压

带有 bash 变量的 JMESPath 查询表达式

相关推荐

最近更新

标签