C# 正则表达式查找和删除重复词

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1058783/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 07:12:42  来源:igfitidea点击:

Regular expression to find and remove duplicate words

c#regexstring

提问by triniMahn

Using regular expressions in C#, is there any way to find and remove duplicate words or symbols in a string containing a variety of words and symbols?

在C#中使用正则表达式,有没有办法在包含各种单词和符号的字符串中查找和删除重复的单词或符号?

Ex.

前任。

Initial string of words:

初始字串:

"I like the environment. The environment is good."

“我喜欢环境。环境很好。”

Desired string:

所需字符串:

"I like the environment. is good"

“我喜欢这里的环境。很好”

Duplicates removed:"the", "environment", "."

删除了重复项:“the”、“environment”、“.”

采纳答案by Per Erik Stendahl

As said by others, you need more than a regex to keep track of words:

正如其他人所说,您需要的不仅仅是一个正则表达式来跟踪单词:

var words = new HashSet<string>();
string text = "I like the environment. The environment is good.";
text = Regex.Replace(text, "\w+", m =>
                     words.Add(m.Value.ToUpperInvariant())
                         ? m.Value
                         : String.Empty);

回答by chaos

Well, Jeff has shown me how to use the magic of in-expression backreferences and the global modifier to make this one happen, so my original answer is inoperative. You should all go vote for Jeff's answer. However, for posterity I'll note that there's a tricky little regex engine sensitivity issue in this one, and if you were using Perl-flavored regex, you would need to do this:

好吧,Jeff 已经向我展示了如何使用表达式内反向引用的魔力和全局修饰符来实现这一点,所以我原来的答案是无效的。你们都应该去投票给杰夫的答案。但是,对于后人,我会注意到这个问题中存在一个棘手的小正则表达式引擎敏感性问题,如果您使用的是 Perl 风格的正则表达式,则需要执行以下操作:

\b(\S+)\b(?=.*\b\b.*)

instead of Jeff's answer, because C#'s regex will effectively capture \bin \1but PCRE will not.

而不是杰夫的回答,因为 C# 的正则表达式将有效地捕获\b\1但 PCRE 不会。

回答by tanascius

Have a look at backreferences:
http://msdn.microsoft.com/en-us/library/thwdfzxy(VS.71).aspx

看看反向引用:http: //msdn.microsoft.com/en-us/library/thwdfzxy(VS.71)
.aspx

This a regex that will find doubled words. But it will match only one word per match. So you have to use it more than once.

这是一个可以找到双字的正则表达式。但每次匹配只会匹配一个单词。所以你必须多次使用它。

new Regex( @"(.*)\b(\w+)\b(.*)()(.*)", RegexOptions.IgnoreCase );

Of course this is not the best solution (see other answers, which propose not to use a regex at all). But you asked for a regex - here is one. Maybe just the idea helps you ...

当然,这不是最好的解决方案(请参阅其他答案,建议根本不使用正则表达式)。但是你要求一个正则表达式 - 这是一个。也许只是这个想法可以帮助你......

回答by Tobias Hertkorn

Regex is not suited for everything. Something like your problem does fall into that category. I would advise you to use a parser instead.

正则表达式并不适合所有情况。像您的问题确实属于该类别。我建议您改用解析器。

回答by arnsholt

As others have pointed out, this is doable with backreferences. See http://msdn.microsoft.com/nb-no/library/thwdfzxy(en-us).aspxfor the details on how to use backreferences in .Net.

正如其他人指出的那样,这可以通过反向引用来实现。看http://msdn.microsoft.com/nb-no/library/thwdfzxy(en-us).aspx有关如何在 .Net 中使用反向引用的详细信息。

Your particular problem to remove punctuation as well makes it a bit more complicated, but I think code along these lines (whitespace is not significant in that regex) should do the trick:

您删除标点符号的特殊问题也使它变得更加复杂,但我认为沿着这些行的代码(该正则表达式中的空格并不重要)应该可以解决问题:

(\b\w+(?:\s+\w+)*)\s+

I've not tested the regex at all, but that should match one or more words separated by whitespace that are repeated. You'll have to add some more logic to allow for puncuation and so on.

我根本没有测试过正则表达式,但它应该匹配一个或多个由重复的空格分隔的单词。您必须添加更多逻辑以允许标点符号等。

回答by Matt Bridges

You won't be able to use regular expressions for this problem, because regex only matches regular languages. The pattern you are trying to match is context-sensitive, and therefore not "regular."

您将无法对这个问题使用正则表达式,因为正则表达式只匹配正则语言。您尝试匹配的模式是上下文相关的,因此不是“常规的”。

Fortunately, it is easy enough to write a parser. Have a look at Per Erik Stendahl's code.

幸运的是,编写解析器很容易。看看 Per Erik Stendahl 的代码。

回答by user7116

Regular expressions would be a poor choice of "tools" to solve this problem. Perhaps the following could work:

正则表达式将是解决此问题的“工具”的糟糕选择。也许以下方法可行:

HashSet<string> corpus = new HashSet<string>();
char[] split = new char[] { ' ', '\t', '\r', '\n', '.', ';', ',', ':', ... };

foreach (string line in inputLines)
{
    string[] parts = line.Split(split, StringSplitOptions.RemoveEmptyEntries);
    foreach (string part in parts)
    {
        corpus.Add(part.ToUpperInvariant());
    }
}

// 'corpus' now contains all of the unique tokens

EDIT: This is me making a big assumption that you're "lexing" for some sort of analysis like searching.

编辑:这是我做了一个很大的假设,即您正在“词法分析”进行某种分析,例如搜索。

回答by Jeff Atwood

This seems to work for me

这似乎对我有用

(\b\S+\b)(?=.*)

Matches like so

像这样匹配

apple apple orange  
orange red blue green orange green blue  
pirates ninjas cowboys ninjas pirates  

回答by Ian Ringrose

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

有些人在遇到问题时会想“我知道,我会使用正则表达式”。现在他们有两个问题。

See When not to use Regex in C# (or Java, C++ etc)

请参阅何时不在 C#(或 Java、C++ 等)中使用 Regex

Of course using a regex to split the string into words may be a useful first step, however String.Split() is clear and it lickly to do everything you need.

当然,使用正则表达式将字符串拆分为单词可能是有用的第一步,但是 String.Split() 很清楚,可以轻松完成您需要的一切。