C# 正则表达式匹配除给定列表之外的所有单词

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/242698/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-03 19:31:10  来源:igfitidea点击:

Regex to match all words except a given list

c#.netregex

提问by John

I am trying to write a replacement regular expression to surround all words in quotes except the words AND, OR and NOT.

我正在尝试编写一个替换正则表达式来将所有单词括在引号中,除了单词 AND、OR 和 NOT。

I have tried the following for the match part of the expression:

我为表达式的匹配部分尝试了以下操作:

(?i)(?<word>[a-z0-9]+)(?<!and|not|or)

and

(?i)(?<word>[a-z0-9]+)(?!and|not|or)

but neither work. The replacement expression is simple and currently surrounds all words.

但都不起作用。替换表达式很简单,目前包含所有单词。

"${word}"

So

所以

This and This not That

这个和这个不是那个

becomes

变成

"This" and "This" not "That"

“这个”和“这个”不是“那个”

采纳答案by Tomalak

This is a little dirty, but it works:

这有点脏,但它有效:

(?<!\b(?:and| or|not))\b(?!(?:and|or|not)\b)

In plain English, this matches any word boundary not preceded by and not followed by "and", "or", or "not". It matches whole words only, e.g. the position after the word "sand" would not be a match just because it is preceded by "and".

在简单的英语中,这匹配任何前面没有“and”、“or”或“not”的词边界。它只匹配整个单词,例如单词“sand”之后的位置不会仅仅因为它前面有“and”而匹配。

The space in front of the "or" in the zero-width look-behind assertion is necessary to make it a fixed length look-behind. Try if that already solves your problem.

零宽度后视断言中“或”前面的空格是使其成为固定长度后视所必需的。如果这已经解决了您的问题,请尝试。

EDIT: Applied to the string "except the words AND, OR and NOT." as a global replace with single quotes, this returns:

编辑:应用于字符串“除了单词 AND、OR 和 NOT”。作为单引号的全局替换,它返回:

'except' 'the' 'words' AND, OR and NOT.

回答by Marc Gravell

Call me crazy, but I'm not a fan of fighting regex; I limit my patterns to simple things I can understand, and often cheat for the rest - for example via a MatchEvaluator:

说我疯了,但我不喜欢与正则表达式作斗争;我将我的模式限制在我可以理解的简单事物上,并且经常在其余部分作弊 - 例如通过MatchEvaluator

    string[] whitelist = new string[] { "and", "not", "or" };
    string input = "foo and bar or blop";
    string result = Regex.Replace(input, @"([a-z0-9]+)",
        delegate(Match match) {
            string word = match.Groups[1].Value;
            return Array.IndexOf(whitelist, word) >= 0
                ? word : ("\"" + word + "\"");
        });

(edited for more terse layout)

(编辑为更简洁的布局)

回答by Markus Jarderot

Based on Tomalaks answer:

基于 Tomalaks 的回答:

(?<!and|or|not)\b(?!and|or|not)

This regex has two problems:

这个正则表达式有两个问题:

  1. (?<! )only works for fixed length look-behind

  2. The previous regex only looked at end ending/beginning of the surrounding words, not the whole word.

  1. (?<! )仅适用于固定长度的后视

  2. 之前的正则表达式只查看周围单词的结尾/开头,而不是整个单词。

(?<!\band)(?<!\bor)(?<!\bnot)\b(?!(?:and|or|not)\b)

(?<!\band)(?<!\bor)(?<!\bnot)\b(?!(?:and|or|not)\b)

This regex fixes both the above problems. First by splitting the look-behind into three separate ones. Second by adding word-boundaries (\b) inside the look-arounds.

这个正则表达式解决了上述两个问题。首先将后视分为三个独立的。其次,通过\b在环视中添加字边界 ( )。

回答by Jan Goyvaerts

John,

约翰,

The regex in your question is almost correct. The only problem is that you put the lookahead at the end of the regex instead of at the start. Also, you need to add word boundaries to force the regex to match whole words. Otherwise, it will match "nd" in "and", "r" in "or", etc, because "nd" and "r" are not in your negative lookahead.

您问题中的正则表达式几乎是正确的。唯一的问题是您将前瞻放在正则表达式的末尾而不是开头。此外,您需要添加单词边界以强制正则表达式匹配整个单词。否则,它将匹配“and”中的“nd”、“or”中的“r”等,因为“nd”和“r”不在您的负面前瞻中。

(?i)\b(?!and|not|or)(?[a-z0-9]+)\b

(?i)\b(?!and|not|or)(?[a-z0-9]+)\b

回答by Jan Goyvaerts

(?!\bnot\b|\band\b|\bor\b|\b\"[^"]+\"\b)((?<=\s|\-|\(|^)[^\"\s\()]+(?=\s|\*|\)|$))

I use this regex to find all words that are not within double quotes or are the words "not" "and" or "or."

我使用这个正则表达式来查找所有不在双引号内的单词,或者是单词“not”、“and”或“or”。

回答by Wiktor Stribi?ew

To match any "word" that is a combination of letters, digits or underscores (including any other word chars defined in the \wshorthand character class), you may use word boundarieslike in

要匹配由字母、数字或下划线组合而成的\w任何“单词”(包括在速记字符类中定义的任何其他单词字符,您可以使用单词边界,

\b(?!(?:word1|word2|word3)\b)\w+

If the "word" is a chunk of non-whitespace characters with start/end of string or whitespace on both endsuse whitespace boundarieslike in

如果“单词”是一大块非空白字符,两端有字符串或空白的开始/结束,请使用空白边界,

(?<!\S)(?!(?:word1|word2|word3)(?!\S))\S+

Here, the two expressions will look like

在这里,两个表达式看起来像

\b(?!(?:and|not|or)\b)\w+
(?<!\S)(?!(?:and|not|or)(?!\S))\S+

See the regex demo(or, a popular regex101 demo, but please note that PCRE \wmeaning is different from the .NET \wmeaning.)

请参阅正则表达式演示(或者,一个流行的regex101 演示,但请注意 PCRE 的\w含义与 .NET 的\w含义不同。)

Pattern explanation

图案说明

  • \b- word boundary
  • (?<!\S)- a negative lookbehind that matches a location that is not immediately preceded with a character other than whitespace, it requires a start of string position or a whitespace char to be right before the current location
  • (?!(?:word1|word2|word3)\b)- a negative lookahead that fails the match if, immediately to the right of the current location, there is word1, word2or word3char sequences followed with a word boundary (or, if (?!\S)whitespace right-hand boundary is used, there must be a whitespace or end of string immediately to the right of the current location)
  • \w+- 1+ word chars
  • \S+- 1+ chars other than whitespace
  • \b-字边界
  • (?<!\S)- 一个负向后视匹配一个位置,该位置前面没有空格以外的字符,它需要字符串位置的开始或在当前位置之前的空格字符
  • (?!(?:word1|word2|word3)\b)-负先行失败比赛如果,立即到当前位置的右侧,有 word1word2word3遵循的字边界char序列(或者,如果(?!\S)使用的空白右侧边界,必须有一个空白或结束紧接当前位置右侧的字符串)
  • \w+- 1+字字符
  • \S+- 除空格以外的 1+ 个字符

In C#, and any other programming language, you may build the pattern dynamically, by joining array/list items with a pipe character, see the demobelow:

在 C# 和任何其他编程语言中,您可以通过使用管道字符连接数组/列表项来动态构建模式,请参见下面的演示

var exceptions = new[] { "and", "not", "or" };
var result = Regex.Replace("This and This not That", 
        $@"\b(?!(?:{string.Join("|", exceptions)})\b)\w+",
        "\"$&\"");
Console.WriteLine(result); // => "This" and "This" not "That"

If your "words" may contain special characters, the whitespace boundaries approach is more suitable, and make sure to escape the "words" with, say, exceptions.Select(Regex.Escape):

如果您的“单词”可能包含特殊字符,则空白边界方法更合适,并确保使用以下内容对“单词”进行转义exceptions.Select(Regex.Escape)

var pattern = $@"(?<!\S)(?!(?:{string.Join("|", exceptions.Select(Regex.Escape))})(?!\S))\S+";

NOTE: If there are too many words to search for, it might be a better idea to build a regex trieout of them.

注意:如果要搜索的单词太多,最好使用它们构建正则表达式