使用正则表达式在 C# 中查找带转义引号的带引号的字符串
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2148587/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Finding quoted strings with escaped quotes in C# using a regular expression
提问by Joshua Lowry
I'm trying to find all of the quoted text on a single line.
我试图在一行中找到所有引用的文本。
Example:
例子:
"Some Text"
"Some more Text"
"Even more text about \"this text\""
I need to get:
我需要得到:
"Some Text"
"Some more Text"
"Even more text about \"this text\""
"Some Text"
"Some more Text"
"Even more text about \"this text\""
\"[^\"\r]*\"
gives me everything except for the last one, because of the escaped quotes.
\"[^\"\r]*\"
由于转义引号,给了我除了最后一个之外的所有内容。
I have read about \"[^\"\\]*(?:\\.[^\"\\]*)*\"
working, but I get an error at run time:
我已阅读有关\"[^\"\\]*(?:\\.[^\"\\]*)*\"
工作的信息,但在运行时出现错误:
parsing ""[^"\]*(?:\.[^"\]*)*"" - Unterminated [] set.
How do I fix this?
我该如何解决?
采纳答案by Alan Moore
What you've got there is an example of Friedl's "unrolled loop" technique, but you seem to have some confusion about how to express it as a string literal. Here's how it should look to the regex compiler:
您所拥有的是 Friedl 的“展开循环”技术的示例,但您似乎对如何将其表示为字符串文字有些困惑。下面是它应该如何看待正则表达式编译器:
"[^"\]*(?:\.[^"\]*)*"
The initial "[^"\\]*
matches a quotation mark followed by zero or more of any characters other than quotation marks or backslashes. That part alone, along with the final "
, will match a simple quoted string with no embedded escape sequences, like "this"
or ""
.
初始"[^"\\]*
匹配引号后跟零个或多个除引号或反斜杠以外的任何字符。单独的那部分,连同最后的"
,将匹配一个简单的带引号的字符串,没有嵌入的转义序列,如"this"
或""
。
If it doesencounter a backslash, \\.
consumes the backslash and whatever follows it, and [^"\\]*
(again) consumes everything up to the next backslash or quotation mark. That part gets repeated as many times as necessary until an unescaped quotation mark turns up (or it reaches the end of the string and the match attempt fails).
如果它确实遇到反斜杠,\\.
则使用反斜杠及其后的任何内容,并且[^"\\]*
(再次)使用下一个反斜杠或引号之前的所有内容。该部分会根据需要重复多次,直到出现未转义的引号(或者它到达字符串的末尾并且匹配尝试失败)。
Note that this will match "foo\"-
in \"foo\"-"bar"
. That may seem to expose a flaw in the regex, but it doesn't; it's the inputthat's invalid. The goal was to match quoted strings, optionally containing backslash-escaped quotes, embedded in other text--why would there be escaped quotes outsideof quoted strings? If you really need to support that, you have a much more complex problem, requiring a very different approach.
请注意,这将匹配"foo\"-
在\"foo\"-"bar"
。这似乎暴露了正则表达式中的一个缺陷,但事实并非如此;这是无效的输入。目标是匹配带引号的字符串,可以选择包含反斜杠转义的引号,嵌入在其他文本中——为什么在带引号的字符串之外会有转义的引号?如果你真的需要支持它,你就会遇到一个更复杂的问题,需要一种非常不同的方法。
As I said, the above is how the regex should look to the regex compiler. But you're writing it in the form of a string literal, and those tend to treat certain characters specially--i.e., backslashes and quotation marks. Fortunately, C#'s verbatim strings save you the hassle of having to double-escape backslashes; you just have to escape each quotation mark with another quotation mark:
正如我所说,上面是正则表达式应该如何看待正则表达式编译器。但是您以字符串文字的形式编写它,并且那些倾向于特殊对待某些字符——即反斜杠和引号。幸运的是,C# 的逐字字符串为您省去了双重转义反斜杠的麻烦;你只需要用另一个引号来转义每个引号:
Regex r = new Regex(@"""[^""\]*(?:\.[^""\]*)*""");
So the rule is double quotation marks for the C# compiler and double backslashes for the regex compiler--nice and easy. This particular regex may look a little awkward, with the three quotation marks at either end, but consider the alternative:
所以规则是 C# 编译器的双引号和正则表达式编译器的双反斜杠——既好又容易。这个特殊的正则表达式可能看起来有点尴尬,两端都有三个引号,但请考虑替代方案:
Regex r = new Regex("\"[^\"\\]*(?:\\.[^\"\\]*)*\"");
In Java, you alwayshave to write them that way. :-(
在 Java 中,您总是必须以这种方式编写它们。:-(
回答by Krill
I know this isn't the cleanest method, but with your example I would check the character before the "
to see if it's a \
. If it is, I would ignore the quote.
我知道这不是最干净的方法,但是在您的示例中,我会在 之前检查字符"
以查看它是否是\
. 如果是这样,我会忽略引用。
回答by Fried Hoeben
Any chance you need to do: \"[^\"\\\\]*(?:\\.[^\"\\\\]*)*\"
您需要做的任何机会: \"[^\"\\\\]*(?:\\.[^\"\\\\]*)*\"
回答by Tim Pietzcker
"(\"|\\|[^"\])*"
should work. Match either an escaped quote, an escaped backslash, or any other character except a quote or backslash character. Repeat.
应该管用。匹配转义引号、转义反斜杠或除引号或反斜杠字符之外的任何其他字符。重复。
In C#:
在 C# 中:
StringCollection resultList = new StringCollection();
Regex regexObj = new Regex(@"""(\""|\\|[^""\])*""");
Match matchResult = regexObj.Match(subjectString);
while (matchResult.Success) {
resultList.Add(matchResult.Value);
matchResult = matchResult.NextMatch();
}
Edit: Added escaped backslash to the list to correctly handle "This is a test\\"
.
编辑:在列表中添加了转义反斜杠以正确处理"This is a test\\"
.
Explanation:
解释:
First match a quote character.
首先匹配一个引号字符。
Then the alternatives are evaluated from left to right. The engine first tries to match an escaped quote. If that doesn't match, it tries an escaped backslash. That way, it can distinguish between "Hello \" string continues"
and "String ends here \\"
.
然后从左到右评估备选方案。引擎首先尝试匹配转义的引用。如果不匹配,它会尝试转义反斜杠。这样,它就可以区分"Hello \" string continues"
和"String ends here \\"
。
If either don't match, then anything else is allowed except for a quote or backslash character. Then repeat.
如果两者都不匹配,则除了引号或反斜杠字符外,允许使用任何其他字符。然后重复。
Finally, match the closing quote.
最后,匹配结束语。
回答by Jason
I recommend getting RegexBuddy. It lets you play around with it until you make sure everything in your test set matches.
我建议获取RegexBuddy。它可以让您玩弄它,直到您确保测试集中的所有内容都匹配为止。
As for your problem, I would try four /'s instead of two:
至于你的问题,我会尝试四个 / 而不是两个:
\"[^\"\\]*(?:\.[^\"\\]*)*\"
回答by Kamarey
The regular expression
正则表达式
(?<!\)".*?(?<!\)"
will also handle text that starts with an escaped quote:
还将处理以转义引号开头的文本:
\"Some Text\" Some Text "Some Text", and "Some more Text" an""d "Even more text about \"this text\""
回答by Ricardo Nolde
Regex for capturing strings (with \
for character escaping), for the .NET engine:
用于捕获字符串的正则表达式(\
用于字符转义),用于 .NET 引擎:
(?>(?(STR)(?(ESC).(?<-ESC>)|\(?<ESC>))|(?!))|(?(STR)"(?<-STR>)|"(?<STR>))|(?(STR).|(?!)))+
Here, a "friendly" version:
在这里,一个“友好”的版本:
(?> | especify nonbacktracking
(?(STR) | if (STRING MODE) then
(?(ESC) | if (ESCAPE MODE) then
.(?<-ESC>) | match any char and exits escape mode (pop ESC)
| | else
\(?<ESC>) | match '\' and enters escape mode (push ESC)
) | endif
| | else
(?!) | do nothing (NOP)
) | endif
| | -- OR
(?(STR) | if (STRING MODE) then
"(?<-STR>) | match '"' and exits string mode (pop STR)
| | else
"(?<STR>) | match '"' and enters string mode (push STR)
) | endif
| | -- OR
(?(STR) | if (STRING MODE) then
. | matches any character
| | else
(?!) | do nothing (NOP)
) | endif
)+ | REPEATS FOR EVERY CHARACTER
Based on http://tomkaminski.com/conditional-constructs-net-regular-expressionsexamples. It relies in quotes balancing. I use it with great success. Use it with Singleline
flag.
基于http://tomkaminski.com/conditional-constructs-net-regular-expressions示例。它依赖于报价平衡。我使用它取得了巨大的成功。与Singleline
标志一起使用。
To play around with regexes, I recommend Rad Software Regular Expression Designer, which has a nice "Language Elements" tab with quick access to some basic instructions. It's based at .NET's regex engine.
要使用正则表达式,我推荐Rad Software Regular Expression Designer,它有一个不错的“语言元素”选项卡,可以快速访问一些基本说明。它基于 .NET 的正则表达式引擎。
回答by Emre
Similar to RegexBuddy posted by @Blankasaurus, RegexMagichelps too.
与@Blankasaurus 发布的 RegexBuddy 类似,RegexMagic 也有帮助。
回答by Piotr Zierhoffer
A simple answer, without the use of ?
, is
一个简单的答案,不使用?
,是
"([^\"]*(\")*)*\"
or, as a verbatim string
或者,作为逐字字符串
@"^""([^\""]*(\"")*(\[^""])*)*"""
It just means:
它只是意味着:
- find the first
"
- find any number of characters that are not
\
or"
- find any number of escaped quotes
\"
- find any number of escaped characters, that are not quotes
- repeat the last three commands until you find
"
- 找到第一个
"
- 找到任意数量的不属于
\
或不属于的字符"
- 查找任意数量的转义引号
\"
- 找到任意数量的转义字符,不是引号
- 重复最后三个命令,直到找到
"
I believe it works as good as @Alan Moore's answer, but, for me, is easier to understand. It accepts unmatched ("unbalanced") quotes as well.
我相信它和@Alan Moore 的回答一样有效,但对我来说,它更容易理解。它也接受不匹配(“不平衡”)的报价。
回答by Alex
Well, Alan Moore's answer is good, but I would modify it a bit to make it more compact. For the regex compiler:
好吧,艾伦摩尔的回答很好,但我会稍微修改一下以使其更紧凑。对于正则表达式编译器:
"([^"\]*(\.)*)*"
Compare with Alan Moore's expression:
与艾伦摩尔的表达进行比较:
"[^"\]*(\.[^"\]*)*"
The explanation is very similar to Alan Moore's one:
解释与 Alan Moore 的解释非常相似:
The first part "
matches a quotation mark.
第一部分"
与引号匹配。
The second part [^"\\]*
matches zero or more of any characters other than quotation marks or backslashes.
第二部分[^"\\]*
匹配零个或多个除引号或反斜杠以外的任何字符。
And the last part (\\.)*
matches backslash and whatever single character follows it. Pay attention on the *, saying that this group is optional.
最后一部分(\\.)*
匹配反斜杠以及它后面的任何单个字符。注意*,表示这个组是可选的。
The parts described, along with the final "
(i.e. "[^"\\]*(\\.)*"
), will match: "Some Text" and "Even more Text\"", but will not match: "Even more text about \"this text\"".
所描述的部分以及最终的"
(即"[^"\\]*(\\.)*"
)将匹配:“Some Text”和“Even more Text\””,但不会匹配:“Even more text about \"this text\"”。
To make it possible, we need the part: [^"\\]*(\\.)*
gets repeated as many times as necessary until an unescaped quotation mark turns up (or it reaches the end of the string and the match attempt fails). So I wrapped that part by brackets and added an asterisk. Now it matches: "Some Text", "Even more Text\"", "Even more text about \"this text\"" and "Hello\\".
为了使其成为可能,我们需要部分:[^"\\]*(\\.)*
根据需要重复多次,直到出现未转义的引号(或者它到达字符串的末尾并且匹配尝试失败)。所以我用括号把那部分包裹起来并加了一个星号。现在它匹配:“Some Text”、“Even more Text\””、“Even more text about \”this text\””和“Hello\\”。
In C# code it will look like:
在 C# 代码中,它将如下所示:
var r = new Regex("\"([^\"\\]*(\\.)*)*\"");
BTW, the order of the two main parts: [^"\\]*
and (\\.)*
does not matter. You can write:
BTW,两个主要部分的顺序:[^"\\]*
和(\\.)*
没有关系。你可以写:
"([^"\]*(\.)*)*"
or
或者
"((\.)*[^"\]*)*"
The result will be the same.
结果将是相同的。
Now we need to solve another problem: \"foo\"-"bar"
. The current expression will match to "foo\"-"
, but we want to match it to "bar"
. I don't know
现在我们需要解决另一个问题:\"foo\"-"bar"
. 当前表达式将匹配到"foo\"-"
,但我们希望将其匹配到"bar"
。我不知道
why would there be escaped quotes outsideof quoted strings
为什么在带引号的字符串之外会有转义的引号
but we can implement it easily by adding the following part to the beginning:(\G|[^\\])
. It says that we want the match start at the point where the previous match ended or after any character except backslash. Why do we need \G
? This is for the following case, for example: "a""b"
.
但我们可以通过添加下面的部分开始容易实现:(\G|[^\\])
。它表示我们希望匹配在前一个匹配结束的点或除反斜杠之外的任何字符之后开始。我们为什么需要\G
?这适用于以下情况,例如:"a""b"
。
Note that (\G|[^\\])"([^"\\]*(\\.)*)*"
matches -"bar"
in \"foo\"-"bar"
. So, to get only "bar"
, we need to specify the group and optionally give it a name, for example "MyGroup". Then C# code will look like:
需要注意的是(\G|[^\\])"([^"\\]*(\\.)*)*"
匹配-"bar"
在\"foo\"-"bar"
。因此,要仅获取"bar"
,我们需要指定组并可选择为其命名,例如“MyGroup”。然后 C# 代码将如下所示:
[TestMethod]
public void RegExTest()
{
//Regex compiler: (?:\G|[^\])(?<MyGroup>"(?:[^"\]*(?:\.)*)*")
string pattern = "(?:\G|[^\\])(?<MyGroup>\"(?:[^\"\\]*(?:\\.)*)*\")";
var r = new Regex(pattern, RegexOptions.IgnoreCase);
//Human readable form: "Some Text" and "Even more Text\"" "Even more text about \"this text\"" "Hello\" \"foo\" - "bar" "a" "b" c "d"
string inputWithQuotedText = "\"Some Text\" and \"Even more Text\\"\" \"Even more text about \\"this text\\"\" \"Hello\\\" \\"foo\\"-\"bar\" \"a\"\"b\"c\"d\"";
var quotedList = new List<string>();
for (Match m = r.Match(inputWithQuotedText); m.Success; m = m.NextMatch())
quotedList.Add(m.Groups["MyGroup"].Value);
Assert.AreEqual(8, quotedList.Count);
Assert.AreEqual("\"Some Text\"", quotedList[0]);
Assert.AreEqual("\"Even more Text\\"\"", quotedList[1]);
Assert.AreEqual("\"Even more text about \\"this text\\"\"", quotedList[2]);
Assert.AreEqual("\"Hello\\\"", quotedList[3]);
Assert.AreEqual("\"bar\"", quotedList[4]);
Assert.AreEqual("\"a\"", quotedList[5]);
Assert.AreEqual("\"b\"", quotedList[6]);
Assert.AreEqual("\"d\"", quotedList[7]);
}