C# 多行无法使正则表达式正常工作
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/289440/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Cannot get regular expression work correctly with multiline
提问by Biri
I have a quite big XML output from an application. I need to process it with my program and then feed it back to the original program. There are pieces in this XML which needs to be filled out our replaced. The interesting part looks like this:
我有一个来自应用程序的相当大的 XML 输出。我需要用我的程序处理它,然后将其反馈给原始程序。这个 XML 中有一些部分需要填写我们的替换内容。有趣的部分是这样的:
<sys:customtag sys:sid="1" sys:type="Processtart" />
<sys:tag>value</sys:tag>
here are some other tags
<sys:tag>value</sys.tag>
<sys:customtag sys:sid="1" sys:type="Procesend" />
and the document contains several pieces like this.
并且该文档包含几部分这样的内容。
I need to get all XML pieces inside these tags to be able to make modifications on it. I wrote a regular expression to get those pieces but it does not work:
我需要获取这些标签中的所有 XML 片段才能对其进行修改。我写了一个正则表达式来获取这些片段,但它不起作用:
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load(@"output.xml");
Regex regExp = new Regex(@"<sys:customtag(.*?)Processtart(.*?)/>(.*?)<sys:customtag (.*?)Procesend(.*?)/>", RegexOptions.Multiline & RegexOptions.IgnorePatternWhitespace & RegexOptions.CultureInvariant);
MatchCollection matches = regExp.Matches(xmlDoc.InnerXml);
If I leave the whole stuff in one line and call this regexp without the multiline option, it does find every occurences. By leaving the file as it is and set the multiline option, it does not work. What is the problem, what should I change? Or is there any easier way to get the XML parts between these tags without regexp?
如果我将所有内容放在一行中并在没有多行选项的情况下调用此正则表达式,它确实会找到所有出现的情况。通过保持文件原样并设置多行选项,它不起作用。有什么问题,我应该改变什么?或者有没有更简单的方法来在没有正则表达式的情况下获取这些标签之间的 XML 部分?
采纳答案by Owen
i believe the option to use is RegexOptions.Singleline
instead of RegexOptions.Multiline
(src). allowing (.) to match newlines should work in your case.
我相信使用的选项是RegexOptions.Singleline
而不是RegexOptions.Multiline
(src)。允许 (.) 匹配换行符应该适用于您的情况。
...the mode where the dot also matches newlines is called "single-line mode". This is a bit unfortunate, because it is easy to mix up this term with "multi-line mode". Multi-line mode only affects anchors, and single-line mode only affects the dot ... When using the regex classes of the .NET framework, you activate this mode by specifying RegexOptions.Singleline, such as in Regex.Match("string", "regex", RegexOptions.Singleline).
...点也与换行符匹配的模式称为“单行模式”。这有点不幸,因为这个术语很容易与“多行模式”混淆。多行模式只影响锚点,单行模式只影响点...当使用.NET框架的regex类时,通过指定RegexOptions.Singleline来激活这种模式,比如在Regex.Match("string ", "regex", RegexOptions.Singleline)。
回答by Marc Gravell
RegExp is a poor tool for xml... could you not juts load it into an XDocument / XmlDocument and use xpath? If you clarify the modifications you want to make, I expect we can fill in the blanks... namespaces are probably the main thing to make it complex in this case, so we just need to use an XmlNamespaceManager
.
RegExp 是一个糟糕的 xml 工具......你不能将它加载到 XDocument / XmlDocument 中并使用 xpath 吗?如果您阐明要进行的修改,我希望我们可以填写空白...在这种情况下,名称空间可能是使其变得复杂的主要因素,因此我们只需要使用XmlNamespaceManager
.
Here's an example that is, granted, more complex than just a regex - however, I would expect it to cope a lot better with the nuances of xml:
这是一个示例,当然,它比正则表达式更复杂 - 但是,我希望它能够更好地处理 xml 的细微差别:
string xml = @"<foo xmlns:sys=""foobar""><bar/><bar><sys:customtag sys:sid=""1"" sys:type=""Processtart"" />
<sys:tag>value</sys:tag>
here are some other tags
<sys:tag>value</sys:tag>
<sys:customtag sys:sid=""1"" sys:type=""Procesend"" /></bar><bar/></foo>";
XmlDocument doc = new XmlDocument();
doc.LoadXml(xml);
XmlNamespaceManager mgr = new XmlNamespaceManager(new NameTable());
mgr.AddNamespace("sys", "foobar");
var matches = doc.SelectNodes("//sys:customtag[@sys:type='Processtart']", mgr);
foreach (XmlElement start in matches)
{
XmlElement end = (XmlElement) start.SelectSingleNode("following-sibling::sys:customtag[@sys:type='Procesend'][1]",mgr);
XmlNode node = start.NextSibling;
while (node != null && node != end)
{
Console.WriteLine(node.OuterXml);
node = node.NextSibling;
}
}
回答by user19871
The regex char "." never matches a newline, even with MultiLine
option is set.
instead, you should use [\s\S]
or other combination with matches anything.
正则表达式字符“.” 即使MultiLine
设置了选项,也永远不会匹配换行符。相反,您应该使用[\s\S]
或其他组合匹配任何东西。
The MultiLine
option only modifies the behaviour of ^ (begin-of-line instead fo begin-of-string) and $ (end-of-line instead of end-of-string)
该MultiLine
选项仅修改 ^(行首而不是字符串开头)和 $(行尾而不是字符串结尾)的行为
BTW: Indeed, regex is not the right way to scan an HTML...
顺便说一句:确实,正则表达式不是扫描 HTML 的正确方法......
回答by Charles
If you're still having problems with this, it may be because you are using AND with your RegexOptions instead of OR.
如果您仍然遇到此问题,可能是因为您在 RegexOptions 中使用 AND 而不是 OR。
This code is wrong and will pass zero as the second parameter to the constructor:
这段代码是错误的,会将零作为第二个参数传递给构造函数:
Regex regExp = new Regex(@"<sys:customtag(.*?)Processtart(.*?)/>(.*?)<sys:customtag (.*?)Procesend(.*?)/>",
RegexOptions.Multiline & RegexOptions.IgnorePatternWhitespace & RegexOptions.CultureInvariant);
This code is correct (as far as using multiple RegexOptions flags):
此代码是正确的(就使用多个 RegexOptions 标志而言):
Regex regExp = new Regex(@"<sys:customtag(.*?)Processtart(.*?)/>(.*?)<sys:customtag (.*?)Procesend(.*?)/>",
RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace | RegexOptions.CultureInvariant);