C# 用于删除 XML 标签及其内容的正则表达式

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/121656/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-03 14:33:07  来源:igfitidea点击:

Regular expression to remove XML tags and their content

提问by Vincent

I have the following string and I would like to remove <bpt *>*</bpt>and <ept *>*</ept>(notice the additional tag content inside them that also needs to be removed) without using a XML parser (overhead too large for tiny strings).

我有以下字符串,我想在不使用 XML 解析器(对于小字符串来说开销太大)的情况下删除<bpt *>*</bpt><ept *>*</ept>(注意其中也需要删除的附加标签内容)。

The big <bpt i="1" x="1" type="bold"><b></bpt>black<ept i="1"></b></ept> <bpt i="2" x="2" type="ulined"><u></bpt>cat<ept i="2"></u></ept> sleeps.

Any regex in VB.NET or C# will do.

任何 VB.NET 或 C# 中的正则表达式都可以。

采纳答案by tyshock

If you just want to remove all the tags from the string, use this (C#):

如果您只想从字符串中删除所有标签,请使用此 (C#):

try {
    yourstring = Regex.Replace(yourstring, "(<[be]pt[^>]+>.+?</[be]pt>)", "");
} catch (ArgumentException ex) {
    // Syntax error in the regular expression
}

EDIT:

编辑:

I decided to add on to my solution with a better option. The previous option would not work if there were embedded tags. This new solution should strip all <**pt*> tags, embedded or not. In addition, this solution uses a back reference to the original [be] match so that the exact matching end tag is found. This solution also creates a reusable Regex object for improved performance so that each iteration does not have to recompile the Regex:

我决定用更好的选择添加到我的解决方案中。如果有嵌入的标签,上一个选项将不起作用。这个新的解决方案应该去除所有 <**pt*> 标签,无论是否嵌入。此外,此解决方案使用对原始 [be] 匹配项的反向引用,以便找到完全匹配的结束标记。此解决方案还创建了一个可重用的 Regex 对象以提高性能,以便每次迭代都不必重新编译 Regex:

bool FoundMatch = false;

try {
    Regex regex = new Regex(@"<([be])pt[^>]+>.+?</pt>");
    while(regex.IsMatch(yourstring) ) {
        yourstring = regex.Replace(yourstring, "");
    }
} catch (ArgumentException ex) {
    // Syntax error in the regular expression
}

ADDITIONAL NOTES:

补充说明:

In the comments a user expressed worry that the '.' pattern matcher would be cpu intensive. While this is true in the case of a standalone greedy '.', the use of the non-greedy character '?' causes the regex engine to only look ahead until it finds the first match of the next character in the pattern versus a greedy '.' which requires the engine to look ahead all the way to the end of the string. I use RegexBuddyas a regex development tool, and it includes a debugger which lets you see the relative performance of different regex patterns. It also auto comments your regexes if desired, so I decided to include those comments here to explain the regex used above:

在评论中,一位用户表示担心“.” 模式匹配器将是 CPU 密集型的。虽然这在独立的贪婪 '.' 的情况下是正确的,但使用非贪婪字符 '?' 导致正则表达式引擎只向前看,直到找到模式中下一个字符的第一个匹配项而不是贪婪的 '.' 这需要引擎一直向前看到字符串的末尾。我使用RegexBuddy作为正则表达式开发工具,它包含一个调试器,可以让您查看不同正则表达式模式的相对性能。如果需要,它还会自动注释您的正则表达式,因此我决定在此处包含这些注释以解释上面使用的正则表达式:

    // <([be])pt[^>]+>.+?</pt>
// 
// Match the character "<" literally ?<?
// Match the regular expression below and capture its match into backreference number 1 ?([be])?
//    Match a single character present in the list "be" ?[be]?
// Match the characters "pt" literally ?pt?
// Match any character that is not a ">" ?[^>]+?
//    Between one and unlimited times, as many times as possible, giving back as needed (greedy) ?+?
// Match the character ">" literally ?>?
// Match any single character that is not a line break character ?.+??
//    Between one and unlimited times, as few times as possible, expanding as needed (lazy) ?+??
// Match the characters "</" literally ?</?
// Match the same text as most recently matched by backreference number 1 ??
// Match the characters "pt>" literally ?pt>?

回答by davenpcj

I presume you want to drop the tag entirely?

我想您想完全删除标签?

(<bpt .*?>.*?</bpt>)|(<ept .*?>.*?</ept>)

The ? after the * makes it non-greedy, so it will try to match as few characters as possible.

这 ?在 * 使它变得非贪婪之后,它会尝试匹配尽可能少的字符。

One problem you'll have is nested tags. stuff would not see the second because the first matched.

您将遇到的一个问题是嵌套标签。东西不会看到第二个,因为第一个匹配。

回答by Torsten Marek

Does the .NET regex engine support negative lookaheads? If yes, then you can use

.NET 正则表达式引擎是否支持负前瞻?如果是,那么您可以使用

(<([eb])pt[^>]+>((?!</pt>).)+</pt>)

Which makes The big black cat sleeps.out of the string above if you remove all matches. However keep in mind that it will not work if you have nested bpt/eptelements. You might also want to add \sin some places to allow for extra whitespace in closing elements etc.

这让大黑猫睡着了。如果删除所有匹配项,则从上面的字符串中删除。但是请记住,如果您嵌套了bpt/ept元素,它将不起作用。您可能还想\s在某些地方添加以允许在关闭元素等时有额外的空格。

回答by Andy Lester

Why do you say the overhead is too large? Did you measure it? Or are you guessing?

为什么说开销太大?你测量了吗?或者你在猜测?

Using a regex instead of a proper parser is a shortcut that you may run afoul of when someone comes along with something like <bpt foo="bar>">

使用正则表达式而不是正确的解析器是一种捷径,当有人提出诸如 <bpt foo="bar>"> 之类的东西时,您可能会遇到这种捷径

回答by Robert Rossney

If you're going to use a regex to remove XML elements, you'd better be sure that your input XML doesn't use elements from different namespaces, or contain CDATA sections whose content you don't want to modify.

如果您打算使用正则表达式来删除 XML 元素,您最好确保您的输入 XML 不使用来自不同名称空间的元素,或者包含您不想修改其内容的 CDATA 部分。

The proper (i.e. both performant and correct) way to do this is with XSLT. An XSLT transform that copies everything except a specific element to the output is a trivial extension of the identity transform. Once the transform is compiled it will execute extremely quickly. And it won't contain any hidden defects.

执行此操作的正确(即既高效又正确)的方法是使用 XSLT。将除特定元素之外的所有内容复制到输出的 XSLT 转换是身份转换的简单扩展。一旦转换被编译,它将非常快速地执行。它不会包含任何隐藏的缺陷。

回答by Robert Rossney

is there any possible way to get a global solution of the regex.pattern for xml type of text? that way i"ll get rid of the replace function and shell use the regex. The trouble is to analyze the < > coming in order or not.. Also replacing reserved chars as ' & and so on. here is the code 'handling special chars functions Friend Function ReplaceSpecChars(ByVal str As String) As String Dim arrLessThan As New Collection Dim arrGreaterThan As New Collection If Not IsDBNull(str) Then

有没有可能的方法来为 xml 类型的文本获得 regex.pattern 的全局解决方案?这样我将摆脱替换函数和 shell 使用正则表达式。麻烦是分析 < > 是否按顺序出现。还将保留字符替换为 ' & 等等。这里是代码'处理特殊chars 函数 Friend Function ReplaceSpecChars(ByVal str As String) As String Dim arrLessThan As New Collection Dim arrGreaterThan As New Collection If Not IsDBNull(str) Then

  str = CStr(str)
  If Len(str) > 0 Then
    str = Replace(str, "&", "&amp;")
    str = Replace(str, "'", "&apos;")
    str = Replace(str, """", "&quot;")
    arrLessThan = FindLocationOfChar("<", str)
    arrGreaterThan = FindLocationOfChar(">", str)
    str = ChangeGreaterLess(arrLessThan, arrGreaterThan, str)
    str = Replace(str, Chr(13), "chr(13)")
    str = Replace(str, Chr(10), "chr(10)")
  End If
  Return str
Else
  Return ""
End If

End Function Friend Function ChangeGreaterLess(ByVal lh As Collection, ByVal gr As Collection, ByVal str As String) As String For i As Integer = 0 To lh.Count If CInt(lh.Item(i)) > CInt(gr.Item(i)) Then str = Replace(str, "<", "<") /////////problems//// End If

End Function Friend Function ChangeGreaterLess(ByVal lh As Collection, ByVal gr As Collection, ByVal str As String) As String For i As Integer = 0 To lh.Count If CInt(lh.Item(i)) > CInt(gr.Item() i)) 然后 str = Replace(str, "<", "<") /////////问题//// End If

  Next


    str = Replace(str, ">", "&gt;")

End Function Friend Function FindLocationOfChar(ByVal chr As Char, ByVal str As String) As Collection Dim arr As New Collection For i As Integer = 1 To str.Length() - 1 If str.ToCharArray(i, 1) = chr Then arr.Add(i) End If Next Return arr End Function

End Function Friend Function FindLocationOfChar(ByVal chr As Char, ByVal str As String) As Collection Dim arr As New Collection For i As Integer = 1 To str.Length() - 1 If str.ToCharArray(i, 1) = chr Then arr .Add(i) End If Next Return arr End 函数

got trouble at problem mark

在问题标记处遇到麻烦

that's a standart xml with different tags i want to analyse..

那是我想分析的带有不同标签的标准 xml。

回答by Eamon Nerbonne

Have you measured this? I haverun into performance issues using .NET's regex engine, but by contrast have parsed xml files of around 40GB withoutissue using the Xml parser (you will need to use XmlReader for larger strings, however).

你测量过这个吗?我使用 .NET 的正则表达式引擎时遇到了性能问题,但相比之下,我使用 Xml 解析器解析了大约 40GB 的 xml 文件而没有问题(但是,对于更大的字符串,您需要使用 XmlReader)。

Please post a an actual code sample and mention your performance requirements: I doubt the Regexclass is the best solution here if performance matters.

请发布一个实际的代码示例并提及您的性能要求:Regex如果性能很重要,我怀疑该类是此处的最佳解决方案。