使用大型文档时，正则表达式的运行速度非常慢-IGI

时间：2020-03-06 15:02:07 　来源:igfitidea点击:

我需要将内联css样式属性转换为其HTML标记等价物。我拥有的解决方案有效，但是使用Microsoft .Net Regex命名空间和长文档(约40页html)运行时，运行速度非常慢。我尝试了几种变体，但没有任何有用的结果。我对执行表达式做了一些包装，但最后只是调用了内置的regex Replace方法。

我确定我在滥用正则表达式的贪婪性，但不确定如何使用单个正则表达式来达到我想要的目的。

我希望能够运行以下单元测试：

[Test]
public void TestCleanReplacesFontWeightWithB()
{
    string html = "<font style=\"font-weight:bold\">Bold Text</font>";
    html = Q4.PrWorkflow.Helper.CleanFormatting(html);
    Assert.AreEqual("<b>Bold Text</b>", html);
}
[Test]
public void TestCleanReplacesMultipleAttributesFontWeightWithB()
{
    string html = "<font style=\"font-weight:bold; color: blue; \">Bold Text</font>";
    html = Q4.PrWorkflow.Helper.CleanFormatting(html);
    Assert.AreEqual("<b>Bold Text</b>", html);
}
[Test]
public void TestCleanReplaceAttributesBoldAndUnderlineWithHtml()
{
    string html = "<span style=\"font-weight:bold; color: blue; text-decoration: underline; \">Bold Text</span>";
    html = Q4.PrWorkflow.Helper.CleanFormatting(html);
    Assert.AreEqual("<u><b>Bold Text</b></u>", html);
}
[Test]
public void TestCleanReplaceAttributesBoldUnderlineAndItalicWithHtml()
{
    string html = "<span style=\"font-weight:bold; color: blue; font-style: italic; text-decoration: underline; \">Bold Text</span>";
    html = Q4.PrWorkflow.Helper.CleanFormatting(html);
    Assert.AreEqual("<u><b><i>Bold Text</i></b></u>", html);
}
[Test]
public void TestCleanReplacesFontWeightWithSpaceWithB()
{
    string html = "<font size=\"10\" style=\"font-weight: bold\">Bold Text</font>";
    html = Q4.PrWorkflow.Helper.CleanFormatting(html);
    Assert.AreEqual("<b>Bold Text</b>", html);
}

我用于实现此逻辑的常规表达式可以工作，但速度很慢。 Ccode中的正则表达式如下所示：

public static IReplacePattern IncludeInlineItalicToITag(ICleanUpHtmlFactory factory)
{
    return factory.CreateReplacePattern("(<(span|font) .*?style=\".*?font-style:\s*italic[^>]*>)(.*?)</\2>", "<i></i></>");
}
public static IReplacePattern IncludeInlineBoldToBTag(ICleanUpHtmlFactory factory)
{
    return factory.CreateReplacePattern("(<(span|font) .*?style=\".*?font-weight:\s*bold[^>]*>)(.*?)</\2>", "<b></b></>");
}
public static IReplacePattern IncludeInlineUnderlineToUTag(ICleanUpHtmlFactory factory)
{
    return factory.CreateReplacePattern("(<(span|font) .*?style=\".*?text-decoration:\s*underline[^>]*>)(.*?)</\2>", "<u></u></>");
}

解决方案

我相信问题在于，如果找到没有定义样式属性的span | font标记，由于"。*？"，它将继续查找直到文档结尾。我尚未对其进行测试，但是将其更改为" [^>] *？"可能会提高性能。

编辑：确保对所有"。*？"应用更改。你有;即使是在标签之间捕获内容的标签(在那里也使用" [^ <] *？")，因为如果文件格式不正确，它将捕获到下一个结束标签。

.NET正则表达式不支持递归构造。 PCRE可以，但是在这里并不重要。

c夫

<font style="font-weight: bold;"> text1 <font color="blue"> text2 </font> text3 </font>

它会被转换成

<b> text1 <font color="blue"> text2 </b> text3 </font>

我的建议是使用适当的标记解析器，并可能在样式标签的值上使用regexp。

编辑：从头开始。 .NET似乎具有平衡的递归模式的构造。但功能不如PCRE / perl中的强大。

(?<N>content) would push N onto a stack if content matches
(?<-N>content) would pop N from the stack, if content matches.
(?(N)yes|no) would match "yes" if N is on the stack, otherwise "no".

有关详细信息，请参见http://weblogs.asp.net/whaggard/archive/2005/02/20/377025.aspx。

疯狂猜测：我相信成本来自替代方案和相应的匹配项。
我们可能要尝试替换：

"(<(span|font) .*?style=\".*?font-style:\s*italic[^>]*>)(.*?)</\2>", "<i></i></>"

有两个单独的表达式：

"(<span .*?style=\".*?font-style:\s*italic[^>]*>)(.*?)</span>", "<i></i></span>"
"(<font .*?style=\".*?font-style:\s*italic[^>]*>)(.*?)</font>", "<i></i></font>"

当然，这使文件的解析增加了一倍，但是正则表达式更简单，引用次数更少，在实践中它可能会更快。它不是很好(重复代码)，但是只要它可以工作...

有趣的是，我做了类似的事情(我手边没有代码)来清理工具生成的HTML，简化它以便JavaHelp可以理解它。在这种情况下，针对HTML的正则表达式是可以的，因为不是人类会犯错误或者改动很小的东西来创建HTML，而是一个具有明确定义的模式的过程。

在测试期间，我发现了奇怪的行为。在单独的线程中运行regexp时，它的运行速度要快得多。
我有使用regexp从Go转到Go拆分为sql脚本的部分。
在不使用单独线程的情况下处理此脚本时，该脚本将持续约2分钟。但是，当使用多线程时，仅持续数秒。

使用大型文档时，正则表达式的运行速度非常慢

解决方案

相关推荐

最近更新

标签

使用大型文档时，正则表达式的运行速度非常慢

解决方案

相关推荐

在Crystal Reports公式字段中使用ToText格式化字段

WinForms中的只读ComboBox

是否可以使用Cobol程序中的Web服务？

为什么要使用指针？

相关推荐

最近更新

标签