在 Java 中执行大量字符串替换的最快方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4285083/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 05:41:52  来源:igfitidea点击:

Fastest way to perform a lot of strings replace in Java

javaregexstring

提问by Averroes

I have to write some sort of parser that get a String and replace certain sets of character with others. The code looks like this:

我必须编写某种解析器来获取字符串并将某些字符集替换为其他字符集。代码如下所示:

noHTMLString = noHTMLString.replaceAll("</p>", "\n");
noHTMLString = noHTMLString.replaceAll("<br/>", "\n\n");
noHTMLString = noHTMLString.replaceAll("<br />", "\n\n");
//here goes A LOT of lines like these ones

The function is very long and performs a lot of strings replaces. The issue here is that it takes a lot of time because the method it's called a lot of times, slowing down the application performance.

该函数很长并且执行了很多字符串替换。这里的问题是它需要很多时间,因为它被调用了很多次,降低了应用程序的性能。

I have read some threads here about using StringBuilder as an alternative but it lacks the ReplaceAll method and as it's noted here Does string.replaceAll() performance suffer from string immutability?the replaceAll method in String class works with

我在这里阅读了一些关于使用 StringBuilder 作为替代方法的线程,但它缺少 ReplaceAll 方法,正如这里所指出的那样string.replaceAll() 性能是否受到字符串不变性的影响?String 类中的 replaceAll 方法适用于

Match Pattern & Matcher and Matcher.replaceAll() uses a StringBuilder to store the eventually returned value so I don't know if switching to StringBuilder will really reduce the time to perform the substitutions.

Match Pattern & Matcher 和 Matcher.replaceAll() 使用 StringBuilder 来存储最终返回的值,所以我不知道切换到 StringBuilder 是否真的会减少执行替换的时间。

Do you know a fast way to do a lot of String replace in a fast way? Do you have any advice for this problem?

您知道一种快速进行大量字符串替换的快速方法吗?你对这个问题有什么建议吗?

Thanks.

谢谢。

EDIT: I have to create a report that have a few fields with html text. For each row I'm calling the method that replaces all the html tags and special characters inside these strings. With a full report it takes more than 3 minutes to parse all the text. The problem is that I have to invoke the method very often

编辑:我必须创建一个报告,其中有几个带有 html 文本的字段。对于每一行,我都调用了替换这些字符串中所有 html 标签和特殊字符的方法。对于完整的报告,解析所有文本需要 3 多分钟。问题是我必须经常调用该方法

回答by Mat B.

I found that org.apache.commons.lang.StringUtils is the fastest if you don't want to bother with the StringBuffer.

我发现如果你不想打扰 StringBuffer,org.apache.commons.lang.StringUtils 是最快的。

You can use it like this:
noHTMLString = StringUtils.replace(noHTMLString, "</p>", "\n");

你可以这样使用它:
noHTMLString = StringUtils.replace(noHTMLString, "</p>", "\n");

I did performance testing it was fester than my custom StrinBuffer solution similar to the one @extraneon proposed.

我做了性能测试,它比我的自定义 StrinBuffer 解决方案更糟糕,类似于@extraneon 提出的解决方案。

回答by Martijn Verburg

It looks like your parsing HTML there, have you though about using a 3rd party libraryinstead of re-inventing the wheel?

看起来你在那里解析 HTML,你有没有想过使用3rd 方库而不是重新发明轮子?

回答by Allanrbo

I agree with Martijn in using a ready-built solution instead of parsing it yourself - there's loads of stuff built into Java in the javax.xml package. A neat solution would be to use XSLT transformation to replace, this looks like an ideal use case for it. However, it is complicated.

我同意 Martijn 使用现成的解决方案而不是自己解析它——javax.xml 包中有大量内置于 Java 的内容。一个巧妙的解决方案是使用 XSLT 转换来替换,这看起来是一个理想的用例。然而,这很复杂。

To answer the question, have you considered using the regular expression libraries? It looks like you have many different things you want to match, and replace with the same thing (\n or empty string). Using regular expressions you could be an expression like "<br>|<br/>|<br />"or even more clever like <br.*?>"to create a matcher object, on which you can call replaceAll.

要回答这个问题,您是否考虑过使用正则表达式库?看起来您有许多不同的东西想要匹配,并替换为相同的东西(\n 或空字符串)。使用正则表达式,您可以"<br>|<br/>|<br />"<br.*?>"创建一个匹配器对象一样 甚至更聪明地创建一个匹配器对象,您可以在该对象上调用 replaceAll。

回答by extraneon

I fully agree with Martijn here. Pick the right tool for the job.

我完全同意 Martijn 的看法。为工作选择合适的工具。

If your file however is not HTML, but only contains some HTML tokens there are a few ways you can speed things up.

但是,如果您的文件不是 HTML,而是仅包含一些 HTML 标记,则有几种方法可以加快速度。

First, if some amount of the input does not contain replaceable elements, consider starting with something like:

首先,如果一些输入不包含可替换元素,请考虑从以下内容开始:

if (!input.contains('<')) {
    return input;
}

Second, consider a regex:

其次,考虑一个正则表达式:

Pattern p = Pattern.compile( your_regex );

Don't make a pattern for every single replaceAll line, but try to combine them (regex has a OR operator) and let Pattern optimize the regex. Do use the compiledpattern and don't compile it in every call, it's fairly expensive.

不要为每一行都创建一个模式,而是尝试组合它们(正则表达式有一个 OR 运算符)并让模式优化正则表达式。一定要使用编译模式并且不要在每次调用中都编译它,它相当昂贵。

If regexes are a bit to complex you can also implement some faster (but potentially less readable) replacement engine yourself:

如果正则表达式有点复杂,您还可以自己实现一些更快(但可能不太可读)的替换引擎:

StringBuilder result = new StringBuilder(input.length();
for (int i=0; i < input.length(); i++) {
  char c = input.charAt(i);

  if ( c != '<' ) {
    continue;
  }

  int closePos = input.indexOf( '>', i);
  if (closePos == -1) {// not found
    result.append( input.substring(i, input.length());
    return result.toString();
  }
  i = closePos;
  String token = input.substring(i, closePos);
  if ( token.equals( "p/" ) {
    result.append("\n");
  } else if (token.equals(...)) {
  } else if (...) {
  } 
}
return result.toString();

This may have some errors :)

这可能有一些错误:)

The advantage is you have to iterate through the input only once. The big disadvantage is that it is not all that easy to understand. You could also write a state machine, analyzing per character what the new state should be, and that would probably be faster and even more work.

优点是您只需遍历输入一次。最大的缺点是它不是那么容易理解。您还可以编写一个状态机,分析每个字符的新状态应该是什么,这可能会更快,甚至更多的工作。