Java中最快的子串搜索方法是什么
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18340097/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
what is the fastest substring search method in Java
提问by Joey
I need to implement a way to search substring (needles) in a list of string (haystack) using Java.
我需要实现一种使用 Java 在字符串(haystack)列表中搜索子字符串(针)的方法。
More specifically, my app has a list of user profiles. If I type some letters, for example, "Ja", and then search, then all the users whose name contains "ja" should show up. For instance, the result could be "Hyman", "Hymanson", "Jason", "Dijafu".
更具体地说,我的应用程序有一个用户配置文件列表。如果我输入一些字母,例如“Ja”,然后搜索,那么所有名称包含“ja”的用户都会出现。例如,结果可能是“Hyman”、“Hymanson”、“Jason”、“Dijafu”。
In Java, as I know, there are 3 build-in method to see search substring in a string.
在 Java 中,据我所知,有 3 种内置方法可以查看字符串中的搜索子字符串。
string.contains()
string.indexOf()
regular expression. it is something like string.matches("ja"))
string.contains()
string.indexOf()
正则表达式。它类似于 string.matches("ja"))
My question is:What are the runtimes of each method above? which one is the fastest or most efficient or most popular way to check if the list of string contains a given substring.
我的问题是:上述每种方法的运行时间是多少?哪一种是检查字符串列表是否包含给定子字符串的最快、最有效或最流行的方法。
I know there exists some algorithms that do the same thing, such as Boyer–Moore string search algorithm, Knuth–Morris–Pratt algorithm and so on. I do not want to use them because I just have a small list of strings, and I think using them is kind of overkill for me right now. Also I have to type a lot of extra coding for such a non-build-in algorithm. If you think my thoughts is not correct, please feel free to correct me.
我知道有一些算法可以做同样的事情,比如 Boyer-Moore 字符串搜索算法、Knuth-Morris-Pratt 算法等等。我不想使用它们,因为我只有一小部分字符串,我认为现在使用它们对我来说有点矫枉过正。此外,我必须为这种非内置算法键入大量额外的代码。如果您认为我的想法不正确,请随时纠正我。
采纳答案by Brinnis
String[] names = new String[]{"Hyman", "Hymanson", "jason", "dijafu"};
long start = 0;
long stop = 0;
//Contains
start = System.nanoTime();
for (int i = 0; i < names.length; i++){
names[i].contains("ja");
}
stop = System.nanoTime();
System.out.println("Contains: " + (stop-start));
//IndexOf
start = System.nanoTime();
for (int i = 0; i < names.length; i++){
names[i].indexOf("ja");
}
stop = System.nanoTime();
System.out.println("IndexOf: " + (stop-start));
//Matches
start = System.nanoTime();
for (int i = 0; i < names.length; i++){
names[i].matches("ja");
}
stop = System.nanoTime();
System.out.println("Matches: " + (stop-start));
Output:
输出:
Contains: 16677
IndexOf: 4491
Matches: 864018
回答by chrylis -cautiouslyoptimistic-
As far as the three you asked about, a regular expression is going to be much slower because it requires putting together a full state machine when you have a much simpler target. For contains
vs indexOf
...
至于你问的三个,正则表达式会慢得多,因为当你有一个更简单的目标时,它需要组合一个完整的状态机。对于contains
比indexOf
...
2114 public boolean contains(CharSequence s) {
2115 return indexOf(s.toString()) > -1;
2116 }
(i.e., contains
just calls indexOf
, but you might incur an extra String
creation on each invocation. This is just one implementation of contains
, but since the contract of contains
is a simplification of indexOf
, this is probably how every implementation will work.)
(即,contains
只调用indexOf
,但您可能会String
在每次调用时产生额外的创建。这只是 的一个实现contains
,但由于 的契约contains
是 的简化indexOf
,这可能是每个实现的工作方式。)
回答by Veronica Cornejo
This is dependent on specific JRE (and even JDK) make/version. It also depends / could depend on factors as string length, probability of being contained, in what position, etc. The only way of obtaining precise performance data requires setting up your exact context.
这取决于特定的 JRE(甚至 JDK)品牌/版本。它还取决于/可能取决于字符串长度、被包含的概率、在什么位置等因素。获得精确性能数据的唯一方法需要设置您的确切上下文。
However, in general aString.contains()
and aString.indexOf()
should be exactly the same. And even if a regular expression was superbly optimized, it wouldn't exceed the performance of the first two.
然而,在一般情况aString.contains()
和aString.indexOf()
应该是完全一样的。即使一个正则表达式得到了极好的优化,它也不会超过前两个的性能。
No, Java doesn't use extremely specialized algorithms either.
不,Java 也不使用非常专业的算法。
回答by FrankPl
From the example in your question, I assume you want to do case insensitive comparisons. Those slow down the process considerably. Hence, if you can live with some inaccuracies - which might depend on the locale in which you need to do the comparison, and your long text is searched again and again, it might make sense to convert the long text one time to uppercase, and the search string as well, and then search case-insensitive.
从您问题中的示例中,我假设您想要进行不区分大小写的比较。那些大大减慢了这个过程。因此,如果您可以忍受一些不准确之处——这可能取决于您需要进行比较的区域设置,并且一次又一次地搜索您的长文本,将长文本一次转换为大写可能是有意义的,并且搜索字符串,然后搜索不区分大小写。
回答by Skylion
If you are searching a large amount of Strings I've read the Aho-Corasickalgorithm is pretty fast, but it's a natively implemented in Java. It's the same algorithm used by GREP in Unix-based systems if that helps and it's pretty efficient. Hereis a Java implementation courtesy of Berkley.
如果您正在搜索大量字符串,我读过Aho-Corasick算法非常快,但它是用 Java 本地实现的。如果有帮助的话,它与 GREP 在基于 Unix 的系统中使用的算法相同,并且非常有效。这是 Berkley 提供的 Java 实现。
回答by CoronA
The accepted answer is not correct and not complete.
接受的答案不正确且不完整。
indexOf()
does a naive string search using backtracking on mismatches. This is quite fast on small patterns/textsbut shows very poor performance on large textscontains("ja")
should be comparable to indexOf (because it delegates to it)matches("ja")
will not deliver the correct result, because it searches for an exact match (only the string"ja"
will match exactly)Pattern p = Pattern.compile("ja"); Matcher m = p.matcher("Hyman"); m.find();
would be the correct way to find texts with regular expressions. In practice (using large texts) it will be the most efficientway using only the java api. This is because a constant pattern (like"ja"
) will not be processed by the regex engine (which is slow) but by an Boyer-Moore-Algorithm (which is fast)
indexOf()
使用不匹配的回溯进行简单的字符串搜索。这在小图案/文本上非常快,但在大文本上表现非常差contains("ja")
应该与 indexOf 相当(因为它委托给它)matches("ja")
不会提供正确的结果,因为它搜索完全匹配(只有字符串"ja"
会完全匹配)Pattern p = Pattern.compile("ja"); Matcher m = p.matcher("Hyman"); m.find();
将是使用正则表达式查找文本的正确方法。在实践中(使用大文本)这将是仅使用 java api的最有效方法。这是因为常量模式(如"ja"
)不会由正则表达式引擎(速度慢)处理,而是由 Boyer-Moore-Algorithm(速度快)处理
回答by android developer
A benchmark in Kotlin (which uses Java anyway, so results are about the same), on Android, using similar approach as stated above, shows that indeed contains
is similar to indexOf
, but for some reason faster, even though it uses it.
Kotlin 中的基准测试(无论如何都使用 Java,所以结果大致相同),在 Android 上,使用与上述类似的方法,表明确实contains
类似于indexOf
,但由于某种原因更快,即使它使用它。
As for regular expressions, because it creates real objects, and allows to go further, it's slower.
至于正则表达式,因为它创建了真实的对象,并且允许走得更远,所以它更慢。
Sample results (time in ms) :
示例结果(以毫秒为单位的时间):
Contains: 0
IndexOf: 5
Matches: 45
Code:
代码:
class MainActivity : AppCompatActivity() {
override fun onCreate(savedInstanceState: Bundle?) {
super.onCreate(savedInstanceState)
setContentView(R.layout.activity_main)
AsyncTask.execute {
val itemsCount = 1000
val minStringLength = 1000
val maxStringLength = 1000
val list = ArrayList<String>(itemsCount)
val r = Random()
val stringToSearchFor = prepareFakeString(r, 5, 10, ALPHABET_LOWERCASE + ALPHABET_UPPERCASE + DIGITS)
for (i in 0 until itemsCount)
list.add(prepareFakeString(r, minStringLength, maxStringLength, ALPHABET_LOWERCASE + ALPHABET_UPPERCASE + DIGITS))
val resultsContains = ArrayList<Boolean>(itemsCount)
val resultsIndexOf = ArrayList<Boolean>(itemsCount)
val resultsRegEx = ArrayList<Boolean>(itemsCount)
//Contains
var start: Long = System.currentTimeMillis()
var stop: Long = System.currentTimeMillis()
for (str in list) {
resultsContains.add(str.contains(stringToSearchFor))
}
Log.d("AppLog", "Contains: " + (stop - start))
//IndexOf
start = System.currentTimeMillis()
for (str in list) {
resultsIndexOf.add(str.indexOf(stringToSearchFor) >= 0)
}
stop = System.currentTimeMillis()
Log.d("AppLog", "IndexOf: " + (stop - start))
//Matches
val regex = stringToSearchFor.toRegex()
start = System.currentTimeMillis()
for (str in list) {
resultsRegEx.add(regex.find(str) != null)
}
stop = System.currentTimeMillis()
Log.d("AppLog", "Matches: " + (stop - start))
Log.d("AppLog", "checking results...")
var foundIssue = false
for (i in 0 until itemsCount) {
if (resultsContains[i] != resultsIndexOf[i] || resultsContains[i] != resultsRegEx[i]) {
foundIssue = true
break
}
}
Log.d("AppLog", "done. All results are the same?${!foundIssue}")
}
}
companion object {
const val ALPHABET_LOWERCASE = "qwertyuiopasdfghjklzxcvbnm"
const val ALPHABET_UPPERCASE = "QWERTYUIOPASDFGHJKLZXCVBNM"
const val DIGITS = "1234567890"
fun prepareFakeString(r: Random, minLength: Int, maxLength: Int, charactersToChooseFrom: String): String {
val length = if (maxLength == minLength) maxLength else r.nextInt(maxLength - minLength) + minLength
val sb = StringBuilder(length)
for (i in 0 until length)
sb.append(charactersToChooseFrom[r.nextInt(charactersToChooseFrom.length)])
return sb.toString()
}
}
}