Java 的 Scanner 对比 String.split() 对比 StringTokenizer;我应该使用哪个?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/736654/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 18:52:34  来源:igfitidea点击:

Java's Scanner vs String.split() vs StringTokenizer; which should I use?

javaregexsplitjava.util.scanner

提问by

I am currently using split()to scan through a file where each line has number of strings delimited by '~'. I read somewhere that Scannercould do a better job with a long file, performance-wise, so I thought about checking it out.

我目前正在使用split()扫描文件,其中每行都有以'~'. 我在某处读到Scanner可以用长文件做得更好的地方,在性能方面,所以我想检查一下。

My question is: Would I have to create two instances of Scanner? That is, one to read a line and another one based on the line to get tokens for a delimiter? If I have to do so, I doubt if I would get any advantage from using it. Maybe I am missing something here?

我的问题是:我是否必须创建两个实例Scanner?也就是说,一个读取一行,另一个基于该行获取分隔符的标记?如果我必须这样做,我怀疑我是否会从使用它中获得任何好处。也许我在这里遗漏了什么?

回答by CookieOfFortune

I would say split()is fastest, and probably good enough for what you're doing. It is less flexible than scannerthough. StringTokenizeris deprecated and is only available for backwards compatibility, so don't use it.

我会说split()是最快的,并且可能足以满足您正在做的事情。它没有那么灵活scannerStringTokenizer已弃用,仅可用于向后兼容,因此请勿使用它。

EDIT: You could always test both implementations to see which one is faster. I'm curious myself if scannercould be faster than split(). Split might be faster for a given size VS Scanner, but I can't be certain of that.

编辑:您总是可以测试两种实现,看看哪个更快。我很好奇自己是否scanner可以比split(). 对于给定的大小 VS Scanner,拆分可能会更快,但我不能确定。

回答by Jerrish Varghese

For processing line you can use scanner and for getting tokens from each line you can use split.

对于处理行,您可以使用扫描仪,并从每行获取令牌,您可以使用拆分。

Scanner scanner = new Scanner(new File(loc));
try {
    while ( scanner.hasNextLine() ){
        String[] tokens = scanner.nextLine().split("~");
        // do the processing for tokens here
    }
}
finally {
    scanner.close();
}

回答by Alan Moore

You can use the useDelimiter("~")method to let you iterate through the tokens on each line with hasNext()/next(), while still using hasNextLine()/nextLine()to iterate through the lines themselves.

您可以使用useDelimiter("~")方法让您使用 遍历每一行上的标记hasNext()/next(),同时仍然使用hasNextLine()/nextLine()遍历行本身。

EDIT: If you're going to do a performance comparison, you should pre-compile the regex when you do the split() test:

编辑:如果您要进行性能比较,则应在执行 split() 测试时预编译正则表达式:

Pattern splitRegex = Pattern.compile("~");
while ((line = bufferedReader.readLine()) != null)
{
  String[] tokens = splitRegex.split(line);
  // etc.
}

If you use String#split(String regex), the regex will be recompiled every time. (Scanner automatically caches all regexes the first time it compiles them.) If you do that, I wouldn't expect to see much difference in performance.

如果使用String#split(String regex),则每次都会重新编译正则表达式。(扫描器在第一次编译它们时会自动缓存所有正则表达式。)如果你这样做,我不希望看到性能有太大差异。

回答by BeeOnRope

You don't actually need a regex here, because you are splitting on a fixed string. Apache StringUtilssplitdoes splitting on plain strings.

您实际上不需要在这里使用正则表达式,因为您正在拆分固定字符串。Apache StringUtilssplit在纯字符串上进行拆分。

For high volume splits, where the splitting is the bottleneck, rather than say file IO, I've found this to be up to 10 times faster than String.split(). However, I did not test it against a compiled regex.

对于大容量拆分,拆分是瓶颈,而不是文件 IO,我发现这比String.split(). 但是,我没有针对已编译的正则表达式对其进行测试。

Guava also has a splitter, implemented in a more OO way, but I found it was significantly slower than StringUtils for high volume splits.

Guava 也有一个拆分器,以更面向对象的方式实现,但我发现它比 StringUtils 进行大容量拆分要慢得多。

回答by Sreesankar

Did some metrics around these in a single threaded model and here are the results I got.

在单线程模型中围绕这些做了一些指标,这是我得到的结果。

~~~~~~~~~~~~~~~~~~Time Metrics~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ Tokenizer  |   String.Split()   |    while+SubString  |    Scanner    |    ScannerWithCompiledPattern    ~
~   4.0 ms   |      5.1 ms        |        1.2 ms       |     0.5 ms    |                0.1 ms            ~
~   4.4 ms   |      4.8 ms        |        1.1 ms       |     0.1 ms    |                0.1 ms            ~
~   3.5 ms   |      4.7 ms        |        1.2 ms       |     0.1 ms    |                0.1 ms            ~
~   3.5 ms   |      4.7 ms        |        1.1 ms       |     0.1 ms    |                0.1 ms            ~
~   3.5 ms   |      4.7 ms        |        1.1 ms       |     0.1 ms    |                0.1 ms            ~
____________________________________________________________________________________________________________

The out come is that Scanner gives the best performance, Now the same needs to be evaluated on a multithreaded mode ! One of my senior's say that the Tokenizer gives a CPU spike and String.split does not.

结果是 Scanner 提供了最好的性能,现在同样需要在多线程模式下进行评估!我的一位前辈说 Tokenizer 会导致 CPU 峰值,而 String.split 不会。