Java StringTokenizer - 读取整数行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19356021/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 16:25:11  来源:igfitidea点击:

StringTokenizer - reading lines with integers

javastringtokenizer

提问by Smajl

I have a question about optimization of my code (which works but is too slow...). I am reading an input in a form

我有一个关于优化我的代码的问题(它有效但太慢了......)。我正在阅读表单中的输入

X1 Y1
X2 Y2
etc

where Xi, Yi are integers. I am using bufferedReaderfor reading lines and then StringTokenizerfor processing those numbers like this:

其中 Xi, Yi 是整数。我bufferedReader用于读取行,然后StringTokenizer像这样处理这些数字:

StringTokenizer st = new StringTokenizer(line, " ");

int x = Integer.parseInt(st.nextToken());
int y = Integer.parseInt(st.nextToken());

The problem is that this approach seems time inefficient when coping with large data sets. Could you suggest me some simple improvement (I have heard that some integer parse int or regex can be used) which would improve the performance? Thanks for any tips

问题是这种方法在处理大型数据集时似乎时间效率低下。您能否建议我进行一些简单的改进(我听说可以使用一些整数解析 int 或正则表达式)来提高性能?感谢您的任何提示

EDIT: Perhaps I misjudged myself and some improvements have to be done elsewhere in the code...

编辑:也许我误判了自己,必须在代码的其他地方进行一些改进......

采纳答案by tom

(updated answer)

(更新的答案)

I can say that whatever the problems in your program speed, the choice of tokenizer is not one of them. After an initial run of each method to even out initialisation quirks, I can parse 1000000 rows of "12 34" in milliseconds. You could switch to using indexOf if you like but I really think you need to look at other bits of code for the bottleneck rather than this micro-optimisation. Split was a surprise for me - it's really, really slow compared to the other methods. I've added in Guava split test and it's faster than String.split but slightly slower than StringTokenizer.

我可以说,无论您的程序速度有什么问题,分词器的选择都不是其中之一。在初始运行每种方法以消除初始化怪癖后,我可以在几毫秒内解析 1000000 行“12 34”。如果您愿意,您可以切换到使用 indexOf,但我真的认为您需要查看其他代码来解决瓶颈问题,而不是这种微优化。Split 对我来说是一个惊喜 - 与其他方法相比,它真的非常非常慢。我在 Guava 拆分测试中添加了它,它比 String.split 快,但比 StringTokenizer 稍慢。

  • Split: 371ms
  • IndexOf: 48ms
  • StringTokenizer: 92ms
  • Guava Splitter.split(): 108ms
  • CsvMapper build a csv doc and parse into POJOS: 237ms (or 175 if you build the lines into one doc!)
  • 分割:371ms
  • 索引:48ms
  • StringTokenizer:92ms
  • 番石榴 Splitter.split(): 108ms
  • CsvMapper 构建一个 csv 文档并解析为 POJOS:237 毫秒(如果将行构建到一个文档中,则为 175 毫秒!)

The difference here is pretty negligible even over millions of rows.

即使超过数百万行,这里的差异也可以忽略不计。

There's now a write up of this on my blog: http://demeranville.com/battle-of-the-tokenizers-delimited-text-parser-performance/

现在在我的博客上写了这篇文章:http: //demeranville.com/battle-of-the-tokenizers-delimited-text-parser-performance/

Code I ran was:

我运行的代码是:

import java.util.StringTokenizer;
import org.junit.Test;

public class TestSplitter {

private static final String line = "12 34";
private static final int RUNS = 1000000;//000000;

public final void testSplit() {
    long start = System.currentTimeMillis();
    for (int i=0;i<RUNS;i++){
        String[] st = line.split(" ");
        int x = Integer.parseInt(st[0]);
        int y = Integer.parseInt(st[1]);
    }
    System.out.println("Split: "+(System.currentTimeMillis() - start)+"ms");
}

public final void testIndexOf() {
    long start = System.currentTimeMillis();
    for (int i=0;i<RUNS;i++){
        int index = line.indexOf(' ');
        int x = Integer.parseInt(line.substring(0,index));
        int y = Integer.parseInt(line.substring(index+1));
    }       
    System.out.println("IndexOf: "+(System.currentTimeMillis() - start)+"ms");      
}

public final void testTokenizer() {
    long start = System.currentTimeMillis();
    for (int i=0;i<RUNS;i++){
        StringTokenizer st = new StringTokenizer(line, " ");
        int x = Integer.parseInt(st.nextToken());
        int y = Integer.parseInt(st.nextToken());
    }
    System.out.println("StringTokenizer: "+(System.currentTimeMillis() - start)+"ms");
}

@Test
public final void testAll() {
    this.testSplit();
    this.testIndexOf();
    this.testTokenizer();
    this.testSplit();
    this.testIndexOf();
    this.testTokenizer();
}

}

eta: here's the guava code:

eta:这是番石榴代码:

public final void testGuavaSplit() {
    long start = System.currentTimeMillis();
    Splitter split = Splitter.on(" ");
    for (int i=0;i<RUNS;i++){
        Iterator<String> it = split.split(line).iterator();
        int x = Integer.parseInt(it.next());
        int y = Integer.parseInt(it.next());
    }
    System.out.println("GuavaSplit: "+(System.currentTimeMillis() - start)+"ms");
}

update

更新

I've added in a CsvMapper test too:

我也添加了 CsvMapper 测试:

public static class CSV{
    public int x;
    public int y;
}

public final void testHymansonSplit() throws JsonProcessingException, IOException {
    CsvMapper mapper = new CsvMapper();
    CsvSchema schema = CsvSchema.builder().addColumn("x", ColumnType.NUMBER).addColumn("y", ColumnType.NUMBER).setColumnSeparator(' ').build();

    long start = System.currentTimeMillis();
    StringBuilder builder = new StringBuilder();
    for (int i = 0; i < RUNS; i++) {
        builder.append(line);
        builder.append('\n');
    }       
    String input = builder.toString();
    MappingIterator<CSV> it = mapper.reader(CSV.class).with(schema).readValues(input);
    while (it.hasNext()){
        CSV csv = it.next();
    }
    System.out.println("CsvMapperSplit: " + (System.currentTimeMillis() - start) + "ms");
}

回答by Jhanvi

You could use regex to check if the value is numerical and then convert to integer:

您可以使用正则表达式检查该值是否为数字,然后转换为整数:

if(st.nextToken().matches("^[0-9]+$"))
        {
           int x = Integer.parseInt(st.nextToken());
        }