java 尽可能快地通过java读取具有数百万行的csv文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36341059/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Reading a csv file with millions of row via java as fast as possible
提问by Joe Leffrey
I want to read a csv files including millions of rows and use the attributes for my decision Tree algorithm. My code is below:
我想读取包含数百万行的 csv 文件,并将这些属性用于我的决策树算法。我的代码如下:
String csvFile = "myfile.csv";
List<String[]> rowList = new ArrayList();
String line = "";
String cvsSplitBy = ",";
String encoding = "UTF-8";
BufferedReader br2 = null;
try {
int counterRow = 0;
br2 = new BufferedReader(new InputStreamReader(new FileInputStream(csvFile), encoding));
while ((line = br2.readLine()) != null) {
line=line.replaceAll(",,", ",NA,");
String[] object = line.split(cvsSplitBy);
rowList.add(object);
counterRow++;
}
System.out.println("counterRow is: "+counterRow);
for(int i=1;i<rowList.size();i++){
try{
//this method includes many if elses only.
ImplementDecisionTreeRulesFor2012(rowList.get(i)[0],rowList.get(i)[1],rowList.get(i)[2],rowList.get(i)[3],rowList.get(i)[4],rowList.get(i)[5],rowList.get(i)[6]);
}
catch(Exception ex){
System.out.printlnt("Exception occurred");
}
}
}
catch(Exception ex){
System.out.println("fix"+ex);
}
It is working fine when the size of the csv file is not large. However, it is large indeed. Therefore I need another way to read a csv faster. Is there any advice? Appreciated, thanks.
当 csv 文件的大小不大时,它工作正常。然而,它确实很大。因此,我需要另一种方法来更快地读取 csv。有什么建议吗?赞赏,谢谢。
采纳答案by laune
In this snippet I see two issues which will slow you down considerably:
在这个片段中,我看到两个问题会大大减慢你的速度:
while ((line = br2.readLine()) != null) {
line=line.replaceAll(",,", ",NA,");
String[] object = line.split(cvsSplitBy);
rowList.add(object);
counterRow++;
}
First, rowList starts with the default capacity and will have to be increased many times, always causing a copy of the old underlying array to the new.
首先,rowList 以默认容量开始,并且必须多次增加,总是导致旧底层数组的副本到新的。
Worse, however, ist the excessive blow-up of the data into a String[] object. You'll need the columns/cells only when you call ImplementDecisionTreeRulesFor2012 for that row- not all the time while you read that file and process all the other rows. Move the split (or something better, as suggested by comments) to the second row.
然而,更糟糕的是将数据过度膨胀到 String[] 对象中。只有当您为该行调用ImplementDecisionTreeRulesFor2012 时,您才需要列/单元格- 在您读取该文件并处理所有其他行时并非总是如此。将拆分(或更好的东西,如评论所建议的那样)移动到第二行。
(Creating many objects is bad, even if you can afford the memory.)
(创建很多对象是不好的,即使你能负担得起内存。)
Perhaps it would be better to call ImplementDecisionTreeRulesFor2012 while you read the "millions"? It would avoid the rowList ArrayList altogether.
也许在阅读“百万”时调用ImplementDecisionTreeRulesFor2012会更好?它会完全避免使用 rowList ArrayList。
LaterPostponing the split reduces the execution time for 10 million rows from 1m8.262s (when the program ran out of heap space) to 13.067s.
稍后推迟拆分将 1000 万行的执行时间从 1m8.262s(当程序用完堆空间时)减少到 13.067s。
If you aren't forced to read all rows before you can call Implp...2012, the time reduces to 4.902s.
如果您在调用 Implp...2012 之前没有被强制读取所有行,则时间减少到 4.902 秒。
Finallywriting the split and replace by hand:
最后手动编写拆分和替换:
String[] object = new String[7];
//...read...
String x = line + ",";
int iPos = 0;
int iStr = 0;
int iNext = -1;
while( (iNext = x.indexOf( ',', iPos )) != -1 && iStr < 7 ){
if( iNext == iPos ){
object[iStr++] = "NA";
} else {
object[iStr++] = x.substring( iPos, iNext );
}
iPos = iNext + 1;
}
// add more "NA" if rows can have less than 7 cells
reduces the time to 1.983s. This is about 30 times faster than the original code, which runs into OutOfMemory anyway.
将时间减少到 1.983 秒。这比原始代码快了大约 30 倍,原始代码无论如何都会运行到 OutOfMemory。
回答by Jeronimo Backes
Just use uniVocity-parsers' CSV parser instead of trying to build your custom parser. Your implementation will probably not be fast or flexible enough to handle all corner cases.
只需使用uniVocity-parsers的 CSV 解析器,而不是尝试构建您的自定义解析器。您的实现可能不够快或不够灵活,无法处理所有极端情况。
It is extremely memory efficient and you can parse a million rows in less than a second. This linkhas a performance comparison of many java CSV libraries and univocity-parsers comes on top.
它具有极高的内存效率,您可以在不到一秒的时间内解析一百万行。此链接对许多 Java CSV 库进行了性能比较,而 univocity-parsers 位居榜首。
Here's a simple example of how to use it:
这是一个如何使用它的简单示例:
CsvParserSettings settings = new CsvParserSettings(); // you'll find many options here, check the tutorial.
CsvParser parser = new CsvParser(settings);
// parses all rows in one go (you should probably use a RowProcessor or iterate row by row if there are many rows)
List<String[]> allRows = parser.parseAll(new File("/path/to/your.csv"));
BUT, that loads everything into memory. To stream all rows, you can do this:
但是,这会将所有内容加载到内存中。要流式传输所有行,您可以执行以下操作:
String[] row;
parser.beginParsing(csvFile)
while ((row = parser.parseNext()) != null) {
//process row here.
}
The faster approach is to use a RowProcessor, it also gives more flexibility:
更快的方法是使用RowProcessor,它还提供了更大的灵活性:
settings.setRowProcessor(myChosenRowProcessor);
CsvParser parser = new CsvParser(settings);
parser.parse(csvFile);
Lastly, it has built-in routinesthat use the parser to perform some common tasks (iterate java beans, dump ResultSet
s, etc)
最后,它具有使用解析器执行一些常见任务(迭代 java bean、转储等)的内置例程ResultSet
This should cover the basics, check the documentation to find the best approach for your case.
这应该涵盖基础知识,检查文档以找到适合您案例的最佳方法。
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
披露:我是这个图书馆的作者。它是开源且免费的(Apache V2.0 许可)。
回答by user3996996
on top of the aforementioned univocity it's worth checking
除了前面提到的单义性之外,还值得一看
- https://github.com/FasterXML/Hymanson-dataformat-csv
- http://simpleflatmapper.org/0101-getting-started-csv.html, it also have a low level api that by pass the String creation.
- https://github.com/FasterXML/Hymanson-dataformat-csv
- http://simpleflatmapper.org/0101-getting-started-csv.html,它也有一个通过字符串创建的低级api。
the 3 of them would as the time of the comment the fastest csv parser.
其中 3 个将作为评论时最快的 csv 解析器。
Chance is that writting your own parser would be slower and buggy.
编写自己的解析器可能会更慢且有问题。
回答by ThomasRS
If you're aiming for objects (i.e. data-binding), I've written a high-performance library sesseltjonna-csvyou might find interesting. Benchmark comparison with SimpleFlatMapper and uniVocity here.
如果您的目标是对象(即数据绑定),我编写了一个高性能库sesseltjonna-csv,您可能会感兴趣。此处与 SimpleFlatMapper 和 uniVocity 的基准比较。