适用于 Java 的优秀且有效的 CSV/TSV 阅读器
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/13879967/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Good and effective CSV/TSV Reader for Java
提问by Robin
I am trying to read big CSV
and TSV
(tab-separated) Files with about 1000000
rows or more. Now I tried to read a TSV
containing ~2500000
lines with opencsv
, but it throws me an java.lang.NullPointerException
. It works with smaller TSV
Files with ~250000
lines. So I was wondering if there are any other Libraries
that support the reading of huge CSV
and TSV
Files. Do you have any ideas?
我正在尝试读取大约行或更多行的大文件CSV
和TSV
(制表符分隔的)文件1000000
。现在我试图用 读取TSV
包含~2500000
行opencsv
,但它抛出了一个java.lang.NullPointerException
. 它适用于TSV
带有~250000
线条的较小文件。所以我想知道是否还有其他Libraries
支持读取大文件CSV
和TSV
文件的工具。你有什么想法?
Everybody who is interested in my Code (I shorten it, so Try-Catch
is obviously invalid):
每个对我的代码感兴趣的人(我缩短了它,所以Try-Catch
显然是无效的):
InputStreamReader in = null;
CSVReader reader = null;
try {
in = this.replaceBackSlashes();
reader = new CSVReader(in, this.seperator, '\"', this.offset);
ret = reader.readAll();
} finally {
try {
reader.close();
}
}
Edit: This is the Method where I construct the InputStreamReader
:
编辑:这是我构建的方法InputStreamReader
:
private InputStreamReader replaceBackSlashes() throws Exception {
FileInputStream fis = null;
Scanner in = null;
try {
fis = new FileInputStream(this.csvFile);
in = new Scanner(fis, this.encoding);
ByteArrayOutputStream out = new ByteArrayOutputStream();
while (in.hasNext()) {
String nextLine = in.nextLine().replace("\", "/");
// nextLine = nextLine.replaceAll(" ", "");
nextLine = nextLine.replaceAll("'", "");
out.write(nextLine.getBytes());
out.write("\n".getBytes());
}
return new InputStreamReader(new ByteArrayInputStream(out.toByteArray()));
} catch (Exception e) {
in.close();
fis.close();
this.logger.error("Problem at replaceBackSlashes", e);
}
throw new Exception();
}
采纳答案by RuntimeException
I have not tried it, but I had investigated superCSV earlier.
我没有尝试过,但我之前研究过 superCSV。
http://sourceforge.net/projects/supercsv/
http://sourceforge.net/projects/supercsv/
http://supercsv.sourceforge.net/
http://supercsv.sourceforge.net/
Check if that works for you, 2.5 million lines.
检查这是否适合您,250 万行。
回答by Jeronimo Backes
Do not use a CSV parser to parse TSV inputs. It will break if the TSV has fields with a quote character, for example.
不要使用 CSV 解析器来解析 TSV 输入。例如,如果 TSV 具有带引号字符的字段,它将中断。
uniVocity-parserscomes with a TSV parser. You can parse a billion rows without problems.
uniVocity-parsers带有一个 TSV 解析器。您可以毫无问题地解析十亿行。
Example to parse a TSV input:
解析 TSV 输入的示例:
TsvParserSettings settings = new TsvParserSettings();
TsvParser parser = new TsvParser(settings);
// parses all rows in one go.
List<String[]> allRows = parser.parseAll(new FileReader(yourFile));
If your input is so big it can't be kept in memory, do this:
如果您的输入太大而无法保存在内存中,请执行以下操作:
TsvParserSettings settings = new TsvParserSettings();
// all rows parsed from your input will be sent to this processor
ObjectRowProcessor rowProcessor = new ObjectRowProcessor() {
@Override
public void rowProcessed(Object[] row, ParsingContext context) {
//here is the row. Let's just print it.
System.out.println(Arrays.toString(row));
}
};
// the ObjectRowProcessor supports conversions from String to whatever you need:
// converts values in columns 2 and 5 to BigDecimal
rowProcessor.convertIndexes(Conversions.toBigDecimal()).set(2, 5);
// converts the values in columns "Description" and "Model". Applies trim and to lowercase to the values in these columns.
rowProcessor.convertFields(Conversions.trim(), Conversions.toLowerCase()).set("Description", "Model");
//configures to use the RowProcessor
settings.setRowProcessor(rowProcessor);
TsvParser parser = new TsvParser(settings);
//parses everything. All rows will be pumped into your RowProcessor.
parser.parse(new FileReader(yourFile));
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
披露:我是这个图书馆的作者。它是开源且免费的(Apache V2.0 许可)。
回答by Sri Harsha Chilakapati
Try switching libraries as suggested by Satish
. If that doesn't help, you have to split the whole file into tokens and process them.
尝试按照Satish
. 如果这没有帮助,您必须将整个文件拆分为令牌并处理它们。
Thinking that your CSV
didn't had any escape characters for commas
认为你CSV
的逗号没有任何转义字符
// r is the BufferedReader pointed at your file
String line;
StringBuilder file = new StringBuilder();
// load each line and append it to file.
while ((line=r.readLine())!=null){
file.append(line);
}
// Make them to an array
String[] tokens = file.toString().split(",");
Then you can process it. Don't forget to trim the token before using it.
然后就可以处理了。在使用之前不要忘记修剪令牌。
回答by Konrad H?ffner
I don't know if that question is still active but here is the one I use successfully. Still may have to implement more interfaces such as Stream or Iterable, however:
我不知道那个问题是否仍然有效,但这是我成功使用的问题。仍然可能需要实现更多的接口,例如 Stream 或 Iterable,但是:
import java.io.Closeable;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.util.Scanner;
/** Reader for the tab separated values format (a basic table format without escapings or anything where the rows are separated by tabulators).**/
public class TSVReader implements Closeable
{
final Scanner in;
String peekLine = null;
public TSVReader(InputStream stream) throws FileNotFoundException
{
in = new Scanner(stream);
}
/**Constructs a new TSVReader which produces values scanned from the specified input stream.*/
public TSVReader(File f) throws FileNotFoundException {in = new Scanner(f);}
public boolean hasNextTokens()
{
if(peekLine!=null) return true;
if(!in.hasNextLine()) {return false;}
String line = in.nextLine().trim();
if(line.isEmpty()) {return hasNextTokens();}
this.peekLine = line;
return true;
}
public String[] nextTokens()
{
if(!hasNextTokens()) return null;
String[] tokens = peekLine.split("[\s\t]+");
// System.out.println(Arrays.toString(tokens));
peekLine=null;
return tokens;
}
@Override public void close() throws IOException {in.close();}
}