适用于 Java 的优秀且有效的 CSV/TSV 阅读器

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13879967/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-31 14:18:58  来源:igfitidea点击:

Good and effective CSV/TSV Reader for Java

javacsvlarge-filesopencsv

提问by Robin

I am trying to read big CSVand TSV(tab-separated) Files with about 1000000rows or more. Now I tried to read a TSVcontaining ~2500000lines with opencsv, but it throws me an java.lang.NullPointerException. It works with smaller TSVFiles with ~250000lines. So I was wondering if there are any other Librariesthat support the reading of huge CSVand TSVFiles. Do you have any ideas?

我正在尝试读取大约行或更多行的大文件CSVTSV(制表符分隔的)文件1000000。现在我试图用 读取TSV包含~2500000opencsv,但它抛出了一个java.lang.NullPointerException. 它适用于TSV带有~250000线条的较小文件。所以我想知道是否还有其他Libraries支持读取大文件CSVTSV文件的工具。你有什么想法?

Everybody who is interested in my Code (I shorten it, so Try-Catchis obviously invalid):

每个对我的代码感兴趣的人(我缩短了它,所以Try-Catch显然是无效的):

InputStreamReader in = null;
CSVReader reader = null;
try {
    in = this.replaceBackSlashes();
    reader = new CSVReader(in, this.seperator, '\"', this.offset);
    ret = reader.readAll();
} finally {
    try {
        reader.close();
    } 
}

Edit: This is the Method where I construct the InputStreamReader:

编辑:这是我构建的方法InputStreamReader

private InputStreamReader replaceBackSlashes() throws Exception {
        FileInputStream fis = null;
        Scanner in = null;
        try {
            fis = new FileInputStream(this.csvFile);
            in = new Scanner(fis, this.encoding);
            ByteArrayOutputStream out = new ByteArrayOutputStream();

            while (in.hasNext()) {
                String nextLine = in.nextLine().replace("\", "/");
                // nextLine = nextLine.replaceAll(" ", "");
                nextLine = nextLine.replaceAll("'", "");
                out.write(nextLine.getBytes());
                out.write("\n".getBytes());
            }

            return new InputStreamReader(new ByteArrayInputStream(out.toByteArray()));
        } catch (Exception e) {
            in.close();
            fis.close();
            this.logger.error("Problem at replaceBackSlashes", e);
        }
        throw new Exception();
    }

采纳答案by RuntimeException

I have not tried it, but I had investigated superCSV earlier.

我没有尝试过,但我之前研究过 superCSV。

http://sourceforge.net/projects/supercsv/

http://sourceforge.net/projects/supercsv/

http://supercsv.sourceforge.net/

http://supercsv.sourceforge.net/

Check if that works for you, 2.5 million lines.

检查这是否适合您,250 万行。

回答by Jeronimo Backes

Do not use a CSV parser to parse TSV inputs. It will break if the TSV has fields with a quote character, for example.

不要使用 CSV 解析器来解析 TSV 输入。例如,如果 TSV 具有带引号字符的字段,它将中断。

uniVocity-parserscomes with a TSV parser. You can parse a billion rows without problems.

uniVocity-parsers带有一个 TSV 解析器。您可以毫无问题地解析十亿行。

Example to parse a TSV input:

解析 TSV 输入的示例:

TsvParserSettings settings = new TsvParserSettings();
TsvParser parser = new TsvParser(settings);

// parses all rows in one go.
List<String[]> allRows = parser.parseAll(new FileReader(yourFile));

If your input is so big it can't be kept in memory, do this:

如果您的输入太大而无法保存在内存中,请执行以下操作:

TsvParserSettings settings = new TsvParserSettings();

// all rows parsed from your input will be sent to this processor
ObjectRowProcessor rowProcessor = new ObjectRowProcessor() {
    @Override
    public void rowProcessed(Object[] row, ParsingContext context) {
        //here is the row. Let's just print it.
        System.out.println(Arrays.toString(row));
    }
};
// the ObjectRowProcessor supports conversions from String to whatever you need:
// converts values in columns 2 and 5 to BigDecimal
rowProcessor.convertIndexes(Conversions.toBigDecimal()).set(2, 5);

// converts the values in columns "Description" and "Model". Applies trim and to lowercase to the values in these columns.
rowProcessor.convertFields(Conversions.trim(), Conversions.toLowerCase()).set("Description", "Model");

//configures to use the RowProcessor
settings.setRowProcessor(rowProcessor);

TsvParser parser = new TsvParser(settings);
//parses everything. All rows will be pumped into your RowProcessor.
parser.parse(new FileReader(yourFile));

Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

披露:我是这个图书馆的作者。它是开源且免费的(Apache V2.0 许可)。

回答by Sri Harsha Chilakapati

Try switching libraries as suggested by Satish. If that doesn't help, you have to split the whole file into tokens and process them.

尝试按照Satish. 如果这没有帮助,您必须将整个文件拆分为令牌并处理它们。

Thinking that your CSVdidn't had any escape characters for commas

认为你CSV的逗号没有任何转义字符

// r is the BufferedReader pointed at your file
String line;
StringBuilder file = new StringBuilder();
// load each line and append it to file.
while ((line=r.readLine())!=null){
    file.append(line);
}
// Make them to an array
String[] tokens = file.toString().split(",");

Then you can process it. Don't forget to trim the token before using it.

然后就可以处理了。在使用之前不要忘记修剪令牌。

回答by Konrad H?ffner

I don't know if that question is still active but here is the one I use successfully. Still may have to implement more interfaces such as Stream or Iterable, however:

我不知道那个问题是否仍然有效,但这是我成功使用的问题。仍然可能需要实现更多的接口,例如 Stream 或 Iterable,但是:

import java.io.Closeable;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.util.Scanner;

/** Reader for the tab separated values format (a basic table format without escapings or anything where the rows are separated by tabulators).**/
public class TSVReader implements Closeable 
{
    final Scanner in;
    String peekLine = null;

    public TSVReader(InputStream stream) throws FileNotFoundException
    {
        in = new Scanner(stream);
    }

    /**Constructs a new TSVReader which produces values scanned from the specified input stream.*/
    public TSVReader(File f) throws FileNotFoundException {in = new Scanner(f);}

    public boolean hasNextTokens()
    {
        if(peekLine!=null) return true;
        if(!in.hasNextLine()) {return false;}
        String line = in.nextLine().trim();
        if(line.isEmpty())  {return hasNextTokens();}
        this.peekLine = line;       
        return true;        
    }

    public String[] nextTokens()
    {
        if(!hasNextTokens()) return null;       
        String[] tokens = peekLine.split("[\s\t]+");
//      System.out.println(Arrays.toString(tokens));
        peekLine=null;      
        return tokens;
    }

    @Override public void close() throws IOException {in.close();}
}