使用新的 Java 8 Streams API 为唯一行解析 CSV 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34639928/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-02 23:09:10  来源:igfitidea点击:

Parsing a CSV file for a unique row using the new Java 8 Streams API

javacsvjava-8java-stream

提问by johnco3

I am trying to use the new Java 8 Streams API (for which I am a complete newbie) to parse for a particular row (the one with 'Neda' in the name column) in a CSV file. Using the following articlefor motivation, I modified and fixed some errors so that I could parse the file containing 3 columns - 'name', 'age' and 'height'.

我正在尝试使用新的 Java 8 Streams API(我是一个完整的新手)来解析 CSV 文件中的特定行(名称列中带有“Neda”的行)。使用以下文章作为动机,我修改并修复了一些错误,以便我可以解析包含 3 列的文件 - 'name'、'age' 和 'height'。

name,age,height
Marianne,12,61
Julie,13,73
Neda,14,66
Julia,15,62
Maryam,18,70

The parsing code is as follows:

解析代码如下:

@Override
public void init() throws Exception {
    Map<String, String> params = getParameters().getNamed();
    if (params.containsKey("csvfile")) {
        Path path = Paths.get(params.get("csvfile"));
        if (Files.exists(path)){
            // use the new java 8 streams api to read the CSV column headings
            Stream<String> lines = Files.lines(path);
            List<String> columns = lines
                .findFirst()
                .map((line) -> Arrays.asList(line.split(",")))
                .get();
            columns.forEach((l)->System.out.println(l));
            // find the relevant sections from the CSV file
            // we are only interested in the row with Neda's name
            int nameIndex = columns.indexOf("name");
            int ageIndex columns.indexOf("age");
            int heightIndex = columns.indexOf("height");
            // we need to know the index positions of the 
            // have to re-read the csv file to extract the values
            lines = Files.lines(path);
            List<List<String>> values = lines
                .skip(1)
                .map((line) -> Arrays.asList(line.split(",")))
                .collect(Collectors.toList());
            values.forEach((l)->System.out.println(l));
        }
    }        
}

Is there any way to avoid re-reading the file following the extraction of the header line? Although this is a very small example file, I will be applying this logic to a large CSV file.

有没有办法避免在提取标题行后重新读取文件?尽管这是一个非常小的示例文件,但我将把这个逻辑应用到一个大的 CSV 文件中。

Is there technique to use the streams API to create a map between the extracted column names (in the first scan of the file) to the values in the remaining rows?

是否有使用流 API 在提取的列名(在文件的第一次扫描中)到剩余行中的值之间创建映射的技术?

How can I return just one row in the form of List<String>(instead of List<List<String>>containing all the rows). I would prefer to just find the row as a mapping between the column names and their corresponding values. (a bit like a result set in JDBC). I see a Collectors.mapMerger function that might be helpful here, but I have no idea how to use it.

我怎样才能以List<String>(而不是List<List<String>>包含所有行)的形式返回一行。我更愿意将行作为列名与其对应值之间的映射。(有点像 JDBC 中的结果集)。我看到一个可能在这里有用的 Collectors.mapMerger 函数,但我不知道如何使用它。

回答by Holger

Use a BufferedReaderexplicitly:

BufferedReader明确使用:

List<String> columns;
List<List<String>> values;
try(BufferedReader br=Files.newBufferedReader(path)) {
    String firstLine=br.readLine();
    if(firstLine==null) throw new IOException("empty file");
    columns=Arrays.asList(firstLine.split(","));
    values = br.lines()
        .map(line -> Arrays.asList(line.split(",")))
        .collect(Collectors.toList());
}

Files.lines(…)also resorts to BufferedReader.lines(…). The only difference is that Files.lineswill configure the stream so that closing the stream will close the reader, which we don't need here, as the explicit try(…)statement already ensures the closing of the BufferedReader.

Files.lines(…)还诉诸BufferedReader.lines(…)。唯一的区别是Files.lines将配置流,以便关闭流将关闭读取器,我们在这里不需要,因为显式try(…)语句已经确保关闭BufferedReader.

Note that there is no guarantee about the state of the reader afterthe stream returned by lines()has been processed, but we can safely read lines beforeperforming the stream operation.

请注意,在处理返回的流之后,无法保证读取器的状态lines(),但是我们可以执行流操作之前安全地读取行。

回答by Tunaki

First, your concern that this code is reading the file twice is not founded. Actually, Files.linesreturns a Stream of the lines that is lazy-populated. So, the first part of the code only reads the first line and the second part of the code reads the rest (it does read the first line a second time though, even if ignored). Quoting its documentation:

首先,您担心此代码读取文件两次是不成立的。实际上,Files.lines返回一个惰性填充的行的流。因此,代码的第一部分仅读取第一行,而代码的第二部分读取其余部分(尽管它会再次读取第一行,即使被忽略)。引用其文档:

Read all lines from a file as a Stream. Unlike readAllLines, this method does not read all lines into a List, but instead populates lazily as the stream is consumed.

从文件中读取所有行作为Stream. 与 不同readAllLines,此方法不会将所有行读入 a List,而是在消耗流时延迟填充。

Onto your second concern about returning just a single row. In functional programming, what you are trying to do is called filtering. The Stream API provides such a method with the help of Stream.filter. This method takes a Predicateas argument, which is a function that returns truefor all the items that should be kept, and falseotherwise.

关于只返回一行的第二个问题。在函数式编程中,您尝试执行的操作称为过滤。Stream API 在 的帮助下提供了这样的方法Stream.filter。此方法将 aPredicate作为参数,该函数返回true所有应保留的项目,false否则返回。

In this case, we want a Predicatethat would return truewhen the name is equal to "Neda". This could be written as the lambda expression s -> s.equals("Neda").

在这种情况下,我们希望 a在名称等于 时Predicate返回。这可以写成 lambda 表达式。true"Neda"s -> s.equals("Neda")

So in the second part of your code, you could have:

所以在你的代码的第二部分,你可以有:

lines = Files.lines(path);
List<List<String>> values = lines
            .skip(1)
            .map(line -> Arrays.asList(line.split(",")))
            .filter(list -> list.get(0).equals("Neda")) // keep only items where the name is "Neda"
            .collect(Collectors.toList());

Note however that this does not ensure that there is only a single item where the name is "Neda", it collects all possible items into a List<List<String>>. You could add some logic to find the first item or throw an exception if no items are found, depending on your business requirement.

但是请注意,这并不能确保名称所在的项目只有一个"Neda",它会将所有可能的项目收集到一个List<List<String>>. 您可以添加一些逻辑来查找第一个项目,或者如果没有找到项目则抛出异常,具体取决于您的业务需求。



Note still that calling twice Files.lines(path)can be avoided by using directly a BufferedReaderas in @Holger's answer.

仍然请注意,Files.lines(path)通过BufferedReader在@Holger 的回答中直接使用 a 可以避免调用两次。

回答by Ismail Ferdous

I know I'm responding so late, but maybe it will help someone in the future

我知道我回复得太晚了,但也许将来会对某人有所帮助

I've made a csv parser/writer , easy to use thanks to its builder pattern

我制作了一个 csv 解析器/编写器,由于其构建器模式而易于使用

For your case:you can filter the lines you want to parse using

对于您的情况:您可以使用过滤器来过滤要解析的行

csvLineFilter(Predicate<String>) 

Hope you find it handy, here is the source code https://github.com/i7paradise/CsvUtils-Java8/

希望你觉得它很方便,这里是源代码 https://github.com/i7paradise/CsvUtils-Java8/

I've joined a main class Demo.javato display how it works

我加入了一个主类Demo.java来展示它是如何工作的

回答by Basil Bourque

Using a CSV-processing library

使用 CSV 处理库

Other Answers are good. But I recommend using a CSV-processing library to read your input files. As others noted, the CSV format is not as simple as it may seem. To begin with, the values may or may not be nested in quote-marks. And there are many variations of CSV, such as those used in Postgres, MySQL, Mongo, Microsoft Excel, and so on.

其他答案都很好。但我建议使用 CSV 处理库来读取您的输入文件。正如其他人所指出的,CSV 格式并不像看起来那么简单。首先,这些值可能嵌套在引号中,也可能不嵌套。CSV 有很多变体,例如在 Postgres、MySQL、Mongo、Microsoft Excel 等中使用的变体。

The Java ecosystem offers several such libraries. I use Apache Commons CSV.

Java 生态系统提供了几个这样的库。我使用Apache Commons CSV

The Apache Commons CSVlibrary does make not use of streams. But you have no need for streamsfor your work if using a library to do the scut work. The library makes easy work of looping the rows from the file, without loading large file into memory.

Apache的百科全书CSV库确实让不使用流。但是,如果使用库来完成 scut 工作,则您的工作不需要流。该库可以轻松地从文件中循环行,而无需将大文件加载到内存中。

create a map between the extracted column names (in the first scan of the file) to the values in the remaining rows?

在提取的列名(在文件的第一次扫描中)到剩余行中的值之间创建映射?

Apache Commons CSVdoes this automatically when you call withHeader.

当您调用Apache Commons CSV 时,它会自动执行此操作withHeader

return just one row in the form of List

以 List 的形式只返回一行

Yes, easy to do.

是的,很容易做到。

As you requested, we can fill Listwith each of the 3 field values for one particular row. This Listacts as a tuple.

根据您的要求,我们可以填充List特定行的 3 个字段值中的每一个。这List充当元组

List < String > tuple = List.of();  // Our goal is to fill this list of values from a single row. Initialize to an empty nonmodifiable list.

We specify the format we expect of our input file: standard CSV(RFC 4180), with the first row populated by column names.

我们指定我们期望的输入文件的格式:标准CSV( RFC 4180),第一行由列名填充。

CSVFormat format =  CSVFormat.RFC4180.withHeader() ;

We specify the file path where to find our input file.

我们指定找到输入文件的文件路径。

Path path = Path.of("/Users/basilbourque/people.csv");

We use try-with-resources syntax (see Tutorial) to automatically close our parser.

我们使用 try-with-resources 语法(参见教程)来自动关闭我们的解析器。

As we read in each row, we check for the name being Neda. If found, we report file our tuple Listwith that row's field values. And we interrupt the looping. We use List.ofto conveniently return a Listobject of some unknown concrete class that is unmodifiable, meaning you cannot add nor remove elements from the list.

当我们阅读每一行时,我们检查名称是否为Neda。如果找到,我们将List使用该行的字段值报告我们的元组文件。我们中断循环。我们List.of用来方便地返回某个List不可修改的未知具体类的对象,这意味着您不能在列表中添加或删除元素。

try (
        CSVParser parser =CSVParser.parse( path , StandardCharsets.UTF_8, format ) ;
)
{
    for ( CSVRecord record : parser )
    {
        if ( record.get( "name" ).equals( "Neda" ) )
        {
            tuple = List.of( record.get( "name" ) , record.get( "age" ) , record.get( "height" ) );
            break ;
        }
    }
}
catch ( FileNotFoundException e )
{
    e.printStackTrace();
}
catch ( IOException e )
{
    e.printStackTrace();
}

If we found success, we should see some items in our List.

如果我们发现成功,我们应该在我们的List.

if ( tuple.isEmpty() )
{
    System.out.println( "Bummer. Failed to report a row for `Neda` name." );
} else
{
    System.out.println( "Success. Found this row for name of `Neda`:" );
    System.out.println( tuple.toString() );
}

When run.

跑的时候。

Success. Found this row for name of Neda:

[Neda, 14, 66]

成功。找到以下名称的这一行Neda

[内达, 14, 66]

Instead of using a Listas a tuple, I suggest your define a Personclass to represent this data with proper data types. Our code here would return a Personinstance rather than a List<String>.

List我建议您定义一个Person类来用适当的数据类型表示这些数据,而不是使用 a作为元组。我们这里的代码将返回一个Person实例而不是一个List<String>.