如何在 Java 8 中并行读取文件的所有行

Question

提问by user3001

I want to read all lines of a 1 GB large file as fast as possible into a Stream<String>. Currently I'm using Files(path).lines()for that. After parsing the file, I'm doing some computations (map()/filter()).

我想尽快将 1 GB 大文件的所有行读入Stream<String>. 目前我正在使用Files(path).lines()它。解析文件后，我正在做一些计算 ( map()/ filter())。

At first I thought this is already done in parallel, but it seems I'm wrong: when reading the file as it is, it takes about 50 seconds on my dual CPU laptop. However, if I split the file using bash commands and then process them in parallel, it only takes about 30 seconds.

起初我以为这已经是并行完成的，但似乎我错了：按原样读取文件时，在我的双 CPU 笔记本电脑上大约需要 50 秒。但是，如果我使用 bash 命令拆分文件然后并行处理它们，则只需要大约 30 秒。

I tried the following combinations:

我尝试了以下组合：

single file, no parallel lines() stream ~ 50 seconds
single file, Files(..).lines().parallel().[...]~ 50 seconds
two files, no parallel lines() strean ~ 30 seconds
two files, Files(..).lines().parallel().[...]~ 30 seconds

单个文件，没有平行线（）流 ~ 50 秒
单个文件，Files(..).lines().parallel().[...]约 50 秒
两个文件，没有平行线() strean ~ 30 秒
两个文件，Files(..).lines().parallel().[...]约 30 秒

I ran these 4 multiple times with roughly the same results (by 1 or 2 seconds). The [...]is a chain of map and filter only, with a toArray(...)at the end to trigger the evaluation.

我多次运行这 4 次，结果大致相同（1 或 2 秒）。的[...]是地图和只有过滤器的链，具有toArray(...)在结束触发评估。

The conclusion is that there is no difference in using lines().parallel(). As reading two files in parallel takes a shorter time, there is a performance gain from splitting the file. However it seems the whole file is read serially.

结论是使用没有区别lines().parallel()。由于并行读取两个文件需要更短的时间，因此拆分文件可以提高性能。然而，似乎整个文件都是串行读取的。

Edit:
I want to point out that I use an SSD, so there is practically no seeking time. The file has 1658652 (relatively short) lines in total. Splitting the file in bash takes about 1.5 seconds:

编辑：
我想指出我使用的是 SSD，所以几乎没有寻找时间。该文件共有 1658652（相对较短）行。在 bash 中拆分文件大约需要 1.5 秒：

   time split -l 829326 file # 829326 = 1658652 / 2
   split -l 829326 file  0,14s user 1,41s system 16% cpu 9,560 total

So my question is, is there any class or function in the Java 8 JDK which can parallelize reading all lines without having to split it first? For example, if I have two CPU cores, the first line reader should start at the first line and a second one at line (totalLines/2)+1.

所以我的问题是，Java 8 JDK 中是否有任何类或函数可以并行读取所有行而不必先拆分它？例如，如果我有两个 CPU 内核，第一行阅读器应该从第一行开始，第二行应该从第 1 行开始(totalLines/2)+1。

Answer 1

采纳答案by matthewmatician

You might find some help from this post. Trying to parallelize the actual reading of a file is probably barking up the wrong tree, as the biggest slowdown will be your file system (even on an SSD).

您可能会从这篇文章中找到一些帮助。尝试并行化文件的实际读取可能会导致错误的树，因为最大的减速将是您的文件系统（即使在 SSD 上）。

If you set up a file channel in memory, you should be able to process the data in parallel from there with great speed, but chances are you won't need it as you'll see a huge speed increase.

如果您在内存中设置文件通道，您应该能够从那里以极快的速度并行处理数据，但您可能不需要它，因为您会看到速度大大提高。

如何在 Java 8 中并行读取文件的所有行

提问by user3001

采纳答案by matthewmatician

相关推荐

最近更新

标签

如何在 Java 8 中并行读取文件的所有行

提问by user3001

采纳答案by matthewmatician

相关推荐

java httprequest.getsession 返回 null

java “Serializable”类的子类是否自动“Serializable”？

java 如果未激活另一个配置文件，则激活 Maven 配置文件

java 如何在 Android 中的 Fragment 中创建是/否警报对话框

相关推荐

最近更新

标签