如何在 Java 8 中并行读取文件的所有行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25711616/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to read all lines of a file in parallel in Java 8
提问by user3001
I want to read all lines of a 1 GB large file as fast as possible into a Stream<String>
. Currently I'm using Files(path).lines()
for that. After parsing the file, I'm doing some computations (map()
/filter()
).
我想尽快将 1 GB 大文件的所有行读入Stream<String>
. 目前我正在使用Files(path).lines()
它。解析文件后,我正在做一些计算 ( map()
/ filter()
)。
At first I thought this is already done in parallel, but it seems I'm wrong: when reading the file as it is, it takes about 50 seconds on my dual CPU laptop. However, if I split the file using bash commands and then process them in parallel, it only takes about 30 seconds.
起初我以为这已经是并行完成的,但似乎我错了:按原样读取文件时,在我的双 CPU 笔记本电脑上大约需要 50 秒。但是,如果我使用 bash 命令拆分文件然后并行处理它们,则只需要大约 30 秒。
I tried the following combinations:
我尝试了以下组合:
- single file, no parallel lines() stream ~ 50 seconds
- single file,
Files(..).lines().parallel().[...]
~ 50 seconds - two files, no parallel lines() strean ~ 30 seconds
- two files,
Files(..).lines().parallel().[...]
~ 30 seconds
- 单个文件,没有平行线()流 ~ 50 秒
- 单个文件,
Files(..).lines().parallel().[...]
约 50 秒 - 两个文件,没有平行线() strean ~ 30 秒
- 两个文件,
Files(..).lines().parallel().[...]
约 30 秒
I ran these 4 multiple times with roughly the same results (by 1 or 2 seconds). The [...]
is a chain of map and filter only, with a toArray(...)
at the end to trigger the evaluation.
我多次运行这 4 次,结果大致相同(1 或 2 秒)。的[...]
是地图和只有过滤器的链,具有toArray(...)
在结束触发评估。
The conclusion is that there is no difference in using lines().parallel()
. As reading two files in parallel takes a shorter time, there is a performance gain from splitting the file. However it seems the whole file is read serially.
结论是使用没有区别lines().parallel()
。由于并行读取两个文件需要更短的时间,因此拆分文件可以提高性能。然而,似乎整个文件都是串行读取的。
Edit:
I want to point out that I use an SSD, so there is practically no seeking time. The file has 1658652 (relatively short) lines in total.
Splitting the file in bash takes about 1.5 seconds:
编辑:
我想指出我使用的是 SSD,所以几乎没有寻找时间。该文件共有 1658652(相对较短)行。在 bash 中拆分文件大约需要 1.5 秒:
time split -l 829326 file # 829326 = 1658652 / 2
split -l 829326 file 0,14s user 1,41s system 16% cpu 9,560 total
So my question is, is there any class or function in the Java 8 JDK which can parallelize reading all lines without having to split it first? For example, if I have two CPU cores,
the first line reader should start at the first line and a second one at line (totalLines/2)+1
.
所以我的问题是,Java 8 JDK 中是否有任何类或函数可以并行读取所有行而不必先拆分它?例如,如果我有两个 CPU 内核,第一行阅读器应该从第一行开始,第二行应该从第 1 行开始(totalLines/2)+1
。
采纳答案by matthewmatician
You might find some help from this post. Trying to parallelize the actual reading of a file is probably barking up the wrong tree, as the biggest slowdown will be your file system (even on an SSD).
您可能会从这篇文章中找到一些帮助。尝试并行化文件的实际读取可能会导致错误的树,因为最大的减速将是您的文件系统(即使在 SSD 上)。
If you set up a file channel in memory, you should be able to process the data in parallel from there with great speed, but chances are you won't need it as you'll see a huge speed increase.
如果您在内存中设置文件通道,您应该能够从那里以极快的速度并行处理数据,但您可能不需要它,因为您会看到速度大大提高。