java 并发读取文件(首选java)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/11867348/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Concurrent reading of a File (java preferred)
提问by user1132593
I have a large file that takes multiple hours to process. So I am thinking of trying to estimate chunks and read the chunks in parallel.
我有一个需要几个小时才能处理的大文件。所以我正在考虑尝试估计块并并行读取块。
Is it possible to to concurrent read on a single file? I have looked at both RandomAccessFile
as well as nio.FileChannel
but based on other posts am not sure if this approach would work.
是否可以同时读取单个文件?我已经看过两者RandomAccessFile
,nio.FileChannel
但基于其他帖子我不确定这种方法是否有效。
回答by Petr Pudlák
The most important question here is what is the bottleneck in your case.
这里最重要的问题是您的情况的瓶颈是什么。
If the bottleneck is your disk IO, then there isn't much you can do at the software part. Parallelizing the computation will only make things worse, because reading the file from different parts simultaneously will degrade disk performance.
如果瓶颈是您的磁盘 IO,那么您在软件部分无能为力。并行计算只会让事情变得更糟,因为同时从不同部分读取文件会降低磁盘性能。
If the bottleneck is processing power, and you have multiple CPU cores, then you can take an advantage of starting multiple threads to work on different parts of the file. You can safely create several InputStream
s or Reader
s to read different parts of the file in parallel (as long as you don't go over your operating system's limit for the number of open files). You could separate the work into tasks and run them in parallel, like in this example:
如果瓶颈是处理能力,并且您有多个 CPU 内核,那么您可以利用启动多个线程来处理文件的不同部分。您可以安全地创建多个InputStream
s 或Reader
s 来并行读取文件的不同部分(只要您不超过操作系统对打开文件数的限制)。您可以将工作分成多个任务并并行运行,如下例所示:
import java.io.*;
import java.util.*;
import java.util.concurrent.*;
public class Split {
private File file;
public Split(File file) {
this.file = file;
}
// Processes the given portion of the file.
// Called simultaneously from several threads.
// Use your custom return type as needed, I used String just to give an example.
public String processPart(long start, long end)
throws Exception
{
InputStream is = new FileInputStream(file);
is.skip(start);
// do a computation using the input stream,
// checking that we don't read more than (end-start) bytes
System.out.println("Computing the part from " + start + " to " + end);
Thread.sleep(1000);
System.out.println("Finished the part from " + start + " to " + end);
is.close();
return "Some result";
}
// Creates a task that will process the given portion of the file,
// when executed.
public Callable<String> processPartTask(final long start, final long end) {
return new Callable<String>() {
public String call()
throws Exception
{
return processPart(start, end);
}
};
}
// Splits the computation into chunks of the given size,
// creates appropriate tasks and runs them using a
// given number of threads.
public void processAll(int noOfThreads, int chunkSize)
throws Exception
{
int count = (int)((file.length() + chunkSize - 1) / chunkSize);
java.util.List<Callable<String>> tasks = new ArrayList<Callable<String>>(count);
for(int i = 0; i < count; i++)
tasks.add(processPartTask(i * chunkSize, Math.min(file.length(), (i+1) * chunkSize)));
ExecutorService es = Executors.newFixedThreadPool(noOfThreads);
java.util.List<Future<String>> results = es.invokeAll(tasks);
es.shutdown();
// use the results for something
for(Future<String> result : results)
System.out.println(result.get());
}
public static void main(String argv[])
throws Exception
{
Split s = new Split(new File(argv[0]));
s.processAll(8, 1000);
}
}
回答by Peter Lawrey
You can parallelise reading a large file provided you have multiple independent spindals. E.g. if you have a Raid 0 + 1 stripped file system, you can see a performance improvement by triggering multiple concurrent reads to the same file.
如果您有多个独立的主轴,您可以并行读取大文件。例如,如果您有一个 Raid 0 + 1 剥离文件系统,您可以通过触发对同一文件的多个并发读取来提高性能。
If however you have a combined file system like Raid 5 or 6 or a plain single disk. It is highly likely that reading the file sequentially is the fastest way to read from that disk. Note: the OS is smart enough to pre-fetch reads when it sees you are reading sequentially so using an additional thread to do this is unlikely to help.
但是,如果您有像 Raid 5 或 6 这样的组合文件系统或普通的单个磁盘。很有可能按顺序读取文件是从该磁盘读取的最快方法。注意:操作系统足够聪明,可以在看到您按顺序读取时预取读取,因此使用附加线程来执行此操作不太可能有帮助。
i.e. using multiple threads will not make you disk any faster.
即使用多个线程不会使您的磁盘速度更快。
If you want to read from disk faster, use a faster drive. A typical SATA HDD can read about 60 MB/second and perform 120 IOPS. A typical SATA SSD drive can read at about 400 MB/s and perform 80,000 IOPS and a typical PCI SSD can read at 900 MB/s and perform 230,000 IOPS.
如果您想更快地从磁盘读取,请使用更快的驱动器。一个典型的 SATA HDD 可以读取大约 60 MB/秒并执行 120 IOPS。典型的 SATA SSD 驱动器可以以大约 400 MB/s 的速度读取并执行 80,000 IOPS,而典型的 PCI SSD 可以以 900 MB/s 的速度读取并执行 230,000 IOPS。
回答by Brad
You can process in parallel, however your hard drive can only read one piece of data at a time. If you read in the file with a single thread, you can then process the data with several threads.
您可以并行处理,但是您的硬盘驱动器一次只能读取一份数据。如果使用单个线程读入文件,则可以使用多个线程处理数据。
回答by Buhb
If you're reading a file from a hard drive, then the fastest way to get the data is to read the file from start to end, that is, not concurrently.
如果您正在从硬盘读取文件,那么获取数据的最快方法是从头到尾读取文件,即不是同时读取。
Now if it's the processing that takes time, then that might benefit from having several threads processing different chunks of data concurrently, but that has nothing to do with how you're reading the file.
现在,如果处理需要时间,那么让多个线程同时处理不同的数据块可能会受益,但这与您读取文件的方式无关。