如何将 CSV 文件拆分为多个块并在 Java 代码中并行读取这些块
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/11098873/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to split a CSV file into multiple chunks and read those chunks in parallel in Java code
提问by JuliaLi
I have a very big CSV file (1GB+), it has 100,000 line.
我有一个非常大的 CSV 文件(1GB+),它有 100,000 行。
I need to write a Java program to parse each line from the CSV file to create a body for a HTTP request to send out.
我需要编写一个 Java 程序来解析 CSV 文件中的每一行,以创建一个用于发送 HTTP 请求的正文。
In other words, I need send out 100,000 HTTP requests which are corresponding to the lines in the CSV file. It will be very long if I do these in a single thread.
换句话说,我需要发送 100,000 个与 CSV 文件中的行相对应的 HTTP 请求。如果我在一个线程中执行这些操作会很长。
I'd like to create 1,000 threads to do i) read a line from the CSV file, ii) create a HTTP request whose body contains the read line's content, and iii) send the HTTP request out and receive response.
我想创建 1,000 个线程来做 i) 从 CSV 文件中读取一行,ii) 创建一个 HTTP 请求,其正文包含读取行的内容,以及 iii) 发送 HTTP 请求并接收响应。
In this way, I need to split the CSV file into 1,000 chunks, and those chunks should have no overlapped lines in each other.
这样,我需要将 CSV 文件拆分为 1,000 个块,并且这些块之间不应有重叠的行。
What's the best way to such a splitting procedure?
这种拆分程序的最佳方法是什么?
回答by dasblinkenlight
Reading a single file at multiple positions concurrently wouldn't let you go any faster (but it could slow you down considerably).
同时读取多个位置的单个文件不会让您走得更快(但它可能会大大减慢您的速度)。
Instead of reading the file from multiple threads, read the file from a single thread, and parallelize the processingof these lines. A singe thread should read your CSV line-by-line, and put each line in a queue. Multiple working threads should then take the next line from the queue, parse it, convert to a request, and process the request concurrently as needed. The splitting of the work would then be done by a single thread, ensuring that there are no missing lines or overlaps.
不是从多个线程读取文件,而是从单个线程读取文件,并并行处理这些行。单个线程应逐行读取您的 CSV,并将每一行放入队列中。然后,多个工作线程应该从队列中取出下一行,解析它,转换为请求,并根据需要并发处理请求。然后,工作的拆分将由单个线程完成,以确保没有丢失的行或重叠。
回答by Peter Lawrey
You can have a thread which reads the lines of the CSV and builds a List of lines read. When this reaches some limit e.g. 100 lines to pass this to a fixed size thread pool to send as a request.
您可以有一个线程来读取 CSV 的行并构建读取的行列表。当达到某个限制时,例如 100 行将其传递到固定大小的线程池以作为请求发送。
I suspect that unless your server has 1000 cores, you might find that using 10-100 concurrent requests is faster.
我怀疑除非您的服务器有 1000 个内核,否则您可能会发现使用 10-100 个并发请求会更快。
回答by amicngh
Read CSV
file in single thread once you get the line delegate this line to one of the Thread
available in pool by constructing the object of your Runnable Task
and pass it to Executors's
submit()
,that will be executed asynchronously .
CSV
一旦您Thread
通过构造您的对象Runnable Task
并将其传递给该行,将这一行委托给池中可用的行之一,则在单线程中读取文件Executors's
submit()
,该行将异步执行。
public static void main(String[] args) throws IOException {
String fName = "C:\Amit\abc.csv";
String thisLine;
FileInputStream fis = new FileInputStream(fName);
DataInputStream myInput = new DataInputStream(fis);
ExecutorService pool=Executors.newFixedThreadPool(1000);
int count = 0; // Concurrent request to Server barrier
while ((thisLine = myInput.readLine()) != null) {
if (count > 150) {
try {
Thread.sleep(100);
count = 0;
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
pool.submit(new MyTask(thisLine));
count++;
}
}
}
Here your Task:
这是你的任务:
class MyTask implements Runnable {
private String lLine;
public MyTask(String line) {
this.lLine=line;
}
public void run() {
// 1) Create Request lLine
// 2) send the HTTP request out and receive response
}
}
回答by biziclop
Have one thread reading the file line by line and for every line read, post a task into an ExecutorService
to perform the HTTP request for each one.
让一个线程逐行读取文件,对于读取的每一行,将一个任务发布到 an 中ExecutorService
以执行每个线程的 HTTP 请求。
Reading the file from multiple threads isn't going to work, as in order to read the n
th line, you have to read all the others first. (It could work in theory if your file contained fixed width records, but CSV isn't a fixed width format.)
从多个线程读取文件是行不通的,因为为了读取第n
th 行,您必须先读取所有其他线程。(如果您的文件包含固定宽度的记录,理论上它可以工作,但 CSV 不是固定宽度的格式。)
回答by ThomasRS
If you're looking to unzip and parse in the same operation, have a look at https://github.com/skjolber/unzip-csv.
如果您希望在同一操作中解压缩和解析,请查看https://github.com/skjolber/unzip-csv。
回答by xeno
Java 8, which is scheduled for release this month, will have improved support for this through parallel streams and lambdas. Oracle's tutorialon parallel streams might be a good starting point.
计划于本月发布的 Java 8 将通过并行流和 lambda 改进对此的支持。Oracle 的并行流教程可能是一个很好的起点。
Note that a pitfall here is too much parallelism. For the example of retrieving URL's, it is likely a good idea to have a low number of parallel calls. Too much parallelism can affect not only bandwidth and the web site you are connecting to, but you will also risk running out of file descriptors, which is a strictly limited resource in most environments where java runs.
请注意,这里的一个陷阱是太多的并行性。对于检索 URL 的示例,具有少量并行调用可能是一个好主意。过多的并行性不仅会影响带宽和您所连接的网站,而且还会有耗尽文件描述符的风险,这在大多数运行 java 的环境中是一种严格受限的资源。
Some frameworks that may help you are Netflix' RxJavaand Akka. Be aware that these frameworks are not trivial and will take some effort to learn.
一些可以帮助你的框架是 Netflix 的RxJava和Akka。请注意,这些框架并非微不足道,需要付出一些努力才能学习。