java 如何使用java多线程将大文本文件拆分成更小的块
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17927398/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how to split a large text file into smaller chunks using java multithread
提问by user2630648
I'm trying to develop a multithreaded java program for split a large text file into smaller text files. The smaller files created must have a prefixed number of lines. For example: if the number of lines of input file is 100 and the input number is 10, the result of my program is to split the input file into 10 files. I've already developed a singlethreaded version of my program:
我正在尝试开发一个多线程 java 程序,用于将大文本文件拆分为较小的文本文件。创建的较小文件必须具有前缀行数。例如:如果输入文件的行数为100,输入数为10,我的程序的结果是将输入文件拆分为10个文件。我已经开发了我的程序的单线程版本:
import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
public class TextFileSingleThreaded {
public static void main(String[] args) {
if (args.length != 2) {
System.out.println("Invalid Input!");
}
//first argument is the file path
File file = new File(args[0]);
//second argument is the number of lines per chunk
//In particular the smaller files will have numLinesPerChunk lines
int numLinesPerChunk = Integer.parseInt(args[1]);
BufferedReader reader = null;
PrintWriter writer = null;
try {
reader = new BufferedReader(new FileReader(file));
} catch (FileNotFoundException e) {
e.printStackTrace();
}
String line;
long start = System.currentTimeMillis();
try {
line = reader.readLine();
for (int i = 1; line != null; i++) {
writer = new PrintWriter(new FileWriter(args[0] + "_part" + i + ".txt"));
for (int j = 0; j < numLinesPerChunk && line != null; j++) {
writer.println(line);
line = reader.readLine();
}
writer.flush();
}
} catch (IOException e) {
e.printStackTrace();
}
writer.close();
long end = System.currentTimeMillis();
System.out.println("Taken time[sec]:");
System.out.println((end - start) / 1000);
}
}
I want to write a multithreaded version of this program but I don't know how to read a file beginning from a specified line. Help me please. :(
我想编写这个程序的多线程版本,但我不知道如何从指定行开始读取文件。请帮帮我。:(
回答by Gray
I want to write a multithreaded version of this program but I don't know how to read a file beginning from a specified line. Help me please. :(
我想编写这个程序的多线程版本,但我不知道如何从指定行开始读取文件。请帮帮我。:(
I would not, as this implied, have each thread read from the beginning of the file ignoring lines until they come to their portion of the input file. This is highly inefficient. As you imply, the reader has to read all of the prior lines if the file is going to be divided up into chunks by lines. This means a whole bunch of duplicate read IO which will result in a much slower application.
正如这暗示的那样,我不会让每个线程从文件开头读取忽略行,直到它们到达输入文件的它们的部分。这是非常低效的。正如您所暗示的,如果文件要按行分成块,则读者必须阅读所有先前的行。这意味着一大堆重复的读取 IO 会导致应用程序变慢。
You could instead have 1 reader and N writers. The reader will be adding the lines to be written to some sort of BlockingQueue
per writer. The problem with this is that chances are you won't get any concurrency. Only one writer will most likely be working at one time while the rest of the writers wait for the reader to reach their part of the input file. Also, if the reader is faster than the writer (which is likely) then you could easily run out of memory queueing up all of the lines in memory if the file to be divided is large. You could use a size limited blocking queue which means the reader may block waiting for the writers but again, multiple writers will most likely not be running at the same time.
您可以改为拥有 1 个读者和 N 个作者。读者将添加要写入的行以某种方式写入BlockingQueue
每个作者。这样做的问题是您可能不会获得任何并发性。一次只有一位编写者很可能正在工作,而其余编写者则等待读取者到达他们的输入文件部分。此外,如果读取器比写入器快(这很可能),那么如果要分割的文件很大,您很容易耗尽内存而将内存中的所有行排队。您可以使用大小受限的阻塞队列,这意味着读取器可能会阻塞等待写入器,但同样,多个写入器很可能不会同时运行。
As mentioned in the comments, the most efficient way of doing this is single threaded because of these restrictions. If you are doing this as an exercise then it sounds like you will need to read the file through one time, note the start and end positions in the file for each of the output files and then fork the threads with those locations so they can re-read the file and write it into their separate output files in parallel without a lot of line buffering.
正如评论中提到的,由于这些限制,最有效的方法是单线程。如果您将此作为练习,那么听起来您需要通读一次文件,注意每个输出文件在文件中的开始和结束位置,然后用这些位置分叉线程,以便它们可以重新- 读取文件并将其并行写入单独的输出文件中,无需大量行缓冲。
回答by Paddle
You only need to read your file one time, and store it into a List :
您只需要读取您的文件一次,并将其存储到 List 中:
BufferedReader br = new BufferedReader(new FileReader(new File("yourfile")));
List<String> list = new ArrayList<String>();
String line;
//for each line of your file
while((line = br.readLine()) != null){
list.add(line);
}
br.close();
//then you can split your list into differents parts
List<List<String>> parts = new ArrayList<ArrayList<String>>();
for(int i = 0; i < 10; i++){
parts.add(new ArrayList<String>());
for(int j =0; j < 10; j++){
parts.get(i).add(list.get(i*10+j));
}
}
//now you have 10 lists which each contain 10 lines
//you still need to to create a thread pool, where each thread put a list into a file
for more informations about thread pools, read this.
有关线程池的更多信息,请阅读此内容。