C# 多线程读取大txt文件?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17188357/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Read large txt file multithreaded?
提问by obdgy
I have large txt file with 100000 lines. I need to start n-count of threads and give every thread unique line from this file.
我有 100000 行的大 txt 文件。我需要启动 n 个线程,并从该文件中为每个线程指定唯一的行。
What is the best way to do this? I think I need to read file line by line and iterator must be global to lock it. Loading the text file to list will be time-consuming and I can receive OutofMemoryexception. Any ideas?
做这个的最好方式是什么?我想我需要逐行读取文件,并且迭代器必须是全局的才能锁定它。将文本文件加载到列表将非常耗时,而且我可能会收到OutofMemory异常。有任何想法吗?
回答by dasblinkenlight
Read the file on one thread, adding its lines to a blocking queue. Start Ntasks reading from that queue. Set max sizeof the queue to prevent out of memory errors.
回答by dtb
You can use the File.ReadLines Methodto read the file line-by-line without loading the whole file into memory at once, and the Parallel.ForEach Methodto process the lines in multiple threads in parallel:
您可以使用File.ReadLines 方法逐行读取文件,而无需一次将整个文件加载到内存中,使用Parallel.ForEach 方法并行处理多个线程中的行:
Parallel.ForEach(File.ReadLines("file.txt"), (line, _, lineNumber) =>
{
// your code here
});
回答by Daan Timmer
Something like:
就像是:
public class ParallelReadExample
{
public static IEnumerable LineGenerator(StreamReader sr)
{
while ((line = sr.ReadLine()) != null)
{
yield return line;
}
}
static void Main()
{
// Display powers of 2 up to the exponent 8:
StreamReader sr = new StreamReader("yourfile.txt")
Parallel.ForEach(LineGenerator(sr), currentLine =>
{
// Do your thing with currentLine here...
} //close lambda expression
);
sr.Close();
}
}
Think it would work. (No C# compiler/IDE here)
认为它会起作用。(这里没有 C# 编译器/IDE)
回答by Matthew Watson
If you want to limit the number of threads to n, the easiest way is to use AsParallel()along with WithDegreeOfParallelism(n)to limit the thread count:
如果要将线程数限制为n,最简单的方法是使用AsParallel()withWithDegreeOfParallelism(n)来限制线程数:
string filename = "C:\TEST\TEST.DATA";
int n = 5;
foreach (var line in File.ReadLines(filename).AsParallel().WithDegreeOfParallelism(n))
{
// Process line.
}
回答by Matthew Watson
As @dtb mentioned above, the fastest way to read a file and then process the individual lines in a file is to: 1) do a File.ReadAllLines() into an array 2) Use a Parallel.For loop to iterate over the array.
正如@dtb 上面提到的,读取文件然后处理文件中各行的最快方法是:1) 将 File.ReadAllLines() 放入数组 2) 使用 Parallel.For 循环遍历数组.
You can read more performance benchmarks here.
The basic gist of the code you would have to write is:
您必须编写的代码的基本要点是:
string[] AllLines = File.ReadAllLines(fileName);
Parallel.For(0, AllLines.Length, x =>
{
DoStuff(AllLines[x]);
//whatever you need to do
});
With the introduction of bigger array sizes in .Net4, as long as you have plenty of memory, this shouldn't be an issue.
随着 .Net4 中引入更大的数组大小,只要您有足够的内存,这应该不是问题。
回答by Jake Drew
After performing my own benchmarks for loading 61,277,203 lines into memory and shoving values into a Dictionary / ConcurrentDictionary() the results seem to support @dtb's answer above that using the following approach is the fastest:
在执行我自己的将 61,277,203 行加载到内存中并将值推入 Dictionary / ConcurrentDictionary() 的基准测试之后,结果似乎支持上面@dtb 的回答,即使用以下方法是最快的:
Parallel.ForEach(File.ReadLines(catalogPath), line =>
{
});
My tests also showed the following:
我的测试还显示以下内容:
- File.ReadAllLines() and File.ReadAllLines().AsParallel() appear to run at almost exactly the same speed on a file of this size. Looking at my CPU activity, it appears they both seem to use two out of my 8 cores?
- Reading all the data first using File.ReadAllLines() appears to be much slower than using File.ReadLines() in a Parallel.ForEach() loop.
- I also tried a producer / consumer or MapReduce style pattern where one thread was used to read the data and a second thread was used to process it. This also did not seem to outperform the simple pattern above.
- File.ReadAllLines() 和 File.ReadAllLines().AsParallel() 在这种大小的文件上以几乎完全相同的速度运行。看看我的 CPU 活动,它们似乎都使用了我 8 个内核中的两个?
- 首先使用 File.ReadAllLines() 读取所有数据似乎比在 Parallel.ForEach() 循环中使用 File.ReadLines() 慢得多。
- 我还尝试了生产者/消费者或 MapReduce 样式模式,其中一个线程用于读取数据,第二个线程用于处理数据。这似乎也没有优于上面的简单模式。
I have included an example of this pattern for reference, since it is not included on this page:
我已经包含了此模式的示例以供参考,因为它未包含在此页面中:
var inputLines = new BlockingCollection<string>();
ConcurrentDictionary<int, int> catalog = new ConcurrentDictionary<int, int>();
var readLines = Task.Factory.StartNew(() =>
{
foreach (var line in File.ReadLines(catalogPath))
inputLines.Add(line);
inputLines.CompleteAdding();
});
var processLines = Task.Factory.StartNew(() =>
{
Parallel.ForEach(inputLines.GetConsumingEnumerable(), line =>
{
string[] lineFields = line.Split('\t');
int genomicId = int.Parse(lineFields[3]);
int taxId = int.Parse(lineFields[0]);
catalog.TryAdd(genomicId, taxId);
});
});
Task.WaitAll(readLines, processLines);
Here are my benchmarks:
这是我的基准:


I suspect that under certain processing conditions, the producer / consumer pattern might outperform the simple Parallel.ForEach(File.ReadLines()) pattern. However, it did not in this situation.
我怀疑在某些处理条件下,生产者/消费者模式可能优于简单的 Parallel.ForEach(File.ReadLines()) 模式。然而,它没有在这种情况下。

