C# 多线程读取大txt文件？

Question

提问by obdgy

I have large txt file with 100000 lines. I need to start n-count of threads and give every thread unique line from this file.

我有 100000 行的大 txt 文件。我需要启动 n 个线程，并从该文件中为每个线程指定唯一的行。

What is the best way to do this? I think I need to read file line by line and iterator must be global to lock it. Loading the text file to list will be time-consuming and I can receive OutofMemoryexception. Any ideas?

做这个的最好方式是什么？我想我需要逐行读取文件，并且迭代器必须是全局的才能锁定它。将文本文件加载到列表将非常耗时，而且我可能会收到OutofMemory异常。有任何想法吗？

Answer 1

回答by dasblinkenlight

Read the file on one thread, adding its lines to a blocking queue. Start Ntasks reading from that queue. Set max sizeof the queue to prevent out of memory errors.

在一个线程上读取文件，将其行添加到阻塞队列中。启动N从该队列读取的任务。设置队列的最大大小以防止内存不足错误。

Answer 2

回答by dtb

You can use the File.ReadLines Methodto read the file line-by-line without loading the whole file into memory at once, and the Parallel.ForEach Methodto process the lines in multiple threads in parallel:

您可以使用File.ReadLines 方法逐行读取文件，而无需一次将整个文件加载到内存中，使用Parallel.ForEach 方法并行处理多个线程中的行：

Parallel.ForEach(File.ReadLines("file.txt"), (line, _, lineNumber) =>
{
    // your code here
});

Answer 3

回答by Daan Timmer

Something like:

就像是：

public class ParallelReadExample
{
    public static IEnumerable LineGenerator(StreamReader sr)
    {
        while ((line = sr.ReadLine()) != null)
        {
            yield return line;
        }
    }

    static void Main()
    {
        // Display powers of 2 up to the exponent 8:
        StreamReader sr = new StreamReader("yourfile.txt")

        Parallel.ForEach(LineGenerator(sr), currentLine =>
            {
                // Do your thing with currentLine here...
            } //close lambda expression
        );

        sr.Close();
    }
}

Think it would work. (No C# compiler/IDE here)

认为它会起作用。（这里没有 C# 编译器/IDE）

Answer 4

回答by Matthew Watson

If you want to limit the number of threads to n, the easiest way is to use AsParallel()along with WithDegreeOfParallelism(n)to limit the thread count:

如果要将线程数限制为n，最简单的方法是使用AsParallel()withWithDegreeOfParallelism(n)来限制线程数：

string filename = "C:\TEST\TEST.DATA";
int n = 5;

foreach (var line in File.ReadLines(filename).AsParallel().WithDegreeOfParallelism(n))
{
    // Process line.
}

Answer 5

回答by Matthew Watson

As @dtb mentioned above, the fastest way to read a file and then process the individual lines in a file is to: 1) do a File.ReadAllLines() into an array 2) Use a Parallel.For loop to iterate over the array.

正如@dtb 上面提到的，读取文件然后处理文件中各行的最快方法是：1) 将 File.ReadAllLines() 放入数组 2) 使用 Parallel.For 循环遍历数组.

You can read more performance benchmarks here.

您可以在此处阅读更多性能基准。

The basic gist of the code you would have to write is:

您必须编写的代码的基本要点是：

string[] AllLines = File.ReadAllLines(fileName);
Parallel.For(0, AllLines.Length, x =>
{
    DoStuff(AllLines[x]);
    //whatever you need to do
});

With the introduction of bigger array sizes in .Net4, as long as you have plenty of memory, this shouldn't be an issue.

随着 .Net4 中引入更大的数组大小，只要您有足够的内存，这应该不是问题。

Answer 6

回答by Jake Drew

After performing my own benchmarks for loading 61,277,203 lines into memory and shoving values into a Dictionary / ConcurrentDictionary() the results seem to support @dtb's answer above that using the following approach is the fastest:

在执行我自己的将 61,277,203 行加载到内存中并将值推入 Dictionary / ConcurrentDictionary() 的基准测试之后，结果似乎支持上面@dtb 的回答，即使用以下方法是最快的：

Parallel.ForEach(File.ReadLines(catalogPath), line =>
{

});

My tests also showed the following:

我的测试还显示以下内容：

File.ReadAllLines() and File.ReadAllLines().AsParallel() appear to run at almost exactly the same speed on a file of this size. Looking at my CPU activity, it appears they both seem to use two out of my 8 cores?
Reading all the data first using File.ReadAllLines() appears to be much slower than using File.ReadLines() in a Parallel.ForEach() loop.
I also tried a producer / consumer or MapReduce style pattern where one thread was used to read the data and a second thread was used to process it. This also did not seem to outperform the simple pattern above.

File.ReadAllLines() 和 File.ReadAllLines().AsParallel() 在这种大小的文件上以几乎完全相同的速度运行。看看我的 CPU 活动，它们似乎都使用了我 8 个内核中的两个？
首先使用 File.ReadAllLines() 读取所有数据似乎比在 Parallel.ForEach() 循环中使用 File.ReadLines() 慢得多。
我还尝试了生产者/消费者或 MapReduce 样式模式，其中一个线程用于读取数据，第二个线程用于处理数据。这似乎也没有优于上面的简单模式。

I have included an example of this pattern for reference, since it is not included on this page:

我已经包含了此模式的示例以供参考，因为它未包含在此页面中：

var inputLines = new BlockingCollection<string>();
ConcurrentDictionary<int, int> catalog = new ConcurrentDictionary<int, int>();

var readLines = Task.Factory.StartNew(() =>
{
    foreach (var line in File.ReadLines(catalogPath)) 
        inputLines.Add(line);

        inputLines.CompleteAdding(); 
});

var processLines = Task.Factory.StartNew(() =>
{
    Parallel.ForEach(inputLines.GetConsumingEnumerable(), line =>
    {
        string[] lineFields = line.Split('\t');
        int genomicId = int.Parse(lineFields[3]);
        int taxId = int.Parse(lineFields[0]);
        catalog.TryAdd(genomicId, taxId);   
    });
});

Task.WaitAll(readLines, processLines);

Here are my benchmarks:

这是我的基准：

enter image description here

在此处输入图片说明

I suspect that under certain processing conditions, the producer / consumer pattern might outperform the simple Parallel.ForEach(File.ReadLines()) pattern. However, it did not in this situation.

我怀疑在某些处理条件下，生产者/消费者模式可能优于简单的 Parallel.ForEach(File.ReadLines()) 模式。然而，它没有在这种情况下。

C# 多线程读取大txt文件？

提问by obdgy

回答by dasblinkenlight

回答by dtb

回答by Daan Timmer

回答by Matthew Watson

回答by Matthew Watson

回答by Jake Drew

相关推荐

最近更新

标签

C# 多线程读取大txt文件？

提问by obdgy

回答by dasblinkenlight

回答by dtb

回答by Daan Timmer

回答by Matthew Watson

回答by Matthew Watson

回答by Jake Drew

相关推荐

C# RijndaelManaged 与 AesCryptoServiceProvider（AES 加密）

C# 如何使用“excelpackage”设置颜色或背景

C# 来自文本文件的数据表？

C# 生成 4-8 位随机数

相关推荐

最近更新

标签