C# .NET 的多线程与多处理：糟糕的 Parallel.ForEach 性能

Question

提问by Saeed Shahrivari

I have coded a very simple "Word Count" program that reads a file and counts each word's occurrence in the file. Here is a part of the code:

我编写了一个非常简单的“字数统计”程序，它读取文件并计算文件中每个单词的出现次数。这是代码的一部分：

class Alaki
{
    private static List<string> input = new List<string>();

    private static void exec(int threadcount)
    {
        ParallelOptions options = new ParallelOptions();
        options.MaxDegreeOfParallelism = threadcount;
        Parallel.ForEach(Partitioner.Create(0, input.Count),options, (range) =>
        {
            var dic = new Dictionary<string, List<int>>();
            for (int i = range.Item1; i < range.Item2; i++)
            {
                //make some delay!
                //for (int x = 0; x < 400000; x++) ;                    

                var tokens = input[i].Split();
                foreach (var token in tokens)
                {
                    if (!dic.ContainsKey(token))
                        dic[token] = new List<int>();
                    dic[token].Add(1);
                }
            }
        });

    }

    public static void Main(String[] args)
    {            
        StreamReader reader=new StreamReader((@"c:\txt-set\agg.txt"));
        while(true)
        {
            var line=reader.ReadLine();
            if(line==null)
                break;
            input.Add(line);
        }

        DateTime t0 = DateTime.Now;
        exec(Environment.ProcessorCount);
        Console.WriteLine("Parallel:  " + (DateTime.Now - t0));
        t0 = DateTime.Now;
        exec(1);
        Console.WriteLine("Serial:  " + (DateTime.Now - t0));
    }
}

It is simple and straight forward. I use a dictionary to count each word's occurrence. The style is roughly based on the MapReduceprogramming model. As you can see, each task is using its own private dictionary. So, there is NO shared variables; just a bunch of tasks that count words by themselves. Here is the output when the code is run on a quad-core i7 CPU:

它简单而直接。我使用字典来计算每个单词的出现次数。风格大致基于MapReduce编程模型。如您所见，每个任务都使用自己的私有字典。所以，没有共享变量；只是一堆自己计算单词的任务。以下是代码在四核 i7 CPU 上运行时的输出：

Parallel: 00:00:01.6220927
Serial: 00:00:02.0471171

并行：00:00:01.6220927
串行：00:00:02.0471171

The speedup is about 1.25 which means a tragedy! But when I add some delay when processing each line, I can reach speedup values about 4.

加速约为1.25，这意味着悲剧！但是当我在处理每一行时添加一些延迟时，我可以达到大约 4 的加速值。

In the original parallel execution with no delay, CPU's utilization hardly reaches to 30% and therefore the speedup is not promising. But, when we add some delay, CPU's utilization reaches to 97%.

在原先的无延迟并行执行中，CPU的利用率很难达到30%，因此加速并不乐观。但是，当我们添加一些延迟时，CPU 的利用率达到 97%。

Firstly, I thought the cause is the IO-bound nature of the program (but I think inserting into a dictionary is to some extent CPU intensive) and it seems logical because all of the threads are reading data from a shared memory bus. However, The surprising point is when I run 4 instances of serial programs (with no delays) simultaneously, CPU's utilization reaches to about raises and all of the four instances finish in about 2.3 seconds!

首先，我认为原因是程序的 IO 绑定性质（但我认为插入字典在某种程度上是 CPU 密集型的）并且这似乎是合乎逻辑的，因为所有线程都从共享内存总线读取数据。然而，令人惊讶的一点是，当我同时运行 4 个串行程序实例（没有延迟）时，CPU 的利用率达到了大约提高，并且所有四个实例都在大约 2.3 秒内完成！

This means that when the code is being run in a multiprocessing configuration, it reaches to a speedup value about 3.5 but when it is being run in multithreading config, the speedup is about 1.25.

这意味着当代码在多处理配置中运行时，它达到大约 3.5 的加速值，但当它在多线程配置中运行时，加速值约为 1.25。

What is your idea? Is there anything wrong about my code? Because I think there is no shared data at all and I think the code shall not experience any contentions. Is there a flaw in .NET's run-time?

你有什么想法？我的代码有什么问题吗？因为我认为根本没有共享数据，而且我认为代码不会出现任何争用。.NET 的运行时是否存在缺陷？

Thanks in advance.

提前致谢。

Answer 1

采纳答案by Bradley Grainger

Parallel.Fordoesn't divide the input into npieces (where nis the MaxDegreeOfParallelism); instead it creates many small batches and makes sure that at most nare being processed concurrently. (This is so that if one batch takes a very long time to process, Parallel.Forcan still be running work on other threads. See Parallelism in .NET - Part 5, Partioning of Workfor more details.)

Parallel.For不会将输入分成n块（其中n是MaxDegreeOfParallelism）；相反，它会创建许多小批量并确保最多同时处理n 个。（这是为了如果一个批处理需要很长时间来处理，Parallel.For仍然可以在其他线程上运行工作。有关更多详细信息，请参阅.NET 中的并行性 - 第 5 部分，工作分区。）

Due to this design, your code is creating and throwing away dozens of Dictionary objects, hundreds of List objects, and thousands of String objects. This is putting enormous pressure on the garbage collector.

由于这种设计，您的代码会创建并丢弃数十个 Dictionary 对象、数百个 List 对象和数千个 String 对象。这给垃圾收集器带来了巨大的压力。

Running PerfMonitoron my computer reports that 43% of the total run time is spent in GC. If you rewrite your code to use fewer temporary objects, you should see the desired 4x speedup. Some excerpts from the PerfMonitor report follow:

在我的计算机上运行PerfMonitor报告总运行时间的 43% 花费在 GC 上。如果您重写代码以使用更少的临时对象，您应该会看到所需的 4 倍加速。PerfMonitor 报告的一些摘录如下：

Over 10% of the total CPU time was spent in the garbage collector. Most well tuned applications are in the 0-10% range. This is typically caused by an allocation pattern that allows objects to live just long enough to require an expensive Gen 2 collection.
This program had a peak GC heap allocation rate of over 10 MB/sec. This is quite high. It is not uncommon that this is simply a performance bug.

超过 10% 的总 CPU 时间花在垃圾收集器上。大多数调整良好的应用程序都在 0-10% 的范围内。这通常是由一种分配模式引起的，该模式允许对象存活的时间刚好足以需要昂贵的 Gen 2 集合。
该程序的峰值 GC 堆分配率超过 10 MB/秒。这是相当高的。这只是一个性能错误并不少见。

Edit:As per your comment, I will attempt to explain the timings you reported. On my computer, with PerfMonitor, I measured between 43% and 52% of time spent in GC. For simplicity, let's assume that 50% of the CPU time is work, and 50% is GC. Thus, if we make the work 4× faster (through multi-threading) but keep the amount of GC the same (this will happen because the number of batches being processed happened to be the same in the parallel and serial configurations), the best improvement we could get is 62.5% of the original time, or 1.6×.

编辑：根据您的评论，我将尝试解释您报告的时间。在我的计算机上，使用 PerfMonitor，我测量了 43% 到 52% 的时间花在 GC 上。为简单起见，我们假设 50% 的 CPU 时间用于工作，50% 用于 GC。因此，如果我们使工作速度提高 4 倍（通过多线程）但保持 GC 的数量相同（这会发生，因为并行和串行配置中正在处理的批次数量恰好相同），最好的我们可以获得的改进是原始时间的 62.5%，或 1.6 倍。

However, we only see a 1.25× speedup because GC isn't multithreaded by default (in workstation GC). As per Fundamentals of Garbage Collection, all managed threads are paused during a Gen 0 or Gen 1 collection. (Concurrent and background GC, in .NET 4 and .NET 4.5, can collect Gen 2 on a background thread.) Your program experiences only a 1.25× speedup (and you see 30% CPU usage overall) because the threads spend most of their time being paused for GC (because the memory allocation pattern of this test program is very poor).

但是，我们只看到了 1.25 倍的加速，因为默认情况下 GC 不是多线程的（在工作站 GC 中）。根据垃圾收集的基础知识，所有托管线程在 Gen 0 或 Gen 1 收集期间都会暂停。（在 .NET 4 和 .NET 4.5 中，并发和后台 GC 可以在后台线程上收集 Gen 2。）您的程序仅体验 1.25 倍的加速（并且您看到总体 CPU 使用率为 30%），因为线程花费了大部分时间GC 暂停的时间（因为这个测试程序的内存分配模式很差）。

If you enable server GC, it will perform garbage collection on multiple threads. If I do this, the program runs 2× faster (with almost 100% CPU usage).

如果启用服务器 GC，它将在多个线程上执行垃圾收集。如果我这样做，程序运行速度将提高 2 倍（CPU 使用率几乎为 100%）。

When you run four instances of the program simultaneously, each has its own managed heap, and the garbage collection for the four processes can execute in parallel. This is why you see 100% CPU usage (each process is using 100% of one CPU). The slightly longer overall time (2.3s for all vs 2.05s for one) is possibly due to inaccuracies in measurement, contention for the disk, time taken to load the file, having to initialise the threadpool, overhead of context switching, or some other environment factor.

当您同时运行程序的四个实例时，每个实例都有自己的托管堆，四个进程的垃圾回收可以并行执行。这就是为什么您会看到 100% CPU 使用率（每个进程使用 100% 的一个 CPU）。总体时间稍长（所有时间为 2.3 秒，一个为 2.05 秒）可能是由于测量不准确、磁盘争用、加载文件所需的时间、必须初始化线程池、上下文切换的开销或其他一些原因环境因素。

Answer 2

回答by Henk Holterman

An attempt to explain the results:

试图解释结果：

a quick run in the VS profiler shows it's barely reaching 40% CPU utilization.
String.Split is the main hotspot.
so a shared something must be blocking the the CPU.
that something is most likely memory allocation. Your bottlenecks are

在 VS 分析器中快速运行显示它几乎没有达到 40% 的 CPU 利用率。
String.Split 是主要的热点。
所以共享的东西一定会阻塞 CPU。
那东西很可能是内存分配。你的瓶颈是

var dic = new Dictionary<string, List<int>>();
...
   dic[token].Add(1);

I replaced this with

我用这个代替了

var dic = new Dictionary<string, int>();
...
... else dic[token] += 1;

and the result is closer to a 2x speedup.

结果更接近 2 倍的加速。

But my counter question would be: does it matter? Your code is very artificial and incomplete. The parallel version ends up creating multiple dictionaries without merging them. This is not even close to a real situation. And as you can see, little details do matter.

但我的反问题是：这重要吗？您的代码非常人为且不完整。并行版本最终会创建多个字典而不合并它们。这甚至不接近真实情况。正如你所看到的，小细节很重要。

Your sample code is to complex to make broad statements about Parallel.ForEach().
It is too simple to solve/analyze a real problem.

您的示例代码非常复杂，无法对Parallel.ForEach().
解决/分析一个真正的问题太简单了。

Answer 3

回答by Slai

Just for fun, here is a shorter PLINQ version:

只是为了好玩，这里有一个较短的 PLINQ 版本：

File.ReadAllText("big.txt").Split().AsParallel().GroupBy(t => t)
                                                .ToDictionary(g => g.Key, g => g.Count());

C# .NET 的多线程与多处理：糟糕的 Parallel.ForEach 性能

提问by Saeed Shahrivari

采纳答案by Bradley Grainger

回答by Henk Holterman

回答by Slai

相关推荐

最近更新

标签

C# .NET 的多线程与多处理：糟糕的 Parallel.ForEach 性能

提问by Saeed Shahrivari

采纳答案by Bradley Grainger

回答by Henk Holterman

回答by Slai

相关推荐

C# LINQ to Entities 消息中仅支持无参数构造函数和初始值设定项

C# CsvHelper 未向内存流写入任何内容

C# 如何在 ASP.NET MVC 4 Razor 项目的视图中显示集合？

C# 在控制器中调用异步方法

相关推荐

最近更新

标签