C# 使用并行 FOR 循环节省时间

Question

提问by tro

I have a question concerning parallel for loops. I have the following code:

我有一个关于并行 for 循环的问题。我有以下代码：

    public static void MultiplicateArray(double[] array, double factor)
    {
        for (int i = 0; i < array.Length; i++)
        {
            array[i] = array[i] * factor;
        }
    }

    public static void MultiplicateArray(double[] arrayToChange, double[] multiplication)
    {
        for (int i = 0; i < arrayToChange.Length; i++)
        {
            arrayToChange[i] = arrayToChange[i] * multiplication[i];
        }
    }

    public static void MultiplicateArray(double[] arrayToChange, double[,] multiArray, int dimension)
    {
        for (int i = 0; i < arrayToChange.Length; i++)
        {
            arrayToChange[i] = arrayToChange[i] * multiArray[i, dimension];
        }
    }

Now I try to add parallel function:

现在我尝试添加并行功能：

    public static void MultiplicateArray(double[] array, double factor)
    {
        Parallel.For(0, array.Length, i =>
            {
                array[i] = array[i] * factor;
            });
    }

    public static void MultiplicateArray(double[] arrayToChange, double[] multiplication)
    {
        Parallel.For(0, arrayToChange.Length, i =>
        {
            arrayToChange[i] = arrayToChange[i] * multiplication[i];
        });
    }

    public static void MultiplicateArray(double[] arrayToChange, double[,] multiArray, int dimension)
    {
        Parallel.For(0, arrayToChange.Length, i =>
        {
            arrayToChange[i] = arrayToChange[i] * multiArray[i, dimension];
        });
    }

The issue is, that I want to save time, not to waste it. With the standard for loop it computes about 2 minutes, but with the parallel for loop it takes 3 min. Why?

问题是，我想节省时间，而不是浪费时间。使用标准 for 循环计算大约 2 分钟，但使用并行 for 循环需要 3 分钟。为什么？

Answer 1

采纳答案by svick

Parallel.For()can improve performance a lot by parallelizing your code, but it also has overhead (synchronization between threads, invoking the delegate on each iteration). And since in your code, each iteration is veryshort (basically, just a few CPU instructions), this overhead can become prominent.

Parallel.For()可以通过并行化代码来大大提高性能，但它也有开销（线程之间的同步，在每次迭代时调用委托）。由于在您的代码中，每次迭代都很短（基本上只有几条 CPU 指令），因此这种开销会变得很突出。

Because of this, I thought using Parallel.For()is not the right solution for you. Instead, if you parallelize your code manually (which is very simple in this case), you may see the performance improve.

因此，我认为 usingParallel.For()不是适合您的解决方案。相反，如果您手动并行化您的代码（在这种情况下非常简单），您可能会看到性能提高。

To verify this, I performed some measurements: I ran different implementations of MultiplicateArray()on an array of 200 000 000 items (the code I used is below). On my machine, the serial version consistently took 0.21 s and Parallel.For()usually took something around 0.45 s, but from time to time, it spiked to 8–9 s!

为了验证这一点，我进行了一些测量：我在MultiplicateArray()包含 200 000 000 个项目的数组上运行了不同的实现（我使用的代码如下）。在我的机器上，串行版本始终需要 0.21 秒，Parallel.For()通常需要大约 0.45 秒，但有时会飙升至 8-9 秒！

First, I'll try to improve the common case and I'll come to those spikes later. We want to process the array by NCPUs, so we split it into Nequally sized parts and process each part separately. The result? 0.35 s. That's still worse than the serial version. But forloop over each item in an array is one of the most optimized constructs. Can't we do something to help the compiler? Extracting computing the bound of the loop could help. It turns out it does: 0.18 s. That's better than the serial version, but not by much. And, interestingly, changing the degree of parallelism from 4 to 2 on my 4-core machine (no HyperThreading) doesn't change the result: still 0.18 s. This makes me conclude that the CPU is not the bottleneck here, memory bandwidth is.

首先，我将尝试改进常见情况，稍后再讨论这些峰值。我们想用N 个CPU处理数组，所以我们将它分成N 个大小相等的部分，并分别处理每个部分。结果？0.35 秒。这仍然比串行版本更糟糕。但是for循环遍历数组中的每一项是最优化的构造之一。我们不能做点什么来帮助编译器吗？提取计算循环的界限可能会有所帮助。事实证明确实如此：0.18 秒。这比串行版本好，但不是很多。而且，有趣的是，在我的 4 核机器（没有超线程）上将并行度从 4 更改为 2 并没有改变结果：仍然是 0.18 秒。这让我得出结论，CPU 不是这里的瓶颈，内存带宽才是。

Now, back to the spikes: my custom parallelization doesn't have them, but Parallel.For()does, why? Parallel.For()does use range partitioning, which means each thread processes its own part of the array. But, if one thread finishes early, it will try to help processing the range of another thread that hasn't finished yet. If that happens, you will get a lot of false sharing, which could slow down the code a lot. And my own test with forcing false sharing seems to indicate this could indeed be the problem. Forcing the degree of parallelism of the Parallel.For()seems to help with the spikes a little.

现在，回到峰值：我的自定义并行化没有它们，但是Parallel.For()有，为什么？Parallel.For()确实使用范围分区，这意味着每个线程处理它自己的数组部分。但是，如果一个线程提前完成，它将尝试帮助处理尚未完成的另一个线程的范围。如果发生这种情况，您将获得大量错误共享，这可能会大大减慢代码速度。我自己的强制错误共享测试似乎表明这确实可能是问题所在。强制并行度Parallel.For()似乎对峰值有所帮助。

Of course, all those measurements are specific to the hardware on my computer and will be different for you, so you should make your own measurements.

当然，所有这些测量都特定于我计算机上的硬件，并且因您而异，因此您应该自己进行测量。

The code I used:

我使用的代码：

static void Main()
{
    double[] array = new double[200 * 1000 * 1000];

    for (int i = 0; i < array.Length; i++)
        array[i] = 1;

    for (int i = 0; i < 5; i++)
    {
        Stopwatch sw = Stopwatch.StartNew();
        Serial(array, 2);
        Console.WriteLine("Serial: {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        ParallelFor(array, 2);
        Console.WriteLine("Parallel.For: {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        ParallelForDegreeOfParallelism(array, 2);
        Console.WriteLine("Parallel.For (degree of parallelism): {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        CustomParallel(array, 2);
        Console.WriteLine("Custom parallel: {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        CustomParallelExtractedMax(array, 2);
        Console.WriteLine("Custom parallel (extracted max): {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        CustomParallelExtractedMaxHalfParallelism(array, 2);
        Console.WriteLine("Custom parallel (extracted max, half parallelism): {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        CustomParallelFalseSharing(array, 2);
        Console.WriteLine("Custom parallel (false sharing): {0:f2} s", sw.Elapsed.TotalSeconds);
    }
}

static void Serial(double[] array, double factor)
{
    for (int i = 0; i < array.Length; i++)
    {
        array[i] = array[i] * factor;
    }
}

static void ParallelFor(double[] array, double factor)
{
    Parallel.For(
        0, array.Length, i => { array[i] = array[i] * factor; });
}

static void ParallelForDegreeOfParallelism(double[] array, double factor)
{
    Parallel.For(
        0, array.Length, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
        i => { array[i] = array[i] * factor; });
}

static void CustomParallel(double[] array, double factor)
{
    var degreeOfParallelism = Environment.ProcessorCount;

    var tasks = new Task[degreeOfParallelism];

    for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
    {
        // capturing taskNumber in lambda wouldn't work correctly
        int taskNumberCopy = taskNumber;

        tasks[taskNumber] = Task.Factory.StartNew(
            () =>
            {
                for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
                    i < array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
                    i++)
                {
                    array[i] = array[i] * factor;
                }
            });
    }

    Task.WaitAll(tasks);
}

static void CustomParallelExtractedMax(double[] array, double factor)
{
    var degreeOfParallelism = Environment.ProcessorCount;

    var tasks = new Task[degreeOfParallelism];

    for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
    {
        // capturing taskNumber in lambda wouldn't work correctly
        int taskNumberCopy = taskNumber;

        tasks[taskNumber] = Task.Factory.StartNew(
            () =>
            {
                var max = array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
                for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
                    i < max;
                    i++)
                {
                    array[i] = array[i] * factor;
                }
            });
    }

    Task.WaitAll(tasks);
}

static void CustomParallelExtractedMaxHalfParallelism(double[] array, double factor)
{
    var degreeOfParallelism = Environment.ProcessorCount / 2;

    var tasks = new Task[degreeOfParallelism];

    for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
    {
        // capturing taskNumber in lambda wouldn't work correctly
        int taskNumberCopy = taskNumber;

        tasks[taskNumber] = Task.Factory.StartNew(
            () =>
            {
                var max = array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
                for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
                    i < max;
                    i++)
                {
                    array[i] = array[i] * factor;
                }
            });
    }

    Task.WaitAll(tasks);
}

static void CustomParallelFalseSharing(double[] array, double factor)
{
    var degreeOfParallelism = Environment.ProcessorCount;

    var tasks = new Task[degreeOfParallelism];

    int i = -1;

    for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
    {
        tasks[taskNumber] = Task.Factory.StartNew(
            () =>
            {
                int j = Interlocked.Increment(ref i);
                while (j < array.Length)
                {
                    array[j] = array[j] * factor;
                    j = Interlocked.Increment(ref i);
                }
            });
    }

    Task.WaitAll(tasks);
}

Example output:

示例输出：

Serial: 0,20 s
Parallel.For: 0,50 s
Parallel.For (degree of parallelism): 8,90 s
Custom parallel: 0,33 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,53 s
Serial: 0,21 s
Parallel.For: 0,52 s
Parallel.For (degree of parallelism): 0,36 s
Custom parallel: 0,31 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,19 s
Custom parallel (false sharing): 7,59 s
Serial: 0,21 s
Parallel.For: 11,21 s
Parallel.For (degree of parallelism): 0,36 s
Custom parallel: 0,32 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,76 s
Serial: 0,21 s
Parallel.For: 0,46 s
Parallel.For (degree of parallelism): 0,35 s
Custom parallel: 0,31 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,58 s
Serial: 0,21 s
Parallel.For: 0,45 s
Parallel.For (degree of parallelism): 0,40 s
Custom parallel: 0,38 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,58 s

Answer 2

回答by Jordi

Parallel.For involves more complex memory management. That result could vary depending on cpu specs, like #cores, L1 & L2 cache...

Parallel.For 涉及更复杂的内存管理。该结果可能因 CPU 规格而异，例如 #cores、L1 和 L2 缓存...

Please take a look to this interesting article:

请看这篇有趣的文章：

http://msdn.microsoft.com/en-us/magazine/cc872851.aspx

Answer 3

回答by Lesto

from http://msdn.microsoft.com/en-us/library/system.threading.tasks.parallel.aspxand http://msdn.microsoft.com/en-us/library/dd537608.aspx

来自http://msdn.microsoft.com/en-us/library/system.threading.tasks.parallel.aspx和http://msdn.microsoft.com/en-us/library/dd537608.aspx

you are not creating three thread/process that execute your for, but the iteration of the for is tryed to be executet in parallel, so even with only one for you are using multiple thread/process.

您不是在创建三个执行 for 的线程/进程，而是尝试并行执行 for 的迭代，因此即使只有一个 for 您也在使用多个线程/进程。

this mean that interation with index = 0 and index = 1 may be executed at the same time.

这意味着可以同时执行与 index = 0 和 index = 1 的交互。

Probabily you are forcing to use too much thread/process, and the overhead for the creation/execution of them is bigger that the speed gain.

可能您被迫使用太多线程/进程，并且创建/执行它们的开销比速度增益更大。

Try to use three normal for but in three different thread/process, if your sistem is multicore (3x at least) it should take less than one minute

尝试使用三个正常但在三个不同的线程/进程中，如果您的系统是多核（至少 3 倍），它应该需要不到一分钟

Answer 4

回答by Kris Vandermotten

See Custom Partitioners for PLINQ and TPL:

请参阅PLINQ 和 TPL 的自定义分区器：

In a Forloop, the body of the loop is provided to the method as a delegate. The cost of invoking that delegate is about the same as a virtual method call. In some scenarios, the body of a parallel loop might be small enough that the cost of the delegate invocation on each loop iteration becomes significant. In such situations, you can use one of the Createoverloads to create an IEnumerable<T>of range partitions over the source elements. Then, you can pass this collection of ranges to a ForEachmethod whose body consists of a regular for loop. The benefit of this approach is that the delegate invocation cost is incurred only once per range, rather than once per element.

在For循环中，循环体作为委托提供给方法。调用该委托的成本与虚拟方法调用大致相同。在某些情况下，并行循环的主体可能足够小，以至于每次循环迭代上的委托调用成本变得很大。在这种情况下，您可以使用Create重载之一IEnumerable<T>在源元素上创建范围分区。然后，您可以将此范围集合传递给ForEach其主体由常规 for 循环组成的方法。这种方法的好处是委托调用成本每个范围只发生一次，而不是每个元素一次。

In your loop body, you are performing a single multiplication, and the overhead of the delegate call will be very noticeable.

在您的循环体中，您正在执行单个乘法，并且委托调用的开销将非常明显。

Try this:

尝试这个：

public static void MultiplicateArray(double[] array, double factor)
{
    var rangePartitioner = Partitioner.Create(0, array.Length);

    Parallel.ForEach(rangePartitioner, range =>
    {
        for (int i = range.Item1; i < range.Item2; i++)
        {
            array[i] = array[i] * factor;
        }
    });
}

See also: Parallel.ForEachdocumentationand Partitioner.Createdocumentation.

另请参阅：Parallel.ForEach文档和Partitioner.Create文档。

Answer 5

回答by Roman Reiner

Svick already provided a great answer but I'd like to emphasize that the key point is not to "parallelize your code manually" instead of using Parallel.For()but that you have to process larger chunks of data.

Svick 已经提供了一个很好的答案，但我想强调的是，关键不是“手动并行化您的代码”而不是使用Parallel.For()而是您必须处理更大的数据块。

This can still be done using Parallel.For()like this:

这仍然可以Parallel.For()像这样使用：

static void My(double[] array, double factor)
{
    int degreeOfParallelism = Environment.ProcessorCount;

    Parallel.For(0, degreeOfParallelism, workerId =>
    {
        var max = array.Length * (workerId + 1) / degreeOfParallelism;
        for (int i = array.Length * workerId / degreeOfParallelism; i < max; i++)
            array[i] = array[i] * factor;
    });
}

which does the same thing as svicks CustomParallelExtractedMax()but is shorter, simpler and (on my machine) performs even slightly faster:

它与 svicks 做同样的事情，CustomParallelExtractedMax()但更短、更简单并且（在我的机器上）执行得更快：

Serial: 3,94 s
Parallel.For: 9,28 s
Parallel.For (degree of parallelism): 9,58 s
Custom parallel: 2,05 s
Custom parallel (extracted max): 1,19 s
Custom parallel (extracted max, half parallelism): 1,49 s
Custom parallel (false sharing): 27,88 s
My: 0,95 s

Btw, the keyword for this which is missing from all the other answers is granularity.

顺便说一句，所有其他答案中都缺少的关键字是粒度。

C# 使用并行 FOR 循环节省时间

提问by tro

采纳答案by svick

回答by Jordi

回答by Lesto

回答by Kris Vandermotten

回答by Roman Reiner

相关推荐

最近更新

标签

C# 使用并行 FOR 循环节省时间

提问by tro

采纳答案by svick

回答by Jordi

回答by Lesto

回答by Kris Vandermotten

回答by Roman Reiner

相关推荐

C# IList<T> 和 List<T> 的区别

C# 数据库“DBName”的日志尾部尚未备份

C# 如何将变量作为 CommandParameter 传递

C# 如何在 HttpClient 的 HttpRequestMessage 上设置 cookie

相关推荐

最近更新

标签