D 与 C++ 相比有多快?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5142366/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 17:30:42  来源:igfitidea点击:

How fast is D compared to C++?

c++performanceruntimed

提问by Lars

I like some features of D, but would be interested if they come with a runtime penalty?

我喜欢 D 的一些功能,但是如果它们带有运行时惩罚,我会感兴趣吗?

To compare, I implemented a simple program that computes scalar products of many short vectors both in C++ and in D. The result is surprising:

为了进行比较,我实现了一个简单的程序,该程序可以在 C++ 和 D 中计算许多短向量的标量积。 结果令人惊讶:

  • D: 18.9 s [see below for final runtime]
  • C++: 3.8 s
  • D: 18.9 s [最终运行时间见下文]
  • C++:3.8 秒

Is C++ really almost five times as fast or did I make a mistake in the D program?

C++ 真的快五倍还是我在 D 程序中犯了错误?

I compiled C++ with g++ -O3 (gcc-snapshot 2011-02-19) and D with dmd -O (dmd 2.052) on a moderate recent linux desktop. The results are reproducible over several runs and standard deviations negligible.

我用 g++ -O3 (gcc-snapshot 2011-02-19) 和 D 用 dmd -O (dmd 2.052) 在最近的 linux 桌面上编译了 C++。结果在多次运行中是可重现的,并且标准偏差可以忽略不计。

Here the C++ program:

这里的 C++ 程序:

#include <iostream>
#include <random>
#include <chrono>
#include <string>

#include <vector>
#include <array>

typedef std::chrono::duration<long, std::ratio<1, 1000>> millisecs;
template <typename _T>
long time_since(std::chrono::time_point<_T>& time) {
      long tm = std::chrono::duration_cast<millisecs>( std::chrono::system_clock::now() - time).count();
  time = std::chrono::system_clock::now();
  return tm;
}

const long N = 20000;
const int size = 10;

typedef int value_type;
typedef long long result_type;
typedef std::vector<value_type> vector_t;
typedef typename vector_t::size_type size_type;

inline value_type scalar_product(const vector_t& x, const vector_t& y) {
  value_type res = 0;
  size_type siz = x.size();
  for (size_type i = 0; i < siz; ++i)
    res += x[i] * y[i];
  return res;
}

int main() {
  auto tm_before = std::chrono::system_clock::now();

  // 1. allocate and fill randomly many short vectors
  vector_t* xs = new vector_t [N];
  for (int i = 0; i < N; ++i) {
    xs[i] = vector_t(size);
      }
  std::cerr << "allocation: " << time_since(tm_before) << " ms" << std::endl;

  std::mt19937 rnd_engine;
  std::uniform_int_distribution<value_type> runif_gen(-1000, 1000);
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < size; ++j)
      xs[i][j] = runif_gen(rnd_engine);
  std::cerr << "random generation: " << time_since(tm_before) << " ms" << std::endl;

  // 2. compute all pairwise scalar products:
  time_since(tm_before);
  result_type avg = 0;
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j) 
      avg += scalar_product(xs[i], xs[j]);
  avg = avg / N*N;
  auto time = time_since(tm_before);
  std::cout << "result: " << avg << std::endl;
  std::cout << "time: " << time << " ms" << std::endl;
}

And here the D version:

这里是 D 版本:

import std.stdio;
import std.datetime;
import std.random;

const long N = 20000;
const int size = 10;

alias int value_type;
alias long result_type;
alias value_type[] vector_t;
alias uint size_type;

value_type scalar_product(const ref vector_t x, const ref vector_t y) {
  value_type res = 0;
  size_type siz = x.length;
  for (size_type i = 0; i < siz; ++i)
    res += x[i] * y[i];
  return res;
}

int main() {   
  auto tm_before = Clock.currTime();

  // 1. allocate and fill randomly many short vectors
  vector_t[] xs;
  xs.length = N;
  for (int i = 0; i < N; ++i) {
    xs[i].length = size;
  }
  writefln("allocation: %i ", (Clock.currTime() - tm_before));
  tm_before = Clock.currTime();

  for (int i = 0; i < N; ++i)
    for (int j = 0; j < size; ++j)
      xs[i][j] = uniform(-1000, 1000);
  writefln("random: %i ", (Clock.currTime() - tm_before));
  tm_before = Clock.currTime();

  // 2. compute all pairwise scalar products:
  result_type avg = cast(result_type) 0;
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j) 
      avg += scalar_product(xs[i], xs[j]);
  avg = avg / N*N;
  writefln("result: %d", avg);
  auto time = Clock.currTime() - tm_before;
  writefln("scalar products: %i ", time);

  return 0;
}

采纳答案by Vladimir Panteleev

To enable all optimizations and disable all safety checks, compile your D program with the following DMD flags:

要启用所有优化并禁用所有安全检查,请使用以下 DMD 标志编译您的 D 程序:

-O -inline -release -noboundscheck

EDIT: I've tried your programs with g++, dmd and gdc. dmd does lag behind, but gdc achieves performance very close to g++. The commandline I used was gdmd -O -release -inline(gdmd is a wrapper around gdc which accepts dmd options).

编辑:我已经用 g++、dmd 和 gdc 尝试过你的程序。dmd 确实落后,但 gdc 的性能非常接近 g++。我使用的命令行是gdmd -O -release -inline(gdmd 是 gdc 的包装器,它接受 dmd 选项)。

Looking at the assembler listing, it looks like neither dmd nor gdc inlined scalar_product, but g++/gdc did emit MMX instructions, so they might be auto-vectorizing the loop.

查看汇编器列表,看起来既没有 dmd 也没有 gdc inlined scalar_product,但是 g++/gdc 确实发出了 MMX 指令,因此它们可能会自动矢量化循环。

回答by dsimcha

One big thing that slows D down is a subpar garbage collection implementation. Benchmarks that don't heavily stress the GC will show very similar performance to C and C++ code compiled with the same compiler backend. Benchmarks that do heavily stress the GC will show that D performs abysmally. Rest assured, though, this is a single (albeit severe) quality-of-implementation issue, not a baked-in guarantee of slowness. Also, D gives you the ability to opt out of GC and tune memory management in performance-critical bits, while still using it in the less performance-critical 95% of your code.

减慢 D 的一件大事是低于标准的垃圾收集实现。不严重强调 GC 的基准将显示与使用相同编译器后端编译的 C 和 C++ 代码非常相似的性能。严重强调 GC 的基准测试将表明 D 的表现非常糟糕。不过请放心,这是一个(尽管很严重)的实施质量问题,而不是对缓慢的内在保证。此外,D 使您能够选择退出 GC 并在性能关键位调整内存管理,同时仍然在 95% 的性能不太关键的代码中使用它。

I've put some effort into improving GC performance latelyand the results have been rather dramatic, at least on synthetic benchmarks. Hopefully these changes will be integrated into one of the next few releases and will mitigate the issue.

我最近在提高 GC 性能方面付出了一些努力,结果相当引人注目,至少在综合基准测试上是这样。希望这些更改将集成到接下来的几个版本之一中,并缓解该问题。

回答by Andrei Alexandrescu

This is a very instructive thread, thanks for all the work to the OP and helpers.

这是一个非常有指导意义的线程,感谢 OP 和助手的所有工作。

One note - this test is not assessing the general question of abstraction/feature penalty or even that of backend quality. It focuses on virtually one optimization (loop optimization). I think it's fair to say that gcc's backend is somewhat more refined than dmd's, but it would be a mistake to assume that the gap between them is as large for all tasks.

一个注意事项 - 此测试不评估抽象/功能惩罚的一般问题,甚至不是后端质量的问题。它几乎专注于一种优化(循环优化)。我认为可以公平地说 gcc 的后端比 dmd 的后端更精致,但假设它们之间的差距对于所有任务都一样大是错误的。

回答by Erich Gubler

Definitely seems like a quality-of-implementation issue.

绝对看起来像是一个实施质量问题。

I ran some tests with the OP's code and made some changes. I actually got D going fasterfor LDC/clang++, operating on the assumption that arrays mustbe allocated dynamically (xsand associated scalars). See below for some numbers.

我使用 OP 的代码运行了一些测试并进行了一些更改。对于 LDC/clang++,我实际上让 D 运行得更快,假设数组必须动态分配(xs和关联的标量)。请参阅下面的一些数字。

Questions for the OP

OP的问题

Is it intentional that the same seed be used for each iteration of C++, while not so for D?

是否有意为 C++ 的每次迭代使用相同的种子,而对于 D 则不是这样?

Setup

设置

I have tweaked the original D source (dubbed scalar.d) to make it portable between platforms. This only involved changing the type of the numbers used to access and modify the size of arrays.

我已经调整了原始 D 源(称为scalar.d)以使其在平台之间可移植。这仅涉及更改用于访问和修改数组大小的数字类型。

After this, I made the following changes:

在此之后,我进行了以下更改:

  • Used uninitializedArrayto avoid default inits for scalars in xs (probably made the biggest difference). This is important because D normally default-inits everything silently, which C++ does not.

  • Factored out printing code and replaced writeflnwith writeln

  • Changed imports to be selective
  • Used pow operator (^^) instead of manual multiplication for final step of calculating average
  • Removed the size_typeand replaced appropriately with the new index_typealias
  • 用于uninitializedArray避免 xs 中标量的默认初始化(可能是最大的不同)。这很重要,因为 D 通常默认默认初始化所有内容,而 C++ 不会。

  • 分解出打印代码并替换writeflnwriteln

  • 将导入更改为选择性
  • ^^在计算平均值的最后一步使用 pow 运算符 ( ) 而不是手动乘法
  • 删除size_type并用新index_type别名适当替换

...thus resulting in scalar2.cpp(pastebin):

...从而导致scalar2.cpppastebin):

    import std.stdio : writeln;
    import std.datetime : Clock, Duration;
    import std.array : uninitializedArray;
    import std.random : uniform;

    alias result_type = long;
    alias value_type = int;
    alias vector_t = value_type[];
    alias index_type = typeof(vector_t.init.length);// Make index integrals portable - Linux is ulong, Win8.1 is uint

    immutable long N = 20000;
    immutable int size = 10;

    // Replaced for loops with appropriate foreach versions
    value_type scalar_product(in ref vector_t x, in ref vector_t y) { // "in" is the same as "const" here
      value_type res = 0;
      for(index_type i = 0; i < size; ++i)
        res += x[i] * y[i];
      return res;
    }

    int main() {
      auto tm_before = Clock.currTime;
      auto countElapsed(in string taskName) { // Factor out printing code
        writeln(taskName, ": ", Clock.currTime - tm_before);
        tm_before = Clock.currTime;
      }

      // 1. allocate and fill randomly many short vectors
      vector_t[] xs = uninitializedArray!(vector_t[])(N);// Avoid default inits of inner arrays
      for(index_type i = 0; i < N; ++i)
        xs[i] = uninitializedArray!(vector_t)(size);// Avoid more default inits of values
      countElapsed("allocation");

      for(index_type i = 0; i < N; ++i)
        for(index_type j = 0; j < size; ++j)
          xs[i][j] = uniform(-1000, 1000);
      countElapsed("random");

      // 2. compute all pairwise scalar products:
      result_type avg = 0;
      for(index_type i = 0; i < N; ++i)
        for(index_type j = 0; j < N; ++j)
          avg += scalar_product(xs[i], xs[j]);
      avg /= N ^^ 2;// Replace manual multiplication with pow operator
      writeln("result: ", avg);
      countElapsed("scalar products");

      return 0;
    }

After testing scalar2.d(which prioritized optimization for speed), out of curiousity I replaced the loops in mainwith foreachequivalents, and called it scalar3.d(pastebin):

经过测试scalar2.d(优先考虑速度优化),出于好奇,我mainforeach等效项替换了循环,并将其称为scalar3.dpastebin):

    import std.stdio : writeln;
    import std.datetime : Clock, Duration;
    import std.array : uninitializedArray;
    import std.random : uniform;

    alias result_type = long;
    alias value_type = int;
    alias vector_t = value_type[];
    alias index_type = typeof(vector_t.init.length);// Make index integrals portable - Linux is ulong, Win8.1 is uint

    immutable long N = 20000;
    immutable int size = 10;

    // Replaced for loops with appropriate foreach versions
    value_type scalar_product(in ref vector_t x, in ref vector_t y) { // "in" is the same as "const" here
      value_type res = 0;
      for(index_type i = 0; i < size; ++i)
        res += x[i] * y[i];
      return res;
    }

    int main() {
      auto tm_before = Clock.currTime;
      auto countElapsed(in string taskName) { // Factor out printing code
        writeln(taskName, ": ", Clock.currTime - tm_before);
        tm_before = Clock.currTime;
      }

      // 1. allocate and fill randomly many short vectors
      vector_t[] xs = uninitializedArray!(vector_t[])(N);// Avoid default inits of inner arrays
      foreach(ref x; xs)
        x = uninitializedArray!(vector_t)(size);// Avoid more default inits of values
      countElapsed("allocation");

      foreach(ref x; xs)
        foreach(ref val; x)
          val = uniform(-1000, 1000);
      countElapsed("random");

      // 2. compute all pairwise scalar products:
      result_type avg = 0;
      foreach(const ref x; xs)
        foreach(const ref y; xs)
          avg += scalar_product(x, y);
      avg /= N ^^ 2;// Replace manual multiplication with pow operator
      writeln("result: ", avg);
      countElapsed("scalar products");

      return 0;
    }

I compiled each of these tests using an LLVM-based compiler, since LDC seems to be the best option for D compilation in terms of performance. On my x86_64 Arch Linux installation I used the following packages:

我使用基于 LLVM 的编译器编译了这些测试中的每一个,因为就性能而言,LDC 似乎是 D 编译的最佳选择。在我的 x86_64 Arch Linux 安装中,我使用了以下软件包:

  • clang 3.6.0-3
  • ldc 1:0.15.1-4
  • dtools 2.067.0-2
  • clang 3.6.0-3
  • ldc 1:0.15.1-4
  • dtools 2.067.0-2

I used the following commands to compile each:

我使用以下命令来编译每个:

  • C++: clang++ scalar.cpp -o"scalar.cpp.exe" -std=c++11 -O3
  • D: rdmd --compiler=ldc2 -O3 -boundscheck=off <sourcefile>
  • C++: clang++ scalar.cpp -o"scalar.cpp.exe" -std=c++11 -O3
  • 乙: rdmd --compiler=ldc2 -O3 -boundscheck=off <sourcefile>

Results

结果

The results (screenshot of raw console output) of each version of the source as follows:

各版本源码的结果(原始控制台输出截图)如下:

  1. scalar.cpp(original C++):

    allocation: 2 ms
    
    random generation: 12 ms
    
    result: 29248300000
    
    time: 2582 ms
    

    C++ sets the standard at 2582 ms.

  2. scalar.d(modified OP source):

    allocation: 5 ms, 293 μs, and 5 hnsecs 
    
    random: 10 ms, 866 μs, and 4 hnsecs 
    
    result: 53237080000
    
    scalar products: 2 secs, 956 ms, 513 μs, and 7 hnsecs 
    

    This ran for ~2957 ms. Slower than the C++ implementation, but not too much.

  3. scalar2.d(index/length type change and uninitializedArray optimization):

    allocation: 2 ms, 464 μs, and 2 hnsecs
    
    random: 5 ms, 792 μs, and 6 hnsecs
    
    result: 59
    
    scalar products: 1 sec, 859 ms, 942 μs, and 9 hnsecs
    

    In other words, ~1860 ms. So far this is in the lead.

  4. scalar3.d(foreaches):

    allocation: 2 ms, 911 μs, and 3 hnsecs
    
    random: 7 ms, 567 μs, and 8 hnsecs
    
    result: 189
    
    scalar products: 2 secs, 182 ms, and 366 μs
    

    ~2182 msis slower than scalar2.d, but faster than the C++ version.

  1. scalar.cpp(原始 C++):

    allocation: 2 ms
    
    random generation: 12 ms
    
    result: 29248300000
    
    time: 2582 ms
    

    C++ 将标准设置为2582 ms

  2. scalar.d(修改后的OP源):

    allocation: 5 ms, 293 μs, and 5 hnsecs 
    
    random: 10 ms, 866 μs, and 4 hnsecs 
    
    result: 53237080000
    
    scalar products: 2 secs, 956 ms, 513 μs, and 7 hnsecs 
    

    这运行了~2957 ms。比 C++ 实现慢,但不会太多。

  3. scalar2.d(索引/长度类型更改和 uninitializedArray 优化):

    allocation: 2 ms, 464 μs, and 2 hnsecs
    
    random: 5 ms, 792 μs, and 6 hnsecs
    
    result: 59
    
    scalar products: 1 sec, 859 ms, 942 μs, and 9 hnsecs
    

    换句话说,~1860 毫秒。到目前为止,这是领先的。

  4. scalar3.d(外露):

    allocation: 2 ms, 911 μs, and 3 hnsecs
    
    random: 7 ms, 567 μs, and 8 hnsecs
    
    result: 189
    
    scalar products: 2 secs, 182 ms, and 366 μs
    

    ~2182 ms比 慢scalar2.d,但比 C++ 版本快。

Conclusion

结论

With the correct optimizations, the D implementation actually went faster than its equivalent C++ implementation using the LLVM-based compilers available. The current gap between D and C++ for most applications seems only to be based on limitations of current implementations.

通过正确的优化,D 实现实际上比使用基于 LLVM 的编译器的等效 C++ 实现更快。对于大多数应用程序而言,D 和 C++ 之间的当前差距似乎只是基于当前实现的局限性。

回答by Trass3r

dmd is the reference implementation of the language and thus most work is put into the frontend to fix bugs rather than optimizing the backend.

dmd 是该语言的参考实现,因此大部分工作都放在前端修复错误而不是优化后端。

"in" is faster in your case cause you are using dynamic arrays which are reference types. With ref you introduce another level of indirection (which is normally used to alter the array itself and not only the contents).

在您的情况下,“in”更快,因为您使用的是引用类型的动态数组。使用 ref 引入另一个级别的间接性(通常用于更改数组本身,而不仅仅是内容)。

Vectors are usually implemented with structs where const ref makes perfect sense. See smallptDvs. smallptfor a real-world example featuring loads of vector operations and randomness.

向量通常用结构体实现,其中 const ref 非常有意义。有关具有大量向量运算和随机性的真实示例,请参见smallptDsmallpt

Note that 64-Bit can also make a difference. I once missed that on x64 gcc compiles 64-Bit code while dmd still defaults to 32 (will change when the 64-Bit codegen matures). There was a remarkable speedup with "dmd -m64 ...".

请注意,64 位也可以有所作为。我曾经错过了 x64 gcc 编译 64 位代码而 dmd 仍然默认为 32(当 64 位代码生成成熟时会改变)。“dmd -m64 ...”有显着的加速。

回答by Jonathan M Davis

Whether C++ or D is faster is likely to be highly dependent on what you're doing. I would think that when comparing well-written C++ to well-written D code, they would generally either be of similar speed, or C++ would be faster, but what the particular compiler manages to optimize could have a big effect completely aside from the language itself.

C++ 还是 D 更快可能高度依赖于你在做什么。我认为,当将编写良好的 C++ 与编写良好的 D 代码进行比较时,它们通常具有相似的速度,或者 C++ 会更快,但是特定编译器设法优化的内容可能会产生很大的影响,完全与语言无关本身。

However, there area few cases where D stands a good chance of beating C++ for speed. The main one which comes to mind would be string processing. Thanks to D's array slicing capabalities, strings (and arrays in general) can be processed much faster than you can readily do in C++. For D1, Tango's XML processor is extremelyfast, thanks primarily to D's array slicing capabilities (and hopefully D2 will have a similarly fast XML parser once the one that's currently being worked on for Phobos has been completed). So, ultimately whether D or C++ is going to be faster is going to be very dependent on what you're doing.

但是,这里d代表击败C ++的速度的很好的机会,少数病例。想到的主要是字符串处理。多亏了 D 的数组切片能力,字符串(以及一般的数组)的处理速度比在 C++ 中容易得多。对于 D1,Tango 的 XML 处理器非常,这主要归功于 D 的数组切片功能(希望 D2 将拥有类似快速的 XML 解析器,一旦目前正在为 Phobos 工作的解析器已经完成)。因此,最终 D 或 C++ 是否会更快将非常依赖于您在做什么。

Now, I amsuprised that you're seeing such a difference in speed in this particular case, but it is the sort of thing that I would expect to improve as dmd improves. Using gdc might yield better results and would likely be a closer comparison of the language itself (rather than the backend) given that it's gcc-based. But it wouldn't surprise me at all if there are a number of things which could be done to speed up the code that dmd generates. I don't think that there's much question that gcc is more mature than dmd at this point. And code optimizations are one of the prime fruits of code maturity.

现在,我惊讶您在这种特殊情况下看到速度有如此大的差异,但我希望随着 dmd 的改进,这种情况会有所改进。使用 gdc 可能会产生更好的结果,并且考虑到它是基于 gcc 的,可能会更接近语言本身(而不是后端)的比较。但是如果有很多事情可以做来加速 dmd 生成的代码,我一点也不感到惊讶。在这一点上,我认为 gcc 比 dmd 更成熟没有多少问题。代码优化是代码成熟度的主要成果之一。

Ultimately, what matters is how well dmd performs for your particular application, but I do agree that it would definitely be nice to know how well C++ and D compare in general. In theory, they should be pretty much the same, but it really depends on the implementation. I think that a comprehensive set of benchmarks would be required to really test how well the two presently compare however.

最终,重要的是 dmd 对您的特定应用程序的性能如何,但我确实同意,了解 C++ 和 D 的总体比较情况肯定会很好。理论上,它们应该几乎相同,但这实际上取决于实现。我认为需要一套全面的基准测试来真正测试两者目前的比较情况。

回答by BCS

You can write C code is D so as far as which is faster, it will depend on a lot of things:

您可以将 C 代码编写为 D 至于哪个更快,这将取决于很多事情:

  • What compiler you use
  • What feature you use
  • how aggressively you optimize
  • 你用什么编译器
  • 你使用什么功能
  • 你优化的积极程度

Differences in the first aren't fair to drag in. The second might give C++ an advantage as it, if anything, has fewer heavy features. The third is the fun one: D code in some ways is easier to optimize because in general it is easier to understand. Also it has the ability to do a large degree of generative programing allowing things like verbose and repetitive but fast code to be written in a shorter forms.

第一个的差异是不公平的。第二个可能会给 C++ 带来优势,因为它有更少的重功能。第三个是有趣的:D 代码在某些方面更容易优化,因为通常它更容易理解。此外,它还具有进行大量生成式编程的能力,允许以较短的形式编写诸如冗长、重复但快速的代码之类的东西。

回答by GManNickG

Seems like a quality of implementation issue. For example, here's what I've been testing with:

似乎是实施质量问题。例如,这是我一直在测试的内容:

import std.datetime, std.stdio, std.random;

version = ManualInline;

immutable N = 20000;
immutable Size = 10;

alias int value_type;
alias long result_type;
alias value_type[] vector_type;

result_type scalar_product(in vector_type x, in vector_type y)
in
{
    assert(x.length == y.length);
}
body
{
    result_type result = 0;

    foreach(i; 0 .. x.length)
        result += x[i] * y[i];

    return result;
}

void main()
{   
    auto startTime = Clock.currTime();

    // 1. allocate vectors
    vector_type[] vectors = new vector_type[N];
    foreach(ref vec; vectors)
        vec = new value_type[Size];

    auto time = Clock.currTime() - startTime;
    writefln("allocation: %s ", time);
    startTime = Clock.currTime();

    // 2. randomize vectors
    foreach(ref vec; vectors)
        foreach(ref e; vec)
            e = uniform(-1000, 1000);

    time = Clock.currTime() - startTime;
    writefln("random: %s ", time);
    startTime = Clock.currTime();

    // 3. compute all pairwise scalar products
    result_type avg = 0;

    foreach(vecA; vectors)
        foreach(vecB; vectors)
        {
            version(ManualInline)
            {
                result_type result = 0;

                foreach(i; 0 .. vecA.length)
                    result += vecA[i] * vecB[i];

                avg += result;
            }
            else
            {
                avg += scalar_product(vecA, vecB);
            }
        }

    avg = avg / (N * N);

    time = Clock.currTime() - startTime;
    writefln("scalar products: %s ", time);
    writefln("result: %s", avg);
}

With ManualInlinedefined I get 28 seconds, but without I get 32. So the compiler isn't even inlining this simple function, which I think it's clear it should be.

有了ManualInline定义,我得到28秒时,但没有我得到32.因此,编译器甚至不内联这个简单的功能,我想很明显是应该的。

(My command line is dmd -O -noboundscheck -inline -release ....)

(我的命令行是dmd -O -noboundscheck -inline -release ...。)