C++ qsort 与 std::sort 的性能?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4708105/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 16:13:54  来源:igfitidea点击:

Performance of qsort vs std::sort?

c++performancesortingstl

提问by Chan

According Scott Meyers, in his Effective STL book - item 46. He claimed that std::sortis about 670% faster than std::qsortdue to the fact of inline. I tested myself, and I saw that qsort is faster :( ! Could anyone help me to explain this strange behavior?

根据 Scott Meyers 的说法,在他的 Effective STL 书中 - 第 46 项。他声称这std::sortstd::qsort由于内联的事实快约 670% 。我测试了自己,我发现 qsort 更快 :( !谁能帮我解释一下这种奇怪的行为?

#include <iostream>
#include <vector>
#include <algorithm>

#include <cstdlib>
#include <ctime>
#include <cstdio>

const size_t LARGE_SIZE = 100000;

struct rnd {
    int operator()() {
        return rand() % LARGE_SIZE;
    }
};

int comp( const void* a, const void* b ) {
    return ( *( int* )a - *( int* )b );
}

int main() {
    int ary[LARGE_SIZE];
    int ary_copy[LARGE_SIZE];
    // generate random data
    std::generate( ary, ary + LARGE_SIZE, rnd() );
    std::copy( ary, ary + LARGE_SIZE, ary_copy );
    // get time
    std::time_t start = std::clock();
    // perform quick sort C using function pointer
    std::qsort( ary, LARGE_SIZE, sizeof( int ), comp );
    std::cout << "C quick-sort time elapsed: " << static_cast<double>( clock() - start ) / CLOCKS_PER_SEC << "\n";
    // get time again
    start = std::clock();
    // perform quick sort C++ using function object
    std::sort( ary_copy, ary_copy + LARGE_SIZE );
    std::cout << "C++ quick-sort time elapsed: " << static_cast<double>( clock() - start ) / CLOCKS_PER_SEC << "\n";
}

This is my result:

这是我的结果:

C quick-sort time elapsed: 0.061
C++ quick-sort time elapsed: 0.086
Press any key to continue . . .

Update

更新

Effective STL 3rd Edition ( 2001 )
Chapter 7 Programming with STL
Item 46: Consider function objects instead of functions as algorithm parameters.

Effective STL 第 3 版 (2001)
第 7 章使用 STL 编程
第 46 项:将函数对象而不是函数视为算法参数。

Best regards,

此致,

回答by Puppy

std::clock() is not a viable timing clock. You should use a platform-specific higher resolution timer, like the Windows High Performance Timer. More than that, the way that you call clock() is that first, text is output to the console, which is included in the time. This definitely invalidates the test. In addition, make sure that you compiled with all optimizations.

std::clock() 不是一个可行的计时时钟。您应该使用特定于平台的更高分辨率计时器,例如 Windows 高性能计时器。更重要的是,您调用 clock() 的方式是首先将文本输出到控制台,该控制台包含在时间中。这肯定会使测试无效。此外,请确保您使用所有优化进行编译。

Finally, I copied and pasted your code, and got 0.016 for qsort and 0.008 for std::sort.

最后,我复制并粘贴了您的代码,qsort 为 0.016,std::sort 为 0.008。

回答by rasmus

I am surprised that no one mentions caches.

我很惊讶没有人提到缓存。

In your code, you start by touching aryand *ary_copy* so they are resident in the cache at the time of qsort. During qsort, *ary_copy* might get evicted. At the time of std::sort, the elements would have to be fetched from memory or a larger (read slower) cache level. This will of course depend on your cache sizes.

在您的代码中,您首先接触ary和 *ary_copy*,以便它们在qsort时驻留在缓存中。在qsort期间, *ary_copy* 可能会被驱逐。在std::sort 时,必须从内存或更大(读取较慢)缓存级别获取元素。这当然取决于您的缓存大小。

Try to reverse the test, i.e., start by running std::sort.

尝试反转测试,即从运行std::sort 开始

As some people have pointed out; making the array larger will make the test more fair. The reason is that a large array is less likely to fit in cache.

正如一些人指出的那样;使数组更大将使测试更公平。原因是大数组不太可能放入缓存。

回答by templatetypedef

The two sorting algorithms, without optimizations enabled, should have comparable performance. The reason that the C++ sorttends to appreciably beat qsortis that the compiler can inline the comparisons being made, since the compiler has type information about what function is being used to perform the comparison. Did you run these tests with optimization enabled? If not, try turning it on and running this test again.

这两种排序算法,在没有启用优化的情况下,应该具有相当的性能。C++sort明显优于C++ 的原因qsort是编译器可以内联进行的比较,因为编译器具有关于正在使用什么函数来执行比较的类型信息。您是否在启用优化的情况下运行这些测试?如果没有,请尝试将其打开并再次运行此测试。

回答by Zan Lynx

Another reason that qsort may perform much better than expected is that newer compilers can inline and optimize through the function pointer.

qsort 性能可能比预期好得多的另一个原因是,较新的编译器可以通过函数指针进行内联和优化。

If the C header defines an inline implementation of qsort instead of implementing it inside of a library and the compiler supports indirect function inlining, then qsort can be just as fast as std::sort.

如果 C 头文件定义了 qsort 的内联实现而不是在库内部实现它,并且编译器支持间接函数内联,那么 qsort 可以与 std::sort 一样快。

回答by 6502

On my machine adding some meat (making the array 10 million elements and moving it in the data section) and compiling with

在我的机器上添加一些肉(使数组有 1000 万个元素并将其移动到数据部分)并编译

g++ -Wall -O2 -osortspeed sortspeed.cpp

I get as result

我得到结果

C quick-sort time elapsed: 3.48
C++ quick-sort time elapsed: 1.26

Be also careful about modern "green" CPUs that may be configured to run at a variable speed depending on the load of the system. When benchmarking, this kind of behavior can drive you crazy (on my machine I've a small script that fixes CPU clock that I use when making speed tests).

还要注意现代“绿色”CPU,它们可能被配置为根据系统负载以可变速度运行。在进行基准测试时,这种行为可能会让您发疯(在我的机器上,我有一个小脚本来修复我在进行速度测试时使用的 CPU 时钟)。

回答by Mr Bingley

Writing accurate benchmarks is difficult, so let's get Noniusto do it for us! Let's test qsort, std::sortwith no inlining, and std::sortwith inlining (the default) on a vector of a million random integers.

编写准确的基准测试很困难,所以让Nonius为我们做吧!让我们在一个包含一百万个随机整数的向量上测试qsortstd::sort没有内联和std::sort内联(默认)。

// sort.cpp
#define NONIUS_RUNNER
#include <nonius.h++>
#include <random>
#include <algorithm>

// qsort
int comp(const void* a, const void* b) {
    const int arg1 = *static_cast<const int*>(a);
    const int arg2 = *static_cast<const int*>(b);

    // we can't simply return a - b, because that might under/overflow
    return (arg1 > arg2) - (arg1 < arg2);
}

// std::sort with no inlining
struct compare_noinline {
    __attribute__((noinline)) bool operator()(const int a, const int b) {
        return a < b;
    }
};

// std::sort with inlining
struct compare {
    // the compiler will automatically inline this
    bool operator()(const int a, const int b) {
        return a < b;
    }
};

std::vector<int> gen_random_vector(const size_t size) {

    std::random_device seed;
    std::default_random_engine engine{seed()};
    std::uniform_int_distribution<int> dist{std::numeric_limits<int>::min(), std::numeric_limits<int>::max()};

    std::vector<int> vec;
    for (size_t i = 0; i < size; i += 1) {
        const int rand_int = dist(engine);
        vec.push_back(rand_int);
    }

    return vec;
}

// generate a vector of a million random integers
constexpr size_t size = 1'000'000;
static const std::vector<int> rand_vec = gen_random_vector(size);

NONIUS_BENCHMARK("qsort", [](nonius::chronometer meter) {

    // Nonius does multiple runs of the benchmark, and each one needs a new
    // copy of the original vector, otherwise we'd just be sorting the same
    // one over and over
    const size_t runs = static_cast<size_t>(meter.runs());
    std::vector<std::vector<int>> vectors{runs};
    std::fill(vectors.begin(), vectors.end(), rand_vec);

    meter.measure([&](const size_t run) {

        std::vector<int>& current_vec = vectors[run];

        std::qsort(current_vec.data(), current_vec.size(), sizeof(int), comp);

        return current_vec;
    });
});

NONIUS_BENCHMARK("std::sort noinline", [](nonius::chronometer meter) {

    const size_t runs = static_cast<size_t>(meter.runs());
    std::vector<std::vector<int>> vectors{runs};
    std::fill(vectors.begin(), vectors.end(), rand_vec);

    meter.measure([&](const size_t run) {

        std::vector<int>& current_vec = vectors[run];

        std::sort(current_vec.begin(), current_vec.end(), compare_noinline{});

        return current_vec;

    });
});

NONIUS_BENCHMARK("std::sort inline", [](nonius::chronometer meter) {

    const size_t runs = static_cast<size_t>(meter.runs());
    std::vector<std::vector<int>> vectors{runs};
    std::fill(vectors.begin(), vectors.end(), rand_vec);

    meter.measure([&](const size_t run) {

        std::vector<int>& current_vec = vectors[run];

        std::sort(current_vec.begin(), current_vec.end(), compare{});

        return current_vec;

    });
});

Compiling with Apple Clang 7.3.0,

使用 Apple Clang 7.3.0 编译,

$ clang++ -std=c++14 -stdlib=libc++ -O3 -march=native sort.cpp -o sort
$ ./sort

and running it on my 1.7 GHz i5 Macbook Air, we get

并在我的 1.7 GHz i5 Macbook Air 上运行它,我们得到

qsort                211 ms +/- 6 ms
std::sort noinline   127 ms +/- 5 ms
std::sort inline      87 ms +/- 4 ms

So std::sortwith no inlining is about 1.7x faster than qsort(perhaps due to different sorting algorithms), and inlining bumps that up to about 2.4x faster. Certainly an impressive speedup, but much less than 670%.

因此std::sort,没有内联的速度大约比qsort(可能是由于不同的排序算法)快 1.7 倍,而内联的颠簸速度大约快 2.4 倍。当然是令人印象深刻的加速,但远低于 670%。

回答by Dennis Beier

Don′t know how std::sort was implemented years ago. But std::sort can be much faster, because std::sort is quicksort with a heap sort fallback. Heapsort is a linearhitmic alghoritm, meaning if you have twice the sorting data, the sorting time doubles. Actually it more than doubles because it is not exactly linear, but however, qsort can be quadratic, so needing exponential more time for sorting twice the input.

不知道几年前 std::sort 是如何实现的。但是 std::sort 可以快得多,因为 std::sort 是带有堆排序回退的快速排序。Heapsort 是一个线性hitmic alghoritm,这意味着如果你有两倍的排序数据,排序时间就会加倍。实际上它增加了一倍多,因为它不是完全线性的,但是,qsort 可以是二次的,因此需要更多的指数时间来对两倍的输入进行排序。