在 C++ 中，从函数返回向量仍然是不好的做法吗？

Question

提问by Nate

Short version:It's common to return large objects—such as vectors/arrays—in many programming languages. Is this style now acceptable in C++0x if the class has a move constructor, or do C++ programmers consider it weird/ugly/abomination?

简短版本：在许多编程语言中返回大对象（例如向量/数组）是很常见的。如果类有移动构造函数，这种风格现在在 C++0x 中是否可以接受，或者 C++ 程序员是否认为它很奇怪/丑陋/令人厌恶？

Long version:In C++0x is this still considered bad form?

长版：在 C++0x 中，这仍然被认为是不好的形式吗？

std::vector<std::string> BuildLargeVector();
...
std::vector<std::string> v = BuildLargeVector();

The traditional version would look like this:

传统版本看起来像这样：

void BuildLargeVector(std::vector<std::string>& result);
...
std::vector<std::string> v;
BuildLargeVector(v);

In the newer version, the value returned from BuildLargeVectoris an rvalue, so v would be constructed using the move constructor of std::vector, assuming (N)RVO doesn't take place.

在较新的版本中，从返回的值BuildLargeVector是一个右值，因此 v 将使用的移动构造函数构造std::vector，假设 (N)RVO 不发生。

Even prior to C++0x the first form would often be "efficient" because of (N)RVO. However, (N)RVO is at the discretion of the compiler. Now that we have rvalue references it is guaranteedthat no deep copy will take place.

即使在 C++0x 之前，由于 (N)RVO，第一种形式通常是“高效的”。但是，(N)RVO 由编译器自行决定。现在我们有了右值引用，可以保证不会发生深拷贝。

Edit: Question is really not about optimization. Both forms shown have near-identical performance in real-world programs. Whereas, in the past, the first form could have had order-of-magnitude worse performance. As a result the first form was a major code smell in C++ programming for a long time. Not anymore, I hope?

编辑：问题实际上与优化无关。显示的两种形式在现实世界的程序中具有几乎相同的性能。而在过去，第一种形式的性能可能会差一个数量级。因此，很长一段时间以来，第一种形式是 C++ 编程中的主要代码异味。不会了，我希望？

Answer 1

采纳答案by Peter Alexander

Dave Abrahams has a pretty comprehensive analysis of the speed of passing/returning values.

Dave Abrahams对传递/返回值的速度进行了非常全面的分析。

Short answer, if you need to return a value then return a value. Don't use output references because the compiler does it anyway. Of course there are caveats, so you should read that article.

简短的回答，如果您需要返回一个值，则返回一个值。不要使用输出引用，因为编译器无论如何都会这样做。当然有一些警告，所以你应该阅读那篇文章。

Answer 2

回答by Jerry Coffin

At least IMO, it's usually a poor idea, but notfor efficiency reasons. It's a poor idea because the function in question should usually be written as a generic algorithm that produces its output via an iterator. Almost any code that accepts or returns a container instead of operating on iterators should be considered suspect.

至少 IMO，这通常是一个糟糕的主意，但不是出于效率原因。这是一个糟糕的主意，因为有问题的函数通常应该编写为通过迭代器生成其输出的通用算法。几乎所有接受或返回容器而不是对迭代器进行操作的代码都应被视为可疑代码。

Don't get me wrong: there are times it makes sense to pass around collection-like objects (e.g., strings) but for the example cited, I'd consider passing or returning the vector a poor idea.

不要误会我的意思：有时传递类似集合的对象（例如，字符串）是有意义的，但对于引用的示例，我认为传递或返回向量是一个糟糕的主意。

Answer 3

回答by peterchen

The gist is:

要点是：

Copy Elision and RVO canavoid the "scary copies" (the compiler is not required to implement these optimizations, and in some situations it can't be applied)

复制省略和静脉阻塞可避免“可怕的副本”（落实这些优化编译器不是必需的，而且在某些情况下，它不能被应用）

C++ 0x RValue references allowa string/vector implementations that guaranteesthat.

C++ 0x RValue 引用允许保证这一点的字符串/向量实现。

If you can abandon older compilers / STL implementations, return vectors freely (and make sure your own objects support it, too). If your code base needs to support "lesser" compilers, stick to the old style.

如果您可以放弃旧的编译器/STL 实现，请自由返回向量（并确保您自己的对象也支持它）。如果您的代码库需要支持“较小”的编译器，请坚持使用旧样式。

Unfortunately, that has major influence on your interfaces. If C++ 0x is not an option, and you need guarantees, you might use instead reference-counted or copy-on-write objects in some scenarios. They have downsides with multithreading, though.

不幸的是，这对您的界面有重大影响。如果 C++ 0x 不是一个选项，并且您需要保证，您可以在某些情况下使用引用计数或写时复制对象。不过，它们在多线程方面也有缺点。

(I wish just one answer in C++ would be simple and straightforward and without conditions).

（我希望 C++ 中的一个答案简单明了且没有条件）。

Answer 4

回答by Boris Dalstein

Indeed, since C++11, the cost of copyingthe std::vectoris gone in most cases.

事实上，从 C++11 开始，复制的成本std::vector在大多数情况下都消失了。

However, one should keep in mind that the cost of constructingthe new vector (then destructingit) still exists, and using output parameters instead of returning by value is still useful when you desire to reuse the vector's capacity. This is documented as an exception in F.20of the C++ Core Guidelines.

但是，应该记住，构建新向量（然后销毁它）的成本仍然存在，当您希望重用向量的容量时，使用输出参数而不是按值返回仍然很有用。这在 C++ 核心指南的F.20中被记录为例外。

Let's compare:

我们来比较一下：

std::vector<int> BuildLargeVector1(size_t vecSize) {
    return std::vector<int>(vecSize, 1);
}

with:

和：

void BuildLargeVector2(/*out*/ std::vector<int>& v, size_t vecSize) {
    v.assign(vecSize, 1);
}

Now, suppose we need to call these methods numItertimes in a tight loop, and perform some action. For example, let's compute the sum of all elements.

现在，假设我们需要numIter在一个紧密的循环中多次调用这些方法，并执行一些操作。例如，让我们计算所有元素的总和。

Using BuildLargeVector1, you would do:

使用BuildLargeVector1，你会这样做：

size_t sum1 = 0;
for (int i = 0; i < numIter; ++i) {
    std::vector<int> v = BuildLargeVector1(vecSize);
    sum1 = std::accumulate(v.begin(), v.end(), sum1);
}

Using BuildLargeVector2, you would do:

使用BuildLargeVector2，你会这样做：

size_t sum2 = 0;
std::vector<int> v;
for (int i = 0; i < numIter; ++i) {
    BuildLargeVector2(/*out*/ v, vecSize);
    sum2 = std::accumulate(v.begin(), v.end(), sum2);
}

In the first example, there are many unnecessary dynamic allocations/deallocations happening, which are prevented in the second example by using an output parameter the old way, reusing already allocated memory. Whether or not this optimization is worth doing depends on the relative cost of the allocation/deallocation compared to the cost of computing/mutating the values.

在第一个示例中，发生了许多不必要的动态分配/释放，在第二个示例中通过使用旧方式的输出参数，重用已分配的内存来防止这些。这种优化是否值得做取决于分配/解除分配的相对成本与计算/改变值的成本相比。

Benchmark

基准

Let's play with the values of vecSizeand numIter. We will keep vecSize*numIter constant so that "in theory", it should take the same time (= there is the same number of assignments and additions, with the exact same values), and the time difference can only come from the cost of allocations, deallocations, and better use of cache.

让我们的价值观发挥vecSize和numIter。我们将保持 vecSize*numIter 恒定，以便“理论上”，它应该花费相同的时间（= 有相同数量的赋值和添加，具有完全相同的值），并且时间差只能来自分配、解除分配和更好地使用缓存。

More specifically, let's use vecSize*numIter = 2^31 = 2147483648, because I have 16GB of RAM and this number ensures that no more than 8GB is allocated (sizeof(int) = 4), ensuring that I am not swapping to disk (all other programs were closed, I had ~15GB available when running the test).

更具体地说，让我们使用 vecSize*numIter = 2^31 = 2147483648，因为我有 16GB 的 RAM，这个数字确保分配的内存不超过 8GB（sizeof(int) = 4），确保我不会交换到磁盘（所有其他程序都关闭了，运行测试时我有大约 15GB 可用）。

Here is the code:

这是代码：

#include <chrono>
#include <iomanip>
#include <iostream>
#include <numeric>
#include <vector>

class Timer {
    using clock = std::chrono::steady_clock;
    using seconds = std::chrono::duration<double>;
    clock::time_point t_;

public:
    void tic() { t_ = clock::now(); }
    double toc() const { return seconds(clock::now() - t_).count(); }
};

std::vector<int> BuildLargeVector1(size_t vecSize) {
    return std::vector<int>(vecSize, 1);
}

void BuildLargeVector2(/*out*/ std::vector<int>& v, size_t vecSize) {
    v.assign(vecSize, 1);
}

int main() {
    Timer t;

    size_t vecSize = size_t(1) << 31;
    size_t numIter = 1;

    std::cout << std::setw(10) << "vecSize" << ", "
              << std::setw(10) << "numIter" << ", "
              << std::setw(10) << "time1" << ", "
              << std::setw(10) << "time2" << ", "
              << std::setw(10) << "sum1" << ", "
              << std::setw(10) << "sum2" << "\n";

    while (vecSize > 0) {

        t.tic();
        size_t sum1 = 0;
        {
            for (int i = 0; i < numIter; ++i) {
                std::vector<int> v = BuildLargeVector1(vecSize);
                sum1 = std::accumulate(v.begin(), v.end(), sum1);
            }
        }
        double time1 = t.toc();

        t.tic();
        size_t sum2 = 0;
        {
            std::vector<int> v;
            for (int i = 0; i < numIter; ++i) {
                BuildLargeVector2(/*out*/ v, vecSize);
                sum2 = std::accumulate(v.begin(), v.end(), sum2);
            }
        } // deallocate v
        double time2 = t.toc();

        std::cout << std::setw(10) << vecSize << ", "
                  << std::setw(10) << numIter << ", "
                  << std::setw(10) << std::fixed << time1 << ", "
                  << std::setw(10) << std::fixed << time2 << ", "
                  << std::setw(10) << sum1 << ", "
                  << std::setw(10) << sum2 << "\n";

        vecSize /= 2;
        numIter *= 2;
    }

    return 0;
}

And here is the result:

结果如下：

$ g++ -std=c++11 -O3 main.cpp && ./a.out
   vecSize,    numIter,      time1,      time2,       sum1,       sum2
2147483648,          1,   2.360384,   2.356355, 2147483648, 2147483648
1073741824,          2,   2.365807,   1.732609, 2147483648, 2147483648
 536870912,          4,   2.373231,   1.420104, 2147483648, 2147483648
 268435456,          8,   2.383480,   1.261789, 2147483648, 2147483648
 134217728,         16,   2.395904,   1.179340, 2147483648, 2147483648
  67108864,         32,   2.408513,   1.131662, 2147483648, 2147483648
  33554432,         64,   2.416114,   1.097719, 2147483648, 2147483648
  16777216,        128,   2.431061,   1.060238, 2147483648, 2147483648
   8388608,        256,   2.448200,   0.998743, 2147483648, 2147483648
   4194304,        512,   0.884540,   0.875196, 2147483648, 2147483648
   2097152,       1024,   0.712911,   0.716124, 2147483648, 2147483648
   1048576,       2048,   0.552157,   0.603028, 2147483648, 2147483648
    524288,       4096,   0.549749,   0.602881, 2147483648, 2147483648
    262144,       8192,   0.547767,   0.604248, 2147483648, 2147483648
    131072,      16384,   0.537548,   0.603802, 2147483648, 2147483648
     65536,      32768,   0.524037,   0.600768, 2147483648, 2147483648
     32768,      65536,   0.526727,   0.598521, 2147483648, 2147483648
     16384,     131072,   0.515227,   0.599254, 2147483648, 2147483648
      8192,     262144,   0.540541,   0.600642, 2147483648, 2147483648
      4096,     524288,   0.495638,   0.603396, 2147483648, 2147483648
      2048,    1048576,   0.512905,   0.609594, 2147483648, 2147483648
      1024,    2097152,   0.548257,   0.622393, 2147483648, 2147483648
       512,    4194304,   0.616906,   0.647442, 2147483648, 2147483648
       256,    8388608,   0.571628,   0.629563, 2147483648, 2147483648
       128,   16777216,   0.846666,   0.657051, 2147483648, 2147483648
        64,   33554432,   0.853286,   0.724897, 2147483648, 2147483648
        32,   67108864,   1.232520,   0.851337, 2147483648, 2147483648
        16,  134217728,   1.982755,   1.079628, 2147483648, 2147483648
         8,  268435456,   3.483588,   1.673199, 2147483648, 2147483648
         4,  536870912,   5.724022,   2.150334, 2147483648, 2147483648
         2, 1073741824,  10.285453,   3.583777, 2147483648, 2147483648
         1, 2147483648,  20.552860,   6.214054, 2147483648, 2147483648

(Intel i7-7700K @ 4.20GHz; 16GB DDR4 2400Mhz; Kubuntu 18.04)

（英特尔 i7-7700K @ 4.20GHz；16GB DDR4 2400Mhz；Kubuntu 18.04）

Notation: mem(v) = v.size() * sizeof(int) = v.size() * 4 on my platform.

符号： mem(v) = v.size() * sizeof(int) = v.size() * 4 在我的平台上。

Not surprisingly, when numIter = 1(i.e., mem(v) = 8GB), the times are perfectly identical. Indeed, in both cases we are only allocating once a huge vector of 8GB in memory. This also proves that no copy happened when using BuildLargeVector1(): I wouldn't have enough RAM to do the copy!

毫不奇怪，当numIter = 1（即 mem(v) = 8GB）时，时间完全相同。事实上，在这两种情况下，我们只在内存中分配了一次巨大的 8GB 向量。这也证明在使用 BuildLargeVector1() 时没有发生复制：我没有足够的 RAM 来进行复制！

When numIter = 2, reusing the vector capacity instead of re-allocating a second vector is 1.37x faster.

当时numIter = 2，重用向量容量而不是重新分配第二个向量的速度提高了 1.37 倍。

When numIter = 256, reusing the vector capacity (instead of allocating/deallocating a vector over and over again 256 times...) is 2.45x faster :)

当numIter = 256，重用向量容量（而不是一遍又一遍地分配/解除分配向量 256 次......）快 2.45 倍:)

We can notice that time1 is pretty much constant from numIter = 1to numIter = 256, which means that allocating one huge vector of 8GB is pretty much as costly as allocating 256 vectors of 32MB. However, allocating one huge vector of 8GB is definitly more expensive than allocating one vector of 32MB, so reusing the vector's capacity provides performance gains.

我们可以注意到 time1 从numIter = 1到几乎是恒定的numIter = 256，这意味着分配一个 8GB 的巨大向量与分配 256 个 32MB 的向量几乎一样昂贵。然而，分配一个 8GB 的巨大向量肯定比分配一个 32MB 的向量更昂贵，因此重用向量的容量可以提供性能提升。

From numIter = 512(mem(v) = 16MB) to numIter = 8M(mem(v) = 1kB) is the sweet spot: both methods are exactly as fast, and faster than all other combinations of numIter and vecSize. This probably has to do with the fact that the L3 cache size of my processor is 8MB, so that the vector pretty much fits completely in cache. I don't really explain why the sudden jump of time1is for mem(v) = 16MB, it would seem more logical to happen just after, when mem(v) = 8MB. Note that surprisingly, in this sweet spot, not re-using capacity is in fact slightly faster! I don't really explain this.

从numIter = 512(mem(v) = 16MB) 到numIter = 8M(mem(v) = 1kB) 是最佳选择：这两种方法都一样快，而且比 numIter 和 vecSize 的所有其他组合都快。这可能与我的处理器的 L3 缓存大小为 8MB 的事实有关，因此向量几乎完全适合缓存。我并没有真正解释为什么突然跳跃的原因time1是 mem(v) = 16MB，当 mem(v) = 8MB 时，发生在这之后似乎更合乎逻辑。请注意，令人惊讶的是，在这个甜蜜点中，不重用容量实际上会稍微快一点！我真的不解释这个。

When numIter > 8Mthings start to get ugly. Both methods get slower but returning the vector by value gets even slower. In the worst case, with a vector containing only one single int, reusing capacity instead of returning by value is 3.3x faster. Presumably, this is due to the fixed costs of malloc() which start to dominate.

当numIter > 8M事情开始变得丑陋时。这两种方法都变慢了，但按值返回向量变得更慢。在最坏的情况下，一个向量只包含一个单一的int，重用容量而不是按值返回快 3.3 倍。据推测，这是由于 malloc() 的固定成本开始占主导地位。

Note how the curve for time2 is smoother than the curve for time1: not only re-using vector capacity is generally faster, but perhaps more importantly, it is more predictable.

请注意 time2 的曲线如何比 time1 的曲线更平滑：不仅重用向量容量通常更快，而且可能更重要的是，它更可预测。

Also note that in the sweet spot, we were able to perform 2 billion additions of 64bit integers in ~0.5s, which is quite optimal on a 4.2Ghz 64bit processor. We could do better by parallelizing the computation in order to use all 8 cores (the test above only uses one core at a time, which I have verified by re-running the test while monitoring CPU usage). The best performance is achieved when mem(v) = 16kB, which is the order of magnitude of L1 cache (L1 data cache for the i7-7700K is 4x32kB).

另请注意，在最佳状态下，我们能够在约 0.5 秒内执行 20 亿次 64 位整数的加法，这在 4.2Ghz 64 位处理器上非常理想。我们可以通过并行计算以使用所有 8 个内核来做得更好（上面的测试一次只使用一个内核，我已经通过在监视 CPU 使用情况的同时重新运行测试进行了验证）。当 mem(v) = 16kB 时达到最佳性能，这是 L1 缓存的数量级（i7-7700K 的 L1 数据缓存为 4x32kB）。

Of course, the differences become less and less relevant the more computation you actually have to do on the data. Below are the results if we replace sum = std::accumulate(v.begin(), v.end(), sum);by for (int k : v) sum += std::sqrt(2.0*k);:

当然，您实际需要对数据进行的计算越多，差异就越不相关。下面是如果我们更换的结果sum = std::accumulate(v.begin(), v.end(), sum);通过for (int k : v) sum += std::sqrt(2.0*k);：

Conclusions

结论

Using output parameters instead of returning by value mayprovide performance gains by re-using capacity.
On a modern desktop computer, this seems only applicable to large vectors (>16MB) and small vectors (<1kB).
Avoid allocating millions/billions of small vectors (< 1kB). If possible, re-use capacity, or better yet, design your architecture differently.

使用输出参数而不是按值返回可以通过重用容量来提高性能。
在现代台式计算机上，这似乎只适用于大向量 (>16MB) 和小向量 (<1kB)。
避免分配数百万/数十亿的小向量 (< 1kB)。如果可能，重用容量，或者更好的是，以不同的方式设计您的架构。

Results may differ on other platforms. As usual, if performance matters, write benchmarks for your specific use case.

结果在其他平台上可能会有所不同。像往常一样，如果性能很重要，请为您的特定用例编写基准测试。

Answer 5

回答by stinky472

I still think it is a bad practice but it's worth noting that my team uses MSVC 2008 and GCC 4.1, so we're not using the latest compilers.

我仍然认为这是一个不好的做法，但值得注意的是，我的团队使用 MSVC 2008 和 GCC 4.1，所以我们没有使用最新的编译器。

Previously a lot of the hotspots shown in vtune with MSVC 2008 came down to string copying. We had code like this:

以前使用 MSVC 2008 在 vtune 中显示的许多热点归结为字符串复制。我们有这样的代码：

String Something::id() const
{
    return valid() ? m_id: "";
}

... note that we used our own String type (this was required because we're providing a software development kit where plugin writers could be using different compilers and therefore different, incompatible implementations of std::string/std::wstring).

...请注意，我们使用了我们自己的 String 类型（这是必需的，因为我们提供了一个软件开发工具包，其中插件编写者可以使用不同的编译器，因此 std::string/std::wstring 的不同的、不兼容的实现）。

I made a simple change in response to the call graph sampling profiling session showing String::String(const String&) to be taking up a significant amount of time. Methods like in the above example were the greatest contributors (actually the profiling session showed memory allocation and deallocation to be one of the biggest hotspots, with the String copy constructor being the primary contributor for the allocations).

为了响应调用图采样分析会话，我做了一个简单的更改，显示 String::String(const String&) 占用了大量时间。上面例子中的方法是最大的贡献者（实际上，分析会话显示内存分配和释放是最大的热点之一，字符串复制构造函数是分配的主要贡献者）。

The change I made was simple:

我所做的改变很简单：

static String null_string;
const String& Something::id() const
{
    return valid() ? m_id: null_string;
}

Yet this made a world of difference! The hotspot went away in subsequent profiler sessions, and in addition to this we do a lot of thorough unit testing to keep track of our application performance. All kinds of performance test times dropped significantly after these simple changes.

然而，这让世界大不相同！热点在随后的分析器会话中消失了，除此之外，我们还进行了大量彻底的单元测试以跟踪我们的应用程序性能。经过这些简单的更改后，各种性能测试时间都显着下降。

Conclusion: we're not using the absolute latest compilers, but we still can't seem to depend on the compiler optimizing away the copying for returning by value reliably (at least not in all cases). That may not be the case for those using newer compilers like MSVC 2010. I'm looking forward to when we can use C++0x and simply use rvalue references and not ever have to worry that we're pessimizing our code by returning complex classes by value.

结论：我们没有使用绝对最新的编译器，但我们似乎仍然无法依赖编译器优化复制以可靠地按值返回（至少不是在所有情况下）。对于那些使用 MSVC 2010 等较新编译器的人来说，情况可能并非如此。我期待着何时我们可以使用 C++0x 并简单地使用右值引用，而不必担心我们会通过返回复杂的代码来使我们的代码变得悲观按值分类。

[Edit] As Nate pointed out, RVO applies to returning temporaries created inside of a function. In my case, there were no such temporaries (except for the invalid branch where we construct an empty string) and thus RVO would not have been applicable.

[编辑] 正如 Nate 指出的那样，RVO 适用于返回在函数内部创建的临时对象。就我而言，没有这样的临时变量（除了我们构造空字符串的无效分支），因此 RVO 不适用。

Answer 6

回答by Nemanja Trifunovic

Just to nitpick a little: it is not common in many programming languages to return arrays from functions. In most of them, a reference to array is returned. In C++, the closest analogy would be returning boost::shared_array

只是挑剔一点：在许多编程语言中从函数返回数组并不常见。在大多数情况下，返回对数组的引用。在 C++ 中，最接近的类比是返回boost::shared_array

Answer 7

回答by Motti

If performance is a real issue you should realise that move semantics aren't alwaysfaster than copying. For example if you have a string that uses the small string optimizationthen for small strings a move constructor must do the exact same amount of work as a regular copy constructor.

如果性能是一个真正的问题，您应该意识到移动语义并不总是比复制快。例如，如果您有一个使用小字符串优化的字符串，那么对于小字符串，移动构造函数必须执行与常规复制构造函数完全相同的工作量。

在 C++ 中，从函数返回向量仍然是不好的做法吗？

提问by Nate

采纳答案by Peter Alexander

回答by Jerry Coffin

回答by peterchen

回答by Boris Dalstein

Benchmark

基准

Conclusions

结论

回答by stinky472

回答by Nemanja Trifunovic

回答by Motti

相关推荐

最近更新

标签

在 C++ 中，从函数返回向量仍然是不好的做法吗？

提问by Nate

采纳答案by Peter Alexander

回答by Jerry Coffin

回答by peterchen

回答by Boris Dalstein

Benchmark

基准

Conclusions

结论

回答by stinky472

回答by Nemanja Trifunovic

回答by Motti

相关推荐

C++ 从构造函数调用成员函数

C++ 在 Qt 中解析 XML 文件

C++ 从 DLL 导出全局变量

C++ 在哪里放置文件以便阅读？

相关推荐

最近更新

标签