C++11 中 COW std::string 实现的合法性

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/12199710/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 16:00:28  来源:igfitidea点击:

Legality of COW std::string implementation in C++11

c++stringc++11stdstringcopy-on-write

提问by acm

It had been my understanding that copy-on-write is not a viable way to implement a conforming std::stringin C++11, but when it came up in discussion recently I found myself unable to directly support that statement.

我的理解是写时复制不是std::string在 C++11 中实现合规的可行方法,但是当它最近在讨论中出现时,我发现自己无法直接支持该声明。

Am I correct that C++11 does not admit COW based implementations of std::string?

我说 C++11 不承认基于 COW 的实现是正确的std::string吗?

If so, is this restriction explicitly stated somewhere in the new standard (where)?

如果是这样,此限制是否在新标准(where)中的某处明确说明?

Or is this restriction implied, in the sense that it is the combined effect of the new requirements on std::stringthat precludes a COW based implementation of std::string. In this case, I'd be interested in a chapter and verse style derivation of 'C++11 effectively prohibits COW based std::stringimplementations'.

或者这个限制是否隐含,从某种意义上说,新要求的综合影响std::string排除了基于 COW 的std::string. 在这种情况下,我会对“C++11 有效禁止基于 COW 的std::string实现”的章节和诗句风格派生感兴趣。

回答by Dave S

It's not allowed, because as per the standard 21.4.1 p6, invalidation of iterators/references is only allowed for

这是不允许的,因为根据标准 21.4.1 p6,迭代器/引用的失效只允许

— as an argument to any standard library function taking a reference to non-const basic_string as an argument.

— Calling non-const member functions, except operator[], at, front, back, begin, rbegin, end, and rend.

— 作为任何标准库函数的参数,将非常量 basic_string 的引用作为参数。

— 调用非常量成员函数,operator[]、at、front、back、begin、rbegin、end 和 rend 除外。

For a COW string, calling non-const operator[]would require making a copy (and invalidating references), which is disallowed by the paragraph above. Hence, it's no longer legal to have a COW string in C++11.

对于 COW 字符串,调用非常量operator[]将需要进行复制(并使引用无效),这是上一段所不允许的。因此,在 C++11 中使用 COW 字符串不再合法。

回答by Jonathan Wakely

The answers by Dave Sand gbjbaanbare correct. (And Luc Danton's is correct too, although it's more a side-effect of forbidding COW strings rather than the original rule that forbids it.)

通过这些问题的答案戴夫小号gbjbaanb正确的。(而且 Luc Danton 的也是正确的,尽管它更多是禁止 COW 字符串的副作用,而不是禁止它的原始规则。)

But to clear up some confusion, I'm going to add some further exposition. Various comments link to a comment of mine on the GCC bugzillawhich gives the following example:

但是为了澄清一些混乱,我将添加一些进一步的说明。各种评论链接到我对 GCC bugzilla 的评论,其中给出了以下示例:

std::string s("str");
const char* p = s.data();
{
    std::string s2(s);
    (void) s[0];
}
std::cout << *p << '\n';  // p is dangling

The point of that example is to demonstrate why GCC's reference counted (COW) string is not valid in C++11. The C++11 standard requires this code to work correctly. Nothing in the code permits the pto be invalidated in C++11.

该示例的重点是演示为什么 GCC 的引用计数 (COW) 字符串在 C++11 中无效。C++11 标准要求此代码正常工作。代码中的p任何内容都不允许在 C++11 中无效。

Using GCC's old reference-counted std::stringimplementation, that code has undefined behaviour, because pisinvalidated, becoming a dangling pointer. (What happens is that when s2is constructed it shares the data with s, but obtaining a non-const reference via s[0]requires the data to be unshared, so sdoes a "copy on write" because the reference s[0]could potentially be used to write into s, then s2goes out of scope, destroying the array pointed to by p).

使用GCC的老引用计数的std::string实现,该代码具有不确定的行为,因为p无效的,成为一个悬摆指针。(发生的情况是,在s2构造时,它与 共享数据s,但是通过获取非常量引用s[0]需要取消共享数据,因此s“写时复制”也是如此,因为该引用s[0]可能用于写入s,然后s2转到超出范围,破坏由p)指向的数组。

The C++03 standard explicitly permits that behaviourin 21.3 [lib.basic.string] p5 where it says that subsequent to a call to data()the first call to operator[]()may invalidate pointers, references and iterators. So GCC's COW string was a valid C++03 implementation.

C++03 标准在 21.3 [lib.basic.string] p5 中明确允许这种行为,它说在调用data()第一次调用之后operator[]()可能会使指针、引用和迭代器无效。所以 GCC 的 COW 字符串是一个有效的 C++03 实现。

The C++11 standard no longer permitsthat behaviour, because no call to operator[]()may invalidate pointers, references or iterators, irrespective of whether they follow a call to data().

C++11 标准不再允许这种行为,因为没有调用operator[]()可能会使指针、引用或迭代器无效,无论它们是否跟在对 的调用之后data()

So the example above mustwork in C++11, but does notwork with libstdc++'s kind of COW string, therefore that kind of COW string is not permitted in C++11.

所以上面的例子必须在 C++11工作,但不适用于 libstdc++ 的那种 COW 字符串,因此这种 COW 字符串在 C++11 中是不允许的。

回答by gbjbaanb

It is, CoW is an acceptable mechanism for making faster strings... but...

确实,CoW 是一种可以接受的制作更快字符串的机制……但是……

it makes multithreading code slower (all that locking to check if you're the only one writing kills performance when using a lot of strings). This was the main reason CoW was killed off years ago.

它使多线程代码变慢(所有这些锁定来检查您是否是唯一一个在使用大量字符串时会降低性能的人)。这是 CoW 多年前被杀的主要原因。

The other reasons are that the []operator will return you the string data, without any protection for you to overwrite a string someone else expects to be unchanging. The same applies to c_str()and data().

另一个原因是[]操作员将返回字符串数据,而没有任何保护措施让您覆盖其他人希望保持不变的字符串。这同样适用于c_str()data()

Quick google says that the multithreading is basically the reason it was effectively disallowed(not explicitly).

Quick google 说多线程基本上是它被有效禁止(未明确)的原因。

The proposal says :

提案说:

Proposal

We propose to make all iterator and element access operations safely concurrently executable.

We are increasing the stability of operations even in sequential code.

This change effectively disallows copy-on-write implementations.

提议

我们建议使所有迭代器和元素访问操作安全地并发执行。

即使在顺序代码中,我们也在提高操作的稳定性。

此更改有效地禁止了写时复制实现。

followed by

其次是

The largest potential loss in performance due to a switch away from copy-on-write implementations is the increased consumption of memory for applications with very large read-mostly strings. However, we believe that for those applications ropes are a better technical solution, and recommend a rope proposal be considered for inclusion in Library TR2.

由于从写时复制实现的切换而导致的最大潜在性能损失是具有非常大的主要读取字符串的应用程序的内存消耗增加。但是,我们认为对于这些应用,绳索是更好的技术解决方案,并建议考虑将绳索提案纳入库 TR2。

Ropesare part of STLPort and SGIs STL.

绳索是 STLPort 和 SGIs STL 的一部分。

回答by Luc Danton

From 21.4.2 basic_string constructors and assignment operators [string.cons]

来自 21.4.2 basic_string 构造函数和赋值运算符 [string.cons]

basic_string(const basic_string<charT,traits,Allocator>& str);

[...]

2 Effects: Constructs an object of class basic_stringas indicated in Table 64. [...]

basic_string(const basic_string<charT,traits,Allocator>& str);

[...]

2效果:构造一个类的对象,basic_string如表 64 所示。 [...]

Table 64 helpfully documents that after construction of an object via this (copy) constructor, this->data()has as value:

表 64 有用地记录了在通过此(复制)构造函数构造对象后,this->data()具有以下值:

points at the ?rst element of an allocated copy of the array whose ?rst element is pointed at by str.data()

指向第一个元素由 str.data() 指向的数组的已分配副本的第一个元素

There are similar requirements for other similar constructors.

其他类似的构造函数也有类似的要求。

回答by Dirk Holsopple

Since it is now guaranteed that strings are stored contiguously and you are now allowed to take a pointer to the internal storage of a string, (i.e. &str[0] works like it would for an array), it's not possible to make a useful COW implementation. You would have to make a copy for way too many things. Even just using operator[]or begin()on a non-const string would require a copy.

由于现在可以保证字符串是连续存储的,并且您现在可以使用指向字符串内部存储的指针(即 &str[0] 的工作方式类似于数组),因此不可能制作有用的 COW执行。您将不得不为太多东西制作副本。即使只是在非常量字符串上使用operator[]begin()也需要一个副本。

回答by Cheers and hth. - Alf

Is COW basic_stringprohibited in C++11 and later?

basic_stringC++11 及更高版本中是否禁止使用COW ?

Regarding

关于

Am I correct that C++11 does not admit COW based implementations of std::string?

我说 C++11 不承认基于 COW 的实现是正确的std::string吗?

Yes.

是的。

Regarding

关于

If so, is this restriction explicitly stated somewhere in the new standard (where)?

如果是这样,这个限制是否在新标准(where)的某处明确说明?

Almostdirectly, by requirements of constant complexity for a number of operations that would require O(n) physical copying of the string data in a COW implementation.

几乎直接地,由于需要对COW 实现中的字符串数据进行 O(n) 物理复制的许多操作的恒定复杂性要求。

For example, for the member functions

例如,对于成员函数

auto operator[](size_type pos) const -> const_reference;
auto operator[](size_type pos) -> reference;

… which in a COW implementation would 1both trigger string data copying to un-share the string value, the C++11 standard requires

...在 COW 实现中 1both 会触发字符串数据复制以取消共享字符串值,C++11 标准要求

C++11 §21.4.5/4C++11 §21.4.5/4

Complexity:constant time.

复杂性:恒定时间。

… which rules out such data copying, and hence, COW.

……排除了这种数据复制,因此排除了 COW。

C++03 supported COW implementations by nothaving these constant complexity requirements, and by, under certain restrictive conditions, allowing calls to operator[](), at(), begin(), rbegin(), end(), or rend()to invalidate references, pointers and iterators referring to the string items, i.e. to possibly incur a COW data copying. This support was removed in C++11.

C++03 支持 COW 实现,因为没有这些恒定的复杂性要求,并且在某些限制条件下,允许调用operator[](), at(), begin(), rbegin(), end(), 或rend()使引用字符串项的引用、指针和迭代器无效,即可能导致COW 数据复制。C++11 中删除了此支持。



Is COW also prohibited via the C++11 invalidation rules?

C++11 失效规则是否也禁止了 COW?

In another answer which at the time of writing is selected as solution, and which is heavily upvoted and therefore apparently believed, it's asserted that

在撰写本文时被选为解决方案的另一个答案中,该答案得到了大量支持,因此显然相信,它断言

For a COW string, calling non-constoperator[]would require making a copy (and invalidating references), which is disallowed by the [quoted] paragraph above [C++11 §21.4.1/6]. Hence, it's no longer legal to have a COW string in C++11.

对于 COW 字符串,调用 non-constoperator[]将需要进行复制(并使引用无效),这是上面 [C++11 §21.4.1/6] 上的 [quoted] 段落所不允许的。因此,在 C++11 中使用 COW 字符串不再合法。

That assertion is incorrect and misleading in two main ways:

该断言在两个主要方面是不正确和具有误导性的:

  • It incorrectly indicates that only the non-constitem accessors need to trigger a COW data copying.
    But also the constitem accessors need to trigger data copying, because they allow client code to form references or pointers that (in C++11) it's not permitted to invalidate later via the operations that can trigger COW data copying.
  • It incorrectly assumes that COW data copying can cause reference invalidation.
    But in a correct implementation COW data copying, un-sharing the string value, is done at a point before there are any references that can be invalidated.
  • 它错误地表明只有非const项目访问者需要触发 COW 数据复制。
    但是const项目访问器也需要触发数据复制,因为它们允许客户端代码形成引用或指针(在 C++11 中)它不允许稍后通过可以触发 COW 数据复制的操作无效。
  • 它错误地假设 COW 数据复制会导致引用失效。
    但是在正确的实现中,COW 数据复制、取消共享字符串值是在存在任何可能无效的引用之前的某个时间点完成的。

To see how a correct C++11 COW implementation of basic_stringwould work, when the O(1) requirements that make this invalid are ignored, think of an implementation where a string can switch between ownership policies. A string instance starts out with policy Sharable. With this policy active there can be no external item references. The instance can transition to Unique policy, and it must do so when an item reference is potentially created such as with a call to .c_str()(at least if that produces a pointer to the internal buffer). In the general case of multiple instances sharing ownership of the value, this entails copying the string data. After that transition to Unique policy the instance can only transition back to Sharable by an operation that invalidates all references, such as assignment.

要了解正确的 C++11 COW 实现如何basic_string工作,当忽略使该无效的 O(1) 要求时,请考虑一个字符串可以在所有权策略之间切换的实现。字符串实例以策略 Shareable 开始。在此策略处于活动状态时,不能有外部项目引用。实例可以转换为 Unique 策略,并且它必须在可能创建项引用时这样做,例如调用.c_str()(至少如果这会产生指向内部缓冲区的指针)。在多个实例共享值所有权的一般情况下,这需要复制字符串数据。在转换到唯一策略之后,实例只能通过使所有引用无效的操作(例如赋值)转换回可共享。

So, while that answer's conclusion, that COW strings are ruled out, is correct, the reasoning offered is incorrect and strongly misleading.

因此,虽然排除 COW 字符串的答案的结论是正确的,但提供的推理是不正确的并且具有很强的误导性。

I suspect the cause of this misunderstanding is a non-normative note in C++11's annex C:

我怀疑造成这种误解的原因是 C++11 的附录 C 中的一个非规范性注释:

C++11 §C.2.11 [diff.cpp03.strings], about §21.3:C++11 §C.2.11 [diff.cpp03.strings],关于 §21.3:

Change: basic_stringrequirements no longer allow reference-counted strings
Rationale:Invalidation is subtly different with reference-counted strings. This change regularizes behavor (sic) for this International Standard.
Effect on original feature:Valid C ++ 2003 code may execute differently in this International Standard

更改basic_string要求不再允许引用计数的字符串
理由:无效与引用计数的字符串略有不同。这一变化规范了本国际标准的行为(原文如此)。
对原始特性的影响:有效的 C++ 2003 代码在本国际标准中可能以不同的方式执行

Here the rationaleexplains the primary whyone decided to remove the C++03 special COW support. This rationale, the why, is not howthe standard effectively disallows COW implementation. The standard disallows COW via the O(1) requirements.

这里的理由解释了主为什么一个决定删除C ++ 03特牛的支持。这个基本原理,原因,并不是标准如何有效地禁止 COW 实施。该标准通过 O(1) 要求不允许 COW。

In short, the C++11 invalidation rules don't rule out a COW implementation of std::basic_string. But they do rule out a reasonably efficient unrestricted C++03-style COW implementation like the one in at least one of g++'s standard library implementations. The special C++03 COW support allowed practical efficiency, in particular using constitem accessors, at the cost of subtle, complex rules for invalidation:

简而言之,C++11 失效规则不排除std::basic_string. 但是他们确实排除了一种合理有效的不受限制的 C++03 风格的 COW 实现,就像在至少一个 g++ 标准库实现中的实现一样。特殊的 C++03 COW 支持允许实际效率,特别是使用const项目访问器,以微妙的、复杂的无效规则为代价:

C++03 §21.3/5C++03 §21.3/5包括“第一次调用”COW 支持:

References, pointers, and iterators referring to the elements of a basic_stringsequence may be invalidated by the following uses of that basic_stringobject:
— As an argument to non-member functions swap()(21.3.7.8), operator>>()(21.3.7.9), and getline()(21.3.7.9).
— As an argument to basic_string::swap().
— Calling data()and c_str()member functions.
— Calling non-constmember functions, except operator[](), at(), begin(), rbegin(), end(), and rend().
— Subsequent to any of the above uses except the forms of insert()and erase()which return iterators, the first call to non-constmember functions operator[](), at(), begin(), rbegin(), end(), or rend().

引用basic_string序列元素的引用、指针和迭代器可能会因该basic_string对象的以下用途而失效:
— 作为非成员函数的参数swap()(21.3.7.8)、operator>>()(21.3.7.9) 和getline()(21.3. 7.9)。
— 作为 的论据basic_string::swap()
— 调用data()c_str()成员函数。
-调用非const成员函数,除了operator[]()at()begin()rbegin()end(),和rend()
— 在上述任何一种使用之后,除了返回迭代器的insert()和形式之外erase(),第一次调用非const成员函数operator[](), at(), begin(), rbegin(),end(),或rend()

These rules are so complex and subtle that I doubt many programmers, if any, could give a precise summary. I could not.

这些规则是如此复杂和微妙,以至于我怀疑许多程序员(如果有的话)能否给出准确的总结。我不能。



What if O(1) requirements are disregarded?

如果 O(1) 要求被忽视怎么办?

If the C++11 constant time requirements on e.g. operator[]are disregarded, then COW for basic_stringcould be technically feasible, but difficult to implement.

如果operator[]忽略对 eg 的 C++11 常量时间要求,那么 COW forbasic_string可能在技术上可行,但难以实现。

Operations which could access the contents of a string without incurring COW data copying include:

可以访问字符串内容而不引起 COW 数据复制的操作包括:

  • Concatenation via +.
  • Output via <<.
  • Using a basic_stringas argument to standard library functions.
  • 通过连接+
  • 通过 输出<<
  • 使用 abasic_string作为标准库函数的参数。

The latter because the standard library is permitted to rely on implementation specific knowledge and constructs.

后者是因为标准库被允许依赖于实现特定的知识和构造。

Additionally an implementation could offer various non-standard functions for accessing string contents without triggering COW data copying.

此外,一个实现可以提供各种非标准函数来访问字符串内容而不触发 COW 数据复制。

A main complicating factor is that in C++11 basic_stringitem access must trigger data copying (un-sharing the string data) but is required to not throw, e.g. C++11 §21.4.5/3 “Throws:Nothing.”. And so it can't use ordinary dynamic allocation to create a new buffer for COW data copying. One way around this is to use a special heap where memory can be reservedwithout being actually allocated, and then reserve the requisite amount for each logical reference to a string value. Reserving and un-reserving in such a heap can be constant time, O(1), and allocating the amount that one has already reserved, can be noexcept. In order to comply with the standard's requirements, with this approach it seems there would need to be one such special reservation-based heap per distinct allocator.

一个主要的复杂因素是,在 C++11 中,basic_string项访问必须触发数据复制(取消共享字符串数据)但要求不 throw,例如 C++11 §21.4.5/3 “ Throws:Nothing.”。所以它不能使用普通的动态分配来创建一个新的缓冲区用于 COW 数据复制。解决这个问题的一种方法是使用一个特殊的堆,可以在其中保留内存而无需实际分配,然后为每个对字符串值的逻辑引用保留必要的数量。在这样的堆中保留和取消保留可以是常数时间,O(1),并且分配已经保留的数量,可以是noexcept. 为了符合标准的要求,使用这种方法似乎每个不同的分配器都需要一个这样的基于保留的特殊堆。



Notes:
1 The constitem accessor triggers a COW data copying because it allows the client code to obtain a reference or pointer to the data, which it's not permitted to invalidate by a later data copying triggered by e.g. the non-constitem accessor.

注意:
1const项目访问器触发 COW 数据复制,因为它允许客户端代码获取数据的引用或指针,不允许由非const项目访问器触发的稍后数据复制使其无效。

回答by zzz777

I was always wondering about immutable cows: once cow is created I could be changed only through assignment from another cow, hence it will be compliant with the standard.

我一直想知道不可变的母牛:一旦创建了母牛,我只能通过另一头母牛的分配来更改,因此它将符合标准。

I had time to try it today for a simple comparison test: a map of size N keyed by string/cow with every node holding a set of all strings in the map (we have NxN number of objects).

我今天有时间尝试一个简单的比较测试:一个大小为 N 的映射,由 string/cow 键控,每个节点都持有映射中所有字符串的集合(我们有 NxN 个对象)。

With strings sized ~300 bytes and N=2000 cows are slightly faster and use almost order of magnitude less memory. See below, sizes are in kbs, run b is with cows.

字符串大小约为 300 字节且 N=2000 奶牛会稍微快一些,并且使用的内存几乎要少一个数量级。见下文,大小以 kbs 为单位,运行 b 是奶牛。

~/icow$ ./tst 2000
preparation a
run
done a: time-delta=6 mem-delta=1563276
preparation b
run
done a: time-delta=3 mem-delta=186384