C++ 从没有复制的 char* 初始化 std::string

Question

提问by Akusete

I have a situation where I need to process large (many GB's) amounts of data as such:

我有一种情况，我需要处理大量（许多 GB）的数据：

build a large string by appending many smaller (C char*) strings
trim the string
convert the string into a C++ const std::string for processing (read only)
repeat

通过附加许多较小的（C char*）字符串来构建一个大字符串
修剪字符串
将字符串转换为 C++ const std::string 进行处理（只读）
重复

The data in each iteration are independent.

每次迭代中的数据都是独立的。

My question is, I'd like to minimise (if possible eliminate) heap allocated memory usage, as it at the moment is my largest performance problem.

我的问题是，我想最小化（如果可能的话）堆分配的内存使用量，因为它目前是我最大的性能问题。

Is there a way to convert a C string (char*) into a stl C++ string (std::string) without requiring std::string to internally alloc/copy the data?

有没有办法将 C 字符串（char*）转换为 stl C++ 字符串（std::string）而不需要 std::string 在内部分配/复制数据？

Alternatively, could I use stringstreams or something similar to re-use a large buffer?

或者，我可以使用 stringstreams 或类似的东西来重用大缓冲区吗？

Edit:Thanks for the answers, for clarity, I think a revised question would be:

编辑：感谢您的回答，为清楚起见，我认为修改后的问题是：

How can I build (via multiple appends) a stl C++ string efficiently. And if performing this action in a loop, where each loop is totally independant, how can I re-use thisallocated space.

如何有效地构建（通过多个附加）一个 stl C++ 字符串。如果在循环中执行此操作，其中每个循环完全独立，我如何重新使用此分配的空间。

Answer 1

采纳答案by e.James

Is it at all possible to use a C++ string in step 1? If you use string::reserve(size_t), you can allocate a large enough buffer to prevent multiple heap allocations while appending the smaller strings, and then you can just use that same C++ string throughout all of the remaining steps.

是否可以在步骤 1 中使用 C++ 字符串？如果使用string::reserve(size_t)，则可以分配足够大的缓冲区以防止在附加较小字符串时进行多次堆分配，然后您可以在所有剩余步骤中使用相同的 C++ 字符串。

See this linkfor more information on the reservefunction.

有关该reserve功能的更多信息，请参阅此链接。

Answer 2

回答by puetzk

You can't actually form a std::string without copying the data. A stringstream would probably reuse the memory from pass to pass (though I think the standard is silent on whether it actually has to), but it still wouldn't avoid the copying.

您实际上无法在不复制数据的情况下形成 std::string。stringstream 可能会重复使用内存（尽管我认为标准对它是否真的必须使用保持沉默），但它仍然不会避免复制。

A common approach to this sort of problem is to write the code which processes the data in step 3 to use a begin/end iterator pair; then it can easily process either a std::string, a vector of chars, a pair of raw pointers, etc. Unlike passing it a container type like std::string, it would no longer know or care how the memory was allocated, since it would still belong to the caller. Carrying this idea to its logical conclusion is boost::range, which adds all the overloaded constructors to still let the caller just pass a string/vector/list/any sort of container with .begin() and .end(), or separate iterators.

解决此类问题的常见方法是编写处理步骤 3 中数据的代码以使用开始/结束迭代器对；然后它可以轻松地处理 std::string、字符向量、一对原始指针等。与将容器类型传递给 std::string 不同，它不再知道或关心内存是如何分配的，因为它仍然属于调用者。将这个想法推向其合乎逻辑的结论是boost::range，它添加了所有重载的构造函数，仍然让调用者只通过 .begin() 和 .end() 传递一个字符串/向量/列表/任何类型的容器，或者分开迭代器。

Having written your processing code to work on an arbitrary iterator range, you could then even write a custom iterator (not as hard as it sounds, basically just an object with some standard typedefs, and operator ++/*/=/==/!= overloaded to get a forward-only iterator) that takes care of advancing to the next fragment each time it hit the end of the one it's working on, skipping over whitespace (I assume that's what you meant by trim). That you never had to assemble the whole string contiguously at all. Whether or not this would be a win depends on how many fragments/how large of fragments you have. This is essentially what the SGI rope mentioned by Martin York is: a string where append forms a linked list of fragments instead of a contiguous buffer, which is thus suitable for much longer values.

编写处理代码以在任意迭代器范围内工作后，您甚至可以编写自定义迭代器（不像听起来那么难，基本上只是一个具有一些标准 typedef 和运算符 ++/*/=/==/ 的对象!= 重载以获得仅向前迭代器），它负责在每次到达正在处理的片段的末尾时前进到下一个片段，跳过空格（我假设这就是您所说的修剪的意思）。您根本不必连续组装整个字符串。这是否会获胜取决于您拥有多少碎片/有多少碎片。这本质上就是 Martin York 提到的 SGI 绳子：一个字符串，其中 append 形成一个片段的链表，而不是一个连续的缓冲区，因此适用于更长的值。

UPDATE(since I still see occasional upvotes on this answer):

更新（因为我仍然偶尔看到对这个答案的赞成）：

C++17 introduces another choice: std::string_view, which replaced std::string in many function signatures, is a non-owning reference to a character data. It is implicitly convertible from std::string, but can also be explicitly constructed from contiguous data owned somewhere else, avoiding the unnecessary copying std::string imposes.

C++17 引入了另一种选择：std::string_view在许多函数签名中替换了 std::string ，它是对字符数据的非拥有引用。它可以从 std::string 隐式转换，但也可以从其他地方拥有的连续数据显式构造，避免不必要的复制 std::string 强加。

Answer 3

回答by Martin York

To help with really big strings SGI has the class Rope in its STL.
Non standard but may be usefull.

为了帮助处理真正的大字符串，SGI 在其 STL 中有 Rope 类。
非标准但可能有用。

http://www.sgi.com/tech/stl/Rope.html

Apparently rope is in the next version of the standard :-)
Note the developer joke. A rope is a big string. (Ha Ha) :-)

显然，绳索在标准的下一个版本中:-)
注意开发人员的笑话。绳子是一根大绳子。（哈哈）：-）

Answer 4

回答by Daniel Earwicker

This is a lateral thinking answer, not directly addressing the question but "thinking" around it. Might be useful, might not...

这是一个横向思考的答案，不是直接解决问题，而是围绕它“思考”。可能有用，可能没用……

Readonly processing of std::string doesn't really require a very complex subset of std::string's features. Is there a possibility that you could do search/replace on the code that performs all the processing on std::strings so it takes some other type instead? Start with a blank class:

std::string 的只读处理并不真正需要 std::string 功能的非常复杂的子集。您是否有可能对在 std::strings 上执行所有处理的代码进行搜索/替换，以便它采用其他类型？从一个空白类开始：

class lightweight_string { };

类轻量字符串 { };

Then replace all std::string references with lightweight_string. Perform a compilation to find out exactly what operations are needed on lightweight_string for it to act as a drop-in replacement. Then you can make your implementation work however you want.

然后用轻量级字符串替换所有 std::string 引用。执行编译以准确找出需要对轻量字符串进行哪些操作才能将其用作替代品。然后，您可以根据需要使您的实施工作。

Answer 5

回答by David Norman

Is each iteration independent enough that you can use the same std::string for each iteration? One would hope that your std::string implementation is smart enough to re-use memory if you assign a const char * to it when it was previously used for something else.

每次迭代是否足够独立，以至于每次迭代都可以使用相同的 std::string？人们希望你的 std::string 实现足够聪明，可以重用内存，如果你在它之前用于其他东西时为其分配了一个 const char * 。

Assigning a char * into a std::string must always at least copy the data. Memory management is one of the main reasons to use std::string, so you won't be a able to override it.

将 char * 分配到 std::string 必须始终至少复制数据。内存管理是使用 std::string 的主要原因之一，因此您将无法覆盖它。

Answer 6

回答by Alan

In this case, might it be better to process the char* directly, instead of assigning it to a std::string.

在这种情况下，最好直接处理 char*，而不是将其分配给 std::string。

C++ 从没有复制的 char* 初始化 std::string

提问by Akusete

采纳答案by e.James

回答by puetzk

回答by Martin York

回答by Daniel Earwicker

回答by David Norman

回答by Alan

相关推荐

最近更新

标签

C++ 从没有复制的 char* 初始化 std::string

提问by Akusete

采纳答案by e.James

回答by puetzk

回答by Martin York

回答by Daniel Earwicker

回答by David Norman

回答by Alan

相关推荐

C++ 禁止编译器警告函数声明从未被引用

'bool' 是 C++ 中的基本数据类型吗？

C++ Visual Studio 2012 RC 中的 Windows 窗体 CLR 应用程序？

C++ 类“未声明的标识符”

相关推荐

最近更新

标签