C++ boost::tokenizer 与 boost::split

Question

提问by czchlong

I am trying to parse a c++ string on every '^' character into a vector tokens. I have always used the boost::split method, but I am now writing performance critical code and would like to know which one gives better performance.

我试图将每个 '^' 字符上的 C++ 字符串解析为向量标记。我一直使用 boost::split 方法，但我现在正在编写性能关键代码，想知道哪一个能提供更好的性能。

For example:

例如：

string message = "A^B^C^D";
vector<string> tokens;
boost::split(tokens, message, boost::is_any_of("^"));

vs.

对比

boost::char_separator<char> sep("^");
boost::tokenizer<boost::char_separator<char> > tokens(text, sep);

Which one would give better performance and why?

哪一个会提供更好的性能，为什么？

Answer 1

回答by brandx

The best choice depends on a few factors. If you're only needing to scan the tokens once, then the boost::tokenizer is a good choice in both runtime and space performance (those vectors of tokens can take up a lot of space, depending on input data.)

最佳选择取决于几个因素。如果您只需要扫描令牌一次，那么 boost::tokenizer 在运行时和空间性能方面都是不错的选择（这些令牌向量可能会占用大量空间，具体取决于输入数据。）

If you're going to be scanning the tokens often, or need a vector with efficient random access, then the boost::split into a vector may be the better option.

如果您要经常扫描令牌，或者需要具有高效随机访问的向量，那么 boost::split 为向量可能是更好的选择。

For example, in your "A^B^C^...^Z" input string where the tokens are 1-byte in length, the boost::split/vector<string>method will consume at least2*N-1 bytes. With the way strings are stored in most STL implementations you can figure it taking more than 8x that count. Storing these strings in a vector is costly in terms of memory and time.

例如，在“A^B^C^...^Z”输入字符串中，令牌长度为 1 个字节，该boost::split/vector<string>方法将消耗至少2*N-1 个字节。通过在大多数 STL 实现中存储字符串的方式，您可以计算出它需要超过 8 倍的数量。将这些字符串存储在向量中的内存和时间成本很高。

I ran a quick test on my machine and a similar pattern with 10 million tokens looked like this:

我在我的机器上运行了一个快速测试，一个包含 1000 万个令牌的类似模式如下所示：

boost::split = 2.5sand ~620MB
boost::tokenizer = 0.9sand 0MB

boost::split = 2.5s和~620MB
boost::tokenizer = 0.9s和0MB

If you're just doing a one-time scan of the tokens, then clearly the tokenizer is better. But, if you're shredding into a structure that you want to reuse during the lifetime of your application, then having a vector of tokens may be preferred.

如果您只是对令牌进行一次性扫描，那么显然令牌生成器更好。但是，如果您要分解成想要在应用程序的生命周期内重复使用的结构，那么可能更喜欢使用令牌向量。

If you want to go the vector route, then I'd recommend not using a vector<string>, but a vector of string::iterators instead. Just shred into a pair of iterators and keep around your big string of tokens for reference. For example:

如果你想走向量路线，那么我建议不要使用 a vector<string>，而是使用string::iterators 的向量。只需将其分解为一对迭代器并保留一大串标记以供参考。例如：

using namespace std;
vector<pair<string::const_iterator,string::const_iterator> > tokens;
boost::split(tokens, s, boost::is_any_of("^"));
for(auto beg=tokens.begin(); beg!=tokens.end();++beg){
   cout << string(beg->first,beg->second) << endl;
}

This improved version takes 1.6sand 390MBon the same server and test. And, best of all the memory overhead of this vector is linear with the number of tokens -- not dependent in any way on the length of tokens, whereas a std::vector<string>stores each token.

这个改进的版本在同一台服务器上需要1.6s和390MB并进行测试。而且，最重要的是，这个向量的内存开销与令牌的数量成线性关系——不以任何方式依赖于令牌的长度，而 astd::vector<string>存储每个令牌。

Answer 2

回答by frobenius

I find rather different results using clang++ -O3 -std=c++11 -stdlib=libc++.

我发现使用clang++ -O3 -std=c++11 -stdlib=libc++.

First I extracted a text file with ~470k words separated by commas with no newlines into a giant string, like so:

首先，我将一个包含约 470k 个单词的文本文件提取到一个巨大的字符串中，由逗号分隔，没有换行符，如下所示：

path const inputPath("input.txt");

filebuf buf;
buf.open(inputPath.string(),ios::in);
if (!buf.is_open())
    return cerr << "can't open" << endl, 1;

string str(filesystem::file_size(inputPath),'void vectorStorage(string const& str)
{
    static size_t const expectedSize = 471785;

    vector<string> contents;
    contents.reserve(expectedSize+1);

    ...

    {
        timed _("split is_any_of");
        split(contents, str, is_any_of(","));
    }
    if (expectedSize != contents.size()) throw runtime_error("bad size");
    contents.clear();

    ...
}
');
buf.sgetn(&str[0], str.size());
buf.close();

Then I ran various timed tests storing results into a pre-sized vector cleared between runs, for example,

然后我运行了各种定时测试，将结果存储到在运行之间清除的预定大小的向量中，例如，

struct timed
{
    ~timed()
    {
        auto duration = chrono::duration_cast<chrono::duration<double, ratio<1,1000>>>(chrono::high_resolution_clock::now() - start_);

        cout << setw(40) << right << name_ << ": " << duration.count() << " ms" << endl;
    }

    timed(std::string name="") :
        name_(name)
    {}


    chrono::high_resolution_clock::time_point const start_ = chrono::high_resolution_clock::now();
    string const name_;
};

For reference, the timer is just this:

作为参考，计时器就是这样的：

Vector: 
                              hand-coded: 54.8777 ms
                         split is_any_of: 67.7232 ms
                     split is_from_range: 49.0215 ms
                               tokenizer: 119.37 ms
One iteration:
                               tokenizer: 97.2867 ms
                          split iterator: 26.5444 ms
            split iterator back_inserter: 57.7194 ms
                split iterator char copy: 34.8381 ms

I also clocked a single iteration (no vector). Here are the results:

我还记录了一次迭代（无向量）。结果如下：

{
    string word;
    word.reserve(128);

    timed _("tokenizer");
    boost::char_separator<char> sep(",");
    boost::tokenizer<boost::char_separator<char> > tokens(str, sep);

    for (auto range : tokens)
    {}
}

{
    string word;

    timed _("split iterator");
    for (auto it = make_split_iterator(str, token_finder(is_from_range(',', ',')));
         it != decltype(it)(); ++it)
    {
        word = move(copy_range<string>(*it));
    }
}

The tokenizer is so much slowerthan split, the one-iteration figure doesn't even include the string copy:

分词器比慢得多split，一次迭代数字甚至不包括字符串副本：

##代码##

Unambiguous conclusion: use split.

明确的结论：使用split.

Answer 3

回答by Bryan Donaldson

It might depend on your version of boost and how you're the functionality.

这可能取决于您的 boost 版本以及您的功能。

We had a performance issue in some logic that was using boost::split 1.41.0 to handle thousands or hundreds of thousands of smaller strings (expected less than 10 tokens). When I ran the code through a performance analyzer we found that a surprising 39% amount of time was spent in boost::split.

我们在使用 boost::split 1.41.0 处理数千或数十万个较小字符串（预计少于 10 个标记）的某些逻辑中存在性能问题。当我通过性能分析器运行代码时，我们发现在 boost::split 上花费了 39% 的时间。

We tried some simple "fixes" that didn't affect performance materially like "we know we wont have more than 10 items on each pass, so preset the vector to 10 items".

我们尝试了一些对性能没有实质性影响的简单“修复”，例如“我们知道每次传递不会超过 10 个项目，因此将向量预设为 10 个项目”。

Since we didn't actually need the vector and could just iterate the tokens and accomplish the same job, we changed the code to boost::tokenize and the same section of code dropped to <1% of runtime.

由于我们实际上并不需要向量，并且可以只迭代标记并完成相同的工作，因此我们将代码更改为 boost::tokenize 并且相同的代码部分降至运行时的 <1%。

Answer 4

回答by Blake Booyah

Processing the tokens as you produce them is the key. I have a setup with a regex and it seems to be as fast as boost::tokenizer. If I store the matches in a vector its at least 50 times slower

在生成令牌时处理令牌是关键。我有一个带有正则表达式的设置，它似乎和 boost::tokenizer 一样快。如果我将匹配项存储在向量中，它至少慢 50 倍

C++ boost::tokenizer 与 boost::split

提问by czchlong

回答by brandx

回答by frobenius

回答by Bryan Donaldson

回答by Blake Booyah

相关推荐

最近更新

标签

C++ boost::tokenizer 与 boost::split

提问by czchlong

回答by brandx

回答by frobenius

回答by Bryan Donaldson

回答by Blake Booyah

相关推荐

C++ 缺少 MSVCP100D.dll

C++ 从 .txt 文件中读取字符串和整数并仅将输出打印为字符串

C++ 程序退出，代码为 0 错误

使用 Java 脚本和 C++ 使用用户名和密码登录的 POST HTTP 请求

相关推荐

最近更新

标签