使用 C++ 进行简单的字符串解析

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2880903/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 11:22:13  来源:igfitidea点击:

Simple string parsing with C++

c++

提问by Andreas Brinck

I've been using C++ for quite a long time now but nevertheless I tend to fall back on scanfwhen I have to parse simple text files. For example given a config like this (also assuming that the order of the fields could vary):

我已经使用 C++ 很长时间了,但是scanf当我必须解析简单的文本文件时,我倾向于依靠。例如,给出这样的配置(还假设字段的顺序可能会有所不同):

foo: [3 4 5]
baz: 3.0

I would write something like:

我会写这样的东西:

char line[SOME_SIZE];
while (fgets(line, SOME_SIZE, file)) {
    int x, y, z;
    if (3 == sscanf(line, "foo: [%d %d %d]", &x, &y, &z)) {
        continue;
    }
    float w;
    if (1 == sscanf(line, "baz: %f", &w)) {
        continue;
    }
}

What's the most concise way to achieve this in C++? Whenever I try I end up with a lot of scaffolding code.

在 C++ 中实现这一目标的最简洁方法是什么?每当我尝试时,我都会得到很多脚手架代码。

采纳答案by Nikko

This is a try using only standard C++.

这是仅使用标准 C++ 的尝试。

Most of the time I use a combination of std::istringstream and std::getline (which can work to separate words) to get what I want. And if I can I make my config files look like:

大多数情况下,我使用 std::istringstream 和 std::getline(可以用于分隔单词)的组合来获取我想要的内容。如果可以的话,我可以让我的配置文件看起来像:

foo=1,2,3,4

foo=1,2,3,4

which makes it easy.

这使它变得容易。

text file is like this:

文本文件是这样的:

foo=1,2,3,4
bar=0


And you parse it like this:

你像这样解析它:

int main()
{
    std::ifstream file( "sample.txt" );

    std::string line;
    while( std::getline( file, line ) )   
    {
        std::istringstream iss( line );

        std::string result;
        if( std::getline( iss, result , '=') )
        {
            if( result == "foo" )
            {
                std::string token;
                while( std::getline( iss, token, ',' ) )
                {
                    std::cout << token << std::endl;
                }
            }
            if( result == "bar" )
            {
               //...
    }
}

回答by Nikko

The C++ String Toolkit Library (StrTk)has the following solution to your problem:

C ++字符串工具箱库(StrTk)具有以下问题的解决方案:

#include <string>
#include <deque>
#include "strtk.hpp"

int main()
{
   std::string file_name = "simple.txt";
   strtk::for_each_line(file_name,
                       [](const std::string& line)
                       {
                          std::deque<std::string> token_list;
                          strtk::parse(line,"[]: ",token_list);
                          if (token_list.empty()) return;

                          const std::string& key = token_list[0];

                          if (key == "foo")
                          {
                            //do 'foo' related thing with token_list[1] 
                            //and token_list[2]
                            return;
                          }

                          if (key == "bar")
                          {
                            //do 'bar' related thing with token_list[1]
                            return;
                          }

                       });

   return 0;
}

More examples can be found Here

更多例子可以在这里找到

回答by Thomas Petit

Boost.Spirit is not reserved to parse complicated structure. It is quite good at micro-parsing too, and almost match the compactness of the C + scanf snippet :

Boost.Spirit 不是用来解析复杂结构的。它也非常擅长微解析,几乎与 C + scanf 片段的紧凑性相匹配:

#include <boost/spirit/include/qi.hpp>
#include <string>
#include <sstream>

using namespace boost::spirit::qi;


int main()
{
   std::string text = "foo: [3 4 5]\nbaz: 3.0";
   std::istringstream iss(text);

   std::string line;
   while (std::getline(iss, line))
   {
      int x, y, z;
      if(phrase_parse(line.begin(), line.end(), "foo: [">> int_ >> int_ >> int_ >> "]", space, x, y, z))
         continue;
      float w;
      if(phrase_parse(line.begin(), line.end(), "baz: ">> float_, space , w))
         continue;
   }
}

(Why they didn't add a "container" version is beyond me, it would be much more convenient if we could just write :

(他们为什么不添加“容器”版本超出了我的理解,如果我们可以写下会方便得多:

if(phrase_parse(line, "foo: [">> int_ >> int_ >> int_ >> "]", space, x, y, z))
   continue;

But it's true that :

但确实如此:

  • It adds a lot of compile time overhead.
  • Error messages are brutal. If you make a small mistake with scanf, you just run your program and immediately get a segfault or an absurd parsed value. Make a small mistake with spirit and you will get hopeless gigantic error messages from the compiler and it takes a LOT of practice with boost.spirit to understand them.
  • 它增加了很多编译时间开销。
  • 错误信息是残酷的。如果您使用 scanf 犯了一个小错误,您只需运行您的程序并立即得到一个段错误或一个荒谬的解析值。在 Spirit 上犯一个小错误,你会从编译器那里得到令人绝望的巨大错误消息,需要大量的 boost.spirit 练习才能理解它们。

So ultimately, for simple parsing I use scanf like everyone else...

所以最终,为了简单的解析,我像其他人一样使用 scanf ......

回答by Roger Dahl

Regular expressions can often be used for parsing strings. Use capture groups(parentheses) to get the various parts of the line being parsed.

正则表达式通常可用于解析字符串。使用capture groups(括号) 获取正在解析的行的各个部分。

For instance, to parse an expression like foo: [3 4 56], use the regular expression (.*): \[(\d+) (\d+) (\d+)\]. The first capture group will contain "foo", the second, third and fourth will contain the numbers 3, 4 and 56.

例如,要解析像 的表达式foo: [3 4 56],请使用正则表达式(.*): \[(\d+) (\d+) (\d+)\]。第一个捕获组将包含“foo”,第二个、第三个和第四个将包含数字 3、4 和 56。

If there are several possible string formats that need to be parsed, like in the example given by the OP, either apply separate regular expressions one by one and see which one matches, or write a regular expression that matches all the possible variations, typically using the |(set union) operator.

如果有几种可能的字符串格式需要解析,就像在 OP 给出的示例中一样,要么一一应用单独的正则表达式并查看哪一个匹配,要么编写一个匹配所有可能变体的正则表达式,通常使用在|(设定工会)运算符。

Regular expressions are very flexible, so the expression can be extended to allow more variations, for instance, an arbitrary number of spaces and other whitespace after the :in the example. Or to only allow the numbers to contain a certain number of digits.

正则表达式非常灵活,因此可以扩展表达式以允许更多变体,例如,示例中的 之后的任意数量的空格和其他空格:。或者只允许数字包含一定数量的数字。

As an added bonus, regular expressions provide an implicit validation since they require a perfect match. For instance, if the number 56in the example above was replaced with 56x, the match would fail. This can also simplify code as, in the example above, the groups containing the numbers can be safely cast to integers without any additional checking being required after a successful match.

作为额外的好处,正则表达式提供了隐式验证,因为它们需要完美匹配。例如,如果56上面示例中的数字被替换为56x,则匹配将失败。这也可以简化代码,因为在上面的示例中,包含数字的组可以安全地转换为整数,而无需在成功匹配后进行任何额外检查。

Regular expressions usually run at good performance and there are many good libraries to chose from. For instance, Boost.Regex.

正则表达式通常以良好的性能运行,并且有许多好的库可供选择。例如,Boost.Regex

回答by rcollyer

I feel your pain. I regularly deal with files that have fixed width fields (output via Fortran77 code), so it is always entertaining to attempt to load them with the minimum of fuss. Personally, I'd like to see boost::formatsupply a scanf implementation. But, barring implementing it myself, I do something similar to @Nikko using boost::tokenizerwith offset separators and lexical castfor conversion. For example,

我感觉到你的痛苦。我经常处理具有固定宽度字段的文件(通过 Fortran77 代码输出),因此尝试以最小的麻烦加载它们总是很有趣的。就个人而言,我希望看到boost::format提供一个 scanf 实现。但是,除非自己实现它,否则我会做一些类似于@Nikko 的操作,使用boost::tokenizer偏移分隔符和词法转换进行转换。例如,

typedef boost::token_iterator_generator< 
                                boost::char_separator<char> >::type tokenizer;

boost::char_separator<char> sep("=,");

std::string line;
std::getline( file_istream, line );
tokenizer tok = boost::make_token_iterator< std::string > (
                                line.begin(), line.end() sep );

std::string var = *tok;  // need to check for tok.at_end() here
++tok;

std::vector< int > vals;
for(;!tok.at_end();++tok){
 vals.push_back( boost::lexical_cast< int >( trimws( *tok ) );
}

Note: boost::lexical_castdoes not deal well with leading whitespace (it throws), so I recommend trimming the whitespace of anything you pass it.

注意:boost::lexical_cast不能很好地处理前导空格(它会抛出),所以我建议修剪任何传递给它的空格。

回答by rcollyer

I think Boost.Spirit is a good way to describe a grammar right in your C++ code. It takes some time to get used to Boost.Spirit but after it is quite easy to use it. It might not be as concise as probably you want but I think it is a handy way of handling simple grammars.Its performance might be a problem so it is likely that in situations where you need speed it might be not a good choice.

我认为 Boost.Spirit 是一种在 C++ 代码中描述语法的好方法。使用 Boost.Spirit 需要一些时间,但使用它之后就很容易了。它可能不像你想要的那么简洁,但我认为它是处理简单语法的一种方便的方式。它的性能可能是一个问题,所以在你需要速度的情况下它可能不是一个好的选择。