如何在 C++ 中读取和解析 CSV 文件？

Question

提问by User1

I need to load and use CSV file data in C++. At this point it can really just be a comma-delimited parser (ie don't worry about escaping new lines and commas). The main need is a line-by-line parser that will return a vector for the next line each time the method is called.

我需要在 C++ 中加载和使用 CSV 文件数据。在这一点上，它真的可以只是一个逗号分隔的解析器（即不用担心转义换行符和逗号）。主要需要的是逐行解析器，每次调用该方法时，它都会为下一行返回一个向量。

I found this article which looks quite promising: http://www.boost.org/doc/libs/1_35_0/libs/spirit/example/fundamental/list_parser.cpp

我发现这篇文章看起来很有前途：http: //www.boost.org/doc/libs/1_35_0/libs/spirit/example/fundamental/list_parser.cpp

I've never used Boost's Spirit, but am willing to try it. But only if there isn't a more straightforward solution I'm overlooking.

我从未使用过 Boost's Spirit，但我愿意尝试一下。但只有在没有更直接的解决方案时，我才会忽略。

Answer 1

回答by Martin York

If you don't care about escaping comma and newline,
AND you can't embed comma and newline in quotes (If you can't escape then...)
then its only about three lines of code (OK 14 ->But its only 15 to read the whole file).

如果你不关心转义逗号和换行符，
并且你不能在引号中嵌入逗号和换行符（如果你不能转义那么......）
那么它只有大约三行代码（OK 14 ->但是它的只需 15 个即可读取整个文件）。

std::vector<std::string> getNextLineAndSplitIntoTokens(std::istream& str)
{
    std::vector<std::string>   result;
    std::string                line;
    std::getline(str,line);

    std::stringstream          lineStream(line);
    std::string                cell;

    while(std::getline(lineStream,cell, ','))
    {
        result.push_back(cell);
    }
    // This checks for a trailing comma with no data after it.
    if (!lineStream && cell.empty())
    {
        // If there was a trailing comma then add an empty element.
        result.push_back("");
    }
    return result;
}

I would just create a class representing a row.
Then stream into that object:

我只会创建一个代表一行的类。
然后流入该对象：

#include <iterator>
#include <iostream>
#include <fstream>
#include <sstream>
#include <vector>
#include <string>

class CSVRow
{
    public:
        std::string const& operator[](std::size_t index) const
        {
            return m_data[index];
        }
        std::size_t size() const
        {
            return m_data.size();
        }
        void readNextRow(std::istream& str)
        {
            std::string         line;
            std::getline(str, line);

            std::stringstream   lineStream(line);
            std::string         cell;

            m_data.clear();
            while(std::getline(lineStream, cell, ','))
            {
                m_data.push_back(cell);
            }
            // This checks for a trailing comma with no data after it.
            if (!lineStream && cell.empty())
            {
                // If there was a trailing comma then add an empty element.
                m_data.push_back("");
            }
        }
    private:
        std::vector<std::string>    m_data;
};

std::istream& operator>>(std::istream& str, CSVRow& data)
{
    data.readNextRow(str);
    return str;
}   
int main()
{
    std::ifstream       file("plop.csv");

    CSVRow              row;
    while(file >> row)
    {
        std::cout << "4th Element(" << row[3] << ")\n";
    }
}

But with a little work we could technically create an iterator:

但是通过一些工作，我们可以在技术上创建一个迭代器：

class CSVIterator
{   
    public:
        typedef std::input_iterator_tag     iterator_category;
        typedef CSVRow                      value_type;
        typedef std::size_t                 difference_type;
        typedef CSVRow*                     pointer;
        typedef CSVRow&                     reference;

        CSVIterator(std::istream& str)  :m_str(str.good()?&str:NULL) { ++(*this); }
        CSVIterator()                   :m_str(NULL) {}

        // Pre Increment
        CSVIterator& operator++()               {if (m_str) { if (!((*m_str) >> m_row)){m_str = NULL;}}return *this;}
        // Post increment
        CSVIterator operator++(int)             {CSVIterator    tmp(*this);++(*this);return tmp;}
        CSVRow const& operator*()   const       {return m_row;}
        CSVRow const* operator->()  const       {return &m_row;}

        bool operator==(CSVIterator const& rhs) {return ((this == &rhs) || ((this->m_str == NULL) && (rhs.m_str == NULL)));}
        bool operator!=(CSVIterator const& rhs) {return !((*this) == rhs);}
    private:
        std::istream*       m_str;
        CSVRow              m_row;
};


int main()
{
    std::ifstream       file("plop.csv");

    for(CSVIterator loop(file); loop != CSVIterator(); ++loop)
    {
        std::cout << "4th Element(" << (*loop)[3] << ")\n";
    }
}

Answer 2

回答by dtw

Solution using Boost Tokenizer:

使用 Boost Tokenizer 的解决方案：

std::vector<std::string> vec;
using namespace boost;
tokenizer<escaped_list_separator<char> > tk(
   line, escaped_list_separator<char>('\', ',', '\"'));
for (tokenizer<escaped_list_separator<char> >::iterator i(tk.begin());
   i!=tk.end();++i) 
{
   vec.push_back(*i);
}

Answer 3

回答by sastanin

My version is not using anything but the standard C++11 library. It copes well with Excel CSV quotation:

我的版本除了标准的 C++11 库之外没有使用任何东西。它可以很好地处理 Excel CSV 报价：

spam eggs,"foo,bar","""fizz buzz"""
1.23,4.567,-8.00E+09

The code is written as a finite-state machine and is consuming one character at a time. I think it's easier to reason about.

该代码被编写为一个有限状态机并且一次消耗一个字符。我认为这更容易推理。

#include <istream>
#include <string>
#include <vector>

enum class CSVState {
    UnquotedField,
    QuotedField,
    QuotedQuote
};

std::vector<std::string> readCSVRow(const std::string &row) {
    CSVState state = CSVState::UnquotedField;
    std::vector<std::string> fields {""};
    size_t i = 0; // index of the current field
    for (char c : row) {
        switch (state) {
            case CSVState::UnquotedField:
                switch (c) {
                    case ',': // end of field
                              fields.push_back(""); i++;
                              break;
                    case '"': state = CSVState::QuotedField;
                              break;
                    default:  fields[i].push_back(c);
                              break; }
                break;
            case CSVState::QuotedField:
                switch (c) {
                    case '"': state = CSVState::QuotedQuote;
                              break;
                    default:  fields[i].push_back(c);
                              break; }
                break;
            case CSVState::QuotedQuote:
                switch (c) {
                    case ',': // , after closing quote
                              fields.push_back(""); i++;
                              state = CSVState::UnquotedField;
                              break;
                    case '"': // "" -> "
                              fields[i].push_back('"');
                              state = CSVState::QuotedField;
                              break;
                    default:  // end of quote
                              state = CSVState::UnquotedField;
                              break; }
                break;
        }
    }
    return fields;
}

/// Read CSV file, Excel dialect. Accept "quoted fields ""with quotes"""
std::vector<std::vector<std::string>> readCSV(std::istream &in) {
    std::vector<std::vector<std::string>> table;
    std::string row;
    while (!in.eof()) {
        std::getline(in, row);
        if (in.bad() || in.fail()) {
            break;
        }
        auto fields = readCSVRow(row);
        table.push_back(fields);
    }
    return table;
}

Answer 4

回答by sastanin

The C++ String Toolkit Library (StrTk)has a token grid class that allows you to load data either from text files, strings or char buffers, and to parse/process them in a row-column fashion.

在C ++字符串工具箱库（StrTk）有一个令牌网类，允许您将数据加载无论是从文本文件，字符串或字符缓冲区，并解析/处理他们的行列时尚。

You can specify the row delimiters and column delimiters or just use the defaults.

您可以指定行分隔符和列分隔符，也可以仅使用默认值。

void foo()
{
   std::string data = "1,2,3,4,5\n"
                      "0,2,4,6,8\n"
                      "1,3,5,7,9\n";

   strtk::token_grid grid(data,data.size(),",");

   for(std::size_t i = 0; i < grid.row_count(); ++i)
   {
      strtk::token_grid::row_type r = grid.row(i);
      for(std::size_t j = 0; j < r.size(); ++j)
      {
         std::cout << r.get<int>(j) << "\t";
      }
      std::cout << std::endl;
   }
   std::cout << std::endl;
}

More examples can be found Here

更多例子可以在这里找到

Answer 5

回答by Joel de Guzman

It is not overkill to use Spirit for parsing CSVs. Spirit is well suited for micro-parsing tasks. For instance, with Spirit 2.1, it is as easy as:

使用 Spirit 解析 CSV 并不过分。Spirit 非常适合微解析任务。例如，使用 Spirit 2.1，它就像：

bool r = phrase_parse(first, last,

    //  Begin grammar
    (
        double_ % ','
    )
    ,
    //  End grammar

    space, v);

The vector, v, gets stuffed with the values. There is a series of tutorialstouching on this in the new Spirit 2.1 docs that's just been released with Boost 1.41.

向量 v 填充了值。在刚刚随 Boost 1.41 发布的新 Spirit 2.1 文档中，有一系列与此相关的教程。

The tutorial progresses from simple to complex. The CSV parsers are presented somewhere in the middle and touches on various techniques in using Spirit. The generated code is as tight as hand written code. Check out the assembler generated!

本教程从简单到复杂。CSV 解析器位于中间的某个位置，涉及使用 Spirit 的各种技术。生成的代码与手写代码一样紧密。查看生成的汇编程序！

Answer 6

回答by stefanB

You can use Boost Tokenizer with escaped_list_separator.

您可以将 Boost Tokenizer 与 escaped_list_separator 一起使用。

escaped_list_separatorparses a superset of the csv. Boost::tokenizer

escaped_list_separator解析 csv 的超集。Boost::tokenizer

This only uses Boost tokenizer header files, no linking to boost libraries required.

这只使用 Boost tokenizer 头文件，不需要链接到 boost 库。

Here is an example, (see Parse CSV File With Boost Tokenizer In C++for details or Boost::tokenizer):

这是一个示例，（有关详细信息，请参阅使用 C++ 中的 Boost Tokenizer 解析 CSV 文件或Boost::tokenizer）：

#include <iostream>     // cout, endl
#include <fstream>      // fstream
#include <vector>
#include <string>
#include <algorithm>    // copy
#include <iterator>     // ostream_operator
#include <boost/tokenizer.hpp>

int main()
{
    using namespace std;
    using namespace boost;
    string data("data.csv");

    ifstream in(data.c_str());
    if (!in.is_open()) return 1;

    typedef tokenizer< escaped_list_separator<char> > Tokenizer;
    vector< string > vec;
    string line;

    while (getline(in,line))
    {
        Tokenizer tok(line);
        vec.assign(tok.begin(),tok.end());

        // vector now contains strings from one row, output to cout here
        copy(vec.begin(), vec.end(), ostream_iterator<string>(cout, "|"));

        cout << "\n----------------------" << endl;
    }
}

Answer 7

回答by Michael

If you DOcare about parsing CSV correctly, this will do it...relatively slowly as it works one char at a time.

如果DO关于正确解析CSV小心，因为它的工作原理一个字符在同一时间，这将做...相对缓慢。

 void ParseCSV(const string& csvSource, vector<vector<string> >& lines)
    {
       bool inQuote(false);
       bool newLine(false);
       string field;
       lines.clear();
       vector<string> line;

       string::const_iterator aChar = csvSource.begin();
       while (aChar != csvSource.end())
       {
          switch (*aChar)
          {
          case '"':
             newLine = false;
             inQuote = !inQuote;
             break;

          case ',':
             newLine = false;
             if (inQuote == true)
             {
                field += *aChar;
             }
             else
             {
                line.push_back(field);
                field.clear();
             }
             break;

          case '\n':
          case '\r':
             if (inQuote == true)
             {
                field += *aChar;
             }
             else
             {
                if (newLine == false)
                {
                   line.push_back(field);
                   lines.push_back(line);
                   field.clear();
                   line.clear();
                   newLine = true;
                }
             }
             break;

          default:
             newLine = false;
             field.push_back(*aChar);
             break;
          }

          aChar++;
       }

       if (field.size())
          line.push_back(field);

       if (line.size())
          lines.push_back(line);
    }

Answer 8

回答by Rolf Kristensen

When using the Boost Tokenizer escaped_list_separator for CSV files, then one should be aware of the following:

对 CSV 文件使用 Boost Tokenizer escaped_list_separator 时，应注意以下几点：

It requires an escape-character (default back-slash - \)
It requires a splitter/seperator-character (default comma - ,)
It requires an quote-character (default quote - ")

它需要一个转义字符（默认反斜杠 - \）
它需要一个分隔符/分隔符（默认逗号 - ,）
它需要一个引号字符（默认引号 - "）

The CSV format specified by wiki states that data fields can contain separators in quotes (supported):

wiki 指定的 CSV 格式规定数据字段可以在引号中包含分隔符（支持）：

1997,Ford,E350,"Super, luxurious truck"

1997年福特E350“超级豪华卡车”

The CSV format specified by wiki states that single quotes should be handled with double-quotes (escaped_list_separator will strip away all quote characters):

wiki 指定的 CSV 格式规定单引号应该用双引号处理（escaped_list_separator 将去掉所有引号字符）：

1997,Ford,E350,"Super ""luxurious"" truck"

1997年，福特，E350，“超级”“豪华”“卡车”

The CSV format doesn't specify that any back-slash characters should be stripped away (escaped_list_separator will strip away all escape characters).

CSV 格式未指定应删除任何反斜杠字符（escaped_list_separator 将删除所有转义字符）。

A possible work-around to fix the default behavior of the boost escaped_list_separator:

一种可能的解决方法来修复 boost escaped_list_separator 的默认行为：

First replace all back-slash characters (\) with two back-slash characters (\\) so they are not stripped away.
Secondly replace all double-quotes ("") with a single back-slash character and a quote (\")

首先用两个反斜杠字符 (\\) 替换所有反斜杠字符 (\)，这样它们就不会被删除。
其次用单个反斜杠字符和引号 (\") 替换所有双引号 ("")

This work-around has the side-effect that empty data-fields that are represented by a double-quote, will be transformed into a single-quote-token. When iterating through the tokens, then one must check if the token is a single-quote, and treat it like an empty string.

此变通方法具有副作用，即由双引号表示的空数据字段将转换为单引号标记。在遍历标记时，必须检查标记是否为单引号，并将其视为空字符串。

Not pretty but it works, as long there are not newlines within the quotes.

不漂亮但它有效，只要引号内没有换行符。

Answer 9

回答by jxh

As all the CSV questions seem to get redirected here, I thought I'd post my answer here. This answer does not directly address the asker's question. I wanted to be able to read in a stream that is known to be in CSV format, and also the types of each field was already known. Of course, the method below could be used to treat every field to be a string type.

由于所有 CSV 问题似乎都被重定向到这里，我想我会在这里发布我的答案。这个答案并没有直接解决提问者的问题。我希望能够读取已知为 CSV 格式的流，并且每个字段的类型也是已知的。当然，可以使用下面的方法将每个字段都视为字符串类型。

As an example of how I wanted to be able to use a CSV input stream, consider the following input (taken from wikipedia's page on CSV):

作为我希望能够使用 CSV 输入流的示例，请考虑以下输入（取自CSV 维基百科页面）：

const char input[] =
"Year,Make,Model,Description,Price\n"
"1997,Ford,E350,\"ac, abs, moon\",3000.00\n"
"1999,Chevy,\"Venture \"\"Extended Edition\"\"\",\"\",4900.00\n"
"1999,Chevy,\"Venture \"\"Extended Edition, Very Large\"\"\",\"\",5000.00\n"
"1996,Jeep,Grand Cherokee,\"MUST SELL!\n\
air, moon roof, loaded\",4799.00\n"
;

Then, I wanted to be able to read in the data like this:

然后，我希望能够像这样读取数据：

std::istringstream ss(input);
std::string title[5];
int year;
std::string make, model, desc;
float price;
csv_istream(ss)
    >> title[0] >> title[1] >> title[2] >> title[3] >> title[4];
while (csv_istream(ss)
       >> year >> make >> model >> desc >> price) {
    //...do something with the record...
}

This was the solution I ended up with.

这是我最终得到的解决方案。

struct csv_istream {
    std::istream &is_;
    csv_istream (std::istream &is) : is_(is) {}
    void scan_ws () const {
        while (is_.good()) {
            int c = is_.peek();
            if (c != ' ' && c != '\t') break;
            is_.get();
        }
    }
    void scan (std::string *s = 0) const {
        std::string ws;
        int c = is_.get();
        if (is_.good()) {
            do {
                if (c == ',' || c == '\n') break;
                if (s) {
                    ws += c;
                    if (c != ' ' && c != '\t') {
                        *s += ws;
                        ws.clear();
                    }
                }
                c = is_.get();
            } while (is_.good());
            if (is_.eof()) is_.clear();
        }
    }
    template <typename T, bool> struct set_value {
        void operator () (std::string in, T &v) const {
            std::istringstream(in) >> v;
        }
    };
    template <typename T> struct set_value<T, true> {
        template <bool SIGNED> void convert (std::string in, T &v) const {
            if (SIGNED) v = ::strtoll(in.c_str(), 0, 0);
            else v = ::strtoull(in.c_str(), 0, 0);
        }
        void operator () (std::string in, T &v) const {
            convert<is_signed_int<T>::val>(in, v);
        }
    };
    template <typename T> const csv_istream & operator >> (T &v) const {
        std::string tmp;
        scan(&tmp);
        set_value<T, is_int<T>::val>()(tmp, v);
        return *this;
    }
    const csv_istream & operator >> (std::string &v) const {
        v.clear();
        scan_ws();
        if (is_.peek() != '"') scan(&v);
        else {
            std::string tmp;
            is_.get();
            std::getline(is_, tmp, '"');
            while (is_.peek() == '"') {
                v += tmp;
                v += is_.get();
                std::getline(is_, tmp, '"');
            }
            v += tmp;
            scan();
        }
        return *this;
    }
    template <typename T>
    const csv_istream & operator >> (T &(*manip)(T &)) const {
        is_ >> manip;
        return *this;
    }
    operator bool () const { return !is_.fail(); }
};

With the following helpers that may be simplified by the new integral traits templates in C++11:

使用以下可以通过 C++11 中新的完整特征模板简化的帮助程序：

template <typename T> struct is_signed_int { enum { val = false }; };
template <> struct is_signed_int<short> { enum { val = true}; };
template <> struct is_signed_int<int> { enum { val = true}; };
template <> struct is_signed_int<long> { enum { val = true}; };
template <> struct is_signed_int<long long> { enum { val = true}; };

template <typename T> struct is_unsigned_int { enum { val = false }; };
template <> struct is_unsigned_int<unsigned short> { enum { val = true}; };
template <> struct is_unsigned_int<unsigned int> { enum { val = true}; };
template <> struct is_unsigned_int<unsigned long> { enum { val = true}; };
template <> struct is_unsigned_int<unsigned long long> { enum { val = true}; };

template <typename T> struct is_int {
    enum { val = (is_signed_int<T>::val || is_unsigned_int<T>::val) };
};

Try it online!

在线试试吧！

Answer 10

回答by jxh

You might want to look at my FOSS project CSVfix(updated link), which is a CSV stream editor written in C++. The CSV parser is no prize, but does the job and the whole package may do what you need without you writing any code.

您可能想查看我的 FOSS 项目CSVfix（更新链接），这是一个用 C++ 编写的 CSV 流编辑器。CSV 解析器不是奖品，但可以完成这项工作，整个包可能会做您需要的事情，而无需您编写任何代码。

See alib/src/a_csv.cppfor the CSV parser, and csvlib/src/csved_ioman.cpp(IOManager::ReadCSV) for a usage example.

见alib / SRC / a_csv.cpp用于CSV解析器和csvlib / SRC / csved_ioman.cpp（IOManager::ReadCSV）为一个使用例。

如何在 C++ 中读取和解析 CSV 文件？

提问by User1

回答by Martin York

回答by dtw

回答by sastanin

回答by sastanin

回答by Joel de Guzman

回答by stefanB

回答by Michael

回答by Rolf Kristensen

回答by jxh

回答by jxh

相关推荐

最近更新

标签

如何在 C++ 中读取和解析 CSV 文件？

提问by User1

回答by Martin York

回答by dtw

回答by sastanin

回答by sastanin

回答by Joel de Guzman

回答by stefanB

回答by Michael

回答by Rolf Kristensen

回答by jxh

回答by jxh

相关推荐

C++ 为什么 std::list 没有运算符 []？

目标 C 的 IDE

C++ 如何在 Windows 或 Linux、32 或 64 位、静态或动态的 Visual Studio 或 g++ 下编译 Qt 5

C++ valgrind 条件跳转或移动取决于未初始化的值，这是否表示内存泄漏？

相关推荐

最近更新

标签