Python 将多个 csv 文件读取到 HDF5 时出现 Pandas ParserError EOF 字符

Question

提问by Matthijs

Using Python3, Pandas 0.12

使用 Python3，Pandas 0.12

I'm trying to write multiple csv files (total size is 7.9 GB) to a HDF5 store to process later onwards. The csv files contain around a million of rows each, 15 columns and data types are mostly strings, but some floats. However when I'm trying to read the csv files I get the following error:

我正在尝试将多个 csv 文件（总大小为 7.9 GB）写入 HDF5 存储以供以后处理。csv 文件每个包含大约一百万行，15 列和数据类型主要是字符串，但也有一些浮点数。但是，当我尝试读取 csv 文件时，出现以下错误：

Traceback (most recent call last):
  File "filter-1.py", line 38, in <module>
    to_hdf()
  File "filter-1.py", line 31, in to_hdf
    for chunk in reader:
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 578, in __iter__
    yield self.read(self.chunksize)
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
    ret = self._engine.read(nrows)
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
  File "parser.pyx", line 740, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:7146)
  File "parser.pyx", line 781, in pandas.parser.TextReader._read_rows (pandas\parser.c:7568)
  File "parser.pyx", line 768, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:7451)
  File "parser.pyx", line 1661, in pandas.parser.raise_parser_error (pandas\parser.c:18744)
pandas.parser.CParserError: Error tokenizing data. C error: EOF inside string starting at line 754991
Closing remaining open files: ta_store.h5... done

Edit:

编辑：

I managed to find a file that produced this problem. I think it's reading an EOF character. However I have no clue to overcome this problem. Given the large size of the combined files I think it's too cumbersome to check each single character in each string. (Even then I would still not be sure what to do.) As far as I checked, there are no strange characters in the csv files that could raise the error. I also tried passing error_bad_lines=Falseto pd.read_csv(), but the error persists.

我设法找到了产生此问题的文件。我认为它正在读取 EOF 字符。但是，我不知道如何克服这个问题。鉴于组合文件的大小，我认为检查每个字符串中的每个单个字符太麻烦了。（即便如此，我仍然不确定该怎么做。）据我检查，csv 文件中没有可能引发错误的奇怪字符。我也尝试传递error_bad_lines=False到pd.read_csv()，但错误仍然存在。

My code is the following:

我的代码如下：

# -*- coding: utf-8 -*-

import pandas as pd
import os
from glob import glob


def list_files(path=os.getcwd()):
    ''' List all files in specified path '''
    list_of_files = [f for f in glob('2013-06*.csv')]
    return list_of_files


def to_hdf():
    """ Function that reads multiple csv files to HDF5 Store """
    # Defining path name
    path = 'ta_store.h5'
    # If path exists delete it such that a new instance can be created
    if os.path.exists(path):
        os.remove(path)
    # Creating HDF5 Store
    store = pd.HDFStore(path)

    # Reading csv files from list_files function
    for f in list_files():
        # Creating reader in chunks -- reduces memory load
        reader = pd.read_csv(f, chunksize=50000)
        # Looping over chunks and storing them in store file, node name 'ta_data'
        for chunk in reader:
            chunk.to_hdf(store, 'ta_data', mode='w', table=True)

    # Return store
    return store.select('ta_data')
    return 'Finished reading to HDF5 Store, continuing processing data.'

to_hdf()

Edit

编辑

If I go into the CSV file that raises the CParserError EOF... and manually delete all rows after the line that is causing the problem, the csv file is read properly. However all I'm deleting are blank rows anyway. The weird thing is that when I manually correct the erroneous csv files, they are loaded fine into the store individually. But when I again use a list of multiple files the 'false' files still return me errors.

如果我进入引发 CParserError EOF ... 的 CSV 文件并手动删除导致问题的行之后的所有行，则会正确读取 csv 文件。但是无论如何我要删除的都是空白行。奇怪的是，当我手动更正错误的 csv 文件时，它们会单独加载到商店中。但是当我再次使用多个文件的列表时，“假”文件仍然返回错误。

Answer 1

回答by Jeff

Make your inner loop like this will allow you to detect the 'bad' file (and further investigate)

使您的内部循环像这样将允许您检测“坏”文件（并进一步调查）

from pandas.io import parser

def to_hdf():

    .....

    # Reading csv files from list_files function
    for f in list_files():
        # Creating reader in chunks -- reduces memory load

        try:

            reader = pd.read_csv(f, chunksize=50000)

            # Looping over chunks and storing them in store file, node name 'ta_data'
            for chunk in reader:
                chunk.to_hdf(store, 'ta_data', table=True)

        except (parser.CParserError) as detail:
             print f, detail

Answer 2

回答by Selah

I had a similar problem. The line listed with the 'EOF inside string' had a string that contained within it a single quote mark. When I added the option quoting=csv.QUOTE_NONE it fixed my problem.

我有一个类似的问题。用“EOF inside string”列出的行有一个字符串，其中包含一个单引号。当我添加选项 quoting=csv.QUOTE_NONE 时，它解决了我的问题。

For example:

例如：

import csv
df = pd.read_csv(csvfile, header = None, delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')

Answer 3

回答by weefwefwqg3

I have the same problem, and after adding these two params to my code, the problem is gone.

我有同样的问题，将这两个参数添加到我的代码后，问题就消失了。

read_csv (...quoting=3, error_bad_lines=False)

read_csv (... quoting=3, error_bad_lines=False)

Answer 4

回答by Guido

For me, the other solutions did not work and caused me quite a headache. error_bad_lines=False still gives the error C error: EOF inside string starting at line. Using a different quoting didn't give the desired results either, since I did not want to have quotes in my text.

对我来说，其他解决方案不起作用，让我很头疼。error_bad_lines=False 仍然给出错误C error: EOF inside string starting at line。使用不同的引用也没有得到预期的结果，因为我不想在我的文本中使用引号。

I realised that there was a bug in Pandas 0.20. Upgrading to version 0.21 completely solved my issue. More info about this bug, see: https://github.com/pandas-dev/pandas/issues/16559

我意识到 Pandas 0.20 中有一个错误。升级到 0.21 版完全解决了我的问题。有关此错误的更多信息，请参阅：https: //github.com/pandas-dev/pandas/issues/16559

Note: this may be Windows-related as mentioned in the URL.

注意：这可能与 URL 中提到的 Windows 相关。

Answer 5

回答by Aman Singh

The solution is to use the parameter engine='python' in the read_csv function. The Pandas CSV parser can use two different “engines” to parse a CSV file – Python or C (which is also the default).

解决方案是在 read_csv 函数中使用参数 engine='python'。Pandas CSV 解析器可以使用两种不同的“引擎”来解析 CSV 文件——Python 或 C（这也是默认的）。

pandas.read_csv(filepath, sep=',', delimiter=None, 
            header='infer', names=None, 
            index_col=None, usecols=None, squeeze=False, 
            ..., engine=None, ...)

The Python engine is described to be “slower, but is more feature complete” in the Pandas documentation.

Python 引擎在Pandas 文档中被描述为“更慢，但功能更完整” 。

engine : {‘c', ‘python'}

Answer 6

回答by MJB

I realize this is an old question, but I wanted to share some more details on the root cause of this error and why the solution from @Selah works.

我意识到这是一个老问题，但我想分享更多有关此错误根本原因的详细信息以及@Selah 的解决方案为何有效。

From the csv.pydocstring:

从csv.py文档字符串：

    * quoting - controls when quotes should be generated by the writer.
    It can take on any of the following module constants:

    csv.QUOTE_MINIMAL means only when required, for example, when a
        field contains either the quotechar or the delimiter
    csv.QUOTE_ALL means that quotes are always placed around fields.
    csv.QUOTE_NONNUMERIC means that quotes are always placed around
        fields which do not parse as integers or floating point
        numbers.
    csv.QUOTE_NONE means that quotes are never placed around fields.

csv.QUOTE_MINIMALis the default value and "is the default quotechar. If somewhere in your csv file you have a quotechar it will be parsed as a string until another occurrence of the quotechar. If your file has odd number of quotechars the last one will not be closed before reaching the EOF(end of file). Also be aware that anything between the quotechars will be parsed as a single string. Even if there are many line breaks (expected to be parsed as separate rows) it all goes into a single field of the table. So the line number that you get in the error can be misleading. To illustrate with an example consider this:

csv.QUOTE_MINIMAL是默认值并且"是默认值quotechar。如果您的 csv 文件中的某个地方有一个 quotechar，它将被解析为一个字符串，直到再次出现该 quotechar。如果您的文件有奇数个quotechars，那么在到达EOF（文件末尾）之前，最后一个不会被关闭。另请注意，quotechars 之间的任何内容都将被解析为单个字符串。即使有很多换行符（预计被解析为单独的行），它也会全部进入表的单个字段。因此，您在错误中获得的行号可能会产生误导。为了用一个例子来说明，考虑这个：

In[4]: import pandas as pd
  ...: from io import StringIO
  ...: test_csv = '''a,b,c
  ...: "d,e,f
  ...: g,h,i
  ...: "m,n,o
  ...: p,q,r
  ...: s,t,u
  ...: '''
  ...: 
In[5]: test = StringIO(test_csv)
In[6]: pd.read_csv(test)
Out[6]: 
                 a  b  c
0  d,e,f\ng,h,i\nm  n  o
1                p  q  r
2                s  t  u
In[7]: test_csv_2 = '''a,b,c
  ...: "d,e,f
  ...: g,h,i
  ...: "m,n,o
  ...: "p,q,r
  ...: s,t,u
  ...: '''
  ...: test_2 = StringIO(test_csv_2)
  ...: 
In[8]: pd.read_csv(test_2)
Traceback (most recent call last):
...
...
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 2

The first string has 2 (even) quotechars. So each quotechar is closed and the csv is parsed without an error, although probably not what we expected. The other string has 3 (odd) quotechars. The last one is not closed and the EOF is reached hence the error. But line 2 that we get in the error message is misleading. We would expect 4, but since everything between first and second quotechar is parsed as a string our "p,q,rline is actually second.

第一个字符串有 2 个（偶数）quotechars。所以每个quotechar都被关闭并且csv被解析没有错误，尽管可能不是我们所期望的。另一个字符串有 3 个（奇数）quotechars。最后一个没有关闭并且到达 EOF 因此错误。但是我们在错误消息中得到的第 2 行具有误导性。我们期望 4，但由于第一个和第二个 quotechar 之间的所有内容都被解析为字符串，我们的"p,q,r行实际上是第二个。

Answer 7

回答by Денис Кокорев

After looking up for a solution for hours, I have finally come up with a workaround.

在寻找解决方案数小时后，我终于想出了一个解决方法。

The best way to eliminate this C error: EOF inside string starting at line exceptionwithout multiprocessing efficiency reduction is to preprocess the input data (if you have such an opportunity).

在C error: EOF inside string starting at line exception不降低多处理效率的情况下消除这种情况的最佳方法是预处理输入数据（如果您有这样的机会）。

Replace all of the '\n' entries in the input file on, for instance ', ', or on any other unique symbols sequence (for example, 'aghr21*&'). Then you will be able to read_csv the data into your dataframe.

替换输入文件中的所有 '\n' 条目，例如 '、' 或任何其他唯一符号序列（例如，'aghr21*&'）。然后您将能够将数据 read_csv 到您的数据框中。

After you have read the data, you may want to replace all of your unique symbols sequences ('aghr21*&'), back on '\n'.

读取数据后，您可能希望将所有唯一符号序列 ('aghr21*&') 替换回 '\n'。

Python 将多个 csv 文件读取到 HDF5 时出现 Pandas ParserError EOF 字符

提问by Matthijs

回答by Jeff

回答by Selah

回答by weefwefwqg3

回答by Guido

回答by Aman Singh

回答by MJB

回答by Денис Кокорев

相关推荐

最近更新

标签

Python 将多个 csv 文件读取到 HDF5 时出现 Pandas ParserError EOF 字符

提问by Matthijs

回答by Jeff

回答by Selah

回答by weefwefwqg3

回答by Guido

回答by Aman Singh

回答by MJB

回答by Денис Кокорев

相关推荐

如何在 Python 中创建递增的文件名？

Python：源代码字符串不能包含空字节

在 Tkinter (Python) 中画圆

Python 在 openpyxl 中查看行值

相关推荐

最近更新

标签