pandas 熊猫无法从大型 StringIO 对象中读取

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24562869/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:13:16  来源:igfitidea点击:

pandas unable to read from large StringIO object

pythoncsvpandasstringiocstringio

提问by castle-bravo

I'm using pandas to manage a large array of 8-byte integers. These integers are included as space-delimited elements of a column in a comma-delimited CSV file, and the array size is about 10000x10000.

我正在使用 Pandas 来管理一个大的 8 字节整数数组。这些整数在逗号分隔的 CSV 文件中作为列的空格分隔元素包含在内,数组大小约为 10000x10000。

Pandas is able to quickly read the comma-delimited data from the first few columns as a DataFrame, and also quickly store the space-delimited strings in another DataFrame with minimal hassle. The trouble comes when I try to cast transform the table from a single column of space-delimited strings to a DataFrame of 8-bit integers.

Pandas 能够从前几列中将逗号分隔的数据作为 DataFrame 快速读取,并且还可以将空格分隔的字符串快速存储在另一个 DataFrame 中,而且麻烦最少。当我尝试将表格从单列空格分隔字符串转换为 8 位整数的 DataFrame 时,问题就来了。

I have tried the following:

我尝试了以下方法:

intdata = pd.DataFrame(strdata.columnname.str.split().tolist(), dtype='uint8')

But the memory usage is unbearable - 10MB worth of integers consumes 2GB of memory. I'm told that it's a limitation of the language and there's nothing I can do about it in this case.

但是内存使用是无法忍受的 - 10MB 的整数消耗 2GB 的内存。有人告诉我这是语言的限制,在这种情况下我无能为力。

As a possible workaround, I was advised to save the string data to a CSV file and then reload the CSV file as a DataFrame of space-delimited integers. This works well, but to avoid the slowdown that comes from writing to disk, I tried writing to a StringIO object.

作为一种可能的解决方法,建议我将字符串数据保存到 CSV 文件,然后将 CSV 文件重新加载为空格分隔整数的 DataFrame。这很有效,但为了避免写入磁盘带来的速度减慢,我尝试写入 StringIO 对象。

Here's a minimal non-working example:

这是一个最小的非工作示例:

import numpy as np
import pandas as pd
from cStringIO import StringIO

a = np.random.randint(0,256,(10000,10000)).astype('uint8')
b = pd.DataFrame(a)
c = StringIO()
b.to_csv(c, delimiter=' ', header=False, index=False)
d = pd.io.parsers.read_csv(c, delimiter=' ', header=None, dtype='uint8')

Which yields the following error message:

这会产生以下错误消息:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 443, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 228, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 533, in __init__
    self._make_engine(self.engine)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 670, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 1032, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "parser.pyx", line 486, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:4494)
ValueError: No columns to parse from file

Which is puzzling, because if I run the exact same code with 'c.csv'instead of c, the code works perfectly. Also, if I use the following snippet:

这是令人费解的,因为如果我用'c.csv'而不是运行完全相同的代码c,代码运行完美。另外,如果我使用以下代码段:

file = open('c.csv', 'w')
file.write(c.getvalue())

The CSV file gets saved without any problems, so writing to the StringIO object is not the issue.

CSV 文件保存时没有任何问题,因此写入 StringIO 对象不是问题。

It is possible that I need to replace cwith c.getvalue()in the read_csv line, but when I do that, the interpreter tries to print the contents of cin the terminal! Surely there is a way to work around this.

我可能需要在 read_csv 行中替换cwith c.getvalue(),但是当我这样做时,解释器会尝试c在终端中打印 的内容!当然有办法解决这个问题。

Thanks in advance for the help.

在此先感谢您的帮助。

回答by DSM

There are two issues here, one fundamental and one you simply haven't come across yet. :^)

这里有两个问题,一个是基本问题,一个是您尚未遇到的问题。:^)

First, after you write to c, you're at the end of the (virtual) file. You need to seekback to the start. We'll use a smaller grid as an example:

首先,在您写入 之后c,您位于(虚拟)文件的末尾。你需要seek回到起点。我们将使用较小的网格作为示例:

>>> a = np.random.randint(0,256,(10,10)).astype('uint8')
>>> b = pd.DataFrame(a)
>>> c = StringIO()
>>> b.to_csv(c, delimiter=' ', header=False, index=False)
>>> next(c)
Traceback (most recent call last):
  File "<ipython-input-57-73b012f9653f>", line 1, in <module>
    next(c)
StopIteration

which generates the "no columns" error. If we seekfirst, though:

这会产生“无列”错误。但是,如果我们seek首先:

>>> c.seek(0)
>>> next(c)
'103,3,171,239,150,35,224,190,225,57\n'

But now you'll notice the second issue-- commas? I thought we requested space delimiters? But to_csvonly accepts sep, not delimiter. Seems to me it should either accept it or object that it doesn't, but silently ignoring it feels like a bug. Anyway, if we use sep(or delim_whitespace=True):

但是现在你会注意到第二个问题——逗号?我以为我们要求使用空格分隔符?但to_csv只接受sep,不接受delimiter。在我看来,它应该要么接受它,要么反对它不接受,但默默地忽略它感觉就像一个错误。无论如何,如果我们使用sep(或delim_whitespace=True):

>>> a = np.random.randint(0,256,(10,10)).astype('uint8')
>>> b = pd.DataFrame(a)
>>> c = StringIO()
>>> b.to_csv(c, sep=' ', header=False, index=False)
>>> c.seek(0)
>>> d = pd.read_csv(c, sep=' ', header=None, dtype='uint8')
>>> d
     0    1    2    3    4    5    6    7    8    9
0  209   65  218  242  178  213  187   63  137  145
1  161  222   50   92  157   31   49   62  218   30
2  182  255  146  249  115   91  160   53  200  252
3  192  116   87   85  164   46  192  228  104  113
4   89  137  142  188  183  199  106  128  110    1
5  208  140  116   50   66  208  116   72  158  169
6   50  221   82  235   16   31  222    9   95  111
7   88   36  204   96  186  205  210  223   22  235
8  136  221   98  191   31  174   83  208  226  150
9   62   93  168  181   26  128  116   92   68  153