Python Pandas to_pickle 不能腌制大型数据帧
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29547522/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python Pandas to_pickle cannot pickle large dataframes
提问by Joseph Roxas
I have a dataframe "DF" with with 500,000 rows. Here are the data types per column:
我有一个包含 500,000 行的数据框“DF”。以下是每列的数据类型:
ID int64
time datetime64[ns]
data object
each entry in the "data" column is an array with size = [5,500]
“数据”列中的每个条目都是一个大小为 [5,500] 的数组
When I try to save this dataframe using
当我尝试使用
DF.to_pickle("my_filename.pkl")
it returned me the following error:
它返回给我以下错误:
12 """
13 with open(path, 'wb') as f:
---> 14 pkl.dump(obj, f, protocol=pkl.HIGHEST_PROTOCOL)
OSError: [Errno 22] Invalid argument
I also try this method but I get the same error:
我也尝试过这种方法,但我得到了同样的错误:
import pickle
with open('my_filename.pkl', 'wb') as f:
pickle.dump(DF, f)
I try to save 10 rows of this dataframe:
我尝试保存此数据框的 10 行:
DF.head(10).to_pickle('test_save.pkl')
and I have no error at all. Therefore, it can save small DF but not large DF.
我完全没有错误。因此,它可以保存小DF,但不能保存大DF。
I am using python 3, ipython notebook 3 in Mac.
我在 Mac 中使用 python 3,ipython notebook 3。
Please help me to solve this problem. I really need to save this DF to a pickle file. I can not find the solution in the internet.
请帮我解决这个问题。我真的需要将此 DF 保存到泡菜文件中。我在互联网上找不到解决方案。
采纳答案by Yupsiree
Probably not the answer you were hoping for but this is what I did......
可能不是您希望的答案,但这就是我所做的......
Split the dataframe into smaller chunks using np.array_split (although numpy functions are not guaranteed to work, it does now, although there used to be a bug for it).
使用 np.array_split 将数据帧拆分成更小的块(虽然不能保证 numpy 函数可以工作,但现在可以了,尽管曾经有一个错误)。
Then pickle the smaller dataframes.
然后pickle较小的数据帧。
When you unpickle them use pandas.append or pandas.concat to glue everything back together.
当你解开它们时,使用 pandas.append 或 pandas.concat 将所有东西粘在一起。
I agree it is a fudge and suboptimal. If anyone can suggest a "proper" answer I'd be interested in seeing it, but I think it as simple as dataframes are not supposed to get above a certain size.
我同意这是一种软糖和次优的。如果有人可以提出一个“正确”的答案,我会很想看到它,但我认为它就像数据框不应该超过特定大小一样简单。
回答by volodymyr
Until there is a fix somewhere on pickle/pandas side of things, I'd say a better option is to use alternative IO backend. HDF is suitable for large datasets (GBs). So you don't need to add additional split/combine logic.
在 pickle/pandas 方面的某处得到修复之前,我认为更好的选择是使用替代 IO 后端。HDF 适用于大型数据集 (GB)。所以你不需要添加额外的拆分/组合逻辑。
df.to_hdf('my_filename.hdf','mydata',mode='w')
df = pd.read_hdf('my_filename.hdf','mydata')
回答by PGorshenin
Try to use compression. It worked for me.
尝试使用压缩。它对我有用。
data_df.to_pickle('data_df.pickle.gzde', compression='gzip')
data_df.to_pickle('data_df.pickle.gzde', compression='gzip')
回答by user3843986
I ran into this same issue and traced the cause to a memory issue. According to this recourseit's usually not actually caused by the memory itself, but the movement of too many resources into the swap space. I was able to save the large pandas file by disableing swap all together withe the command (provided in that link):
我遇到了同样的问题,并将原因追溯到内存问题。根据这种资源,它通常实际上不是由内存本身引起的,而是由太多资源移动到交换空间引起的。我能够通过使用命令(在该链接中提供)一起禁用交换来保存大Pandas文件:
swapoff -a

