Pandas 数据帧枢轴 - 内存错误

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39648991/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:04:48  来源:igfitidea点击:

Pandas dataframe pivot - Memory Error

pythonpandasdataframe

提问by Ulderique Demoitre

I have a dataframe dfwith the following structure:

我有一个df具有以下结构的数据框:

        val          newidx    Code
Idx                             
0       1.0      1220121127    706
1       1.0      1220121030    706
2       1.0      1620120122    565

It has 1000000 lines. In total we have 600 unique Codevalue and 200000 unique newidxvalues.

它有 1000000 行。我们总共有 600 个唯一Code值和 200000 个唯一newidx值。

If I perform the following operation

如果我执行以下操作

df.pivot_table(values='val', index='newidx', columns='Code', aggfunc='max')

I get a MemoryError. but this sounds strange as the size of the resulting dataframe should be sustainable: 200000x600.

我得到一个MemoryError. 但这听起来很奇怪,因为结果数据框的大小应该是可持续的:200000x600。

How much memory requires such operation? Is there a way to fix this memory error?

这样的操作需要多少内存?有没有办法解决这个内存错误?

回答by Kartik

Try to see if this fits in your memory:

试着看看这是否符合你的记忆:

df.groupby(['newidx', 'Code'])['val'].max().unstack()

pivot_tableis unfortunately very memory intensive as it may make multiple copies of data.

pivot_table不幸的是,它非常占用内存,因为它可能会制作多个数据副本。



If the groupbydoes not work, you will have to split your DataFrame into smaller pieces. Try not to assign multiple times. For example, if reading from csv:

如果groupby不起作用,您将不得不将您的 DataFrame 拆分成更小的部分。尽量不要分配多次。例如,如果从 csv 读取:

df = pd.read_csv('file.csv').groupby(['newidx', 'Code'])['val'].max().unstack()

avoids multiple assignments.

避免多次赋值。

回答by mplf

I've had a very similar problem when carrying out a merge between 4 dataframes recently.

我最近在执行 4 个数据帧之间的合并时遇到了一个非常相似的问题。

What worked for me was disabling the index during the groupby, then merging.

对我有用的是在 groupby 期间禁用索引,然后合并。

if @Kartiks answer doesn't work, try this before chunking the DataFrame.

如果@Kartiks 回答不起作用,请在对 DataFrame 进行分块之前尝试此操作。

df.groupby(['newidx', 'Code'], as_index=False)['val'].max().unstack()