Pandas 数据帧枢轴 - 内存错误
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39648991/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas dataframe pivot - Memory Error
提问by Ulderique Demoitre
I have a dataframe df
with the following structure:
我有一个df
具有以下结构的数据框:
val newidx Code
Idx
0 1.0 1220121127 706
1 1.0 1220121030 706
2 1.0 1620120122 565
It has 1000000 lines.
In total we have 600 unique Code
value and 200000 unique newidx
values.
它有 1000000 行。我们总共有 600 个唯一Code
值和 200000 个唯一newidx
值。
If I perform the following operation
如果我执行以下操作
df.pivot_table(values='val', index='newidx', columns='Code', aggfunc='max')
I get a MemoryError
. but this sounds strange as the size of the resulting dataframe should be sustainable: 200000x600.
我得到一个MemoryError
. 但这听起来很奇怪,因为结果数据框的大小应该是可持续的:200000x600。
How much memory requires such operation? Is there a way to fix this memory error?
这样的操作需要多少内存?有没有办法解决这个内存错误?
回答by Kartik
Try to see if this fits in your memory:
试着看看这是否符合你的记忆:
df.groupby(['newidx', 'Code'])['val'].max().unstack()
pivot_table
is unfortunately very memory intensive as it may make multiple copies of data.
pivot_table
不幸的是,它非常占用内存,因为它可能会制作多个数据副本。
If the groupby
does not work, you will have to split your DataFrame into smaller pieces. Try not to assign multiple times. For example, if reading from csv:
如果groupby
不起作用,您将不得不将您的 DataFrame 拆分成更小的部分。尽量不要分配多次。例如,如果从 csv 读取:
df = pd.read_csv('file.csv').groupby(['newidx', 'Code'])['val'].max().unstack()
avoids multiple assignments.
避免多次赋值。
回答by mplf
I've had a very similar problem when carrying out a merge between 4 dataframes recently.
我最近在执行 4 个数据帧之间的合并时遇到了一个非常相似的问题。
What worked for me was disabling the index during the groupby, then merging.
对我有用的是在 groupby 期间禁用索引,然后合并。
if @Kartiks answer doesn't work, try this before chunking the DataFrame.
如果@Kartiks 回答不起作用,请在对 DataFrame 进行分块之前尝试此操作。
df.groupby(['newidx', 'Code'], as_index=False)['val'].max().unstack()