如何在 Pandas 中的超大数据帧上创建数据透视表
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29439589/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to create a pivot table on extremely large dataframes in Pandas
提问by TraxusIV
I need to create a pivot table of 2000 columns by around 30-50 million rows from a dataset of around 60 million rows. I've tried pivoting in chunks of 100,000 rows, and that works, but when I try to recombine the DataFrames by doing a .append() followed by .groupby('someKey').sum(), all my memory is taken up and python eventually crashes.
我需要从大约 6000 万行的数据集中创建一个包含 2000 列的数据透视表,其中包含大约 30-5000 万行。我试过以 100,000 行的块为单位进行旋转,并且可行,但是当我尝试通过执行 .append() 后跟 .groupby('someKey').sum() 来重新组合数据帧时,我所有的内存都被占用了和 python 最终崩溃。
How can I do a pivot on data this large with a limited ammount of RAM?
如何使用有限的 RAM 对这么大的数据进行透视?
EDIT: adding sample code
编辑:添加示例代码
The following code includes various test outputs along the way, but the last print is what we're really interested in. Note that if we change segMax to 3, instead of 4, the code will produce a false positive for correct output. The main issue is that if a shipmentid entry is not in each and every chunk that sum(wawa) looks at, it doesn't show up in the output.
下面的代码包括一路上的各种测试输出,但最后一个打印是我们真正感兴趣的。请注意,如果我们将 segMax 更改为 3,而不是 4,则代码将为正确输出生成误报。主要问题是,如果 sum(wawa) 查看的每个块中都没有shippingid 条目,则它不会显示在输出中。
import pandas as pd
import numpy as np
import random
from pandas.io.pytables import *
import os
pd.set_option('io.hdf.default_format','table')
# create a small dataframe to simulate the real data.
def loadFrame():
frame = pd.DataFrame()
frame['shipmentid']=[1,2,3,1,2,3,1,2,3] #evenly distributing shipmentid values for testing purposes
frame['qty']= np.random.randint(1,5,9) #random quantity is ok for this test
frame['catid'] = np.random.randint(1,5,9) #random category is ok for this test
return frame
def pivotSegment(segmentNumber,passedFrame):
segmentSize = 3 #take 3 rows at a time
frame = passedFrame[(segmentNumber*segmentSize):(segmentNumber*segmentSize + segmentSize)] #slice the input DF
# ensure that all chunks are identically formatted after the pivot by appending a dummy DF with all possible category values
span = pd.DataFrame()
span['catid'] = range(1,5+1)
span['shipmentid']=1
span['qty']=0
frame = frame.append(span)
return frame.pivot_table(['qty'],index=['shipmentid'],columns='catid', \
aggfunc='sum',fill_value=0).reset_index()
def createStore():
store = pd.HDFStore('testdata.h5')
return store
segMin = 0
segMax = 4
store = createStore()
frame = loadFrame()
print('Printing Frame')
print(frame)
print(frame.info())
for i in range(segMin,segMax):
segment = pivotSegment(i,frame)
store.append('data',frame[(i*3):(i*3 + 3)])
store.append('pivotedData',segment)
print('\nPrinting Store')
print(store)
print('\nPrinting Store: data')
print(store['data'])
print('\nPrinting Store: pivotedData')
print(store['pivotedData'])
print('**************')
print(store['pivotedData'].set_index('shipmentid').groupby('shipmentid',level=0).sum())
print('**************')
print('$$$')
for df in store.select('pivotedData',chunksize=3):
print(df.set_index('shipmentid').groupby('shipmentid',level=0).sum())
print('$$$')
store['pivotedAndSummed'] = sum((df.set_index('shipmentid').groupby('shipmentid',level=0).sum() for df in store.select('pivotedData',chunksize=3)))
print('\nPrinting Store: pivotedAndSummed')
print(store['pivotedAndSummed'])
store.close()
os.remove('testdata.h5')
print('closed')
采纳答案by Andy Hayden
You could do the appending with HDF5/pytables. This keeps it out of RAM.
您可以使用 HDF5/pytables 进行附加。这使它远离RAM。
Use the table format:
使用表格格式:
store = pd.HDFStore('store.h5')
for ...:
...
chunk # the chunk of the DataFrame (which you want to append)
store.append('df', chunk)
Now you can read it in as a DataFrame in one go (assuming this DataFrame can fit in memory!):
现在,您可以一次性将其作为 DataFrame 读入(假设此 DataFrame 可以放入内存中!):
df = store['df']
You can also query, to get only subsections of the DataFrame.
您还可以查询以仅获取 DataFrame 的子部分。
Aside: You should also buy more RAM, it's cheap.
旁白:你也应该买更多的内存,它很便宜。
Edit: you can groupby/sum from the store iterativelysince this "map-reduces" over the chunks:
编辑:您可以迭代地从存储中分组/求和,因为这个“映射减少”了块:
# note: this doesn't work, see below
sum(df.groupby().sum() for df in store.select('df', chunksize=50000))
# equivalent to (but doesn't read in the entire frame)
store['df'].groupby().sum()
Edit2: Using sum as above doesn't actually work in pandas 0.16 (I thought it did in 0.15.2), instead you can use reducewith add:
Edit2:使用上面的 sum 实际上在 Pandas 0.16 中不起作用(我认为它在 0.15.2 中起作用),相反,您可以使用reducewith add:
reduce(lambda x, y: x.add(y, fill_value=0),
(df.groupby().sum() for df in store.select('df', chunksize=50000)))
In python 3 you must import reduce from functools.
在 python 3 中,您必须从 functools 导入 reduce。
Perhaps it's more pythonic/readable to write this as:
也许将其编写为更pythonic/可读:
chunks = (df.groupby().sum() for df in store.select('df', chunksize=50000))
res = next(chunks) # will raise if there are no chunks!
for c in chunks:
res = res.add(c, fill_value=0)
If performance is poor / if there are a large number of new groups then it may be preferable to start the res as zero of the correct size (by getting the unique group keys e.g. by looping through the chunks), and then add in place.
如果性能很差/如果有大量新组,那么最好将 res 作为正确大小的零开始(通过获取唯一的组键,例如通过循环遍历块),然后添加到位。

