Pandas - 将大数据帧切成块

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44729727/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:52:31  来源:igfitidea点击:

Pandas - Slice Large Dataframe in Chunks

pythonpandasdataframeslice

提问by Walt Reed

I have a large dataframe (>3MM rows) that I'm trying to pass through a function (the one below is largely simplified), and I keep getting a Memory Errormessage.

我有一个大数据框(> 3MM 行),我正试图通过一个函数(下面的函数在很大程度上进行了简化),并且我不断收到一条Memory Error消息。

I think I'm passing too large of a dataframe into the function, so I'm trying to:

我想我将太大的数据帧传递到函数中,所以我试图:

1) Slice the dataframe into smaller chunks (preferably sliced by AcctName)

1) 将数据帧切成更小的块(最好由 切片AcctName

2) Pass the dataframe into the function

2)将数据帧传递给函数

3) Concatenate the dataframes back into one large dataframe

3)将数据帧连接回一个大数据帧

def trans_times_2(df):
    df['Double_Transaction'] = df['Transaction'] * 2

large_df 
AcctName   Timestamp    Transaction
ABC        12/1         12.12
ABC        12/2         20.89
ABC        12/3         51.93    
DEF        12/2         13.12
DEF        12/8          9.93
DEF        12/9         92.09
GHI        12/1         14.33
GHI        12/6         21.99
GHI        12/12        98.81

I know that my function works properly, since it will work on a smaller dataframe (e.g. 40,000 rows). I tried the following, but I was unsuccessful with concatenating the small dataframes back into one large dataframe.

我知道我的函数可以正常工作,因为它可以在较小的数据帧(例如 40,000 行)上工作。我尝试了以下操作,但是将小数据帧连接回一个大数据帧没有成功。

def split_df(df):
    new_df = []
    AcctNames = df.AcctName.unique()
    DataFrameDict = {elem: pd.DataFrame for elem in AcctNames}
    key_list = [k for k in DataFrameDict.keys()]
    new_df = []
    for key in DataFrameDict.keys():
        DataFrameDict[key] = df[:][df.AcctNames == key]
        trans_times_2(DataFrameDict[key])
    rejoined_df = pd.concat(new_df)

How I envision the dataframes being split:

我如何设想被拆分的数据帧:

df1
AcctName   Timestamp    Transaction  Double_Transaction
ABC        12/1         12.12        24.24
ABC        12/2         20.89        41.78
ABC        12/3         51.93        103.86

df2
AcctName   Timestamp    Transaction  Double_Transaction
DEF        12/2         13.12        26.24
DEF        12/8          9.93        19.86
DEF        12/9         92.09        184.18

df3
AcctName   Timestamp    Transaction  Double_Transaction
GHI        12/1         14.33        28.66
GHI        12/6         21.99        43.98
GHI        12/12        98.81        197.62

回答by Scott Boston

You can use list comprehension to split your dataframe into smaller dataframes contained in a list.

您可以使用列表理解将数据帧拆分为列表中包含的较小数据帧。

n = 200000  #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]

You can access the chunks with:

您可以通过以下方式访问块:

list_df[0]
list_df[1]
etc...

Then you can assemble it back into a one dataframe using pd.concat.

然后你可以使用 pd.concat 将它组装回一个单一的数据帧。

By AcctName

按帐户名称

list_df = []

for n,g in df.groupby('AcctName'):
    list_df.append(g)