pandas 合并多个大型DataFrame的有效方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/50886269/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:42:38  来源:igfitidea点击:

Efficient way to merge multiple large DataFrames

pythonpandasdataframemergeout-of-memory

提问by imperialgendarme

Suppose I have 4 small DataFrames

假设我有 4 个小数据帧

df1, df2, df3and df4

df1, df2,df3df4

import pandas as pd
from functools import reduce
import numpy as np

df1 = pd.DataFrame([['a', 1, 10], ['a', 2, 20], ['b', 1, 4], ['c', 1, 2], ['e', 2, 10]])
df2 = pd.DataFrame([['a', 1, 15], ['a', 2, 20], ['c', 1, 2]])
df3 = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 1]])  
df4 = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 15]])   


df1.columns = ['name', 'id', 'price']
df2.columns = ['name', 'id', 'price']
df3.columns = ['name', 'id', 'price']    
df4.columns = ['name', 'id', 'price']   

df1 = df1.rename(columns={'price':'pricepart1'})
df2 = df2.rename(columns={'price':'pricepart2'})
df3 = df3.rename(columns={'price':'pricepart3'})
df4 = df4.rename(columns={'price':'pricepart4'})

Create above are the 4 DataFrames, what I would like is in the code below.

上面创建的是 4 个数据帧,我想要的是下面的代码。

# Merge dataframes
df = pd.merge(df1, df2, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
df = pd.merge(df , df3, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
df = pd.merge(df , df4, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')

# Fill na values with 'missing'
df = df.fillna('missing')

So I have achieved this for 4 DataFrames that don't have many rows and columns.

所以我已经为 4 个没有很多行和列的 DataFrame 实现了这一点。

Basically, I want to extend the above outer merge solution to MULTIPLE (48) DataFrames of size 62245 X 3:

基本上,我想将上述外部合并解决方案扩展到大小为 62245 X 3 的 MULTIPLE (48) DataFrames:

So I came up with this solution by building from another StackOverflow answer that used a lambda reduce:

所以我通过从另一个使用 lambda 减少的 StackOverflow 答案中构建来提出这个解决方案:

from functools import reduce
import pandas as pd
import numpy as np
dfList = []

#To create the 48 DataFrames of size 62245 X 3
for i in range(0, 49):

    dfList.append(pd.DataFrame(np.random.randint(0,100,size=(62245, 3)), columns=['name',  'id',  'pricepart' + str(i + 1)]))


#The solution I came up with to extend the solution to more than 3 DataFrames
df_merged = reduce(lambda  left, right: pd.merge(left, right, left_on=['name', 'id'], right_on=['name', 'id'], how='outer'), dfList).fillna('missing')

This is causing a MemoryError.

这导致MemoryError.

I do not know what to do to stop the kernel from dying.. I've been stuck on this for two days.. Some code for the EXACT merge operation that I have performed that does not cause the MemoryErroror something that gives you the same result, would be really appreciated.

我不知道怎样做才能免于死亡停止内核。我一直停留在这两天。对于确切的合并运行中的一些代码,我执行的不会引起MemoryError什么的,让你在同一结果,将不胜感激。

Also, the 3 columns in the main DataFrame (NOT the reproducible 48 DataFrames in the example) are of type int64, int64and float64and I'd prefer them to stay that way because of the integer and float that it represents.

另外,在主数据帧(不是可再现48个DataFrames中的例子)的3列的类型int64int64float64与我宁愿他们留,因为整数和浮子,它代表的这种方式。

EDIT:

编辑:

Instead of iteratively trying to run the merge operations or using the reduce lambda functions, I have done it in groups of 2! Also, I've changed the datatype of some columns, some did not need to be float64. So I brought it down to float16. It gets very far but still ends up throwing a MemoryError.

我没有反复尝试运行合并操作或使用 reduce lambda 函数,而是分 2 组完成!另外,我已经更改了某些列的数据类型,有些不需要是float64. 所以我把它归结为float16. 它走得很远,但最终仍然抛出一个MemoryError.

intermediatedfList = dfList    

tempdfList = []    

#Until I merge all the 48 frames two at a time, till it becomes size 2
while(len(intermediatedfList) != 2):

    #If there are even number of DataFrames
    if len(intermediatedfList)%2 == 0:

        #Go in steps of two
        for i in range(0, len(intermediatedfList), 2):

            #Merge DataFrame in index i, i + 1
            df1 = pd.merge(intermediatedfList[i], intermediatedfList[i + 1], left_on=['name',  'id'], right_on=['name',  'id'], how='outer')
            print(df1.info(memory_usage='deep'))

            #Append it to this list
            tempdfList.append(df1)

        #After DataFrames in intermediatedfList merging it two at a time using an auxillary list tempdfList, 
        #Set intermediatedfList to be equal to tempdfList, so it can continue the while loop. 
        intermediatedfList = tempdfList 

    else:

        #If there are odd number of DataFrames, keep the first DataFrame out

        tempdfList = [intermediatedfList[0]]

        #Go in steps of two starting from 1 instead of 0
        for i in range(1, len(intermediatedfList), 2):

            #Merge DataFrame in index i, i + 1
            df1 = pd.merge(intermediatedfList[i], intermediatedfList[i + 1], left_on=['name',  'id'], right_on=['name',  'id'], how='outer')
            print(df1.info(memory_usage='deep'))
            tempdfList.append(df1)

        #After DataFrames in intermediatedfList merging it two at a time using an auxillary list tempdfList, 
        #Set intermediatedfList to be equal to tempdfList, so it can continue the while loop. 
        intermediatedfList = tempdfList 

Is there any way I can optimize my code to avoid MemoryError, I've even used AWS 192GB RAM (I now owe them 7$ which I could've given one of yall), that gets farther than what I've gotten, and it still throws MemoryErrorafter reducing a list of 28 DataFrames to 4..

有什么方法可以优化我的代码来避免MemoryError,我什至使用了 AWS 192GB RAM(我现在欠他们 7 美元,我本可以给他们一个),这比我得到的要多,而且它MemoryError在将 28 个 DataFrames 的列表减少到 4 个后仍然抛出..

采纳答案by cs95

You may get some benefit from performing index-aligned concatenation using pd.concat. This should hopefully be faster and more memory efficient than an outer merge as well.

您可能会从使用pd.concat. 这也应该比外部合并更快,内存效率更高。

df_list = [df1, df2, ...]
for df in df_list:
    df.set_index(['name', 'id'], inplace=True)

df = pd.concat(df_list, axis=1) # join='inner'
df.reset_index(inplace=True)

Alternatively, you can replace the concat(second step) by an iterative join:

或者,您可以用concat迭代替换(第二步)join

from functools import reduce
df = reduce(lambda x, y: x.join(y), df_list)

This may or may not be better than the merge.

这可能会也可能不会比merge.

回答by jpp

You can try a simple forloop. The only memory optimization I have applied is downcasting to most optimal inttype via pd.to_numeric.

您可以尝试一个简单的for循环。我应用的唯一内存优化是int通过pd.to_numeric.

I am also using a dictionary to store dataframes. This is good practice for holding a variable number of variables.

我还使用字典来存储数据帧。这是保存可变数量变量的好习惯。

import pandas as pd

dfs = {}
dfs[1] = pd.DataFrame([['a', 1, 10], ['a', 2, 20], ['b', 1, 4], ['c', 1, 2], ['e', 2, 10]])
dfs[2] = pd.DataFrame([['a', 1, 15], ['a', 2, 20], ['c', 1, 2]])
dfs[3] = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 1]])  
dfs[4] = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 15]])   

df = dfs[1].copy()

for i in range(2, max(dfs)+1):
    df = pd.merge(df, dfs[i].rename(columns={2: i+1}),
                  left_on=[0, 1], right_on=[0, 1], how='outer').fillna(-1)
    df.iloc[:, 2:] = df.iloc[:, 2:].apply(pd.to_numeric, downcast='integer')

print(df)

   0  1   2   3   4   5
0  a  1  10  15  -1  -1
1  a  2  20  20  -1  -1
2  b  1   4  -1  -1  -1
3  c  1   2   2  -1  -1
4  e  2  10  -1  20  20
5  d  1  -1  -1  10  10
6  f  1  -1  -1   1  15

You should not, as a rule, combine strings such as "missing" with numeric types, as this will turn your entire series into objecttype series. Here we use -1, but you may wish to use NaNwith floatdtype instead.

通常,您不应将诸如“missing”之类的字符串与数字类型结合使用,因为这会将您的整个系列变成object类型系列。这里我们使用-1,但您可能希望使用NaNwith floatdtype 代替。

回答by user85779

Seems like part of what dask dataframes were designed to do (out of memory ops with dataframes). See Best way to join two large datasets in Pandasfor example code. Sorry not copying and pasting but don't want to seem like I am trying to take credit from answerer in linked entry.

似乎是 dask 数据帧的设计目的的一部分(数据帧的内存不足操作)。有关示例代码,请参阅 在 Pandas 中连接两个大型数据集的最佳方法。抱歉没有复制和粘贴,但不想看起来像我试图从链接条目中的回答者那里获得信用。

回答by theletz

So, you have 48 dfs with 3 columns each - name, id, and different column for every df.

因此,您有 48 个 dfs,每个 dfs 有 3 列 - 每个 df 的名称、ID 和不同的列。

You don`t must to use merge....

您不必使用合并....

Instead, if you concat all the dfs

相反,如果你连接所有的 dfs

df = pd.concat([df1,df2,df3,df4])

You will recieve:

您将收到:

Out[3]: 
   id name  pricepart1  pricepart2  pricepart3  pricepart4
0   1    a        10.0         NaN         NaN         NaN
1   2    a        20.0         NaN         NaN         NaN
2   1    b         4.0         NaN         NaN         NaN
3   1    c         2.0         NaN         NaN         NaN
4   2    e        10.0         NaN         NaN         NaN
0   1    a         NaN        15.0         NaN         NaN
1   2    a         NaN        20.0         NaN         NaN
2   1    c         NaN         2.0         NaN         NaN
0   1    d         NaN         NaN        10.0         NaN
1   2    e         NaN         NaN        20.0         NaN
2   1    f         NaN         NaN         1.0         NaN
0   1    d         NaN         NaN         NaN        10.0
1   2    e         NaN         NaN         NaN        20.0
2   1    f         NaN         NaN         NaN        15.0

Now you can group by name and id and take the sum:

现在您可以按名称和 ID 分组并计算总和:

df.groupby(['name','id']).sum().fillna('missing').reset_index()

If you will try it with the 48 dfs you will see it solves the MemoryError:

如果您尝试使用 48 dfs,您将看到它解决了 MemoryError:

dfList = []
#To create the 48 DataFrames of size 62245 X 3
for i in range(0, 49):
    dfList.append(pd.DataFrame(np.random.randint(0,100,size=(62245, 3)), columns=['name',  'id',  'pricepart' + str(i + 1)]))

df = pd.concat(dfList)
df.groupby(['name','id']).sum().fillna('missing').reset_index()