pandas 合并多个大型DataFrame的有效方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/50886269/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Efficient way to merge multiple large DataFrames
提问by imperialgendarme
Suppose I have 4 small DataFrames
假设我有 4 个小数据帧
df1
, df2
, df3
and df4
df1
, df2
,df3
和df4
import pandas as pd
from functools import reduce
import numpy as np
df1 = pd.DataFrame([['a', 1, 10], ['a', 2, 20], ['b', 1, 4], ['c', 1, 2], ['e', 2, 10]])
df2 = pd.DataFrame([['a', 1, 15], ['a', 2, 20], ['c', 1, 2]])
df3 = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 1]])
df4 = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 15]])
df1.columns = ['name', 'id', 'price']
df2.columns = ['name', 'id', 'price']
df3.columns = ['name', 'id', 'price']
df4.columns = ['name', 'id', 'price']
df1 = df1.rename(columns={'price':'pricepart1'})
df2 = df2.rename(columns={'price':'pricepart2'})
df3 = df3.rename(columns={'price':'pricepart3'})
df4 = df4.rename(columns={'price':'pricepart4'})
Create above are the 4 DataFrames, what I would like is in the code below.
上面创建的是 4 个数据帧,我想要的是下面的代码。
# Merge dataframes
df = pd.merge(df1, df2, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
df = pd.merge(df , df3, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
df = pd.merge(df , df4, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
# Fill na values with 'missing'
df = df.fillna('missing')
So I have achieved this for 4 DataFrames that don't have many rows and columns.
所以我已经为 4 个没有很多行和列的 DataFrame 实现了这一点。
Basically, I want to extend the above outer merge solution to MULTIPLE (48) DataFrames of size 62245 X 3:
基本上,我想将上述外部合并解决方案扩展到大小为 62245 X 3 的 MULTIPLE (48) DataFrames:
So I came up with this solution by building from another StackOverflow answer that used a lambda reduce:
所以我通过从另一个使用 lambda 减少的 StackOverflow 答案中构建来提出这个解决方案:
from functools import reduce
import pandas as pd
import numpy as np
dfList = []
#To create the 48 DataFrames of size 62245 X 3
for i in range(0, 49):
dfList.append(pd.DataFrame(np.random.randint(0,100,size=(62245, 3)), columns=['name', 'id', 'pricepart' + str(i + 1)]))
#The solution I came up with to extend the solution to more than 3 DataFrames
df_merged = reduce(lambda left, right: pd.merge(left, right, left_on=['name', 'id'], right_on=['name', 'id'], how='outer'), dfList).fillna('missing')
This is causing a MemoryError
.
这导致MemoryError
.
I do not know what to do to stop the kernel from dying.. I've been stuck on this for two days.. Some code for the EXACT merge operation that I have performed that does not cause the MemoryError
or something that gives you the same result, would be really appreciated.
我不知道怎样做才能免于死亡停止内核。我一直停留在这两天。对于确切的合并运行中的一些代码,我执行的不会引起MemoryError
什么的,让你在同一结果,将不胜感激。
Also, the 3 columns in the main DataFrame (NOT the reproducible 48 DataFrames in the example) are of type int64
, int64
and float64
and I'd prefer them to stay that way because of the integer and float that it represents.
另外,在主数据帧(不是可再现48个DataFrames中的例子)的3列的类型int64
,int64
和float64
与我宁愿他们留,因为整数和浮子,它代表的这种方式。
EDIT:
编辑:
Instead of iteratively trying to run the merge operations or using the reduce lambda functions, I have done it in groups of 2! Also, I've changed the datatype of some columns, some did not need to be float64
. So I brought it down to float16
. It gets very far but still ends up throwing a MemoryError
.
我没有反复尝试运行合并操作或使用 reduce lambda 函数,而是分 2 组完成!另外,我已经更改了某些列的数据类型,有些不需要是float64
. 所以我把它归结为float16
. 它走得很远,但最终仍然抛出一个MemoryError
.
intermediatedfList = dfList
tempdfList = []
#Until I merge all the 48 frames two at a time, till it becomes size 2
while(len(intermediatedfList) != 2):
#If there are even number of DataFrames
if len(intermediatedfList)%2 == 0:
#Go in steps of two
for i in range(0, len(intermediatedfList), 2):
#Merge DataFrame in index i, i + 1
df1 = pd.merge(intermediatedfList[i], intermediatedfList[i + 1], left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
print(df1.info(memory_usage='deep'))
#Append it to this list
tempdfList.append(df1)
#After DataFrames in intermediatedfList merging it two at a time using an auxillary list tempdfList,
#Set intermediatedfList to be equal to tempdfList, so it can continue the while loop.
intermediatedfList = tempdfList
else:
#If there are odd number of DataFrames, keep the first DataFrame out
tempdfList = [intermediatedfList[0]]
#Go in steps of two starting from 1 instead of 0
for i in range(1, len(intermediatedfList), 2):
#Merge DataFrame in index i, i + 1
df1 = pd.merge(intermediatedfList[i], intermediatedfList[i + 1], left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
print(df1.info(memory_usage='deep'))
tempdfList.append(df1)
#After DataFrames in intermediatedfList merging it two at a time using an auxillary list tempdfList,
#Set intermediatedfList to be equal to tempdfList, so it can continue the while loop.
intermediatedfList = tempdfList
Is there any way I can optimize my code to avoid MemoryError
, I've even used AWS 192GB RAM (I now owe them 7$ which I could've given one of yall), that gets farther than what I've gotten, and it still throws MemoryError
after reducing a list of 28 DataFrames to 4..
有什么方法可以优化我的代码来避免MemoryError
,我什至使用了 AWS 192GB RAM(我现在欠他们 7 美元,我本可以给他们一个),这比我得到的要多,而且它MemoryError
在将 28 个 DataFrames 的列表减少到 4 个后仍然抛出..
采纳答案by cs95
You may get some benefit from performing index-aligned concatenation using pd.concat
. This should hopefully be faster and more memory efficient than an outer merge as well.
您可能会从使用pd.concat
. 这也应该比外部合并更快,内存效率更高。
df_list = [df1, df2, ...]
for df in df_list:
df.set_index(['name', 'id'], inplace=True)
df = pd.concat(df_list, axis=1) # join='inner'
df.reset_index(inplace=True)
Alternatively, you can replace the concat
(second step) by an iterative join
:
或者,您可以用concat
迭代替换(第二步)join
:
from functools import reduce
df = reduce(lambda x, y: x.join(y), df_list)
This may or may not be better than the merge
.
这可能会也可能不会比merge
.
回答by jpp
You can try a simple for
loop. The only memory optimization I have applied is downcasting to most optimal int
type via pd.to_numeric
.
您可以尝试一个简单的for
循环。我应用的唯一内存优化是int
通过pd.to_numeric
.
I am also using a dictionary to store dataframes. This is good practice for holding a variable number of variables.
我还使用字典来存储数据帧。这是保存可变数量变量的好习惯。
import pandas as pd
dfs = {}
dfs[1] = pd.DataFrame([['a', 1, 10], ['a', 2, 20], ['b', 1, 4], ['c', 1, 2], ['e', 2, 10]])
dfs[2] = pd.DataFrame([['a', 1, 15], ['a', 2, 20], ['c', 1, 2]])
dfs[3] = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 1]])
dfs[4] = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 15]])
df = dfs[1].copy()
for i in range(2, max(dfs)+1):
df = pd.merge(df, dfs[i].rename(columns={2: i+1}),
left_on=[0, 1], right_on=[0, 1], how='outer').fillna(-1)
df.iloc[:, 2:] = df.iloc[:, 2:].apply(pd.to_numeric, downcast='integer')
print(df)
0 1 2 3 4 5
0 a 1 10 15 -1 -1
1 a 2 20 20 -1 -1
2 b 1 4 -1 -1 -1
3 c 1 2 2 -1 -1
4 e 2 10 -1 20 20
5 d 1 -1 -1 10 10
6 f 1 -1 -1 1 15
You should not, as a rule, combine strings such as "missing" with numeric types, as this will turn your entire series into object
type series. Here we use -1
, but you may wish to use NaN
with float
dtype instead.
通常,您不应将诸如“missing”之类的字符串与数字类型结合使用,因为这会将您的整个系列变成object
类型系列。这里我们使用-1
,但您可能希望使用NaN
with float
dtype 代替。
回答by user85779
Seems like part of what dask dataframes were designed to do (out of memory ops with dataframes). See Best way to join two large datasets in Pandasfor example code. Sorry not copying and pasting but don't want to seem like I am trying to take credit from answerer in linked entry.
似乎是 dask 数据帧的设计目的的一部分(数据帧的内存不足操作)。有关示例代码,请参阅 在 Pandas 中连接两个大型数据集的最佳方法。抱歉没有复制和粘贴,但不想看起来像我试图从链接条目中的回答者那里获得信用。
回答by theletz
So, you have 48 dfs with 3 columns each - name, id, and different column for every df.
因此,您有 48 个 dfs,每个 dfs 有 3 列 - 每个 df 的名称、ID 和不同的列。
You don`t must to use merge....
您不必使用合并....
Instead, if you concat all the dfs
相反,如果你连接所有的 dfs
df = pd.concat([df1,df2,df3,df4])
You will recieve:
您将收到:
Out[3]:
id name pricepart1 pricepart2 pricepart3 pricepart4
0 1 a 10.0 NaN NaN NaN
1 2 a 20.0 NaN NaN NaN
2 1 b 4.0 NaN NaN NaN
3 1 c 2.0 NaN NaN NaN
4 2 e 10.0 NaN NaN NaN
0 1 a NaN 15.0 NaN NaN
1 2 a NaN 20.0 NaN NaN
2 1 c NaN 2.0 NaN NaN
0 1 d NaN NaN 10.0 NaN
1 2 e NaN NaN 20.0 NaN
2 1 f NaN NaN 1.0 NaN
0 1 d NaN NaN NaN 10.0
1 2 e NaN NaN NaN 20.0
2 1 f NaN NaN NaN 15.0
Now you can group by name and id and take the sum:
现在您可以按名称和 ID 分组并计算总和:
df.groupby(['name','id']).sum().fillna('missing').reset_index()
If you will try it with the 48 dfs you will see it solves the MemoryError:
如果您尝试使用 48 dfs,您将看到它解决了 MemoryError:
dfList = []
#To create the 48 DataFrames of size 62245 X 3
for i in range(0, 49):
dfList.append(pd.DataFrame(np.random.randint(0,100,size=(62245, 3)), columns=['name', 'id', 'pricepart' + str(i + 1)]))
df = pd.concat(dfList)
df.groupby(['name','id']).sum().fillna('missing').reset_index()