pandas 有效地合并熊猫中的多个数据帧
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36412453/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
efficiently merge multiple dataframes in pandas
提问by Cmdt.Ed
What is the most efficient way to merge multiple data frames (i.e., more than 2) in pandas? There are a few answers:
在 Pandas 中合并多个数据帧(即超过 2 个)的最有效方法是什么?有几个答案:
- pandas joining multiple dataframes on columns
- Pandas left outer join multiple dataframes on multiple columns
but these all involve multiple joins. If I have N data frames these would require N-1 joins.
但这些都涉及多个连接。如果我有 N 个数据框,这些将需要 N-1 个连接。
If I weren't using pandas, another solution would be to just put everything into a hash table based on the common index as the key and build the final version. This is basically like a hash join in SQL I believe. Is there something like that in pandas?
如果我不使用Pandas,另一种解决方案是将所有内容放入基于公共索引作为键的哈希表中并构建最终版本。这基本上就像我相信的 SQL 中的哈希连接。Pandas中有这样的东西吗?
If not, would it be more efficient to just create a new data frame with the common index and pass it the raw data from each data frame? It seems like that would at least prevent you from creating a new data frame in each of the N-1 joins.
如果没有,仅创建具有公共索引的新数据框并将每个数据框的原始数据传递给它会更有效吗?看起来这至少会阻止您在每个 N-1 连接中创建新数据框。
Thanks.
谢谢。
回答by MaxU
if you can join your data frames by index you can do it in one conveyor:
如果您可以按索引加入数据框,则可以在一个传送带上完成:
df1.join(df2).join(df3).join(df4)
example:
例子:
In [187]: df1
Out[187]:
a b
0 5 2
1 6 7
2 6 5
3 1 6
4 0 2
In [188]: df2
Out[188]:
c d
0 5 7
1 5 5
2 2 4
3 4 3
4 9 0
In [189]: df3
Out[189]:
e f
0 8 1
1 0 9
2 4 5
3 3 9
4 9 5
In [190]: df1.join(df2).join(df3)
Out[190]:
a b c d e f
0 5 2 5 7 8 1
1 6 7 5 5 0 9
2 6 5 2 4 4 5
3 1 6 4 3 3 9
4 0 2 9 0 9 5
It should be pretty fast and effective
它应该非常快速和有效
alternatively you can concatenate them:
或者,您可以连接它们:
In [191]: pd.concat([df1,df2,df3], axis=1)
Out[191]:
a b c d e f
0 5 2 5 7 8 1
1 6 7 5 5 0 9
2 6 5 2 4 4 5
3 1 6 4 3 3 9
4 0 2 9 0 9 5
Time comparison for 3 DF's with 100K rows each:
3 个 DF 的时间比较,每个 100K 行:
In [198]: %timeit pd.concat([df1,df2,df3], axis=1)
100 loops, best of 3: 5.67 ms per loop
In [199]: %timeit df1.join(df2).join(df3)
100 loops, best of 3: 3.93 ms per loop
so as you can see join
is bit faster
所以正如你所看到的那样join
快一点