pandas 有效地合并熊猫中的多个数据帧

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36412453/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:59:51  来源:igfitidea点击:

efficiently merge multiple dataframes in pandas

pythonpandas

提问by Cmdt.Ed

What is the most efficient way to merge multiple data frames (i.e., more than 2) in pandas? There are a few answers:

在 Pandas 中合并多个数据帧(即超过 2 个)的最有效方法是什么?有几个答案:

  1. pandas joining multiple dataframes on columns
  2. Pandas left outer join multiple dataframes on multiple columns
  1. Pandas在列上加入多个数据框
  2. Pandas左外连接多列上的多个数据框

but these all involve multiple joins. If I have N data frames these would require N-1 joins.

但这些都涉及多个连接。如果我有 N 个数据框,这些将需要 N-1 个连接。

If I weren't using pandas, another solution would be to just put everything into a hash table based on the common index as the key and build the final version. This is basically like a hash join in SQL I believe. Is there something like that in pandas?

如果我不使用Pandas,另一种解决方案是将所有内容放入基于公共索引作为键的哈希表中并构建最终版本。这基本上就像我相信的 SQL 中的哈希连接。Pandas中有这样的东西吗?

If not, would it be more efficient to just create a new data frame with the common index and pass it the raw data from each data frame? It seems like that would at least prevent you from creating a new data frame in each of the N-1 joins.

如果没有,仅创建具有公共索引的新数据框并将每个数据框的原始数据传递给它会更有效吗?看起来这至少会阻止您在每个 N-1 连接中创建新数据框。

Thanks.

谢谢。

回答by MaxU

if you can join your data frames by index you can do it in one conveyor:

如果您可以按索引加入数据框,则可以在一个传送带上完成:

df1.join(df2).join(df3).join(df4)

example:

例子:

In [187]: df1
Out[187]:
   a  b
0  5  2
1  6  7
2  6  5
3  1  6
4  0  2

In [188]: df2
Out[188]:
   c  d
0  5  7
1  5  5
2  2  4
3  4  3
4  9  0

In [189]: df3
Out[189]:
   e  f
0  8  1
1  0  9
2  4  5
3  3  9
4  9  5

In [190]: df1.join(df2).join(df3)
Out[190]:
   a  b  c  d  e  f
0  5  2  5  7  8  1
1  6  7  5  5  0  9
2  6  5  2  4  4  5
3  1  6  4  3  3  9
4  0  2  9  0  9  5

It should be pretty fast and effective

它应该非常快速和有效

alternatively you can concatenate them:

或者,您可以连接它们:

In [191]: pd.concat([df1,df2,df3], axis=1)
Out[191]:
   a  b  c  d  e  f
0  5  2  5  7  8  1
1  6  7  5  5  0  9
2  6  5  2  4  4  5
3  1  6  4  3  3  9
4  0  2  9  0  9  5

Time comparison for 3 DF's with 100K rows each:

3 个 DF 的时间比较,每个 100K 行:

In [198]: %timeit pd.concat([df1,df2,df3], axis=1)
100 loops, best of 3: 5.67 ms per loop

In [199]: %timeit df1.join(df2).join(df3)
100 loops, best of 3: 3.93 ms per loop

so as you can see joinis bit faster

所以正如你所看到的那样join快一点