使用均值合并 Pandas 中的数据帧

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19490064/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:15:53  来源:igfitidea点击:

Merge DataFrames in Pandas using the mean

pythonmergepandas

提问by Martin Preusse

I have a set of DataFrames with numeric values and partly overlapping indices. I would like to merge them an take the mean if an index occurs in more than one DataFrame.

我有一组带有数值和部分重叠索引的数据帧。如果索引出现在多个 DataFrame 中,我想将它们合并并取平均值。

import pandas as pd
import numpy as np

df1 = pd.DataFrame([1,2,3], columns=['col'], index=['a','b','c'])
df2 = pd.DataFrame([4,5,6], columns=['col'], index=['b','c','d'])

This gives me two DataFrames:

这给了我两个数据帧:

   col            col
a    1        b     4
b    2        c     5
c    3        d     6

Now I would like to merge the DataFrames and take the mean for each index (if applicable, i.e. if it occurs more than once).

现在我想合并 DataFrames 并取每个索引的平均值(如果适用,即它出现多次)。

Should look like this:

应该是这样的:

    col
a     1
b     3
c     4
d     6

Can I do this with some advanced merging/joining?

我可以通过一些高级合并/加入来做到这一点吗?

采纳答案by Roman Pekar

something like this:

像这样:

df3 = pd.concat((df1, df2))
df3.groupby(df3.index).mean()

#    col
# a    1
# b    3
# c    4
# d    6

or other way around, as in @unutbu answer:

或其他方式,如@unutbu 的回答:

pd.concat((df1, df2), axis=1).mean(axis=1)

回答by unutbu

In [22]: pd.merge(df1, df2, left_index=True, right_index=True, how='outer').mean(axis=1)
Out[23]: 
a    1
b    3
c    4
d    6
dtype: float64


Regarding Roman's question, I find IPython's %timeitcommand a convenient way to benchmark code:

关于 Roman 的问题,我发现IPython%timeit命令是一种对代码进行基准测试的便捷方法:

In [28]: %timeit df3 = pd.concat((df1, df2)); df3.groupby(df3.index).mean()
1000 loops, best of 3: 617 μs per loop

In [29]: %timeit pd.merge(df1, df2, left_index=True, right_index=True, how='outer').mean(axis=1)
1000 loops, best of 3: 577 μs per loop

In [39]: %timeit pd.concat((df1, df2), axis=1).mean(axis=1)
1000 loops, best of 3: 524 μs per loop

In this case, pd.concat(...).mean(...)turns out to be a bit faster. But really we should test bigger dataframes to get a more meaningful benchmark.

在这种情况下,pd.concat(...).mean(...)结果会更快一些。但实际上我们应该测试更大的数据帧以获得更有意义的基准。

By the way, if you do not want to install IPython, equivalent benchmarks can be run using Python's timeitmodule. It just takes a bit more setup. The docs has some examplesshowing how to do this.

顺便说一句,如果您不想安装 IPython,可以使用Python 的timeit模块运行等效的基准测试。它只需要更多的设置。该文档有一些例子显示如何做到这一点。



Note that if df1or df2were to have duplicate entries in its index, for example like this:

请注意,如果df1df2将在其索引中包含重复条目,例如:

N = 1000
df1 = pd.DataFrame([1,2,3]*N, columns=['col'], index=['a','b','c']*N)
df2 = pd.DataFrame([4,5,6]*N, columns=['col'], index=['b','c','d']*N)

then these three answers give different results:

那么这三个答案给出了不同的结果:

In [56]: df3 = pd.concat((df1, df2)); df3.groupby(df3.index).mean()
Out[56]: 
   col
a    1
b    3
c    4
d    6

pd.mergeprobably does not give the kind of answer you want:

pd.merge可能不会给出您想要的那种答案:

In [58]: len(pd.merge(df1, df2, left_index=True, right_index=True, how='outer').mean(axis=1))
Out[58]: 2002000

While pd.concat((df1, df2), axis=1)raises a ValueError:

虽然pd.concat((df1, df2), axis=1)引发了一个 ValueError:

In [48]: pd.concat((df1, df2), axis=1)
ValueError: cannot reindex from a duplicate axis