使用均值合并 Pandas 中的数据帧

Question

提问by Martin Preusse

I have a set of DataFrames with numeric values and partly overlapping indices. I would like to merge them an take the mean if an index occurs in more than one DataFrame.

我有一组带有数值和部分重叠索引的数据帧。如果索引出现在多个 DataFrame 中，我想将它们合并并取平均值。

import pandas as pd
import numpy as np

df1 = pd.DataFrame([1,2,3], columns=['col'], index=['a','b','c'])
df2 = pd.DataFrame([4,5,6], columns=['col'], index=['b','c','d'])

This gives me two DataFrames:

这给了我两个数据帧：

   col            col
a    1        b     4
b    2        c     5
c    3        d     6

Now I would like to merge the DataFrames and take the mean for each index (if applicable, i.e. if it occurs more than once).

现在我想合并 DataFrames 并取每个索引的平均值（如果适用，即它出现多次）。

Should look like this:

应该是这样的：

Can I do this with some advanced merging/joining?

我可以通过一些高级合并/加入来做到这一点吗？

Answer 1

采纳答案by Roman Pekar

something like this:

像这样：

df3 = pd.concat((df1, df2))
df3.groupby(df3.index).mean()

#    col
# a    1
# b    3
# c    4
# d    6

or other way around, as in @unutbu answer:

或其他方式，如@unutbu 的回答：

pd.concat((df1, df2), axis=1).mean(axis=1)

Answer 2

回答by unutbu

In [22]: pd.merge(df1, df2, left_index=True, right_index=True, how='outer').mean(axis=1)
Out[23]: 
a    1
b    3
c    4
d    6
dtype: float64

Regarding Roman's question, I find IPython's %timeitcommand a convenient way to benchmark code:

关于 Roman 的问题，我发现IPython的%timeit命令是一种对代码进行基准测试的便捷方法：

In [28]: %timeit df3 = pd.concat((df1, df2)); df3.groupby(df3.index).mean()
1000 loops, best of 3: 617 μs per loop

In [29]: %timeit pd.merge(df1, df2, left_index=True, right_index=True, how='outer').mean(axis=1)
1000 loops, best of 3: 577 μs per loop

In [39]: %timeit pd.concat((df1, df2), axis=1).mean(axis=1)
1000 loops, best of 3: 524 μs per loop

In this case, pd.concat(...).mean(...)turns out to be a bit faster. But really we should test bigger dataframes to get a more meaningful benchmark.

在这种情况下，pd.concat(...).mean(...)结果会更快一些。但实际上我们应该测试更大的数据帧以获得更有意义的基准。

By the way, if you do not want to install IPython, equivalent benchmarks can be run using Python's timeitmodule. It just takes a bit more setup. The docs has some examplesshowing how to do this.

顺便说一句，如果您不想安装 IPython，可以使用Python 的timeit模块运行等效的基准测试。它只需要更多的设置。该文档有一些例子显示如何做到这一点。

Note that if df1or df2were to have duplicate entries in its index, for example like this:

请注意，如果df1或df2将在其索引中包含重复条目，例如：

N = 1000
df1 = pd.DataFrame([1,2,3]*N, columns=['col'], index=['a','b','c']*N)
df2 = pd.DataFrame([4,5,6]*N, columns=['col'], index=['b','c','d']*N)

then these three answers give different results:

那么这三个答案给出了不同的结果：

In [56]: df3 = pd.concat((df1, df2)); df3.groupby(df3.index).mean()
Out[56]: 
   col
a    1
b    3
c    4
d    6

pd.mergeprobably does not give the kind of answer you want:

pd.merge可能不会给出您想要的那种答案：

In [58]: len(pd.merge(df1, df2, left_index=True, right_index=True, how='outer').mean(axis=1))
Out[58]: 2002000

While pd.concat((df1, df2), axis=1)raises a ValueError:

虽然pd.concat((df1, df2), axis=1)引发了一个 ValueError：

In [48]: pd.concat((df1, df2), axis=1)
ValueError: cannot reindex from a duplicate axis

使用均值合并 Pandas 中的数据帧

提问by Martin Preusse

采纳答案by Roman Pekar

回答by unutbu

相关推荐

最近更新

标签

使用均值合并 Pandas 中的数据帧

提问by Martin Preusse

采纳答案by Roman Pekar

回答by unutbu

相关推荐

Pandas 合并错误：MemoryError

pandas 在 DataFrame 对象上使用滚动应用

pandas - 扩展 DataFrame 的索引将新行的所有列设置为 NaN？

Pandas：条形图 xtick 频率

相关推荐

最近更新

标签