Python 如何计算pandas中一行中所有元素的加权和?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18419962/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to compute weighted sum of all elements in a row in pandas?
提问by ask
I have a pandas data frame with multiple columns. I want to create a new column weighted_sum
from the values in the row and another column vector dataframe weight
我有一个多列的熊猫数据框。我想weighted_sum
根据行中的值和另一个列向量数据帧创建一个新列weight
weighted_sum
should have the following value:
weighted_sum
应具有以下值:
row[weighted_sum] = row[col0]*weight[0] + row[col1]*weight[1] + row[col2]*weight[2] + ...
row[weighted_sum] = row[col0]*weight[0] + row[col1]*weight[1] + row[col2]*weight[2] + ...
I found the function sum(axis=1)
, but it doesn't let me multiply with weight
.
我找到了函数sum(axis=1)
,但它不允许我乘以weight
。
Edit: I changed things a bit.
编辑:我改变了一些东西。
weight
looks like this:
weight
看起来像这样:
0
col1 0.5
col2 0.3
col3 0.2
df
looks like this:
df
看起来像这样:
col1 col2 col3
1.0 2.2 3.5
6.1 0.4 1.2
df*weight
returns a dataframe full of Nan
values.
df*weight
返回一个充满Nan
值的数据帧。
采纳答案by Phillip Cloud
The problem is that you're multiplying a frame with a frame of a different size with a different row index. Here's the solution:
问题是您将一个框架与具有不同行索引的不同大小的框架相乘。这是解决方案:
In [121]: df = DataFrame([[1,2.2,3.5],[6.1,0.4,1.2]], columns=list('abc'))
In [122]: weight = DataFrame(Series([0.5, 0.3, 0.2], index=list('abc'), name=0))
In [123]: df
Out[123]:
a b c
0 1.00 2.20 3.50
1 6.10 0.40 1.20
In [124]: weight
Out[124]:
0
a 0.50
b 0.30
c 0.20
In [125]: df * weight
Out[125]:
0 a b c
0 nan nan nan nan
1 nan nan nan nan
a nan nan nan nan
b nan nan nan nan
c nan nan nan nan
You can either access the column:
您可以访问该列:
In [126]: df * weight[0]
Out[126]:
a b c
0 0.50 0.66 0.70
1 3.05 0.12 0.24
In [128]: (df * weight[0]).sum(1)
Out[128]:
0 1.86
1 3.41
dtype: float64
Or use dot
to get back another DataFrame
或者dot
用来取回另一个DataFrame
In [127]: df.dot(weight)
Out[127]:
0
0 1.86
1 3.41
To bring it all together:
把它们放在一起:
In [130]: df['weighted_sum'] = df.dot(weight)
In [131]: df
Out[131]:
a b c weighted_sum
0 1.00 2.20 3.50 1.86
1 6.10 0.40 1.20 3.41
Here are the timeit
s of each method, using a larger DataFrame
.
以下是timeit
每种方法的s,使用较大的DataFrame
.
In [145]: df = DataFrame(randn(10000000, 3), columns=list('abc'))
weight
In [146]: weight = DataFrame(Series([0.5, 0.3, 0.2], index=list('abc'), name=0))
In [147]: timeit df.dot(weight)
10 loops, best of 3: 57.5 ms per loop
In [148]: timeit (df * weight[0]).sum(1)
10 loops, best of 3: 125 ms per loop
For a wide DataFrame
:
对于广泛DataFrame
:
In [162]: df = DataFrame(randn(10000, 1000))
In [163]: weight = DataFrame(randn(1000, 1))
In [164]: timeit df.dot(weight)
100 loops, best of 3: 5.14 ms per loop
In [165]: timeit (df * weight[0]).sum(1)
10 loops, best of 3: 41.8 ms per loop
So, dot
is faster and more readable.
所以,dot
更快,更易读。
NOTE:If any of your data contain NaN
s then you should not use dot
you should use the multiply-and-sum method. dot
cannot handle NaN
s since it is just a thin wrapper around numpy.dot()
(which doesn't handle NaN
s).
注意:如果您的任何数据包含NaN
s 那么您不应该使用dot
您应该使用乘法和求和方法。dot
无法处理NaN
s,因为它只是一个薄包装numpy.dot()
(不处理NaN
s)。
回答by Andy Hayden
Assuming weights is a Series of weights for each columns, you can just multiply and do the sum:
假设权重是每列的一系列权重,您可以乘以求和:
In [11]: df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['a', 'b', 'c'])
In [12]: weights = pd.Series([7, 8, 9], index=['a', 'b', 'c'])
In [13]: (df * weights)
Out[13]:
a b c
0 7 16 27
1 28 40 54
In [14]: (df * weights).sum(1)
Out[14]:
0 50
1 122
dtype: int64
The benefit of this approach is it takes care of columns which you don't want to weigh:
这种方法的好处是它可以处理您不想加权的列:
In [21]: weights = pd.Series([7, 8], index=['a', 'b'])
In [22]: (df * weights)
Out[22]:
a b c
0 7 16 NaN
1 28 40 NaN
In [23]: (df * weights).sum(1)
Out[23]:
0 23
1 68
dtype: float64