Python 如何计算pandas中一行中所有元素的加权和?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18419962/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 10:41:14  来源:igfitidea点击:

How to compute weighted sum of all elements in a row in pandas?

pythonpandasdataframecalculated-columnsweighted-average

提问by ask

I have a pandas data frame with multiple columns. I want to create a new column weighted_sumfrom the values in the row and another column vector dataframe weight

我有一个多列的熊猫数据框。我想weighted_sum根据行中的值和另一个列向量数据帧创建一个新列weight

weighted_sumshould have the following value:

weighted_sum应具有以下值:

row[weighted_sum] = row[col0]*weight[0] + row[col1]*weight[1] + row[col2]*weight[2] + ...

row[weighted_sum] = row[col0]*weight[0] + row[col1]*weight[1] + row[col2]*weight[2] + ...

I found the function sum(axis=1), but it doesn't let me multiply with weight.

我找到了函数sum(axis=1),但它不允许我乘以weight

Edit: I changed things a bit.

编辑:我改变了一些东西。

weightlooks like this:

weight看起来像这样:

     0
col1 0.5
col2 0.3
col3 0.2

dflooks like this:

df看起来像这样:

col1 col2 col3
1.0  2.2  3.5
6.1  0.4  1.2

df*weightreturns a dataframe full of Nanvalues.

df*weight返回一个充满Nan值的数据帧。

采纳答案by Phillip Cloud

The problem is that you're multiplying a frame with a frame of a different size with a different row index. Here's the solution:

问题是您将一个框架与具有不同行索引的不同大小的框架相乘。这是解决方案:

In [121]: df = DataFrame([[1,2.2,3.5],[6.1,0.4,1.2]], columns=list('abc'))

In [122]: weight = DataFrame(Series([0.5, 0.3, 0.2], index=list('abc'), name=0))

In [123]: df
Out[123]:
           a          b          c
0       1.00       2.20       3.50
1       6.10       0.40       1.20

In [124]: weight
Out[124]:
           0
a       0.50
b       0.30
c       0.20

In [125]: df * weight
Out[125]:
           0          a          b          c
0        nan        nan        nan        nan
1        nan        nan        nan        nan
a        nan        nan        nan        nan
b        nan        nan        nan        nan
c        nan        nan        nan        nan

You can either access the column:

您可以访问该列:

In [126]: df * weight[0]
Out[126]:
           a          b          c
0       0.50       0.66       0.70
1       3.05       0.12       0.24

In [128]: (df * weight[0]).sum(1)
Out[128]:
0         1.86
1         3.41
dtype: float64

Or use dotto get back another DataFrame

或者dot用来取回另一个DataFrame

In [127]: df.dot(weight)
Out[127]:
           0
0       1.86
1       3.41

To bring it all together:

把它们放在一起:

In [130]: df['weighted_sum'] = df.dot(weight)

In [131]: df
Out[131]:
           a          b          c  weighted_sum
0       1.00       2.20       3.50          1.86
1       6.10       0.40       1.20          3.41

Here are the timeits of each method, using a larger DataFrame.

以下是timeit每种方法的s,使用较大的DataFrame.

In [145]: df = DataFrame(randn(10000000, 3), columns=list('abc'))
weight
In [146]: weight = DataFrame(Series([0.5, 0.3, 0.2], index=list('abc'), name=0))

In [147]: timeit df.dot(weight)
10 loops, best of 3: 57.5 ms per loop

In [148]: timeit (df * weight[0]).sum(1)
10 loops, best of 3: 125 ms per loop

For a wide DataFrame:

对于广泛DataFrame

In [162]: df = DataFrame(randn(10000, 1000))

In [163]: weight = DataFrame(randn(1000, 1))

In [164]: timeit df.dot(weight)
100 loops, best of 3: 5.14 ms per loop

In [165]: timeit (df * weight[0]).sum(1)
10 loops, best of 3: 41.8 ms per loop

So, dotis faster and more readable.

所以,dot更快,更易读。

NOTE:If any of your data contain NaNs then you should not use dotyou should use the multiply-and-sum method. dotcannot handle NaNs since it is just a thin wrapper around numpy.dot()(which doesn't handle NaNs).

注意:如果您的任何数据包含NaNs 那么您不应该使用dot您应该使用乘法和求和方法。dot无法处理NaNs,因为它只是一个薄包装numpy.dot()(不处理NaNs)。

回答by Andy Hayden

Assuming weights is a Series of weights for each columns, you can just multiply and do the sum:

假设权重是每列的一系列权重,您可以乘以求和:

In [11]: df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['a', 'b', 'c'])

In [12]: weights = pd.Series([7, 8, 9], index=['a', 'b', 'c'])

In [13]: (df * weights)
Out[13]: 
    a   b   c
0   7  16  27
1  28  40  54

In [14]: (df * weights).sum(1)
Out[14]: 
0     50
1    122
dtype: int64

The benefit of this approach is it takes care of columns which you don't want to weigh:

这种方法的好处是它可以处理您不想加权的列:

In [21]: weights = pd.Series([7, 8], index=['a', 'b'])

In [22]: (df * weights)
Out[22]: 
    a   b   c
0   7  16 NaN
1  28  40 NaN

In [23]: (df * weights).sum(1)
Out[23]: 
0    23
1    68
dtype: float64