Python 对熊猫数据框中的两列求和

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22342285/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:45:53  来源:igfitidea点击:

summing two columns in a pandas dataframe

pythonpandas

提问by yoshiserry

when I use this syntax it creates a series rather than adding a column to my new dataframe (sum). Please help.

当我使用此语法时,它会创建一个系列,而不是向我的新数据框(总和)添加一列。请帮忙。

My code:

我的代码:

sum = data['variance'] = data.budget + data.actual

My Data (in dataframe df): (currently has everything except the budget - actual, I want to create a variance column?

我的数据(在数据框 df 中):(目前除了预算之外的所有内容 - 实际,我想创建一个方差列?

    cluster     date    budget  actual          | budget - actual
0   a   2014-01-01 00:00:00     11000   10000       1000
1   a   2014-02-01 00:00:00     1200    1000
2   a   2014-03-01 00:00:00     200     100
3   b   2014-04-01 00:00:00     200     300
4   b   2014-05-01 00:00:00     400     450
5   c   2014-06-01 00:00:00     700     1000
6   c   2014-07-01 00:00:00     1200    1000
7   c   2014-08-01 00:00:00     200     100
8   c   2014-09-01 00:00:00     200     300

采纳答案by Andy Hayden

I think you've misunderstood some python syntax, the following does two assignments:

我认为你误解了一些 python 语法,下面有两个任务:

In [11]: a = b = 1

In [12]: a
Out[12]: 1

In [13]: b
Out[13]: 1

So in your code it was as if you were doing:

所以在你的代码中,就好像你在做:

sum = df['budget'] + df['actual'] ?# a Series
# and
df['variance'] = df['budget'] + df['actual']  # assigned to a column

The latter creates a new column for df:

后者为 df 创建一个新列:

In [21]: df
Out[21]:
  cluster                 date  budget  actual
0       a  2014-01-01 00:00:00   11000   10000
1       a  2014-02-01 00:00:00    1200    1000
2       a  2014-03-01 00:00:00     200     100
3       b  2014-04-01 00:00:00     200     300
4       b  2014-05-01 00:00:00     400     450
5       c  2014-06-01 00:00:00     700    1000
6       c  2014-07-01 00:00:00    1200    1000
7       c  2014-08-01 00:00:00     200     100
8       c  2014-09-01 00:00:00     200     300

In [22]: df['variance'] = df['budget'] + df['actual']

In [23]: df
Out[23]:
  cluster                 date  budget  actual  variance
0       a  2014-01-01 00:00:00   11000   10000     21000
1       a  2014-02-01 00:00:00    1200    1000      2200
2       a  2014-03-01 00:00:00     200     100       300
3       b  2014-04-01 00:00:00     200     300       500
4       b  2014-05-01 00:00:00     400     450       850
5       c  2014-06-01 00:00:00     700    1000      1700
6       c  2014-07-01 00:00:00    1200    1000      2200
7       c  2014-08-01 00:00:00     200     100       300
8       c  2014-09-01 00:00:00     200     300       500

As an aside, you shouldn't use sumas a variable name as the overrides the built-in sum function.

顺便sum说一句,您不应将其用作变量名,因为它会覆盖内置 sum 函数。

回答by Rishi Bansal

Same think can be done using lambda function. Here I am reading the data from a xlsx file.

同样的想法可以使用 lambda 函数来完成。在这里,我正在从 xlsx 文件中读取数据。

import pandas as pd
df = pd.read_excel("data.xlsx", sheet_name = 4)
print df

Output:

输出:

  cluster Unnamed: 1      date  budget  actual
0       a 2014-01-01  00:00:00   11000   10000
1       a 2014-02-01  00:00:00    1200    1000
2       a 2014-03-01  00:00:00     200     100
3       b 2014-04-01  00:00:00     200     300
4       b 2014-05-01  00:00:00     400     450
5       c 2014-06-01  00:00:00     700    1000
6       c 2014-07-01  00:00:00    1200    1000
7       c 2014-08-01  00:00:00     200     100
8       c 2014-09-01  00:00:00     200     300

Sum two columns into 3rd new one.

将两列相加为第三个新列。

df['variance'] = df.apply(lambda x: x['budget'] + x['actual'], axis=1)
print df

Output:

输出:

  cluster Unnamed: 1      date  budget  actual  variance
0       a 2014-01-01  00:00:00   11000   10000     21000
1       a 2014-02-01  00:00:00    1200    1000      2200
2       a 2014-03-01  00:00:00     200     100       300
3       b 2014-04-01  00:00:00     200     300       500
4       b 2014-05-01  00:00:00     400     450       850
5       c 2014-06-01  00:00:00     700    1000      1700
6       c 2014-07-01  00:00:00    1200    1000      2200
7       c 2014-08-01  00:00:00     200     100       300
8       c 2014-09-01  00:00:00     200     300       500

回答by Archie

You could also use the .add()function:

您还可以使用该.add()功能:

 df.loc[:,'variance'] = df.loc[:,'budget'].add(df.loc[:,'actual'])

回答by R. Cox

If "budget" has any NaN values but you don't want it to sum to NaN then try:

如果“预算”有任何 NaN 值,但您不希望它与 NaN 相加,请尝试:

def fun (b, a):
    if math.isnan(b):
        return a
    else:
        return b + a

f = np.vectorize(fun, otypes=[float])

df['variance'] = f(df['budget'], df_Lp['actual'])

回答by pylist

df['variance'] = df.loc[:,['budget','actual']].sum(axis=1)