Python 对熊猫数据框中的两列求和
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22342285/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
summing two columns in a pandas dataframe
提问by yoshiserry
when I use this syntax it creates a series rather than adding a column to my new dataframe (sum). Please help.
当我使用此语法时,它会创建一个系列,而不是向我的新数据框(总和)添加一列。请帮忙。
My code:
我的代码:
sum = data['variance'] = data.budget + data.actual
My Data (in dataframe df): (currently has everything except the budget - actual, I want to create a variance column?
我的数据(在数据框 df 中):(目前除了预算之外的所有内容 - 实际,我想创建一个方差列?
cluster date budget actual | budget - actual
0 a 2014-01-01 00:00:00 11000 10000 1000
1 a 2014-02-01 00:00:00 1200 1000
2 a 2014-03-01 00:00:00 200 100
3 b 2014-04-01 00:00:00 200 300
4 b 2014-05-01 00:00:00 400 450
5 c 2014-06-01 00:00:00 700 1000
6 c 2014-07-01 00:00:00 1200 1000
7 c 2014-08-01 00:00:00 200 100
8 c 2014-09-01 00:00:00 200 300
采纳答案by Andy Hayden
I think you've misunderstood some python syntax, the following does two assignments:
我认为你误解了一些 python 语法,下面有两个任务:
In [11]: a = b = 1
In [12]: a
Out[12]: 1
In [13]: b
Out[13]: 1
So in your code it was as if you were doing:
所以在你的代码中,就好像你在做:
sum = df['budget'] + df['actual'] ?# a Series
# and
df['variance'] = df['budget'] + df['actual'] # assigned to a column
The latter creates a new column for df:
后者为 df 创建一个新列:
In [21]: df
Out[21]:
cluster date budget actual
0 a 2014-01-01 00:00:00 11000 10000
1 a 2014-02-01 00:00:00 1200 1000
2 a 2014-03-01 00:00:00 200 100
3 b 2014-04-01 00:00:00 200 300
4 b 2014-05-01 00:00:00 400 450
5 c 2014-06-01 00:00:00 700 1000
6 c 2014-07-01 00:00:00 1200 1000
7 c 2014-08-01 00:00:00 200 100
8 c 2014-09-01 00:00:00 200 300
In [22]: df['variance'] = df['budget'] + df['actual']
In [23]: df
Out[23]:
cluster date budget actual variance
0 a 2014-01-01 00:00:00 11000 10000 21000
1 a 2014-02-01 00:00:00 1200 1000 2200
2 a 2014-03-01 00:00:00 200 100 300
3 b 2014-04-01 00:00:00 200 300 500
4 b 2014-05-01 00:00:00 400 450 850
5 c 2014-06-01 00:00:00 700 1000 1700
6 c 2014-07-01 00:00:00 1200 1000 2200
7 c 2014-08-01 00:00:00 200 100 300
8 c 2014-09-01 00:00:00 200 300 500
As an aside, you shouldn't use sumas a variable name as the overrides the built-in sum function.
顺便sum说一句,您不应将其用作变量名,因为它会覆盖内置 sum 函数。
回答by Rishi Bansal
Same think can be done using lambda function. Here I am reading the data from a xlsx file.
同样的想法可以使用 lambda 函数来完成。在这里,我正在从 xlsx 文件中读取数据。
import pandas as pd
df = pd.read_excel("data.xlsx", sheet_name = 4)
print df
Output:
输出:
cluster Unnamed: 1 date budget actual
0 a 2014-01-01 00:00:00 11000 10000
1 a 2014-02-01 00:00:00 1200 1000
2 a 2014-03-01 00:00:00 200 100
3 b 2014-04-01 00:00:00 200 300
4 b 2014-05-01 00:00:00 400 450
5 c 2014-06-01 00:00:00 700 1000
6 c 2014-07-01 00:00:00 1200 1000
7 c 2014-08-01 00:00:00 200 100
8 c 2014-09-01 00:00:00 200 300
Sum two columns into 3rd new one.
将两列相加为第三个新列。
df['variance'] = df.apply(lambda x: x['budget'] + x['actual'], axis=1)
print df
Output:
输出:
cluster Unnamed: 1 date budget actual variance
0 a 2014-01-01 00:00:00 11000 10000 21000
1 a 2014-02-01 00:00:00 1200 1000 2200
2 a 2014-03-01 00:00:00 200 100 300
3 b 2014-04-01 00:00:00 200 300 500
4 b 2014-05-01 00:00:00 400 450 850
5 c 2014-06-01 00:00:00 700 1000 1700
6 c 2014-07-01 00:00:00 1200 1000 2200
7 c 2014-08-01 00:00:00 200 100 300
8 c 2014-09-01 00:00:00 200 300 500
回答by Archie
回答by R. Cox
If "budget" has any NaN values but you don't want it to sum to NaN then try:
如果“预算”有任何 NaN 值,但您不希望它与 NaN 相加,请尝试:
def fun (b, a):
if math.isnan(b):
return a
else:
return b + a
f = np.vectorize(fun, otypes=[float])
df['variance'] = f(df['budget'], df_Lp['actual'])
回答by pylist
df['variance'] = df.loc[:,['budget','actual']].sum(axis=1)

