pandas 在带有 lambda 函数的数据框中使用 if 语句
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27845145/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using an if statement in a dataframe with lambda functions
提问by IcemanBerlin
I am trying to add a new column to a dataframe based on an if statement depending on the values of two columns. i.e. if column x == None then column y else column x
我正在尝试根据两列的值基于 if 语句向数据框添加新列。即如果列 x == None 那么列 y else 列 x
below is the script I have written but doesn't work. any ideas?
下面是我写的脚本,但不起作用。有任何想法吗?
dfCurrentReportResults['Retention'] = dfCurrentReportResults.apply(lambda x : x.Retention_y if x.Retention_x == None else x.Retention_x)
Also I got this error message: AttributeError: ("'Series' object has no attribute 'Retention_x'", u'occurred at index BUSINESSUNIT_NAME')
我还收到此错误消息:AttributeError: ("'Series' object has no attribute 'Retention_x'", u'occurred at index BUSINESSUNIT_NAME')
fyi: BUSINESSUNIT_NAME is the first column name
仅供参考:BUSINESSUNIT_NAME 是第一个列名
Additional Info:
附加信息:
My data printed out looks like this and I want to add a 3rd column to take a value if there is one else keep NaN.
我打印出来的数据看起来像这样,如果还有其他人保留 NaN,我想添加第三列来取值。
Retention_x Retention_y
0 1 NaN
1 NaN 0.672183
2 NaN 1.035613
3 NaN 0.771469
4 NaN 0.916667
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
UPDATE:In the end I was having issues referencing the Null or is Null in my dataframe the final line of code I used also including the axis = 1 answered my question.
更新:最后,我在引用 Null 或数据帧中的 Null 时遇到了问题,我使用的最后一行代码也包括轴 = 1 回答了我的问题。
dfCurrentReportResults['RetentionLambda'] = dfCurrentReportResults.apply(lambda x : x['Retention_y'] if pd.isnull(x['Retention_x']) else x['Retention_x'], axis = 1)
Thanks @EdChum, @strim099 and @aus_lacy for all your input. As my data set gets larger I may switch to the np.where option if I notice performance issues.
感谢@EdChum、@strim099 和@aus_lacy 提供的所有意见。随着我的数据集变大,如果我注意到性能问题,我可能会切换到 np.where 选项。
回答by Jason Strimpel
You'r lambda is operating on the 0 axis which is columnwise. Simply add axis=1to the applyarg list. This is clearly documented.
您的 lambda 正在按列的 0 轴上运行。只需添加axis=1到applyarg 列表。这是有明确记录的。
In [1]: import pandas
In [2]: dfCurrentReportResults = pandas.DataFrame([['a','b'],['c','d'],['e','f'],['g','h'],['i','j']], columns=['Retention_y', 'Retention_x'])
In [3]: dfCurrentReportResults['Retention_x'][1] = None
In [4]: dfCurrentReportResults['Retention_x'][3] = None
In [5]: dfCurrentReportResults
Out[5]:
Retention_y Retention_x
0 a b
1 c None
2 e f
3 g None
4 i j
In [6]: dfCurrentReportResults['Retention'] = dfCurrentReportResults.apply(lambda x : x.Retention_y if x.Retention_x == None else x.Retention_x, axis=1)
In [7]: dfCurrentReportResults
Out[7]:
Retention_y Retention_x Retention
0 a b b
1 c None c
2 e f f
3 g None g
4 i j j
回答by EdChum
Just use np.where:
只需使用np.where:
dfCurrentReportResults['Retention'] = np.where(df.Retention_x == None, df.Retention_y, else df.Retention_x)
This uses the test condition, the first param and sets the value to df.Retention_yelse df.Retention_x
这使用测试条件,第一个参数并将值设置为df.Retention_yelsedf.Retention_x
also avoid using applywhere possible as this is just going to loop over the values, np.whereis a vectorised method and will scale much better.
也尽量避免使用apply,因为这只会循环遍历值,np.where是一种矢量化方法,可以更好地扩展。
UPDATE
更新
OK no need to use np.wherejust use the following simpler syntax:
OK 无需使用,np.where只需使用以下更简单的语法:
dfCurrentReportResults['Retention'] = df.Retention_y.where(df.Retention_x == None, df.Retention_x)
Further update
进一步更新
dfCurrentReportResults['Retention'] = df.Retention_y.where(df.Retention_x.isnull(), df.Retention_x)

