Python/Pandas - 基于多个变量和 if/elif/else 函数创建新变量
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38798115/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python/Pandas - creating new variable based on several variables and if/elif/else function
提问by John_Everyman
I am trying to create a new variable that is conditional based on values from several other values. I'm writing here because I've tried writing this as a nested ifelse() statement in R, but it had too many nested ifelse's so it threw an error, and I think there should be an easier way to sort this out in Python.
我正在尝试创建一个新变量,该变量基于来自其他几个值的值。我写在这里是因为我已经尝试将它写成 R 中的嵌套 ifelse() 语句,但是它有太多嵌套的 ifelse,所以它抛出了一个错误,我认为应该有一种更简单的方法在 Python 中解决这个问题.
I have a dataframe (called df) that looks roughly like this (although in reality it's much bigger with many more month/year variables) that I've read in as a pandas DataFrame:
我有一个数据框(称为 df),它看起来大致像这样(尽管实际上它更大,有更多的月/年变量),我已将其作为 Pandas DataFrame 读入:
ID Sept_2015 Oct_2015 Nov_2015 Dec_2015 Jan_2016 Feb_2016 Mar_2016 \
0 1 0 0 0 0 1 1 1
1 2 0 0 0 0 0 0 0
2 3 0 0 0 0 1 1 1
3 4 0 0 0 0 0 0 0
4 5 1 1 1 1 1 1 1
grad_time
0 240
1 218
2 236
3 0
4 206
I'm trying to create a new variable that depends on values from all these variables, but values from "earlier" variables need to have precedent, so the if/elif/else condition would like something like this:
我正在尝试创建一个依赖于所有这些变量的值的新变量,但是来自“较早”变量的值需要有先例,所以 if/elif/else 条件会像这样:
if df['Sept_2015'] > 0 & df['grad_time'] <= 236:
return 236
elif df['Oct_2015'] > 0 & df['grad_time'] <= 237:
return 237
elif df['Nov_2015'] > 0 & df['grad_time'] <= 238:
return 238
elif df['Dec_2015'] > 0 & df['grad_time'] <= 239:
return 239
elif df['Jan_2016'] > 0 & df['grad_time'] <= 240:
return 240
elif df['Feb_2016'] > 0 & df['grad_time'] <= 241:
return 241
elif df['Mar_2016'] > 0 & df['grad_time'] <= 242:
return 242
else:
return 0
And based on this, I'd like it to return a new variable that looks like this:
基于此,我希望它返回一个如下所示的新变量:
trisk
0 240
1 0
2 240
3 0
4 236
I've tried writing a function like this:
我试过写一个这样的函数:
def test_func(df):
""" Test Function for generating new value"""
if df['Sept_2015'] > 0 & df['grad_time'] <= 236:
return 236
elif df['Oct_2015'] > 0 & df['grad_time'] <= 237:
return 237
...
else:
return 0
and mapping it to the dataframe to create new variable like this:
并将其映射到数据框以创建新变量,如下所示:
new_df = pd.DataFrame(map(test_func, df))
However, when I run it, I get the following TypeError
但是,当我运行它时,出现以下 TypeError
Traceback (most recent call last):
File "<ipython-input-83-19b45bcda45a>", line 1, in <module>
new_df = pd.DataFrame(map(new_func, test_df))
File "<ipython-input-82-a2eb6f9d7a3a>", line 3, in new_func
if df['Sept_2015'] > 0 & df['grad_time'] <= 236:
TypeError: string indices must be integers, not str
So I can see it's not wanting the column name here. But I've tried this a number of other ways and can't get it to work. Also, I understand this might not be the best way to write this (mapping the function) so I am open to new ways to attempt to solve the problem of generating the trisk variable. Thanks in advance and apologies if I haven't provided something.
所以我可以看到这里不需要列名。但是我已经尝试了许多其他方法并且无法使其正常工作。另外,我知道这可能不是编写此(映射函数)的最佳方式,因此我乐于尝试解决生成 trisk 变量的问题的新方法。提前致谢,如果我没有提供一些东西,我深表歉意。
回答by piRSquared
Setup
设置
df = pd.DataFrame([[0, 0, 0, 0, 1, 1, 1, 240],
[0, 0, 0, 0, 0, 0, 0, 218],
[0, 0, 0, 0, 1, 1, 1, 236],
[0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 206]],
pd.Index(range(1, 6), name='ID'),
['Sept_2015', 'Oct_2015', 'Nov_2015', 'Dec_2015',
'Jan_2016', 'Feb_2016', 'Mar_2016', 'grad_time'])
I used mostly numpy for this
为此,我主要使用 numpy
a = np.array([236, 237, 238, 239, 240, 241, 242])
b = df.values[:, :-1]
g = df.values[:, -1][:, None] <= a
a[(b & g).argmax(1)] * (b & g).any(1)
Assigning it to new column
将其分配给新列
df['trisk'] = a[(b != 0).argmax(1)] * (b != 0).any(1)
df
回答by Alberto Garcia-Raboso
Without getting into streamlining your logic (which @piRSquared gets into): you can apply your test_func
to the rows by issuing .apply(test_func, axis=1)
to your dataframe.
无需简化您的逻辑(@piRSquared 进入):您可以test_func
通过发布.apply(test_func, axis=1)
到您的数据帧来将您的应用应用于行。
import io
import pandas as pd
data = io.StringIO('''\
ID Sept_2015 Oct_2015 Nov_2015 Dec_2015 Jan_2016 Feb_2016 Mar_2016 grad_time
0 1 0 0 0 0 1 1 1 240
1 2 0 0 0 0 0 0 0 218
2 3 0 0 0 0 1 1 1 236
3 4 0 0 0 0 0 0 0 0
4 5 1 1 1 1 1 1 1 206
''')
df = pd.read_csv(data, delim_whitespace=True)
def test_func(df):
""" Test Function for generating new value"""
if df['Sept_2015'] > 0 & df['grad_time'] <= 236:
return 236
elif df['Oct_2015'] > 0 & df['grad_time'] <= 237:
return 237
elif df['Nov_2015'] > 0 & df['grad_time'] <= 238:
return 238
elif df['Dec_2015'] > 0 & df['grad_time'] <= 239:
return 239
elif df['Jan_2016'] > 0 & df['grad_time'] <= 240:
return 240
elif df['Feb_2016'] > 0 & df['grad_time'] <= 241:
return 241
elif df['Mar_2016'] > 0 & df['grad_time'] <= 242:
return 242
else:
return 0
trisk = df.apply(test_func, axis=1)
trick.name = 'trisk'
print(trisk)
Output:
输出:
0 240
1 0
2 240
3 0
4 236
Name: trisk, dtype: int64