Python/Pandas - 基于多个变量和 if/elif/else 函数创建新变量

Question

提问by John_Everyman

I am trying to create a new variable that is conditional based on values from several other values. I'm writing here because I've tried writing this as a nested ifelse() statement in R, but it had too many nested ifelse's so it threw an error, and I think there should be an easier way to sort this out in Python.

我正在尝试创建一个新变量，该变量基于来自其他几个值的值。我写在这里是因为我已经尝试将它写成 R 中的嵌套 ifelse() 语句，但是它有太多嵌套的 ifelse，所以它抛出了一个错误，我认为应该有一种更简单的方法在 Python 中解决这个问题.

I have a dataframe (called df) that looks roughly like this (although in reality it's much bigger with many more month/year variables) that I've read in as a pandas DataFrame:

我有一个数据框（称为 df），它看起来大致像这样（尽管实际上它更大，有更多的月/年变量），我已将其作为 Pandas DataFrame 读入：

   ID  Sept_2015  Oct_2015  Nov_2015  Dec_2015  Jan_2016  Feb_2016  Mar_2016  \
0   1          0         0         0         0         1         1         1   
1   2          0         0         0         0         0         0         0   
2   3          0         0         0         0         1         1         1   
3   4          0         0         0         0         0         0         0   
4   5          1         1         1         1         1         1         1   

   grad_time  
0        240  
1        218  
2        236  
3          0  
4        206

I'm trying to create a new variable that depends on values from all these variables, but values from "earlier" variables need to have precedent, so the if/elif/else condition would like something like this:

我正在尝试创建一个依赖于所有这些变量的值的新变量，但是来自“较早”变量的值需要有先例，所以 if/elif/else 条件会像这样：

if df['Sept_2015'] > 0 & df['grad_time'] <= 236:
    return 236
elif df['Oct_2015'] > 0 & df['grad_time'] <= 237:
    return 237
elif df['Nov_2015'] > 0 & df['grad_time'] <= 238:
    return 238
elif df['Dec_2015'] > 0 & df['grad_time'] <= 239:
    return 239
elif df['Jan_2016'] > 0 & df['grad_time'] <= 240:
    return 240
elif df['Feb_2016'] > 0 & df['grad_time'] <= 241:
    return 241
elif df['Mar_2016'] > 0 & df['grad_time'] <= 242:
    return 242
else:
    return 0

And based on this, I'd like it to return a new variable that looks like this:

基于此，我希望它返回一个如下所示的新变量：

I've tried writing a function like this:

我试过写一个这样的函数：

def test_func(df):
    """ Test Function for generating new value"""
    if df['Sept_2015'] > 0 & df['grad_time'] <= 236:
        return 236
    elif df['Oct_2015'] > 0 & df['grad_time'] <= 237:
        return 237
    ...
    else:
        return 0

and mapping it to the dataframe to create new variable like this:

并将其映射到数据框以创建新变量，如下所示：

new_df = pd.DataFrame(map(test_func, df))

However, when I run it, I get the following TypeError

但是，当我运行它时，出现以下 TypeError

 Traceback (most recent call last):

  File "<ipython-input-83-19b45bcda45a>", line 1, in <module>
     new_df = pd.DataFrame(map(new_func, test_df))

  File "<ipython-input-82-a2eb6f9d7a3a>", line 3, in new_func
     if df['Sept_2015'] > 0 & df['grad_time'] <= 236:

TypeError: string indices must be integers, not str

So I can see it's not wanting the column name here. But I've tried this a number of other ways and can't get it to work. Also, I understand this might not be the best way to write this (mapping the function) so I am open to new ways to attempt to solve the problem of generating the trisk variable. Thanks in advance and apologies if I haven't provided something.

所以我可以看到这里不需要列名。但是我已经尝试了许多其他方法并且无法使其正常工作。另外，我知道这可能不是编写此（映射函数）的最佳方式，因此我乐于尝试解决生成 trisk 变量的问题的新方法。提前致谢，如果我没有提供一些东西，我深表歉意。

Answer 1

回答by piRSquared

Setup

设置

df = pd.DataFrame([[0, 0, 0, 0, 1, 1, 1, 240],
                   [0, 0, 0, 0, 0, 0, 0, 218],
                   [0, 0, 0, 0, 1, 1, 1, 236],
                   [0, 0, 0, 0, 0, 0, 0,   0],
                   [1, 1, 1, 1, 1, 1, 1, 206]],
                  pd.Index(range(1, 6), name='ID'),
                  ['Sept_2015', 'Oct_2015', 'Nov_2015', 'Dec_2015',
                   'Jan_2016', 'Feb_2016', 'Mar_2016', 'grad_time'])

I used mostly numpy for this

为此，我主要使用 numpy

a = np.array([236, 237, 238, 239, 240, 241, 242])
b = df.values[:, :-1]
g = df.values[:, -1][:, None] <= a

a[(b & g).argmax(1)] * (b & g).any(1)

Assigning it to new column

将其分配给新列

df['trisk'] = a[(b != 0).argmax(1)] * (b != 0).any(1)

df

Answer 2

回答by Alberto Garcia-Raboso

Without getting into streamlining your logic (which @piRSquared gets into): you can apply your test_functo the rows by issuing .apply(test_func, axis=1)to your dataframe.

无需简化您的逻辑（@piRSquared 进入）：您可以test_func通过发布.apply(test_func, axis=1)到您的数据帧来将您的应用应用于行。

import io
import pandas as pd

data = io.StringIO('''\
   ID  Sept_2015  Oct_2015  Nov_2015  Dec_2015  Jan_2016  Feb_2016  Mar_2016  grad_time  
0   1          0         0         0         0         1         1         1        240
1   2          0         0         0         0         0         0         0        218   
2   3          0         0         0         0         1         1         1        236
3   4          0         0         0         0         0         0         0          0
4   5          1         1         1         1         1         1         1        206
''')
df = pd.read_csv(data, delim_whitespace=True)

def test_func(df):
    """ Test Function for generating new value"""
    if df['Sept_2015'] > 0 & df['grad_time'] <= 236:
        return 236
    elif df['Oct_2015'] > 0 & df['grad_time'] <= 237:
        return 237
    elif df['Nov_2015'] > 0 & df['grad_time'] <= 238:
        return 238
    elif df['Dec_2015'] > 0 & df['grad_time'] <= 239:
        return 239
    elif df['Jan_2016'] > 0 & df['grad_time'] <= 240:
        return 240
    elif df['Feb_2016'] > 0 & df['grad_time'] <= 241:
        return 241
    elif df['Mar_2016'] > 0 & df['grad_time'] <= 242:
        return 242
    else:
        return 0

trisk = df.apply(test_func, axis=1)
trick.name = 'trisk'
print(trisk)

Output:

输出：

0    240
1      0
2    240
3      0
4    236
Name: trisk, dtype: int64

Python/Pandas - 基于多个变量和 if/elif/else 函数创建新变量

提问by John_Everyman

回答by piRSquared

Setup

设置

回答by Alberto Garcia-Raboso

相关推荐

最近更新

标签

Python/Pandas - 基于多个变量和 if/elif/else 函数创建新变量

提问by John_Everyman

回答by piRSquared

Setup

设置

回答by Alberto Garcia-Raboso

相关推荐

Pandas，基于列值的条件列分配

pandas 读取 .csv 文件时在 Python 中解析日期的最快方法？

多选的 Pandas read_sql 查询

pandas 将多列拆分为熊猫数据框中的行

相关推荐

最近更新

标签