Python 具有多个条件的 Numpy“where”

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39109045/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 21:53:08  来源:igfitidea点击:

Numpy "where" with multiple conditions

pythonpandasnumpydataframe

提问by Poisson

I try to add a new column "energy_class" to a dataframe "df_energy" which it contains the string "high" if the "consumption_energy" value > 400, "medium" if the "consumption_energy" value is between 200 and 400, and "low" if the "consumption_energy" value is under 200. I try to use np.where from numpy, but I see that numpy.where(condition[, x, y])treat only two condition not 3 like in my case.

我尝试将一个新列“energy_class”添加到数据框“df_energy”中,如果“consumption_energy”值> 400,则它包含字符串“high”,如果“consumption_energy”值在200到400之间,则为“medium”,并且“如果“consumption_energy”值低于 200,则“低”。我尝试使用 numpy 中的 np.where,但我看到numpy.where(condition[, x, y])只处理两种情况,而不是像我的情况那样处理 3。

Any idea to help me please?

有什么想法可以帮助我吗?

Thank you in advance

先感谢您

回答by Alexander

You can use a ternary:

您可以使用三元

np.where(consumption_energy > 400, 'high', 
         (np.where(consumption_energy < 200, 'low', 'medium')))

回答by Merlin

Try this: Using the setup from @Maxu

试试这个:使用@Maxu 的设置

col         = 'consumption_energy'
conditions  = [ df2[col] >= 400, (df2[col] < 400) & (df2[col]> 200), df2[col] <= 200 ]
choices     = [ "high", 'medium', 'low' ]

df2["energy_class"] = np.select(conditions, choices, default=np.nan)


  consumption_energy energy_class
0                 459         high
1                 416         high
2                 186          low
3                 250       medium
4                 411         high
5                 210       medium
6                 343       medium
7                 328       medium
8                 208       medium
9                 223       medium

回答by MaxU

I would use the cut()method here, which will generate very efficient and memory-saving categorydtype:

我将在这里使用cut()方法,它将生成非常高效且节省内存的categorydtype:

In [124]: df
Out[124]:
   consumption_energy
0                 459
1                 416
2                 186
3                 250
4                 411
5                 210
6                 343
7                 328
8                 208
9                 223

In [125]: pd.cut(df.consumption_energy,
                 [0, 200, 400, np.inf],
                 labels=['low','medium','high']
          )
Out[125]:
0      high
1      high
2       low
3    medium
4      high
5    medium
6    medium
7    medium
8    medium
9    medium
Name: consumption_energy, dtype: category
Categories (3, object): [low < medium < high]

回答by MaxU

I like to keep the code clean. That's why I prefer np.vectorizefor such tasks.

我喜欢保持代码干净。这就是为什么我更喜欢np.vectorize这样的任务。

def conditions(x):
    if x > 400:
        return "High"
    elif x > 200:
        return "Medium"
    else:
        return "Low"

func = np.vectorize(conditions)
energy_class = func(df_energy["consumption_energy"])

Then just add numpy array as a column in your dataframe using:

然后只需使用以下方法将 numpy 数组添加为数据框中的列:

df_energy["energy_class"] = energy_class

The advantage in this approach is that if you wish to add more complicated constraints to a column, it can be done easily. Hope it helps.

这种方法的优点是,如果您希望向列添加更复杂的约束,则可以轻松完成。希望能帮助到你。

回答by wpmoradi

I second using np.vectorize. It is much faster than np.where and also cleaner code wise. You can definitely tell the speed up with larger data sets. You can use a dictionary format for your conditionals as well as the output of those conditions.

我第二次使用 np.vectorize。它比 np.where 快得多,而且代码更清晰。您绝对可以通过更大的数据集来判断速度的提高。您可以为您的条件以及这些条件的输出使用字典格式。

# Vectorizing with numpy 
row_dic = {'Condition1':'high',
          'Condition2':'medium',
          'Condition3':'low',
          'Condition4':'lowest'}

def Conditions(dfSeries_element,dictionary):
    '''
    dfSeries_element is an element from df_series 
    dictionary: is the dictionary of your conditions with their outcome
    '''
    if dfSeries_element in dictionary.keys():
        return dictionary[dfSeries]

def VectorizeConditions():
    func = np.vectorize(Conditions)
    result_vector = func(df['Series'],row_dic)
    df['new_Series'] = result_vector

    # running the below function will apply multi conditional formatting to your df
VectorizeConditions()

回答by Poudel

WARNING: Always be careful that if your data has missing values np.wheremay be tricky to use and may give you the wrong result inadvertently.

警告:始终要小心,如果您的数据缺少值np.where,则使用起来可能会很棘手,并且可能会在不经意间给您错误的结果。

Consider this situation:

考虑这种情况:

df['cons_ener_cat'] = np.where(df.consumption_energy > 400, 'high', 
         (np.where(df.consumption_energy < 200, 'low', 'medium')))

# if we do not use this second line, then
#  if consumption energy is missing it would be shown medium, which is WRONG.
df.loc[df.consumption_energy.isnull(), 'cons_ener_cat'] = np.nan

Alternatively, you can use one-more nested np.wherefor medium versus nan which would be ugly.

或者,您可以np.where对 medium 和 nan使用多个嵌套,这会很丑陋。

IMHO best way to go is pd.cut. It deals with NaNs and easy to use.

恕我直言,最好的方法是pd.cut。它处理 NaN 并且易于使用。

Examples:

例子:

import numpy as np
import pandas as pd
import seaborn as sns

df = sns.load_dataset('titanic')

# pd.cut
df['age_cat'] = pd.cut(df.age, [0, 20, 60, np.inf], labels=['child','medium','old'])


# manually add another line for nans
df['age_cat2'] = np.where(df.age > 60, 'old', (np.where(df.age <20, 'child', 'medium')))
df.loc[df.age.isnull(), 'age_cat'] = np.nan

# multiple nested where
df['age_cat3'] = np.where(df.age > 60, 'old',
                         (np.where(df.age <20, 'child',
                                   np.where(df.age.isnull(), np.nan, 'medium'))))

# outptus
print(df[['age','age_cat','age_cat2','age_cat3']].head(7))
    age age_cat age_cat2 age_cat3
0  22.0  medium   medium   medium
1  38.0  medium   medium   medium
2  26.0  medium   medium   medium
3  35.0  medium   medium   medium
4  35.0  medium   medium   medium
5   NaN     NaN   medium      nan
6  54.0  medium   medium   medium