pandas 如何基于多个条件在 df 中创建新列?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31413286/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:37:15  来源:igfitidea点击:

How to create new column in a df based on multiple conditions?

pythonnumpypandas

提问by c1ez

I have a df with 3 columns: v1, v2, v3;where

我有一个包含 3 列的 df:v1、v2、v3;其中

v1=[a,b,c,a] 
v2=[d,d,f,n] 
v3=[a,k,i,j] 

What I like to do is to create new columns based on conditions in column v1~v3.

我喜欢做的是根据列v1~v3中的条件创建新列。

I can do single condition,

我可以做单一条件,

df['v1_a']=np.where(df['v1']=='a',1,0)

it gives a new column named 'v1_a'with 1/0

它提供了一个名为新列'v1_a'1/0

However, if I want to create a new column based on multiple conditions, this does not work:

但是,如果我想根据多个条件创建一个新列,这不起作用:

df['v2_flag']=np.where(df['v2']=='f' or df['v2']=='h',1,0)

How can I accomplish this?

我怎样才能做到这一点?

回答by Dan

In python andand orcan only give a single result and can't be overridden to have other purposes by modules like the giant row by row comparison you're trying to do.

在 python 中andor只能给出一个结果,并且不能被模块覆盖以达到其他目的,比如你正在尝试进行的巨型逐行比较。

You need to use the symbolic &(and) and |(or), which are normally used for bit-wise comparisons. These have been re-purposed by pandas to be a row by row comparison, which actually makes sense as being analogous to bit-wise comparisons. That is more of a happy coincidence though, as these were mainly used because these can be overridden by the modules.

您需要使用符号&(and) 和|(or),它们通常用于按位比较。这些已被 Pandas 重新定义为逐行比较,这实际上类似于按位比较。不过,这更像是一个快乐的巧合,因为这些主要是因为它们可以被模块覆盖。

Because of the priority of these and equalities, you'll need parentheses around each term or else it would calculate the |before the ==which isn't what you want. You can use something like this:

由于这些和平等的优先级,你就需要大约每学期括号否则将计算|在之前==这是不是你想要的。你可以使用这样的东西:

df['v2_flag']=np.where((df['v2']=='f')|(df['v2']=='h'),1,0)

回答by Kasramvd

If you use multiple condition you'll get the following ValueErrorbecause np.where()doesn't accept multiple condition :

如果您使用多个条件,您将得到以下结果,ValueError因为np.where()不接受多个条件:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

So in your I suggest to use np.logical_or.

所以在你我建议使用np.logical_or.

df['v2_flag']=np.where(np.logical_or(df['v2']=='f',df['v2']=='h'),1,0)

See the following example too:

请参见以下示例:

>>> a=np.array([2,2,2,5,7,8,1,4,2,3,4,5,6])
>>> np.where(np.logical_or(a==5,a==2),a,0)
array([2, 2, 2, 5, 0, 0, 0, 0, 2, 0, 0, 5, 0])

回答by unutbu

df['v2']=='f' or df['v2']=='h'raises the ValueError beforeit gets to np.where. The orcauses Python to evaluate df['v2']=='f'and df['v2']=='h'in a boolean context. But Pandas Series, like NumPy arrays, refuse to be reduce to a single boolean value -- they raise a ValueError instead.

df['v2']=='f' or df['v2']=='h'引发ValueError异常之前,它到达np.where。将or导致Python来评估df['v2']=='f',并df['v2']=='h'在布尔上下文。但是 Pandas Series,就像 NumPy 数组一样,拒绝减少到一个单一的布尔值——它们反而引发了一个 ValueError

To fix your code, you could use

要修复您的代码,您可以使用

df['v2_flag'] = np.where( (df['v2']=='f') | (df['v2']=='h'), 1, 0)

The |performs bitwise-or element-wise over the two boolean-valued Series.

|执行按位或逐元素在两个布尔值系列。

Other ways to define df['v2_flag']include

其他定义方式df['v2_flag']包括

df['v2_flag'] = ((df['v2']=='f') | (df['v2']=='h')).astype(int)

or

或者

df['v2_flag'] = df['v2'].isin(['f', 'h']).astype(int)