pandas 如何基于多个条件在 df 中创建新列?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31413286/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to create new column in a df based on multiple conditions?
提问by c1ez
I have a df with 3 columns: v1, v2, v3;where
我有一个包含 3 列的 df:v1、v2、v3;其中
v1=[a,b,c,a]
v2=[d,d,f,n]
v3=[a,k,i,j]
What I like to do is to create new columns based on conditions in column v1~v3.
我喜欢做的是根据列v1~v3中的条件创建新列。
I can do single condition,
我可以做单一条件,
df['v1_a']=np.where(df['v1']=='a',1,0)
it gives a new column named 'v1_a'with 1/0
它提供了一个名为新列'v1_a'与1/0
However, if I want to create a new column based on multiple conditions, this does not work:
但是,如果我想根据多个条件创建一个新列,这不起作用:
df['v2_flag']=np.where(df['v2']=='f' or df['v2']=='h',1,0)
How can I accomplish this?
我怎样才能做到这一点?
回答by Dan
In python andand orcan only give a single result and can't be overridden to have other purposes by modules like the giant row by row comparison you're trying to do.
在 python 中and,or只能给出一个结果,并且不能被模块覆盖以达到其他目的,比如你正在尝试进行的巨型逐行比较。
You need to use the symbolic &(and) and |(or), which are normally used for bit-wise comparisons. These have been re-purposed by pandas to be a row by row comparison, which actually makes sense as being analogous to bit-wise comparisons. That is more of a happy coincidence though, as these were mainly used because these can be overridden by the modules.
您需要使用符号&(and) 和|(or),它们通常用于按位比较。这些已被 Pandas 重新定义为逐行比较,这实际上类似于按位比较。不过,这更像是一个快乐的巧合,因为这些主要是因为它们可以被模块覆盖。
Because of the priority of these and equalities, you'll need parentheses around each term or else it would calculate the |before the ==which isn't what you want. You can use something like this:
由于这些和平等的优先级,你就需要大约每学期括号否则将计算|在之前==这是不是你想要的。你可以使用这样的东西:
df['v2_flag']=np.where((df['v2']=='f')|(df['v2']=='h'),1,0)
回答by Kasramvd
If you use multiple condition you'll get the following ValueErrorbecause np.where()doesn't accept multiple condition :
如果您使用多个条件,您将得到以下结果,ValueError因为np.where()不接受多个条件:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
So in your I suggest to use np.logical_or.
所以在你我建议使用np.logical_or.
df['v2_flag']=np.where(np.logical_or(df['v2']=='f',df['v2']=='h'),1,0)
See the following example too:
请参见以下示例:
>>> a=np.array([2,2,2,5,7,8,1,4,2,3,4,5,6])
>>> np.where(np.logical_or(a==5,a==2),a,0)
array([2, 2, 2, 5, 0, 0, 0, 0, 2, 0, 0, 5, 0])
回答by unutbu
df['v2']=='f' or df['v2']=='h'raises the ValueError beforeit gets to np.where. The orcauses Python to evaluate df['v2']=='f'and df['v2']=='h'in a boolean context. But Pandas Series, like NumPy arrays, refuse to be reduce to a single boolean value -- they raise a ValueError instead.
df['v2']=='f' or df['v2']=='h'引发ValueError异常之前,它到达np.where。将or导致Python来评估df['v2']=='f',并df['v2']=='h'在布尔上下文。但是 Pandas Series,就像 NumPy 数组一样,拒绝减少到一个单一的布尔值——它们反而引发了一个 ValueError。
To fix your code, you could use
要修复您的代码,您可以使用
df['v2_flag'] = np.where( (df['v2']=='f') | (df['v2']=='h'), 1, 0)
The |performs bitwise-or element-wise over the two boolean-valued Series.
在|执行按位或逐元素在两个布尔值系列。
Other ways to define df['v2_flag']include
其他定义方式df['v2_flag']包括
df['v2_flag'] = ((df['v2']=='f') | (df['v2']=='h')).astype(int)
or
或者
df['v2_flag'] = df['v2'].isin(['f', 'h']).astype(int)

