pandas 如何基于多个条件在 df 中创建新列？

Question

提问by c1ez

I have a df with 3 columns: v1, v2, v3;where

我有一个包含 3 列的 df：v1、v2、v3；其中

v1=[a,b,c,a] 
v2=[d,d,f,n] 
v3=[a,k,i,j]

What I like to do is to create new columns based on conditions in column v1~v3.

我喜欢做的是根据列v1~v3中的条件创建新列。

I can do single condition,

我可以做单一条件，

df['v1_a']=np.where(df['v1']=='a',1,0)

it gives a new column named 'v1_a'with 1/0

它提供了一个名为新列'v1_a'与1/0

However, if I want to create a new column based on multiple conditions, this does not work:

但是，如果我想根据多个条件创建一个新列，这不起作用：

df['v2_flag']=np.where(df['v2']=='f' or df['v2']=='h',1,0)

How can I accomplish this?

我怎样才能做到这一点？

Answer 1

回答by Dan

In python andand orcan only give a single result and can't be overridden to have other purposes by modules like the giant row by row comparison you're trying to do.

在 python 中and，or只能给出一个结果，并且不能被模块覆盖以达到其他目的，比如你正在尝试进行的巨型逐行比较。

You need to use the symbolic &(and) and |(or), which are normally used for bit-wise comparisons. These have been re-purposed by pandas to be a row by row comparison, which actually makes sense as being analogous to bit-wise comparisons. That is more of a happy coincidence though, as these were mainly used because these can be overridden by the modules.

您需要使用符号&(and) 和|(or)，它们通常用于按位比较。这些已被 Pandas 重新定义为逐行比较，这实际上类似于按位比较。不过，这更像是一个快乐的巧合，因为这些主要是因为它们可以被模块覆盖。

Because of the priority of these and equalities, you'll need parentheses around each term or else it would calculate the |before the ==which isn't what you want. You can use something like this:

由于这些和平等的优先级，你就需要大约每学期括号否则将计算|在之前==这是不是你想要的。你可以使用这样的东西：

df['v2_flag']=np.where((df['v2']=='f')|(df['v2']=='h'),1,0)

Answer 2

回答by Kasramvd

If you use multiple condition you'll get the following ValueErrorbecause np.where()doesn't accept multiple condition :

如果您使用多个条件，您将得到以下结果，ValueError因为np.where()不接受多个条件：

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

So in your I suggest to use np.logical_or.

所以在你我建议使用np.logical_or.

df['v2_flag']=np.where(np.logical_or(df['v2']=='f',df['v2']=='h'),1,0)

See the following example too:

请参见以下示例：

>>> a=np.array([2,2,2,5,7,8,1,4,2,3,4,5,6])
>>> np.where(np.logical_or(a==5,a==2),a,0)
array([2, 2, 2, 5, 0, 0, 0, 0, 2, 0, 0, 5, 0])

Answer 3

回答by unutbu

df['v2']=='f' or df['v2']=='h'raises the ValueError beforeit gets to np.where. The orcauses Python to evaluate df['v2']=='f'and df['v2']=='h'in a boolean context. But Pandas Series, like NumPy arrays, refuse to be reduce to a single boolean value -- they raise a ValueError instead.

df['v2']=='f' or df['v2']=='h'引发ValueError异常之前，它到达np.where。将or导致Python来评估df['v2']=='f'，并df['v2']=='h'在布尔上下文。但是 Pandas Series，就像 NumPy 数组一样，拒绝减少到一个单一的布尔值——它们反而引发了一个 ValueError。

To fix your code, you could use

要修复您的代码，您可以使用

df['v2_flag'] = np.where( (df['v2']=='f') | (df['v2']=='h'), 1, 0)

The |performs bitwise-or element-wise over the two boolean-valued Series.

在|执行按位或逐元素在两个布尔值系列。

Other ways to define df['v2_flag']include

其他定义方式df['v2_flag']包括

df['v2_flag'] = ((df['v2']=='f') | (df['v2']=='h')).astype(int)

or

或者

df['v2_flag'] = df['v2'].isin(['f', 'h']).astype(int)

pandas 如何基于多个条件在 df 中创建新列？

提问by c1ez

回答by Dan

回答by Kasramvd

回答by unutbu

相关推荐

最近更新

标签

pandas 如何基于多个条件在 df 中创建新列？

提问by c1ez

回答by Dan

回答by Kasramvd

回答by unutbu

相关推荐

Python：Pandas - 按组删除第一行

在 Pandas 中将分钟格式的时间列转换为 HH:MM:SS 格式的时间

pandas Python，制作数据帧时出现内存错误

pandas 有没有办法让 Seaborn 或 Vincent 互动？

相关推荐

最近更新

标签