Python 在 Pandas 数据框中矢量化条件赋值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28896769/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 03:52:21  来源:igfitidea点击:

vectorize conditional assignment in pandas dataframe

pythonnumpypandasvectorization

提问by azuric

If I have a dataframe dfwith column xand want to create column ybased on values of xusing this in pseudo code:

如果我有一个df带有列的数据框,x并且想y根据x在伪代码中使用它的值创建列:

 if df['x'] <-2 then df['y'] = 1 
 else if df['x'] > 2 then df['y']= -1 
 else df['y'] = 0

How would I achieve this? I assume np.whereis the best way to do this but not sure how to code it correctly.

我将如何实现这一目标?我认为这np.where是最好的方法,但不确定如何正确编码。

采纳答案by EdChum

One simple method would be to assign the default value first and then perform 2 loccalls:

一种简单的方法是先分配默认值,然后执行 2 次loc调用:

In [66]:

df = pd.DataFrame({'x':[0,-3,5,-1,1]})
df
Out[66]:
   x
0  0
1 -3
2  5
3 -1
4  1

In [69]:

df['y'] = 0
df.loc[df['x'] < -2, 'y'] = 1
df.loc[df['x'] > 2, 'y'] = -1
df
Out[69]:
   x  y
0  0  0
1 -3  1
2  5 -1
3 -1  0
4  1  0

If you wanted to use np.wherethen you could do it with a nested np.where:

如果你想使用,np.where那么你可以使用嵌套的np.where

In [77]:

df['y'] = np.where(df['x'] < -2 , 1, np.where(df['x'] > 2, -1, 0))
df
Out[77]:
   x  y
0  0  0
1 -3  1
2  5 -1
3 -1  0
4  1  0

So here we define the first condition as where x is less than -2, return 1, then we have another np.wherewhich tests the other condition where x is greater than 2 and returns -1, otherwise return 0

所以这里我们将第一个条件定义为 x 小于 -2,返回 1,然后我们有另一个np.where测试 x 大于 2 的另一个条件并返回 -1,否则返回 0

timings

时间

In [79]:

%timeit df['y'] = np.where(df['x'] < -2 , 1, np.where(df['x'] > 2, -1, 0))

1000 loops, best of 3: 1.79 ms per loop

In [81]:

%%timeit
df['y'] = 0
df.loc[df['x'] < -2, 'y'] = 1
df.loc[df['x'] > 2, 'y'] = -1

100 loops, best of 3: 3.27 ms per loop

So for this sample dataset the np.wheremethod is twice as fast

所以对于这个示例数据集,该np.where方法的速度是原来的两倍

回答by Erfan

This is a good use case for pd.cutwhere you define ranges and based on those rangesyou can assign labels:

这是一个很好的用例,用于pd.cut定义范围并基于ranges您可以分配的范围labels

df['y'] = pd.cut(df['x'], [-np.inf, -2, 2, np.inf], labels=[1, 0, -1], right=False)

Output

输出

   x  y
0  0  0
1 -3  1
2  5 -1
3 -1  0
4  1  0