Python 在 Pandas 数据框中矢量化条件赋值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28896769/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
vectorize conditional assignment in pandas dataframe
提问by azuric
If I have a dataframe df
with column x
and want to create column y
based on values of x
using this in pseudo code:
如果我有一个df
带有列的数据框,x
并且想y
根据x
在伪代码中使用它的值创建列:
if df['x'] <-2 then df['y'] = 1
else if df['x'] > 2 then df['y']= -1
else df['y'] = 0
How would I achieve this? I assume np.where
is the best way to do this but not sure how to code it correctly.
我将如何实现这一目标?我认为这np.where
是最好的方法,但不确定如何正确编码。
采纳答案by EdChum
One simple method would be to assign the default value first and then perform 2 loc
calls:
一种简单的方法是先分配默认值,然后执行 2 次loc
调用:
In [66]:
df = pd.DataFrame({'x':[0,-3,5,-1,1]})
df
Out[66]:
x
0 0
1 -3
2 5
3 -1
4 1
In [69]:
df['y'] = 0
df.loc[df['x'] < -2, 'y'] = 1
df.loc[df['x'] > 2, 'y'] = -1
df
Out[69]:
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0
If you wanted to use np.where
then you could do it with a nested np.where
:
如果你想使用,np.where
那么你可以使用嵌套的np.where
:
In [77]:
df['y'] = np.where(df['x'] < -2 , 1, np.where(df['x'] > 2, -1, 0))
df
Out[77]:
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0
So here we define the first condition as where x is less than -2, return 1, then we have another np.where
which tests the other condition where x is greater than 2 and returns -1, otherwise return 0
所以这里我们将第一个条件定义为 x 小于 -2,返回 1,然后我们有另一个np.where
测试 x 大于 2 的另一个条件并返回 -1,否则返回 0
timings
时间
In [79]:
%timeit df['y'] = np.where(df['x'] < -2 , 1, np.where(df['x'] > 2, -1, 0))
1000 loops, best of 3: 1.79 ms per loop
In [81]:
%%timeit
df['y'] = 0
df.loc[df['x'] < -2, 'y'] = 1
df.loc[df['x'] > 2, 'y'] = -1
100 loops, best of 3: 3.27 ms per loop
So for this sample dataset the np.where
method is twice as fast
所以对于这个示例数据集,该np.where
方法的速度是原来的两倍
回答by Erfan
This is a good use case for pd.cut
where you define ranges and based on those ranges
you can assign labels
:
这是一个很好的用例,用于pd.cut
定义范围并基于ranges
您可以分配的范围labels
:
df['y'] = pd.cut(df['x'], [-np.inf, -2, 2, np.inf], labels=[1, 0, -1], right=False)
Output
输出
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0