基于从 Pandas DataFrame 中其他 2 个列的值中进行条件选择的新列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/17774271/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
New column based on conditional selection from the values of 2 other columns in a Pandas DataFrame
提问by Uninvited Guest
I've got a DataFramewhich contains stock values.
我有一个DataFrame包含股票价值的。
It looks like this:
它看起来像这样:
>>>Data Open High Low Close Volume Adj Close Date                                                       
2013-07-08  76.91  77.81  76.85  77.04  5106200  77.04
When I try to make a conditional new column with the following if statement:
当我尝试使用以下 if 语句创建有条件的新列时:
Data['Test'] =Data['Close'] if Data['Close'] > Data['Open'] else Data['Open']
I get the following error:
我收到以下错误:
Traceback (most recent call last):
  File "<pyshell#116>", line 1, in <module>
    Data[1]['Test'] =Data[1]['Close'] if Data[1]['Close'] > Data[1]['Open'] else Data[1]['Open']
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I then used a.all():
然后我使用了a.all():
Data[1]['Test'] =Data[1]['Close'] if all(Data[1]['Close'] > Data[1]['Open']) else Data[1]['Open']
The result was that the entire ['Open']Column was selected. I didn't get the condition that I wanted, which is to select every time the biggest value between the ['Open']and ['Close']columns. 
结果是选择了整个['Open']列。我没有得到我想要的条件,即每次选择['Open']和['Close']列之间的最大值。
Any help is appreciated.
任何帮助表示赞赏。
Thanks.
谢谢。
采纳答案by DSM
From a DataFrame like:
从 DataFrame 像:
>>> df
         Date   Open   High    Low  Close   Volume  Adj Close
0  2013-07-08  76.91  77.81  76.85  77.04  5106200      77.04
1  2013-07-00  77.04  79.81  71.81  72.87  1920834      77.04
2  2013-07-10  72.87  99.81  64.23  93.23  2934843      77.04
The simplest thing I can think of would be:
我能想到的最简单的事情是:
>>> df["Test"] = df[["Open", "Close"]].max(axis=1)
>>> df
         Date   Open   High    Low  Close   Volume  Adj Close   Test
0  2013-07-08  76.91  77.81  76.85  77.04  5106200      77.04  77.04
1  2013-07-00  77.04  79.81  71.81  72.87  1920834      77.04  77.04
2  2013-07-10  72.87  99.81  64.23  93.23  2934843      77.04  93.23
df.ix[:,["Open", "Close"]].max(axis=1)might be a little faster, but I don't think it's as nice to look at.
df.ix[:,["Open", "Close"]].max(axis=1)可能会快一点,但我不认为它看起来很好看。
Alternatively, you could use .applyon the rows:
或者,您可以.apply在行上使用:
>>> df["Test"] = df.apply(lambda row: max(row["Open"], row["Close"]), axis=1)
>>> df
         Date   Open   High    Low  Close   Volume  Adj Close   Test
0  2013-07-08  76.91  77.81  76.85  77.04  5106200      77.04  77.04
1  2013-07-00  77.04  79.81  71.81  72.87  1920834      77.04  77.04
2  2013-07-10  72.87  99.81  64.23  93.23  2934843      77.04  93.23
Or fall back to numpy:
或者回到 numpy:
>>> df["Test"] = np.maximum(df["Open"], df["Close"])
>>> df
         Date   Open   High    Low  Close   Volume  Adj Close   Test
0  2013-07-08  76.91  77.81  76.85  77.04  5106200      77.04  77.04
1  2013-07-00  77.04  79.81  71.81  72.87  1920834      77.04  77.04
2  2013-07-10  72.87  99.81  64.23  93.23  2934843      77.04  93.23
The basic problem is that if/elsedoesn't play nicely with arrays, because if (something)always coerces the somethinginto a single bool.  It's not equivalent to "for every element in the array something, if the condition holds" or anything like that.  
最根本的问题是,if/else不能很好地与阵列玩,因为if (something)总是将强制转换something成一个单一的bool。它不等同于“对于数组中的每个元素,如果条件成立”或类似的东西。  
回答by Jeff
In [7]: df = DataFrame(randn(10,2),columns=list('AB'))
In [8]: df
Out[8]: 
          A         B
0 -0.954317 -0.485977
1  0.364845 -0.193453
2  0.020029 -1.839100
3  0.778569  0.706864
4  0.033878  0.437513
5  0.362016  0.171303
6  2.880953  0.856434
7 -0.109541  0.624493
8  1.015952  0.395829
9 -0.337494  1.843267
This is a where conditional, saying give me the value for A if A > B, else give me B
这是一个 where 条件,说如果 A > B 给我 A 的值,否则给我 B
# this syntax is EQUIVALENT to
# df.loc[df['A']>df['B'],'A'] = df['B']
In [9]: df['A'].where(df['A']>df['B'],df['B'])
Out[9]: 
0   -0.485977
1    0.364845
2    0.020029
3    0.778569
4    0.437513
5    0.362016
6    2.880953
7    0.624493
8    1.015952
9    1.843267
dtype: float64
In this case maxis equivalent
在这种情况下max是等价的
In [10]: df.max(1)
Out[10]: 
0   -0.485977
1    0.364845
2    0.020029
3    0.778569
4    0.437513
5    0.362016
6    2.880953
7    0.624493
8    1.015952
9    1.843267
dtype: float64
回答by Sajjan Singh
The issue is that you're asking python to evaluate a condition (Data['Close'] > Data['Open']) which contains more than one boolean value.  You do not want to use anyor allsince either, since that will set Data['Test']to either Data['Open']or Data['Close'].  
问题是您要求 python 评估Data['Close'] > Data['Open']包含多个布尔值的条件 ( )。您不想使用anyor all,因为这将设置Data['Test']为Data['Open']or 或Data['Close']。  
There might be a cleaner method, but one approach is to use a mask (boolean array):
可能有一种更简洁的方法,但一种方法是使用掩码(布尔数组):
mask = Data['Close'] > Data['Open']
Data['Test'] = pandas.concat([Data['Close'][mask].dropna(), Data['Open'][~mask].dropna()]).reindex_like(Data)

