pandas 熊猫面板中的布尔掩码
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14650341/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
boolean mask in pandas panel
提问by granders19
i am having some trouble masking a panel in the same way that I would a DataFrame. What I want to do feels simple, but I have not found a way looking at the docs and online forums. I have a simple example below:
我在以与 DataFrame 相同的方式屏蔽面板时遇到了一些麻烦。我想做的事情感觉很简单,但我还没有找到查看文档和在线论坛的方法。我在下面有一个简单的例子:
import pandas
import numpy as np
import datetime
start_date = datetime.datetime(2009,3,1,6,29,59)
r = pandas.date_range(start_date, periods=12)
cols_1 = ['AAPL', 'AAPL', 'GOOG', 'GOOG', 'GS', 'GS']
cols_2 = ['close', 'rate', 'close', 'rate', 'close', 'rate']
dat = np.random.randn(12, 6)
dftst = pandas.DataFrame(dat, columns=pandas.MultiIndex.from_arrays([cols_1, cols_2], names=['ticker','field']), index=r)
pn = dftst.T.to_panel().transpose(2,0,1)
print pn
Out[14]:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 12 (major_axis) x 3 (minor_axis)
Items axis: close to rate
Major_axis axis: 2009-03-01 06:29:59 to 2009-03-12 06:29:59
Minor_axis axis: AAPL to GS
I now have a Panel object, if I take a slice along the items axis, I get a DataFrame
我现在有一个 Panel 对象,如果我沿项目轴切片,我会得到一个 DataFrame
close_p = pn['close']
print close_p
Out[16]:
ticker AAPL GOOG GS
2009-03-01 06:29:59 -0.082203 -0.286354 1.227193
2009-03-02 06:29:59 0.340005 -0.688933 -1.505137
2009-03-03 06:29:59 -0.525567 0.321858 -0.035047
2009-03-04 06:29:59 -0.123549 -0.841781 -0.616523
2009-03-05 06:29:59 -0.407504 0.188372 1.311262
2009-03-06 06:29:59 0.272883 0.817179 0.584664
2009-03-07 06:29:59 -1.767227 1.168876 0.443096
2009-03-08 06:29:59 -0.685501 -0.534373 -0.063906
2009-03-09 06:29:59 0.851820 0.068740 0.566537
2009-03-10 06:29:59 0.390678 -0.012422 -0.152375
2009-03-11 06:29:59 -0.985585 -0.917705 -0.585091
2009-03-12 06:29:59 0.067498 -0.764343 0.497270
I can filter this data in two ways:
我可以通过两种方式过滤这些数据:
1) I create a mask and mask the data as follows:
1)我创建一个掩码并按如下方式屏蔽数据:
msk = close_p > 0
close_p = close_p.mask(msk)
2) I can just slice by the boolean operator in msk above
2)我可以在上面的 msk 中通过布尔运算符进行切片
close_p = close_p[close_p > 0]
Out[28]:
ticker AAPL GOOG GS
2009-03-01 06:29:59 NaN NaN 1.227193
2009-03-02 06:29:59 0.340005 NaN NaN
2009-03-03 06:29:59 NaN 0.321858 NaN
2009-03-04 06:29:59 NaN NaN NaN
2009-03-05 06:29:59 NaN 0.188372 1.311262
2009-03-06 06:29:59 0.272883 0.817179 0.584664
2009-03-07 06:29:59 NaN 1.168876 0.443096
2009-03-08 06:29:59 NaN NaN NaN
2009-03-09 06:29:59 0.851820 0.068740 0.566537
2009-03-10 06:29:59 0.390678 NaN NaN
2009-03-11 06:29:59 NaN NaN NaN
2009-03-12 06:29:59 0.067498 NaN 0.497270
What I cannot figure out how to do is filter all of my data based on a mask without a for loop. I can do the following:
我不知道该怎么做是根据没有 for 循环的掩码过滤我的所有数据。我可以执行以下操作:
msk = (pn['rate'] > 0) & (pn['close'] > 0)
def mask_panel(pan, msk):
for item in pan.items:
pan[item] = pan[item].mask(msk)
return pan
print pn['close']
Out[32]:
ticker AAPL GOOG GS
2009-03-01 06:29:59 -0.082203 -0.286354 1.227193
2009-03-02 06:29:59 0.340005 -0.688933 -1.505137
2009-03-03 06:29:59 -0.525567 0.321858 -0.035047
2009-03-04 06:29:59 -0.123549 -0.841781 -0.616523
2009-03-05 06:29:59 -0.407504 0.188372 1.311262
2009-03-06 06:29:59 0.272883 0.817179 0.584664
2009-03-07 06:29:59 -1.767227 1.168876 0.443096
2009-03-08 06:29:59 -0.685501 -0.534373 -0.063906
2009-03-09 06:29:59 0.851820 0.068740 0.566537
2009-03-10 06:29:59 0.390678 -0.012422 -0.152375
2009-03-11 06:29:59 -0.985585 -0.917705 -0.585091
2009-03-12 06:29:59 0.067498 -0.764343 0.497270
mask_panel(pn, msk)
print pn['close']
Out[34]:
ticker AAPL GOOG GS
2009-03-01 06:29:59 -0.082203 -0.286354 NaN
2009-03-02 06:29:59 NaN -0.688933 -1.505137
2009-03-03 06:29:59 -0.525567 NaN -0.035047
2009-03-04 06:29:59 -0.123549 -0.841781 -0.616523
2009-03-05 06:29:59 -0.407504 NaN NaN
2009-03-06 06:29:59 NaN NaN NaN
2009-03-07 06:29:59 -1.767227 NaN NaN
2009-03-08 06:29:59 -0.685501 -0.534373 -0.063906
2009-03-09 06:29:59 NaN NaN NaN
2009-03-10 06:29:59 NaN -0.012422 -0.152375
2009-03-11 06:29:59 -0.985585 -0.917705 -0.585091
2009-03-12 06:29:59 NaN -0.764343 NaN
So the above loop does the trick. I know there is a faster vectorized way of doing this using the ndarray, but I have not put that together yet. It also seems like this should be functionality that is built into the pandas library. If there is a way to do this that I am missing, any suggestions would be much appreciated.
所以上面的循环可以解决问题。我知道使用 ndarray 有一种更快的矢量化方法,但我还没有把它放在一起。这似乎也应该是 Pandas 库中内置的功能。如果有一种方法可以做到这一点,我将不胜感激任何建议。
回答by Jeff
I think this will work (and what Panel.where should do, but its a bit non-trivial because it has to handle a bunch of cases)
我认为这会起作用(以及 Panel.where 应该做什么,但它有点重要,因为它必须处理一堆案例)
# construct the mask in 2-d (a frame)
In [36]: mask = (pn['close']>0) & (pn['rate']>0)
In [37]: mask
Out[37]:
ticker AAPL GOOG GS
2009-03-01 06:29:59 False False False
2009-03-02 06:29:59 False False True
....
# here's the key, this broadcasts, setting the values which
# don't meet the condition to nan
In [38]: masked_values = np.where(mask,pn.values,np.nan)
# reconstruct the panel (the _construct_axes_dict is an internal function that returns
# dict of the axes, e.g. items -> the items, major_axis -> .....
In [42]: x = pd.Panel(masked_values,**pn._construct_axes_dict())
Out[42]:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 12 (major_axis) x 3 (minor_axis)
Items axis: close to rate
Major_axis axis: 2009-03-01 06:29:59 to 2009-03-12 06:29:59
Minor_axis axis: AAPL to GS
# the values
In [43]: x
Out[43]:
array([[[ nan, nan, nan],
[ nan, nan, 0.09575723],
[ nan, nan, nan],
[ nan, nan, nan],
[ nan, 2.07229823, 0.04347515],
[ nan, nan, nan],
[ nan, nan, 2.18342239],
[ nan, nan, 1.73674381],
[ nan, 2.01173087, nan],
[ 0.24109645, 0.94583072, nan],
[ 0.36953467, nan, 0.18044432],
[ 1.74164222, 1.02314752, 1.73736033]],
[[ nan, nan, nan],
[ nan, nan, 0.06960387],
[ nan, nan, nan],
[ nan, nan, nan],
[ nan, 0.63202199, 0.56724391],
[ nan, nan, nan],
[ nan, nan, 0.71964824],
[ nan, nan, 1.03482927],
[ nan, 0.18256148, nan],
[ 1.29451667, 0.49804327, nan],
[ 2.04726538, nan, 0.12883128],
[ 0.70647885, 0.7277734 , 0.77844475]]])

