Pandas drop 函数:不可对齐的布尔系列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/18149316/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas drop function: Unalignable boolean Series
提问by fred
I have two DataFrames. The first df0:
我有两个数据帧。第一个 df0:
Name       CHR  MAPINFO     PMG         APA 
cg13869341  1   15865   0.8954256   0.8409144
cg14008030  1   18827   0.5941512   0.712414
cg12045430  1   29407   0.1110794   0.1302404
cg20826792  1   29425   0.177532    0.1304049
cg00381604  1   29435   0.09003246  0.04180672
cg20253340  1   68849   0.4738799   0.444899
end the second df1:
结束第二个 df1:
probe   Chromosome  Gstart  Gend
A_23_P11744     1   4363    39806
A_33_P3365932   1   4363    39806
A_32_P923011    1   24554   46081
I would like to iterate over df0["MAPINFO"] and drop rows that don't match condition and append the means to another df. My code is as followed:
我想遍历 df0["MAPINFO"] 并删除不匹配条件的行并将平均值附加到另一个 df。我的代码如下:
for pos in df0['MAPINFO']:
    cond = (( pos < df1['Gstart']) & ( pos > df1['Gend']))
    print df0.drop(df0[cond].index.values).mean(axis=0, skipna=True, level=None)
which gives the following error message:
这给出了以下错误消息:
/usr/lib64/python2.7/site-packages/pandas-0.12.0-py2.7-linux-x86_64.egg/pandas/core/frame.py:2021: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
"DataFrame index.", UserWarning)
Traceback (most recent call last):
 File "/home/ferreirafm/bin/cpg_means.py", line 239, in <module>
main()
File "/home/ferreirafm/bin/cpg_means.py", line 231, in main
import2df(infprobe, infchrom)
File "/home/ferreirafm/bin/cpg_means.py", line 20, in import2df
df0.drop(df0[cond].index.values)#.mean(axis=0, skipna=True, level=None)
File "/usr/lib64/python2.7/site-packages/pandas-0.12.0-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 1995, in __getitem__
return self._getitem_array(key)
File "/usr/lib64/python2.7/site-packages/pandas-0.12.0-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 2027, in _getitem_array
key = _check_bool_indexer(self.index, key)
File "/usr/lib64/python2.7/site-packages/pandas-0.12.0-py2.7-linux-x86_64.egg/pandas/core/indexing.py", line 1017, in _check_bool_indexer
raise IndexingError('Unalignable boolean Series key provided')
pandas.core.indexing.IndexingError: Unalignable boolean Series key provided
I'm almost sure that such piece of code used to work in previous version of Pandas. However, I can't figure out whats going wrong. Any help is appreciated.
我几乎可以肯定这段代码曾经在 Pandas 的早期版本中工作过。但是,我无法弄清楚出了什么问题。任何帮助表示赞赏。
Expected results: Observe that the last row of df0 is gonna be dropped as df1 'MAPINFO' of the first line (15865) is outside the df1 range Gstart and Gend. So, the results is gonna be the means by columns of the non-dropped lines from df0 (means of PGM and APA). That is, the resulting df will be:
预期结果:观察到 df0 的最后一行将被删除,因为第一行 (15865) 的 df1 'MAPINFO' 在 df1 范围 Gstart 和 Gend 之外。因此,结果将是来自 df0(PGM 和 APA 的平均值)的非删除线的列的平均值。也就是说,生成的 df 将是:
Name       CHR  MAPINFO     PMG         APA 
cg13869341  1   15865   0.8954256   0.8409144
cg14008030  1   18827   0.5941512   0.712414
cg12045430  1   29407   0.1110794   0.1302404
cg20826792  1   29425   0.177532    0.1304049
cg00381604  1   29435   0.09003246  0.04180672
The last row from df0 "cg20253340 1 68849 0.4738799 0.444899" is removed and the means by row is taken.
df0 "cg20253340 1 68849 0.4738799 0.444899" 的最后一行被删除,并采用逐行的方法。
采纳答案by lowtech
My solution would be to make bool index which implements inclusion criteria then just use it:
我的解决方案是制作实现包含标准的 bool 索引,然后使用它:
import pandas as pd
df0 = pd.DataFrame.from_records([["cg13869341", 1, 15865, 0.8954256, 0.8409144],
                                 ["cg14008030", 1, 18827, 0.5941512, 0.712414],
                                 ["cg12045430", 1, 29407, 0.1110794, 0.1302404],
                                 ["cg20826792", 1, 29425, 0.177532, 0.1304049],
                                 ["cg00381604", 1, 29435, 0.09003246, 0.04180672],
                                 ["cg20253340", 1, 68849, 0.4738799, 0.444899]],
                                columns = ["Name", "CHR", "MAPINFO", "PMG", "APA"])
df1 = pd.DataFrame.from_records([["A_23_P11744", 1, 4363, 39806],
                                 ["A_33_P3365932", 1, 4363, 39806],
                                 ["A_32_P923011", 1, 24554, 46081]],
                                columns = ["probe", "Chromosome", "Gstart", "Gend"])
F = df0.MAPINFO.apply(lambda x: ((df1.Gstart <= x) & (x <= df1.Gend)).any())
print df0[F] ## as you exepected
# mean by rows
res = df0[F]
res['mean'] = df0[F][['PMG', 'APA']].mean(1)
print res
# mean by columns
print df0[F][['PMG', 'APA']].mean(0)

