pandas 根据列值选择行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45642845/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:14:01  来源:igfitidea点击:

Selecting rows based on a column value

pandasslice

提问by Manoj Agrawal

I have a data frame something like this

我有一个像这样的数据框

data = {'ID': [1,2,3,4,5,6,7,8,9],
       'Doc':['Order','Order','Inv','Order','Order','Shp','Order', 'Order','Inv'],
       'Rep':[101,101,101,102,102,102,103,103,103]}
frame = pd.DataFrame(data)


    Doc     ID  Rep
0   Order   1   101
1   Order   2   101
2   Inv     3   101
3   Order   4   102
4   Order   5   102
5   Shp     6   102
6   Order   7   103
7   Order   8   103
8   Inv     9   103

Now I want to select rows for Rep that have Doc type as Inv only.

现在我想为 Rep 选择仅 Doc 类型为 Inv 的行。

I want a dataframe as

我想要一个数据框作为

    Doc     ID  Rep
0   Order   1   101
1   Order   2   101
2   Inv     3   101
6   Order   7   103
7   Order   8   103
8   Inv     9   103

All reps will have Doc type Orders so I was trying to do something like this

所有代表都会有 Doc 类型的订单,所以我试图做这样的事情

frame[frame.Rep == frame.Rep[frame.Doc == 'Inv']] 

but I get an error

但我收到一个错误

ValueError: Can only compare identically-labeled Series objects

ValueError:只能比较标记相同的系列对象

回答by jezrael

You can use twice boolean indexing- first get all Repby condition and then all rows by isin:

您可以使用两次boolean indexing- 首先Rep按条件获取所有行,然后按isin以下方式获取所有行:

a = frame.loc[frame['Doc'] == 'Inv', 'Rep']
print (a)
2    101
8    103
Name: Rep, dtype: int64

df = frame[frame['Rep'].isin(a)]
print (df)
     Doc  ID  Rep
0  Order   1  101
1  Order   2  101
2    Inv   3  101
6  Order   7  103
7  Order   8  103
8    Inv   9  103

Solution with query:

解决方案query

a = frame.query("Doc == 'Inv'")['Rep']
df = frame.query("Rep in @a")
print (df)
     Doc  ID  Rep
0  Order   1  101
1  Order   2  101
2    Inv   3  101
6  Order   7  103
7  Order   8  103
8    Inv   9  103

Timings:

时间

np.random.seed(123)
N = 1000000
L = ['Order','Shp','Inv']
frame = pd.DataFrame({'Doc': np.random.choice(L, N,  p=[0.49, 0.5, 0.01]),
                     'ID':np.arange(1,N+1),
                     'Rep':np.random.randint(1000, size=N)})
print (frame.head())

     Doc  ID  Rep
0    Shp   1   95
1  Order   2  147
2  Order   3  282
3    Shp   4   82
4    Shp   5  746

In [204]: %timeit (frame.groupby('Rep').filter(lambda x: 'Inv' in x['Doc'].values))
1 loop, best of 3: 250 ms per loop

In [205]: %timeit (frame[frame['Rep'].isin(frame.loc[frame['Doc'] == 'Inv', 'Rep'])])
100 loops, best of 3: 17.3 ms per loop

In [206]: %%timeit
     ...: a = frame.query("Doc == 'Inv'")['Rep']
     ...: frame.query("Rep in @a")
     ...: 
100 loops, best of 3: 14.5 ms per loop

EDIT:

编辑:

Thank you John Galtfor nice suggestion:

谢谢John Galt的好建议:

df = frame.query("Rep in %s" % frame.query("Doc == 'Inv'")['Rep'].tolist()) 
print (df)
     Doc  ID  Rep
0  Order   1  101
1  Order   2  101
2    Inv   3  101
6  Order   7  103
7  Order   8  103
8    Inv   9  103

回答by Dagdoba

import pandas as pd


frame_Filtered=frame[frame['Doc'].str.contains('Inv|Order')]

print(frame_Filtered)

Output I got

我得到的输出

     Doc  ID  Rep
     0  Order   1  101
     1  Order   2  101
     2    Inv   3  101
     3  Order   4  102
     4  Order   5  102
     6  Order   7  103
     7  Order   8  103
     8    Inv   9  103