Pandas:根据特定列的值计数选择行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36166090/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: Selecting rows based on value counts of a particular column
提问by bigO6377
Whats the simplest way of selecting all rows from a panda dataframe, who's sym occurs exactly twice in the entire table? For example, in the table below, I would like to select all rows with sym in ['b','e'], since the value_counts for these symbols equal 2.
从Pandas数据框中选择所有行的最简单方法是什么,谁的 sym 在整个表中恰好出现两次?例如,在下表中,我想选择 ['b','e'] 中带有 sym 的所有行,因为这些符号的 value_counts 等于 2。
df=pd.DataFrame({'sym':['a', 'b', 'b', 'c', 'd','d','d','e','e'],'price':np.random.randn(9)})
price sym
0 -0.0129 a
1 -1.2940 b
2 1.8423 b
3 -0.7160 c
4 -2.3216 d
5 -0.0120 d
6 -0.5914 d
7 0.6280 e
8 0.5361 e
df.sym.value_counts()
Out[237]:
d 3
e 2
b 2
c 1
a 1
回答by jezrael
I think you can use groupby
by column sym
and filter
values with length == 2
:
我认为您可以groupby
按列sym
和filter
值使用length == 2
:
print df.groupby("sym").filter(lambda x: len(x) == 2)
price sym
1 0.400157 b
2 0.978738 b
7 -0.151357 e
8 -0.103219 e
Second solution use isin
with boolean indexing:
s = df.sym.value_counts()
print s[s == 2].index
Index([u'e', u'b'], dtype='object')
print df[df.sym.isin(s[s == 2].index)]
price sym
1 0.400157 b
2 0.978738 b
7 -0.151357 e
8 -0.103219 e
And fastest solution with transform
and boolean indexing
:
并用最快的解决方案transform
和boolean indexing
:
print (df[df.groupby("sym")["sym"].transform('size') == 2])
price sym
1 -1.2940 b
2 1.8423 b
7 0.6280 e
8 0.5361 e
回答by Tim Cui
You can use map
, which should be faster than using groupby
and transform
:
您可以使用map
,这应该比使用groupby
and更快transform
:
df[df['sym'].map(df['sym'].value_counts()) == 2]
e.g.
例如
%%timeit
df[df['sym'].map(df['sym'].value_counts()) == 2]
Out[1]:
1.83 ms ± 23.7 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df[df.groupby("sym")["sym"].transform('size') == 2]
Out[2]:
2.08 ms ± 41.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)