Pandas：根据特定列的值计数选择行

Question

提问by bigO6377

Whats the simplest way of selecting all rows from a panda dataframe, who's sym occurs exactly twice in the entire table? For example, in the table below, I would like to select all rows with sym in ['b','e'], since the value_counts for these symbols equal 2.

从Pandas数据框中选择所有行的最简单方法是什么，谁的 sym 在整个表中恰好出现两次？例如，在下表中，我想选择 ['b','e'] 中带有 sym 的所有行，因为这些符号的 value_counts 等于 2。

df=pd.DataFrame({'sym':['a', 'b', 'b', 'c', 'd','d','d','e','e'],'price':np.random.randn(9)})

                     price sym
    0              -0.0129   a
    1              -1.2940   b
    2               1.8423   b
    3              -0.7160   c
    4              -2.3216   d
    5              -0.0120   d
    6              -0.5914   d
    7               0.6280   e
    8               0.5361   e

df.sym.value_counts()
Out[237]: 
d    3
e    2
b    2
c    1
a    1

Answer 1

回答by jezrael

I think you can use groupbyby column symand filtervalues with length == 2:

我认为您可以groupby按列sym和filter值使用length == 2：

print df.groupby("sym").filter(lambda x: len(x) == 2)
      price sym
1  0.400157   b
2  0.978738   b
7 -0.151357   e
8 -0.103219   e

Second solution use isinwith boolean indexing:

第二种解决方案isin与布尔索引一起使用：

s = df.sym.value_counts()

print s[s == 2].index
Index([u'e', u'b'], dtype='object')

print df[df.sym.isin(s[s == 2].index)]
      price sym
1  0.400157   b
2  0.978738   b
7 -0.151357   e
8 -0.103219   e

And fastest solution with transformand boolean indexing:

并用最快的解决方案transform和boolean indexing：

print (df[df.groupby("sym")["sym"].transform('size') == 2])
    price sym
1 -1.2940   b
2  1.8423   b
7  0.6280   e
8  0.5361   e

Answer 2

回答by Tim Cui

You can use map, which should be faster than using groupbyand transform:

您可以使用map，这应该比使用groupbyand更快transform：

df[df['sym'].map(df['sym'].value_counts()) == 2]

e.g.

例如

%%timeit
df[df['sym'].map(df['sym'].value_counts()) == 2]
Out[1]:
1.83 ms ± 23.7 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df[df.groupby("sym")["sym"].transform('size') == 2]
Out[2]:
2.08 ms ± 41.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pandas：根据特定列的值计数选择行

提问by bigO6377

回答by jezrael

回答by Tim Cui

相关推荐

最近更新

标签

Pandas：根据特定列的值计数选择行

提问by bigO6377

回答by jezrael

回答by Tim Cui

相关推荐

pandas 将熊猫日期时间月份转换为字符串表示

pandas 熊猫计算唯一行

Pandas 相当于 Python 的 readlines 函数

Python Pandas：仅旋转 DataFrame 中的某些列，同时保留其他列

相关推荐

最近更新

标签