Python 在每组熊猫数据框中对列进行排序并选择前 n 行

Question

提问by Shubham R

I have a dataframe like this:

我有一个这样的数据框：

mainid  pidx    pidy   score
  1      a        b      2
  1      a        c      5
  1      c        a      7
  1      c        b      2
  1      a        e      8
  2      x        y      1
  2      y        z      3
  2      z        y      5
  2      x        w      12
  2      x        v      1
  2      y        x      6

I want to groupby on column 'pidx'and then sort score in descending order in each groupi.e for each pidx

我想groupby on column 'pidx'然后sort score in descending order in each groupie 对于每个 pidx

and then select head(2)i.e top 2 from each group.

然后select head(2)即每组前 2 名。

The result I am looking for is like this:

我正在寻找的结果是这样的：

mainid   pidx    pidy    score
  1        a      e        8
  1        a      c        5
  1        c      a        7
  1        c      b        2
  2        x      w        12
  2        x      y        1
  2        y      x        6
  2        y      z        3
  2        z      y        5

What I tried was:

我尝试的是：

df.sort(['pidx','score'],ascending = False).groupby('pidx').head(2)

and this seems to work, but I dont know if it's the right approach if working on a huge dataset. What other best method can I use to get such result?

这似乎有效，但我不知道如果处理庞大的数据集，这是否是正确的方法。我可以使用什么其他最佳方法来获得这样的结果？

Answer 1

回答by jezrael

There are 2 solutions:

有2种解决方案：

1.sort_valuesand aggregate head:

1.sort_values和聚合head：

df1 = df.sort_values('score',ascending = False).groupby('pidx').head(2)
print (df1)

    mainid pidx pidy  score
8        2    x    w     12
4        1    a    e      8
2        1    c    a      7
10       2    y    x      6
1        1    a    c      5
7        2    z    y      5
6        2    y    z      3
3        1    c    b      2
5        2    x    y      1

2.set_indexand aggregate nlargest:

2.set_index和聚合nlargest：

df = df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index() 
print (df)
  pidx  mainid pidy  score
0    a       1    e      8
1    a       1    c      5
2    c       1    a      7
3    c       1    b      2
4    x       2    w     12
5    x       2    y      1
6    y       2    x      6
7    y       2    z      3
8    z       2    y      5

Timings:

时间：

np.random.seed(123)
N = 1000000

L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'mainid':np.random.randint(1000, size=N),
                   'pidx': np.random.randint(10000, size=N),
                   'pidy': np.random.choice(L2, N),
                   'score':np.random.randint(1000, size=N)})
#print (df)

def epat(df):
    grouped = df.groupby('pidx')
    new_df = pd.DataFrame([], columns = df.columns)
    for key, values in grouped:
        new_df = pd.concat([new_df, grouped.get_group(key).sort_values('score', ascending=True)[:2]], 0)
    return (new_df)

print (epat(df))

In [133]: %timeit (df.sort_values('score',ascending = False).groupby('pidx').head(2))
1 loop, best of 3: 309 ms per loop

In [134]: %timeit (df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index())
1 loop, best of 3: 7.11 s per loop

In [147]: %timeit (epat(df))
1 loop, best of 3: 22 s per loop

Answer 2

回答by epattaro

a simple solution would be:

一个简单的解决方案是：

grouped = DF.groupby('pidx')

new_df = pd.DataFrame([], columns = DF.columns)

for key, values in grouped:

    new_df = pd.concat([new_df, grouped.get_group(key).sort_values('score', ascending=True)[:2]], 0)

hope it helps!

希望能帮助到你！

Python 在每组熊猫数据框中对列进行排序并选择前 n 行

提问by Shubham R

回答by jezrael

回答by epattaro

相关推荐

最近更新

标签

Python 在每组熊猫数据框中对列进行排序并选择前 n 行

提问by Shubham R

回答by jezrael

回答by epattaro

相关推荐

“i”在 Python .pyi 扩展名中代表什么？

Python 在 Pycharm 中的 Jupyter Notebook 身份验证令牌

Python 使用 webdriver 滚动到元素？

pytest 不能导入模块，而 python 可以

相关推荐

最近更新

标签