Python 在每组熊猫数据框中对列进行排序并选择前 n 行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41825978/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Sorting columns and selecting top n rows in each group pandas dataframe
提问by Shubham R
I have a dataframe like this:
我有一个这样的数据框:
mainid pidx pidy score
1 a b 2
1 a c 5
1 c a 7
1 c b 2
1 a e 8
2 x y 1
2 y z 3
2 z y 5
2 x w 12
2 x v 1
2 y x 6
I want to groupby on column 'pidx'
and then sort score in descending order in each group
i.e for each pidx
我想groupby on column 'pidx'
然后sort score in descending order in each group
ie 对于每个 pidx
and then select head(2)
i.e top 2 from each group.
然后select head(2)
即每组前 2 名。
The result I am looking for is like this:
我正在寻找的结果是这样的:
mainid pidx pidy score
1 a e 8
1 a c 5
1 c a 7
1 c b 2
2 x w 12
2 x y 1
2 y x 6
2 y z 3
2 z y 5
What I tried was:
我尝试的是:
df.sort(['pidx','score'],ascending = False).groupby('pidx').head(2)
and this seems to work, but I dont know if it's the right approach if working on a huge dataset. What other best method can I use to get such result?
这似乎有效,但我不知道如果处理庞大的数据集,这是否是正确的方法。我可以使用什么其他最佳方法来获得这样的结果?
回答by jezrael
There are 2 solutions:
有2种解决方案:
1.sort_values
and aggregate head
:
1.sort_values
和聚合head
:
df1 = df.sort_values('score',ascending = False).groupby('pidx').head(2)
print (df1)
mainid pidx pidy score
8 2 x w 12
4 1 a e 8
2 1 c a 7
10 2 y x 6
1 1 a c 5
7 2 z y 5
6 2 y z 3
3 1 c b 2
5 2 x y 1
2.set_index
and aggregate nlargest
:
df = df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index()
print (df)
pidx mainid pidy score
0 a 1 e 8
1 a 1 c 5
2 c 1 a 7
3 c 1 b 2
4 x 2 w 12
5 x 2 y 1
6 y 2 x 6
7 y 2 z 3
8 z 2 y 5
Timings:
时间:
np.random.seed(123)
N = 1000000
L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'mainid':np.random.randint(1000, size=N),
'pidx': np.random.randint(10000, size=N),
'pidy': np.random.choice(L2, N),
'score':np.random.randint(1000, size=N)})
#print (df)
def epat(df):
grouped = df.groupby('pidx')
new_df = pd.DataFrame([], columns = df.columns)
for key, values in grouped:
new_df = pd.concat([new_df, grouped.get_group(key).sort_values('score', ascending=True)[:2]], 0)
return (new_df)
print (epat(df))
In [133]: %timeit (df.sort_values('score',ascending = False).groupby('pidx').head(2))
1 loop, best of 3: 309 ms per loop
In [134]: %timeit (df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index())
1 loop, best of 3: 7.11 s per loop
In [147]: %timeit (epat(df))
1 loop, best of 3: 22 s per loop
回答by epattaro
a simple solution would be:
一个简单的解决方案是:
grouped = DF.groupby('pidx')
new_df = pd.DataFrame([], columns = DF.columns)
for key, values in grouped:
new_df = pd.concat([new_df, grouped.get_group(key).sort_values('score', ascending=True)[:2]], 0)
hope it helps!
希望能帮助到你!