pandas 熊猫:如何找到列中每个类别的最大 n 值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25043639/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas: How to find the max n values for each category in a column
提问by robroc
I have a huge municipal library catalog dataset with book title, the library it's in, the library's borough, and the number of times it was loaned out.
我有一个巨大的市政图书馆目录数据集,里面有书名、它所在的图书馆、图书馆的行政区以及它被借出的次数。
I want to find the top 3 most loaned books for each neighbourhood.
我想找到每个街区外借最多的前 3 本书。
Ideally, I'd get something like this:
理想情况下,我会得到这样的东西:
Borough Title Total_loans
A Book1 35615
A Book2 34895
A Book3 2548
B Book1 6541
B Book2 5425
etc.
等等。
This is the closest I was able to get, but the resulting data frame is not grouped by borough and hard to read.
这是我能得到的最接近的数据,但生成的数据框没有按行政区分组并且难以阅读。
import pandas as pd
df = pd.DataFrame({"borough":["A", "B", "B", "A", "A"], "title":["Book2", "Book1", "Book2", "Book2", "Book1"], "total_loans":[4, 48, 46, 78, 15]})
top_boroughs = df.groupby(['borough','title'])
top_boroughs.aggregate(sum).sort(['total_loans','title'], ascending=False)
Thanks for your help.
谢谢你的帮助。
回答by FooBar
In short:
简而言之:
df.groupby(level=[0,1]).sum().reset_index().sort_values(['borough', 'total_loans'], ascending=[1,0]).groupby('borough').head(3)
The steps:
步骤:
- Do the correct grouping and sum
- Sort by borough and maximum values
- group by borough and take
3first
- 进行正确的分组和求和
- 按行政区和最大值排序
- 按行政区分组,
3先取
This is superior to the accepted answer due to both
由于两者,这都优于接受的答案
- readability (yes, one long line, but you could equally split it up): all standard operations
- performance (standard optimized operations compared to enlarging a dataframe iteratively with
concat, wasting memory
- 可读性(是的,一行很长,但您可以将其平均拆分):所有标准操作
- 性能(标准优化操作与使用 迭代扩大数据帧相比
concat,浪费内存
My output (using head(1)since test data has only 2rows per group:
我的输出(使用head(1)因为测试数据2每组只有行:
Out[484]:
borough title total_loans
1 A Book2 82
2 B Book1 48
回答by acushner
something like this:
像这样:
t = df.groupby(['borough', 'title']).sum()
t.sort('total_loans', ascending=True)
t = t.groupby(level=[0,1]).head(3).reset_index()
t.sort(['borough', 'title'], ascending=(True, False)) #not sure if this is necessary, tough to tell with limited data, but just in case...
回答by locohamster
'''
Created on Jul 30, 2014
class TopX():
def __init__(self, top,sortFunction):
self.topX=top
self.sortFunction=sortFunction
self.data=[]
def addNewItem(self,item):
self.data.append(item)
self.data.sort( key=self.sortFunction,reverse=True)
self.data=self.data[:self.topX]
def getMax(self):
return self.data
def runMe():
top = TopX(3, lambda x:int(x[2]))
with open("lib.txt","r") as f:
string= f.readlines()
for line in string:
data= [x.strip() for x in line.split(' ')]
top.addNewItem(data)
print top.getMax()
if __name__ == '__main__':
runMe()
Works with input file in format:
适用于以下格式的输入文件:
A Book1 1
A Book2 10
A Book3 3
B Book1 7
B Book2 5
Giving results:
给出结果:
[['A', 'Book2', '10'], ['B', 'Book1', '7'], ['B', 'Book2', '5']]
You may specify amount of top books and sorting key if you need to adjust criteria later on.
如果您稍后需要调整标准,您可以指定顶级书籍的数量和排序键。

