pandas 熊猫：如何找到列中每个类别的最大 n 值

Question

提问by robroc

I have a huge municipal library catalog dataset with book title, the library it's in, the library's borough, and the number of times it was loaned out.

我有一个巨大的市政图书馆目录数据集，里面有书名、它所在的图书馆、图书馆的行政区以及它被借出的次数。

I want to find the top 3 most loaned books for each neighbourhood.

我想找到每个街区外借最多的前 3 本书。

Ideally, I'd get something like this:

理想情况下，我会得到这样的东西：

Borough    Title    Total_loans
A          Book1    35615 
A          Book2    34895
A          Book3    2548
B          Book1    6541
B          Book2    5425

etc.

等等。

This is the closest I was able to get, but the resulting data frame is not grouped by borough and hard to read.

这是我能得到的最接近的数据，但生成的数据框没有按行政区分组并且难以阅读。

import pandas as pd

df = pd.DataFrame({"borough":["A", "B", "B", "A", "A"], "title":["Book2", "Book1", "Book2", "Book2", "Book1"], "total_loans":[4, 48, 46, 78, 15]})

top_boroughs = df.groupby(['borough','title'])
top_boroughs.aggregate(sum).sort(['total_loans','title'], ascending=False)

Thanks for your help.

谢谢你的帮助。

Answer 1

回答by FooBar

In short:

简而言之：

df.groupby(level=[0,1]).sum().reset_index().sort_values(['borough', 'total_loans'], ascending=[1,0]).groupby('borough').head(3)

The steps:

步骤：

Do the correct grouping and sum
Sort by borough and maximum values
group by borough and take 3first

进行正确的分组和求和
按行政区和最大值排序
按行政区分组，3先取

This is superior to the accepted answer due to both

由于两者，这都优于接受的答案

readability (yes, one long line, but you could equally split it up): all standard operations
performance (standard optimized operations compared to enlarging a dataframe iteratively with concat, wasting memory

可读性（是的，一行很长，但您可以将其平均拆分）：所有标准操作
性能（标准优化操作与使用迭代扩大数据帧相比concat，浪费内存

My output (using head(1)since test data has only 2rows per group:

我的输出（使用head(1)因为测试数据2每组只有行：

Out[484]: 
  borough  title  total_loans
1       A  Book2           82
2       B  Book1           48

Answer 2

回答by acushner

something like this:

像这样：

t = df.groupby(['borough', 'title']).sum()
t.sort('total_loans', ascending=True)
t = t.groupby(level=[0,1]).head(3).reset_index()
t.sort(['borough', 'title'], ascending=(True, False)) #not sure if this is necessary, tough to tell with limited data, but just in case...

Answer 3

回答by locohamster

'''
Created on Jul 30, 2014

class TopX():



    def __init__(self, top,sortFunction):
        self.topX=top
        self.sortFunction=sortFunction
        self.data=[]


    def addNewItem(self,item):
        self.data.append(item)
        self.data.sort( key=self.sortFunction,reverse=True)
        self.data=self.data[:self.topX]
    def getMax(self):
        return self.data



def runMe():
    top = TopX(3, lambda x:int(x[2]))
    with open("lib.txt","r") as f:
        string= f.readlines()
        for line in string:
            data= [x.strip() for x in line.split(' ')]
            top.addNewItem(data)

    print top.getMax()    


if __name__ == '__main__':
    runMe()

Works with input file in format:

适用于以下格式的输入文件：

A Book1 1
A Book2 10
A Book3 3
B Book1 7
B Book2 5

Giving results:

给出结果：

[['A', 'Book2', '10'], ['B', 'Book1', '7'], ['B', 'Book2', '5']]

You may specify amount of top books and sorting key if you need to adjust criteria later on.

如果您稍后需要调整标准，您可以指定顶级书籍的数量和排序键。

pandas 熊猫：如何找到列中每个类别的最大 n 值

提问by robroc

回答by FooBar

回答by acushner

回答by locohamster

相关推荐

最近更新

标签

pandas 熊猫：如何找到列中每个类别的最大 n 值

提问by robroc

回答by FooBar

回答by acushner

回答by locohamster

相关推荐

使用 Python 的 Pandas 按箱查找平均值

Pandas：如何迭代两个格式完全相同的数据帧？

Python Pandas 使用索引或列标识符连接/合并数据帧

基于三列将一个 Pandas 数据帧中的行与另一行匹配

相关推荐

最近更新

标签