Pandas:按组过滤唯一值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33019446/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: filter unique values in groups
提问by Andres
I have a dataframe with sales information in a supermarket. Each row in the dataframe represents an item, with several characteristics as columns. The original DataFrame is something like this:
我有一个包含超市销售信息的数据框。数据框中的每一行代表一个项目,有几个特征作为列。原始的 DataFrame 是这样的:
In [1]: import pandas as pd
my_data = [{'ticket_number' : '001', 'ITEM' : 'vegetable', 'ticket_line' : '1'},
{'TICKET_NUMBER' : '001', 'ITEM' : 'vegetable', 'TICKET_ROW' : '2'},
{'TICKET_NUMBER' : '001', 'ITEM' : 'soup', 'TICKET_ROW' : '3'},
{'TICKET_NUMBER' : '002', 'ITEM' : 'soup', 'TICKET_ROW' : '1'},
{'TICKET_NUMBER' : '002', 'ITEM' : 'drink', 'TICKET_ROW' : '2'},
{'TICKET_NUMBER' : '003', 'ITEM' : 'meat', 'TICKET_ROW' : '1'},
{'TICKET_NUMBER' : '003', 'ITEM' : 'vegetable', 'TICKET_ROW' : '2'},
{'TICKET_NUMBER' : '003', 'ITEM' : 'meat', 'TICKET_ROW' : '3'}]
df = pd.DataFrame(my_data)
In [2]: df
Out [2]:
TICKET_NUMBER TICKET_ROW ITEM
0 001 1 vegetable
1 001 2 vegetable
2 001 3 soup
3 002 1 soup
4 002 2 drink
5 003 1 meat
6 003 2 vegetable
7 003 3 meat
I want to filter out duplicated items that belong to the same ticket. For example, in the first ticket (TICKET_NUMBER==001), there are 2 vegetables, so I want to delete 1 of them. The same happens in ticket number 003 with meat.
我想过滤掉属于同一张票的重复项目。比如第一张票(TICKET_NUMBER==001)有2个蔬菜,所以我想删除其中1个。同样的情况发生在带有肉的票号 003 中。
So, the final dataset would look like this:
因此,最终数据集将如下所示:
TICKET_NUMBER TICKET_ROW ITEM
0 001 1 vegetable
1 001 3 soup
2 002 1 soup
3 002 2 drink
4 003 1 meat
5 003 2 vegetable
My guess was to groupbyTICKET_NUMBER, then filter ITEM by unique(), (df.groupby(['TICKET_NUMBER','TICKET_ROW'])['ITEM'].unique()). Once I have the unique values, I would like to reverse those groups (kind of "ungroupby") to a DataFrame. Is that possible?
我的猜测是groupbyTICKET_NUMBER,然后按unique(), ( df.groupby(['TICKET_NUMBER','TICKET_ROW'])['ITEM'].unique())过滤 ITEM 。一旦我有了唯一值,我想将这些组(类似于“ungroupby”)反转为 DataFrame。那可能吗?
I'm sure there are other ways of doing what I'm looking for. Please, help!
我确信还有其他方法可以做我正在寻找的事情。请帮忙!
Thank you!
谢谢!
采纳答案by DSM
I think you're close. It looks like taking the first TICKET_ROW in the case of duplicates would suffice, and we can use as_index=Falseto keep things looking like the original dataframe. So we can group by TICKET_NUMBER and ITEM and take the first TICKET_ROW:
我认为你很接近。看起来在重复的情况下取第一个 TICKET_ROW 就足够了,我们可以使用它as_index=False来保持事物看起来像原始数据帧。所以我们可以按 TICKET_NUMBER 和 ITEM 分组并取第一个 TICKET_ROW:
df.groupby(["TICKET_NUMBER", "ITEM"], sort=False, as_index=False)["TICKET_ROW"].first()
which gives
这使
In [46]: df.groupby(["TICKET_NUMBER", "ITEM"], sort=False, as_index=False)["TICKET_ROW"].first()
Out[46]:
TICKET_NUMBER ITEM TICKET_ROW
0 001 vegetable 1
1 001 soup 3
2 002 soup 1
3 002 drink 2
4 003 meat 1
5 003 vegetable 2

