Pandas：按组过滤唯一值

Question

提问by Andres

I have a dataframe with sales information in a supermarket. Each row in the dataframe represents an item, with several characteristics as columns. The original DataFrame is something like this:

我有一个包含超市销售信息的数据框。数据框中的每一行代表一个项目，有几个特征作为列。原始的 DataFrame 是这样的：

In [1]: import pandas as pd
        my_data = [{'ticket_number' : '001', 'ITEM' : 'vegetable', 'ticket_line' : '1'},
               {'TICKET_NUMBER' : '001', 'ITEM' : 'vegetable', 'TICKET_ROW' : '2'},
               {'TICKET_NUMBER' : '001', 'ITEM' : 'soup', 'TICKET_ROW' : '3'},
               {'TICKET_NUMBER' : '002', 'ITEM' : 'soup', 'TICKET_ROW' : '1'},
               {'TICKET_NUMBER' : '002', 'ITEM' : 'drink', 'TICKET_ROW' : '2'},
               {'TICKET_NUMBER' : '003', 'ITEM' : 'meat', 'TICKET_ROW' : '1'},
               {'TICKET_NUMBER' : '003', 'ITEM' : 'vegetable', 'TICKET_ROW' : '2'},
               {'TICKET_NUMBER' : '003', 'ITEM' : 'meat', 'TICKET_ROW' : '3'}]
        df = pd.DataFrame(my_data)

In [2]: df
Out [2]:    
            TICKET_NUMBER   TICKET_ROW        ITEM
         0        001            1           vegetable
         1        001            2           vegetable
         2        001            3           soup
         3        002            1           soup
         4        002            2           drink
         5        003            1           meat
         6        003            2           vegetable
         7        003            3           meat

I want to filter out duplicated items that belong to the same ticket. For example, in the first ticket (TICKET_NUMBER==001), there are 2 vegetables, so I want to delete 1 of them. The same happens in ticket number 003 with meat.

我想过滤掉属于同一张票的重复项目。比如第一张票（TICKET_NUMBER==001）有2个蔬菜，所以我想删除其中1个。同样的情况发生在带有肉的票号 003 中。

So, the final dataset would look like this:

因此，最终数据集将如下所示：

        TICKET_NUMBER   TICKET_ROW        ITEM
     0        001            1           vegetable
     1        001            3           soup
     2        002            1           soup
     3        002            2           drink
     4        003            1           meat
     5        003            2           vegetable

My guess was to groupbyTICKET_NUMBER, then filter ITEM by unique(), (df.groupby(['TICKET_NUMBER','TICKET_ROW'])['ITEM'].unique()). Once I have the unique values, I would like to reverse those groups (kind of "ungroupby") to a DataFrame. Is that possible?

我的猜测是groupbyTICKET_NUMBER，然后按unique(), ( df.groupby(['TICKET_NUMBER','TICKET_ROW'])['ITEM'].unique())过滤 ITEM 。一旦我有了唯一值，我想将这些组（类似于“ungroupby”）反转为 DataFrame。那可能吗？

I'm sure there are other ways of doing what I'm looking for. Please, help!

我确信还有其他方法可以做我正在寻找的事情。请帮忙！

Thank you!

谢谢！

Answer 1

采纳答案by DSM

I think you're close. It looks like taking the first TICKET_ROW in the case of duplicates would suffice, and we can use as_index=Falseto keep things looking like the original dataframe. So we can group by TICKET_NUMBER and ITEM and take the first TICKET_ROW:

我认为你很接近。看起来在重复的情况下取第一个 TICKET_ROW 就足够了，我们可以使用它as_index=False来保持事物看起来像原始数据帧。所以我们可以按 TICKET_NUMBER 和 ITEM 分组并取第一个 TICKET_ROW：

df.groupby(["TICKET_NUMBER", "ITEM"], sort=False, as_index=False)["TICKET_ROW"].first()

which gives

这使

In [46]: df.groupby(["TICKET_NUMBER", "ITEM"], sort=False, as_index=False)["TICKET_ROW"].first()
Out[46]: 
  TICKET_NUMBER       ITEM TICKET_ROW
0           001  vegetable          1
1           001       soup          3
2           002       soup          1
3           002      drink          2
4           003       meat          1
5           003  vegetable          2

Pandas：按组过滤唯一值

提问by Andres

采纳答案by DSM

相关推荐

最近更新

标签

Pandas：按组过滤唯一值

提问by Andres

采纳答案by DSM

相关推荐

Python pandas .isnull() 不适用于对象 dtype 中的 NaT

pandas 将元组作为一行附加到数据帧

Pandas 数据框 - 运行总和并重置

从 Pandas 数据框中的单元格中提取字符串

相关推荐

最近更新

标签