Pandas - 按 id 分组并使用阈值删除重复项

Question

提问by Mansumen

I have the following data:

我有以下数据：

userid itemid
  1       1
  1       1
  1       3
  1       4
  2       1
  2       2
  2       3

I want to drop userIDs who has viewed the same itemID more than or equal to twice. For example, userid=1 has viewed itemid=1 twice, and thus I want to drop the entire record of userid=1. However, since userid=2 hasn't viewed the same item twice, I will leave userid=2 as it is.

我想删除查看相同 itemID 两次以上的用户 ID。例如，userid=1 已经查看了 itemid=1 两次，因此我想删除 userid=1 的整个记录。但是，由于 userid=2 没有两次查看同一个项目，我将保留 userid=2 原样。

So I want my data to be like the following:

所以我希望我的数据如下所示：

userid itemid
  2       1
  2       2
  2       3

Can someone help me?

有人能帮我吗？

import pandas as pd    
df = pd.DataFrame({'userid':[1,1,1,1, 2,2,2],
                   'itemid':[1,1,3,4, 1,2,3] })

Answer 1

回答by root

You can use duplicatedto determine the row level duplicates, then perform a groupbyon 'userid' to determine 'userid' level duplicates, then drop accordingly.

您可以使用duplicated来确定行级重复项，然后groupby对“userid”执行 a以确定“userid”级重复项，然后相应地删除。

To drop without a threshold:

无阈值下降：

df = df[~df.duplicated(['userid', 'itemid']).groupby(df['userid']).transform('any')]

To drop with a threshold, use keep=Falsein duplicated, and sum over the Boolean column and compare against your threshold. For example, with a threshold of 3:

要删除阈值，请使用keep=Falsein duplicated，并对布尔列求和并与您的阈值进行比较。例如，阈值为 3：

df = df[~df.duplicated(['userid', 'itemid'], keep=False).groupby(df['userid']).transform('sum').ge(3)]

The resulting output for no threshold:

没有阈值的结果输出：

   userid  itemid
4       2       1
5       2       2
6       2       3

Answer 2

回答by piRSquared

`filter`

Was made for this. You can pass a function that returns a boolean that determines if the group passed the filter or not.

是为此而生的。您可以传递一个函数，该函数返回一个布尔值，以确定该组是否通过了过滤器。

filterand value_counts
Most generalizable and intuitive

filter和value_counts
最通用和最直观的

df.groupby('userid').filter(lambda x: x.itemid.value_counts().max() < 2)

filterand is_unique
special case when looking for n < 2

filter和is_unique
寻找时的特殊情况n < 2

df.groupby('userid').filter(lambda x: x.itemid.is_unique)

   userid  itemid
4       2       1
5       2       2
6       2       3

Answer 3

回答by DYZ

Group the dataframe by users and items:

按用户和项目对数据框进行分组：

views = df.groupby(['userid','itemid'])['itemid'].count()
#userid  itemid
#1       1         2 <=== The offending row
#        3         1
#        4         1
#2       1         1
#        2         1
#        3         1
#Name: dummy, dtype: int64

Find out who saw any item only once:

找出谁只看过一次任何项目：

THRESHOLD = 2
viewed = ~(views.unstack() >= THRESHOLD).any(axis=1)
#userid
#1    False
#2     True
#dtype: bool

Combine the results and keep the 'good' rows:

合并结果并保留“好”行：

combined = df.merge(pd.DataFrame(viewed).reset_index())
combined[combined[0]][['userid','itemid']]
#   userid  itemid
#4       2       1
#5       2       2
#6       2       3

Answer 4

回答by Allen

# group userid and itemid and get a count
df2 = df.groupby(by=['userid','itemid']).apply(lambda x: len(x)).reset_index()
#Extract rows where the max userid-itemid count is less than 2.
df2 = df2[~df2.userid.isin(df2[df2.ix[:,-1]>1]['userid'])][df.columns]
print(df2)
   itemid  userid
3       1       2
4       2       2
5       3       2

If you want to drop at a certain threshold, just set

如果你想下降到某个阈值，只需设置

df2.ix[:,-1]>threshold]

Answer 5

回答by arnold

I do not know whether there is a function available in Pandasto do this task. However, I tried to make a workaround to deal with your problem.

我不知道是否有可用的函数Pandas来执行此任务。但是，我尝试了一种解决方法来解决您的问题。

Here is the full code.

这是完整的代码。

import pandas as pd
dictionary = {'userid':[1,1,1,1,2,2,2],
              'itemid':[1,1,3,4,1,2,3]}

df = pd.DataFrame(dictionary, columns=['userid', 'itemid'])

selected_user = []

for user in df['userid'].drop_duplicates().tolist():

    items = df.loc[df['userid']==user]['itemid'].tolist()
    if len(items) != len(set(items)): continue
    else: selected_user.append(user)

result = df.loc[(df['userid'].isin(selected_user))]

This code will result the following outcome.

此代码将导致以下结果。

    userid  itemid
4   2       1
5   2       2
6   2       3

Hope it helps.

希望能帮助到你。

Pandas - 按 id 分组并使用阈值删除重复项

提问by Mansumen

回答by root

回答by piRSquared

`filter`

`filter`

回答by DYZ

回答by Allen

回答by arnold

相关推荐

最近更新

标签

Pandas - 按 id 分组并使用阈值删除重复项

提问by Mansumen

回答by root

回答by piRSquared

filter

filter

回答by DYZ

回答by Allen

回答by arnold

相关推荐

如何使用 Python Pandas Stylers 根据给定列为整行着色？

pandas 根据没有公共列的其他两个日期之间的日期合并两个数据框

pandas 熊猫在 csv 列中读取为浮点数并将空单元格设置为 0

pandas 使用pandas从csv中删除特定行

相关推荐

最近更新

标签

`filter`

`filter`