每个唯一值采样一条记录（pandas、python）

Question

提问by Ruslan

I work with python-pandas dataframes, and I have a large dataframe containing users and their data. Each user can have multiple rows. I want to sample 1-row per user. My current solution seems not efficient:

我使用 python-pandas 数据框，我有一个包含用户及其数据的大型数据框。每个用户可以有多行。我想为每个用户采样 1 行。我目前的解决方案似乎效率不高：

df1 = pd.DataFrame({'User': ['user1', 'user1', 'user2', 'user3', 'user2', 'user3'],
                 'B': ['B', 'B1', 'B2', 'B3','B4','B5'],
                 'C': ['C', 'C1', 'C2', 'C3','C4','C5'],
                 'D': ['D', 'D1', 'D2', 'D3','D4','D5'],
                 'E': ['E', 'E1', 'E2', 'E3','E4','E5']},
                 index=[0, 1, 2, 3,4,5])

df1
>>  B   C   D   E   User
0   B   C   D   E   user1
1   B1  C1  D1  E1  user1
2   B2  C2  D2  E2  user2
3   B3  C3  D3  E3  user3
4   B4  C4  D4  E4  user2
5   B5  C5  D5  E5  user3

userList = list(df1.User.unique())
userList
> ['user1', 'user2', 'user3']

The I loop over unique users list and sample one row per user, saving them to a different dataframe

I 遍历唯一用户列表并对每个用户采样一行，将它们保存到不同的数据帧

usersSample = pd.DataFrame() # empty dataframe, to save samples
for i in userList:
    usersSample=usersSample.append(df1[df1.User == i].sample(1)) 

> usersSample   
B   C   D   E   User
0   B   C   D   E   user1
4   B4  C4  D4  E4  user2
3   B3  C3  D3  E3  user3

Is there a more efficient way of achieving that? I'd really like to: 1) avoid appending to dataframe usersSample. This is gradually growing object and it seriously kills performance. And 2) avoid looping over users one at a time. Is there a way to sample 1-per-user more efficiently?

有没有更有效的方法来实现这一目标？我真的很想：1）避免附加到数据帧 usersSample。这是逐渐增长的对象，它严重影响性能。2) 避免一次循环一个用户。有没有办法更有效地对每个用户采样 1 个？

Answer 1

回答by piRSquared

This is what you want:

这就是你想要的：

df1.groupby('User').apply(lambda df: df.sample(1))

Without the extra index:

没有额外的索引：

df1.groupby('User', group_keys=False).apply(lambda df: df.sample(1))

Answer 2

回答by ayhan

Based on number of rows per user this might be faster:

根据每个用户的行数，这可能会更快：

df.sample(frac=1).drop_duplicates(['User'])

Answer 3

回答by TED Zhao

df1_user_sample_one = df1.groupby('User').apply(lambda x:x.sample(1))

Using DataFrame.groupby.apply and lambda function to sample 1

使用 DataFrame.groupby.apply 和 lambda 函数来采样 1

每个唯一值采样一条记录（pandas、python）

提问by Ruslan

回答by piRSquared

回答by ayhan

回答by TED Zhao

相关推荐

最近更新

标签

每个唯一值采样一条记录（pandas、python）

提问by Ruslan

回答by piRSquared

回答by ayhan

回答by TED Zhao

相关推荐

Pandas 回合不适用于 DataFrame

pandas 熊猫，使用 pd.to_hdf 将多个数据集存储在一个 h5 文件中

无法通过 python pandas 计算 MACD

pandas 两个数据点之间的线性插值

相关推荐

最近更新

标签