每个唯一值采样一条记录(pandas、python)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38390242/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Sampling one record per unique value (pandas, python)
提问by Ruslan
I work with python-pandas dataframes, and I have a large dataframe containing users and their data. Each user can have multiple rows. I want to sample 1-row per user. My current solution seems not efficient:
我使用 python-pandas 数据框,我有一个包含用户及其数据的大型数据框。每个用户可以有多行。我想为每个用户采样 1 行。我目前的解决方案似乎效率不高:
df1 = pd.DataFrame({'User': ['user1', 'user1', 'user2', 'user3', 'user2', 'user3'],
'B': ['B', 'B1', 'B2', 'B3','B4','B5'],
'C': ['C', 'C1', 'C2', 'C3','C4','C5'],
'D': ['D', 'D1', 'D2', 'D3','D4','D5'],
'E': ['E', 'E1', 'E2', 'E3','E4','E5']},
index=[0, 1, 2, 3,4,5])
df1
>> B C D E User
0 B C D E user1
1 B1 C1 D1 E1 user1
2 B2 C2 D2 E2 user2
3 B3 C3 D3 E3 user3
4 B4 C4 D4 E4 user2
5 B5 C5 D5 E5 user3
userList = list(df1.User.unique())
userList
> ['user1', 'user2', 'user3']
The I loop over unique users list and sample one row per user, saving them to a different dataframe
I 遍历唯一用户列表并对每个用户采样一行,将它们保存到不同的数据帧
usersSample = pd.DataFrame() # empty dataframe, to save samples
for i in userList:
usersSample=usersSample.append(df1[df1.User == i].sample(1))
> usersSample
B C D E User
0 B C D E user1
4 B4 C4 D4 E4 user2
3 B3 C3 D3 E3 user3
Is there a more efficient way of achieving that? I'd really like to: 1) avoid appending to dataframe usersSample. This is gradually growing object and it seriously kills performance. And 2) avoid looping over users one at a time. Is there a way to sample 1-per-user more efficiently?
有没有更有效的方法来实现这一目标?我真的很想:1)避免附加到数据帧 usersSample。这是逐渐增长的对象,它严重影响性能。2) 避免一次循环一个用户。有没有办法更有效地对每个用户采样 1 个?
回答by piRSquared
回答by ayhan
Based on number of rows per user this might be faster:
根据每个用户的行数,这可能会更快:
df.sample(frac=1).drop_duplicates(['User'])
回答by TED Zhao
df1_user_sample_one = df1.groupby('User').apply(lambda x:x.sample(1))
Using DataFrame.groupby.apply and lambda function to sample 1
使用 DataFrame.groupby.apply 和 lambda 函数来采样 1