如何使用 Python pandas 对多个变量进行“分组”,消除重复项
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24049604/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to “group by” multiple variables, eliminating duplicates, with Python pandas
提问by TNeykov
I have an input file with this sort of data:
我有一个包含此类数据的输入文件:
**> Due big size of input file, i need to take only unique pairs -
userID-locationID (some kind of preprocessing)**
**> 由于输入文件很大,我只需要取唯一的对 -
userID-locationID(某种预处理)**
userID locationID
1 loc1
1 loc2
1 loc3
2 loc1
3 loc4
3 loc3
3 loc1
I have to find how many distinct users are checked on each location and get new column with values. I already tried this but it is not what I need.
我必须找到在每个位置检查了多少不同的用户,并获取带有值的新列。我已经试过了,但这不是我需要的。
DataFrame({'count': df.groupby(["userID","locationID",]).size()}).reset_index()
回答by Happy001
This should be what you are looking for, but I'm not sure if there's an easier way:
这应该是您正在寻找的,但我不确定是否有更简单的方法:
In [5]: df.groupby(['locID','userId']).last().groupby(level='locID').size()
Out[5]:
locID
loc1 3
loc2 1
loc3 2
loc4 1
dtype: int64
Taking the last of each group will remove duplicats
取每组的最后一个将删除重复项
回答by Andy Hayden
There's a Series (groupby) method just for this: nunique.
有一个系列(GROUPBY)方法只是这样的:nunique。
In [11]: df # Note the duplicated row I appended at the end
Out[11]:
userID locationID
0 1 loc1
1 1 loc2
2 1 loc3
3 2 loc1
4 3 loc4
5 3 loc3
6 3 loc1
7 3 loc1
In [12]: g = df.groupby('locationID')
In [13]: g['userID'].nunique()
Out[13]:
locationID
loc1 3
loc2 1
loc3 2
loc4 1
dtype: int64
回答by Aaron Hall
Solution:
解决方案:
df.groupby(['locID']).size()
returns:
返回:
locID
loc1 3
loc2 1
loc3 2
loc4 1
Try it for yourself:
自己试试:
import pandas
txt = '''userID locationID
1 loc1
1 loc2
1 loc3
2 loc1
3 loc4
3 loc3
3 loc1'''
listtxt = list(txt.splitlines())
columns = tuple(filter(None, listtxt.pop(0).split()))
vals = [tuple(filter(None, line.split())) for line in listtxt]
df = pandas.DataFrame(vals, columns=columns)
dfnow returns:
df现在返回:
userID locationID
0 1 loc1
1 1 loc2
2 1 loc3
3 2 loc1
4 3 loc4
5 3 loc3
6 3 loc1
and
和
df.groupby(['locationID']).size()
returns:
返回:
locationID
loc1 3
loc2 1
loc3 2
loc4 1
回答by alfasin
import pandas as pn
df = pn.DataFrame({'userId': pn.Series([1,1,1,2,3,3,3]),
'locID': pn.Series(['loc1', 'loc2', 'loc3', 'loc1', 'loc4', 'loc3','loc1'])})
print df.groupby(['locID']).count().userId
OUTPUT:
输出:
locID
loc1 3
loc2 1
loc3 2
loc4 1

