Python 计算每组熊猫的唯一值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38309729/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Count unique values with pandas per groups
提问by Arseniy Krupenin
I need to count unique ID
values in every domain
I have data
我需要计算ID
每个domain
数据中的唯一值
ID, domain
123, 'vk.com'
123, 'vk.com'
123, 'twitter.com'
456, 'vk.com'
456, 'facebook.com'
456, 'vk.com'
456, 'google.com'
789, 'twitter.com'
789, 'vk.com'
I try df.groupby(['domain', 'ID']).count()
But I want to get
我尝试df.groupby(['domain', 'ID']).count()
但我想得到
domain, count
vk.com 3
twitter.com 2
facebook.com 1
google.com 1
回答by jezrael
You need nunique
:
你需要nunique
:
df = df.groupby('domain')['ID'].nunique()
print (df)
domain
'facebook.com' 1
'google.com' 1
'twitter.com' 2
'vk.com' 3
Name: ID, dtype: int64
If you need to strip
'
characters:
如果您需要字符:strip
'
df = df.ID.groupby([df.domain.str.strip("'")]).nunique()
print (df)
domain
facebook.com 1
google.com 1
twitter.com 2
vk.com 3
Name: ID, dtype: int64
Or as Jon Clementscommented:
或者正如乔恩克莱门茨评论的那样:
df.groupby(df.domain.str.strip("'"))['ID'].nunique()
You can retain the column name like this:
您可以像这样保留列名:
df = df.groupby(by='domain', as_index=False).agg({'ID': pd.Series.nunique})
print(df)
domain ID
0 fb 1
1 ggl 1
2 twitter 2
3 vk 3
The difference is that nunique()
returns a Series and agg()
returns a DataFrame.
区别在于nunique()
返回一个 Series 并agg()
返回一个 DataFrame。
回答by Psidom
Generally to count distinct values in single column, you can use Series.value_counts
:
通常要计算单列中的不同值,您可以使用Series.value_counts
:
df.domain.value_counts()
#'vk.com' 5
#'twitter.com' 2
#'facebook.com' 1
#'google.com' 1
#Name: domain, dtype: int64
To see how many unique values in a column, use Series.nunique
:
要查看列中有多少唯一值,请使用Series.nunique
:
df.domain.nunique()
# 4
To get all these distinct values, you can use unique
or drop_duplicates
, the slight difference between the two functions is that unique
return a numpy.array
while drop_duplicates
returns a pandas.Series
:
要获得所有这些不同的值,您可以使用unique
or drop_duplicates
,这两个函数之间的细微差别是unique
return a numpy.array
whiledrop_duplicates
返回 a pandas.Series
:
df.domain.unique()
# array(["'vk.com'", "'twitter.com'", "'facebook.com'", "'google.com'"], dtype=object)
df.domain.drop_duplicates()
#0 'vk.com'
#2 'twitter.com'
#4 'facebook.com'
#6 'google.com'
#Name: domain, dtype: object
As for this specific problem, since you'd like to count distinct value with respect to another variable, besides groupby
method provided by other answers here, you can also simply drop duplicates firstly and then do value_counts()
:
至于这个特定问题,由于您想计算相对于另一个变量的不同值,除了groupby
此处其他答案提供的方法之外,您还可以简单地先删除重复项,然后执行以下操作value_counts()
:
import pandas as pd
df.drop_duplicates().domain.value_counts()
# 'vk.com' 3
# 'twitter.com' 2
# 'facebook.com' 1
# 'google.com' 1
# Name: domain, dtype: int64
回答by kamran kausar
df.domain.value_counts()
df.domain.value_counts()
>>> df.domain.value_counts()
vk.com 5
twitter.com 2
google.com 1
facebook.com 1
Name: domain, dtype: int64
回答by ysearka
IIUC you want the number of different ID
for every domain
, then you can try this:
IIUC 你想要ID
每个不同的数量domain
,那么你可以试试这个:
output = df.drop_duplicates()
output.groupby('domain').size()
output:
输出:
domain
facebook.com 1
google.com 1
twitter.com 2
vk.com 3
dtype: int64
You could also use value_counts
, which is slightly less efficient.But the best is Jezrael's answer using nunique
:
你也可以使用value_counts
,它的效率稍低。但最好的是 Jezrael 的答案使用nunique
:
%timeit df.drop_duplicates().groupby('domain').size()
1000 loops, best of 3: 939 μs per loop
%timeit df.drop_duplicates().domain.value_counts()
1000 loops, best of 3: 1.1 ms per loop
%timeit df.groupby('domain')['ID'].nunique()
1000 loops, best of 3: 440 μs per loop