Python 计算每组熊猫的唯一值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38309729/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 20:38:27  来源:igfitidea点击:

Count unique values with pandas per groups

pythonpandasgroup-byuniquepandas-groupby

提问by Arseniy Krupenin

I need to count unique IDvalues in every domainI have data

我需要计算ID每个domain数据中的唯一值

ID, domain
123, 'vk.com'
123, 'vk.com'
123, 'twitter.com'
456, 'vk.com'
456, 'facebook.com'
456, 'vk.com'
456, 'google.com'
789, 'twitter.com'
789, 'vk.com'

I try df.groupby(['domain', 'ID']).count()But I want to get

我尝试df.groupby(['domain', 'ID']).count()但我想得到

domain, count
vk.com   3
twitter.com   2
facebook.com   1
google.com   1

回答by jezrael

You need nunique:

你需要nunique

df = df.groupby('domain')['ID'].nunique()

print (df)
domain
'facebook.com'    1
'google.com'      1
'twitter.com'     2
'vk.com'          3
Name: ID, dtype: int64

If you need to strip'characters:

如果您需要字符:strip'

df = df.ID.groupby([df.domain.str.strip("'")]).nunique()
print (df)
domain
facebook.com    1
google.com      1
twitter.com     2
vk.com          3
Name: ID, dtype: int64

Or as Jon Clementscommented:

或者正如乔恩克莱门茨评论的那样:

df.groupby(df.domain.str.strip("'"))['ID'].nunique()

You can retain the column name like this:

您可以像这样保留列名:

df = df.groupby(by='domain', as_index=False).agg({'ID': pd.Series.nunique})
print(df)
    domain  ID
0       fb   1
1      ggl   1
2  twitter   2
3       vk   3

The difference is that nunique()returns a Series and agg()returns a DataFrame.

区别在于nunique()返回一个 Series 并agg()返回一个 DataFrame。

回答by Psidom

Generally to count distinct values in single column, you can use Series.value_counts:

通常要计算单列中的不同值,您可以使用Series.value_counts

df.domain.value_counts()

#'vk.com'          5
#'twitter.com'     2
#'facebook.com'    1
#'google.com'      1
#Name: domain, dtype: int64

To see how many unique values in a column, use Series.nunique:

要查看列中有多少唯一值,请使用Series.nunique

df.domain.nunique()
# 4

To get all these distinct values, you can use uniqueor drop_duplicates, the slight difference between the two functions is that uniquereturn a numpy.arraywhile drop_duplicatesreturns a pandas.Series:

要获得所有这些不同的值,您可以使用uniqueor drop_duplicates,这两个函数之间的细微差别是uniquereturn a numpy.arraywhiledrop_duplicates返回 a pandas.Series

df.domain.unique()
# array(["'vk.com'", "'twitter.com'", "'facebook.com'", "'google.com'"], dtype=object)

df.domain.drop_duplicates()
#0          'vk.com'
#2     'twitter.com'
#4    'facebook.com'
#6      'google.com'
#Name: domain, dtype: object


As for this specific problem, since you'd like to count distinct value with respect to another variable, besides groupbymethod provided by other answers here, you can also simply drop duplicates firstly and then do value_counts():

至于这个特定问题,由于您想计算相对于另一个变量的不同值,除了groupby此处其他答案提供的方法之外,您还可以简单地先删除重复项,然后执行以下操作value_counts()

import pandas as pd
df.drop_duplicates().domain.value_counts()

# 'vk.com'          3
# 'twitter.com'     2
# 'facebook.com'    1
# 'google.com'      1
# Name: domain, dtype: int64

回答by kamran kausar

df.domain.value_counts()

df.domain.value_counts()

>>> df.domain.value_counts()

vk.com          5

twitter.com     2

google.com      1

facebook.com    1

Name: domain, dtype: int64

回答by ysearka

IIUC you want the number of different IDfor every domain, then you can try this:

IIUC 你想要ID每个不同的数量domain,那么你可以试试这个:

output = df.drop_duplicates()
output.groupby('domain').size()

output:

输出:

    domain
facebook.com    1
google.com      1
twitter.com     2
vk.com          3
dtype: int64

You could also use value_counts, which is slightly less efficient.But the best is Jezrael's answer using nunique:

你也可以使用value_counts,它的效率稍低。但最好的是 Jezrael 的答案使用nunique

%timeit df.drop_duplicates().groupby('domain').size()
1000 loops, best of 3: 939 μs per loop
%timeit df.drop_duplicates().domain.value_counts()
1000 loops, best of 3: 1.1 ms per loop
%timeit df.groupby('domain')['ID'].nunique()
1000 loops, best of 3: 440 μs per loop