Python Pandas:带有 aggfunc = count unique distinct 的数据透视表

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/12860421/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 12:01:56  来源:igfitidea点击:

Python Pandas : pivot table with aggfunc = count unique distinct

pythonpandaspivot-table

提问by dmi

df2 = pd.DataFrame({'X' : ['X1', 'X1', 'X1', 'X1'], 'Y' : ['Y2','Y1','Y1','Y1'], 'Z' : ['Z3','Z1','Z1','Z2']})

    X   Y   Z
0  X1  Y2  Z3
1  X1  Y1  Z1
2  X1  Y1  Z1
3  X1  Y1  Z2

g=df2.groupby('X')

pd.pivot_table(g, values='X', rows='Y', cols='Z', margins=False, aggfunc='count')

Traceback (most recent call last): ... AttributeError: 'Index' object has no attribute 'index'

回溯(最近一次调用):... AttributeError: 'Index' 对象没有属性 'index'

How do I get a Pivot Table with counts of unique valuesof one DataFrame column for two other columns?
Is there aggfuncfor count unique? Should I be using np.bincount()?

如何获得一个数据透视表,其中包含其他两列的一个 DataFrame 列的唯一值计数
是否有aggfunc用于计数独特之处?我应该使用np.bincount()吗?

NB. I am aware of 'Series' values_counts()however I need a pivot table.

注意。我知道“系列”,values_counts()但我需要一个数据透视表。



EDIT: The output should be:

编辑:输出应该是:

Z   Z1  Z2  Z3
Y             
Y1   1   1 NaN
Y2 NaN NaN   1

采纳答案by Chang She

Do you mean something like this?

你的意思是这样的吗?

In [39]: df2.pivot_table(values='X', rows='Y', cols='Z', 
                         aggfunc=lambda x: len(x.unique()))
Out[39]: 
Z   Z1  Z2  Z3
Y             
Y1   1   1 NaN
Y2 NaN NaN   1

Note that using lenassumes you don't have NAs in your DataFrame. You can do x.value_counts().count()or len(x.dropna().unique())otherwise.

请注意,使用len假定您NA的 DataFrame 中没有s。你可以这样做x.value_counts().count()len(x.dropna().unique())以其他方式。

回答by Pablo Navarro

You can construct a pivot table for each distinct value of X. In this case,

您可以为 的每个不同值构建一个数据透视表X。在这种情况下,

for xval, xgroup in g:
    ptable = pd.pivot_table(xgroup, rows='Y', cols='Z', 
        margins=False, aggfunc=numpy.size)

will construct a pivot table for each value of X. You may want to index ptableusing the xvalue. With this code, I get (for X1)

将为 的每个值构建一个数据透视表X。您可能希望ptable使用xvalue. 使用此代码,我得到 (for X1)

     X        
Z   Z1  Z2  Z3
Y             
Y1   2   1 NaN
Y2 NaN NaN   1

回答by julian peng

This is a good way of counting entries within .pivot_table:

这是计算 内条目的好方法.pivot_table

df2.pivot_table(values='X', index=['Y','Z'], columns='X', aggfunc='count')


        X1  X2
Y   Z       
Y1  Z1   1   1
    Z2   1  NaN
Y2  Z3   1  NaN

回答by Manavalan Gajapathy

aggfunc=pd.Series.nuniqueprovides distinct count.

aggfunc=pd.Series.nunique提供不同的计数。

Credit to @hume for this solution (see comment under the accepted answer). Adding as answer here for better discoverability.

此解决方案归功于@hume(请参阅已接受答案下的评论)。在此处添加答案以提高可发现性。

回答by Javier

Since at least version 0.16 of pandas, it does not take the parameter "rows"

由于至少版本 0.16 的熊猫,它不带参数“行”

As of 0.23, the solution would be:

从 0.23 开始,解决方案是:

df2.pivot_table(values='X', index='Y', columns='Z', aggfunc=pd.Series.nunique)

which returns:

返回:

Z    Z1   Z2   Z3
Y                
Y1  1.0  1.0  NaN
Y2  NaN  NaN  1.0

回答by Benoit Drogou

Since none of the answers are up to date with the last version of Pandas, I am writing another solution for this problem:

由于最新版本的 Pandas 没有一个答案是最新的,我正在为这个问题编写另一个解决方案:

In [1]:
import pandas as pd

# Set exemple
df2 = pd.DataFrame({'X' : ['X1', 'X1', 'X1', 'X1'], 'Y' : ['Y2','Y1','Y1','Y1'], 'Z' : ['Z3','Z1','Z1','Z2']})

# Pivot
pd.crosstab(index=df2['Y'], columns=df2['Z'], values=df2['X'], aggfunc=pd.Series.nunique)

Out [1]:
Z   Z1  Z2  Z3
Y           
Y1  1.0 1.0 NaN
Y2  NaN NaN 1.0

回答by grisaitis

For best performance I recommend doing DataFrame.drop_duplicatesfollowed up aggfunc='count'.

为了获得最佳性能,我建议进行DataFrame.drop_duplicates跟进aggfunc='count'

Others are correct that aggfunc=pd.Series.nuniquewill work. This can be slow, however, if the number of indexgroups you have is large (>1000).

其他人是正确的,aggfunc=pd.Series.nunique会起作用。但是,如果index您拥有的组数很大(> 1000),这可能会很慢。

So instead of (to quote @Javier)

所以而不是(引用@Javier)

df2.pivot_table('X', 'Y', 'Z', aggfunc=pd.Series.nunique)

I suggest

我建议

df2.drop_duplicates(['X', 'Y', 'Z']).pivot_table('X', 'Y', 'Z', aggfunc='count')

This works because it guarantees that every subgroup (each combination of ('Y', 'Z')) will have unique (non-duplicate) values of 'X'.

这是有效的,因为它保证每个子组( 的每个组合('Y', 'Z'))将具有唯一(非重复)的 值'X'