pandas 按组每列的唯一值数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27002926/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:40:48  来源:igfitidea点击:

Number of unique values per column by group

pythonpandas

提问by Amelio Vazquez-Reina

Consider the following dataframe:

考虑以下数据框:

      A      B  E
0   bar    one  1
1   bar  three  1
2  flux    six  1
3  flux  three  2
4   foo   five  2
5   foo    one  1
6   foo    two  1
7   foo    two  2

I would like to find, for each value of A, the number of unique values in the other columns.

我想为 的每个值找到A其他列中唯一值的数量。

  1. I thought the following would do it:

    df.groupby('A').apply(lambda x: x.nunique())
    

    but I get an error:

    AttributeError: 'DataFrame' object has no attribute 'nunique'
    
  2. I also tried with:

    df.groupby('A').nunique()
    

    but I also got the error:

    AttributeError: 'DataFrameGroupBy' object has no attribute 'nunique'
    
  3. Finally I tried with:

    df.groupby('A').apply(lambda x: x.apply(lambda y: y.nunique()))
    

    which returns:

          A  B  E
    A            
    bar   1  2  1
    flux  1  2  2
    foo   1  3  2
    

    and seems to be correct. Strangely though, it also returns the column Ain the result. Why?

  1. 我认为以下会做到这一点:

    df.groupby('A').apply(lambda x: x.nunique())
    

    但我收到一个错误:

    AttributeError: 'DataFrame' object has no attribute 'nunique'
    
  2. 我也试过:

    df.groupby('A').nunique()
    

    但我也得到了错误:

    AttributeError: 'DataFrameGroupBy' object has no attribute 'nunique'
    
  3. 最后我尝试了:

    df.groupby('A').apply(lambda x: x.apply(lambda y: y.nunique()))
    

    返回:

          A  B  E
    A            
    bar   1  2  1
    flux  1  2  2
    foo   1  3  2
    

    并且似乎是正确的。但奇怪的是,它还返回A结果中的列。为什么?

采纳答案by huu

The DataFrameobject doesn't have nunique, only Seriesdo. You have to pick out which column you want to apply nunique()on. You can do this with a simple dot operator:

DataFrame对象没有nunique,只是Series做。您必须选择要申请的列nunique()。你可以用一个简单的点运算符来做到这一点:

df.groupby('A').apply(lambda x: x.B.nunique())

will print:

将打印:

A
bar     2
flux    2
foo     3

And doing:

并做:

df.groupby('A').apply(lambda x: x.E.nunique())

will print:

将打印:

A
bar     1
flux    2
foo     2

Alternatively you can do this with one function call using:

或者,您可以使用一个函数调用来执行此操作:

df.groupby('A').aggregate({'B': lambda x: x.nunique(), 'E': lambda x: x.nunique()})

which will print:

这将打印:

      B  E
A
bar   2  1
flux  2  2
foo   3  2

To answer your question about why your recursive lambda prints the Acolumn as well, it's because when you do a groupby/applyoperation, you're now iterating through three DataFrameobjects. Each DataFrameobject is a sub-DataFrameof the original. Applying an operation to that will apply it to each Series. There are three Seriesper DataFrameyou're applying the nunique()operator to.

要回答关于为什么递归 lambda 也打印A列的问题,这是因为当您执行groupby/apply操作时,您现在正在遍历三个DataFrame对象。每个DataFrame对象都是DataFrame原始对象的子对象。对其应用操作会将其应用到每个Series. 有三个Series每个DataFrame你申请的nunique()运营商。

The first Seriesbeing evaluated on each DataFrameis the ASeries, and since you've done a groupbyon A, you know that in each DataFrame, there is only one unique value in the ASeries. This explains why you're ultimately given an Aresult column with all 1's.

Series对 each 进行评估的第一个值DataFrameASeries,并且由于您已经完成了groupbyon A,您知道在 each 中DataFrame, 中只有一个唯一值ASeries。这解释了为什么您最终会得到一个A包含 all的结果列1

回答by Aswitha Visvesvaran

I encountered the same problem. Upgrading pandas to the latest version solved the problem for me.

我遇到了同样的问题。将Pandas升级到最新版本为我解决了这个问题。

df.groupby('A').nunique()

The above code did not work for me in Pandas version 0.19.2. I upgraded it to Pandas version 0.21.1 and it worked.

上面的代码在 Pandas 0.19.2 版中对我不起作用。我将它升级到 Pandas 版本 0.21.1 并且它起作用了。

You can check the version using the following code:

您可以使用以下代码检查版本:

print('Pandas version ' + pd.__version__)