Python 什么时候使用 df.value_counts() 和 df.groupby('...').count() 比较合适?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47487753/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 18:13:57  来源:igfitidea点击:

When is it appropriate to use df.value_counts() vs df.groupby('...').count()?

pythonpandasdataframepandas-groupby

提问by Ollie Khakwani

I've heard in Pandas there's often multiple ways to do the same thing, but I was wondering –

我听说 Pandas 通常有多种方法可以做同样的事情,但我想知道 –

If I'm trying to group data by a value within a specific column and count the number of items with that value, when does it make sense to use df.groupby('colA').count()and when does it make sense to use df['colA'].value_counts()?

如果我尝试按特定列中的值对数据进行分组并计算具有该值的项目数,那么何时使用有意义,何时使用df.groupby('colA').count()有意义df['colA'].value_counts()

采纳答案by jezrael

There is difference value_countsreturn:

有差异value_counts回报:

The resulting object will be in descending order so that the first element is the most frequently-occurring element.

结果对象将按降序排列,因此第一个元素是最常出现的元素。

but countnot, it sort output by index(created by column in groupby('col')).

count不是,它按index(由中的列创建groupby('col'))对输出进行排序。



df.groupby('colA').count() 

is for aggregate all columns of dfby function count.So it count values excluding NaNs.

用于聚合dfby 函数的所有列count.所以它计算不包括NaNs 的值。

So if need countonly one column need:

所以如果count只需要一列需要:

df.groupby('colA')['colA'].count() 

Sample:

样本:

df = pd.DataFrame({'colB':list('abcdefg'),
                   'colC':[1,3,5,7,np.nan,np.nan,4],
                   'colD':[np.nan,3,6,9,2,4,np.nan],
                   'colA':['c','c','b','a',np.nan,'b','b']})

print (df)
  colA colB  colC  colD
0    c    a   1.0   NaN
1    c    b   3.0   3.0
2    b    c   5.0   6.0
3    a    d   7.0   9.0
4  NaN    e   NaN   2.0
5    b    f   NaN   4.0
6    b    g   4.0   NaN

print (df['colA'].value_counts())
b    3
c    2
a    1
Name: colA, dtype: int64

print (df.groupby('colA').count())
      colB  colC  colD
colA                  
a        1     1     1
b        3     2     2
c        2     2     1

print (df.groupby('colA')['colA'].count())
colA
a    1
b    3
c    2
Name: colA, dtype: int64

回答by Bharath

Groupbyand value_countsare totally different functions. You cannot perform value_counts on a dataframe.

Groupby并且value_counts是完全不同的功能。您不能对数据帧执行 value_counts。

Value Countsare limited only for a single column or series and it's sole purpose is to return the series of frequencies of values

Value Counts仅限于单个列或系列,其唯一目的是返回值的频率系列

Groupbyreturns a object so one can perform statistical computations over it. So when you do df.groupby(col).count()it will return the number of true values present in columns with respect to the specific columnsin groupby.

Groupby返回一个对象,以便可以对其进行统计计算。因此,当您执行df.groupby(col).count()此操作时,它将返回列中相对于specific columnsin groupby的真值数。

When should be value_countsused and when should groupby.countbe used : Lets take an example

什么时候应该value_counts使用,什么时候应该 groupby.count使用:让我们举个例子

df = pd.DataFrame({'id': [1, 2, 3, 4, 2, 2, 4], 'color': ["r","r","b","b","g","g","r"], 'size': [1,2,1,2,1,3,4]})

Groupby count:

分组计数

df.groupby('color').count()
       id  size
color          
b       2     2
g       2     2
r       3     3

Groupby count is generally used for getting the valid number of values present in all the columns with reference toor with respect toone or more columns specified. So not a number (nan) will be excluded.

Groupby 计数通常用于获取所有列with reference towith respect to指定的一或多个列中存在的有效值数。所以不会排除一个数字(nan)。

To find the frequency using groupby you need to aggregate against the specified column itself like @jez did. (maybe to avoid this and make developers life easy value_counts is implemented ).

要使用 groupby 查找频率,您需要像@jez 一样针对指定的列本身进行聚合。(也许是为了避免这种情况并使开发人员生活更轻松 value_counts 已实现)。

Value Counts:

值计数

df['color'].value_counts()

r    3
g    2
b    2
Name: color, dtype: int64

Value count is generally used for finding the frequency of the values present in one particular column.

值计数通常用于查找某一特定列中出现的值的频率。

In conclusion :

综上所述 :

.groupby(col).count()should be used when you want to find the frequency of valid values present in columns with respect to specified col.

.groupby(col).count()当您想要查找列中存在的有效值的频率时,应该使用指定的col.

.value_counts()should be used to find the frequencies of a series.

.value_counts()应该用于查找系列的频率。