Python 什么时候使用 df.value_counts() 和 df.groupby('...').count() 比较合适?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47487753/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
When is it appropriate to use df.value_counts() vs df.groupby('...').count()?
提问by Ollie Khakwani
I've heard in Pandas there's often multiple ways to do the same thing, but I was wondering –
我听说 Pandas 通常有多种方法可以做同样的事情,但我想知道 –
If I'm trying to group data by a value within a specific column and count the number of items with that value, when does it make sense to use df.groupby('colA').count()
and when does it make sense to use df['colA'].value_counts()
?
如果我尝试按特定列中的值对数据进行分组并计算具有该值的项目数,那么何时使用有意义,何时使用df.groupby('colA').count()
有意义df['colA'].value_counts()
?
采纳答案by jezrael
There is difference value_counts
return:
有差异value_counts
回报:
The resulting object will be in descending order so that the first element is the most frequently-occurring element.
结果对象将按降序排列,因此第一个元素是最常出现的元素。
but count
not, it sort output by index
(created by column in groupby('col')
).
但count
不是,它按index
(由中的列创建groupby('col')
)对输出进行排序。
df.groupby('colA').count()
is for aggregate all columns of df
by function count.
So it count values excluding NaN
s.
用于聚合df
by 函数的所有列count.
所以它计算不包括NaN
s 的值。
So if need count
only one column need:
所以如果count
只需要一列需要:
df.groupby('colA')['colA'].count()
Sample:
样本:
df = pd.DataFrame({'colB':list('abcdefg'),
'colC':[1,3,5,7,np.nan,np.nan,4],
'colD':[np.nan,3,6,9,2,4,np.nan],
'colA':['c','c','b','a',np.nan,'b','b']})
print (df)
colA colB colC colD
0 c a 1.0 NaN
1 c b 3.0 3.0
2 b c 5.0 6.0
3 a d 7.0 9.0
4 NaN e NaN 2.0
5 b f NaN 4.0
6 b g 4.0 NaN
print (df['colA'].value_counts())
b 3
c 2
a 1
Name: colA, dtype: int64
print (df.groupby('colA').count())
colB colC colD
colA
a 1 1 1
b 3 2 2
c 2 2 1
print (df.groupby('colA')['colA'].count())
colA
a 1
b 3
c 2
Name: colA, dtype: int64
回答by Bharath
Groupby
and value_counts
are totally different functions. You cannot perform value_counts on a dataframe.
Groupby
并且value_counts
是完全不同的功能。您不能对数据帧执行 value_counts。
Value Counts
are limited only for a single column or series and it's sole purpose is to return the series of frequencies of values
Value Counts
仅限于单个列或系列,其唯一目的是返回值的频率系列
Groupby
returns a object so one can perform statistical computations over it. So when you do df.groupby(col).count()
it will return the number of true values present in columns with respect to the specific columns
in groupby.
Groupby
返回一个对象,以便可以对其进行统计计算。因此,当您执行df.groupby(col).count()
此操作时,它将返回列中相对于specific columns
in groupby的真值数。
When should be value_counts
used and when should groupby.count
be used :
Lets take an example
什么时候应该value_counts
使用,什么时候应该 groupby.count
使用:让我们举个例子
df = pd.DataFrame({'id': [1, 2, 3, 4, 2, 2, 4], 'color': ["r","r","b","b","g","g","r"], 'size': [1,2,1,2,1,3,4]})
Groupby count:
分组计数:
df.groupby('color').count()
id size
color
b 2 2
g 2 2
r 3 3
Groupby count is generally used for getting the valid number of values present in all the columns
with reference to
orwith respect to
one or more columns specified. So not a number (nan) will be excluded.
Groupby 计数通常用于获取所有列
with reference to
或with respect to
指定的一或多个列中存在的有效值数。所以不会排除一个数字(nan)。
To find the frequency using groupby you need to aggregate against the specified column itself like @jez did. (maybe to avoid this and make developers life easy value_counts is implemented ).
要使用 groupby 查找频率,您需要像@jez 一样针对指定的列本身进行聚合。(也许是为了避免这种情况并使开发人员生活更轻松 value_counts 已实现)。
Value Counts:
值计数:
df['color'].value_counts()
r 3
g 2
b 2
Name: color, dtype: int64
Value count is generally used for finding the frequency of the values present in one particular column.
值计数通常用于查找某一特定列中出现的值的频率。
In conclusion :
综上所述 :
.groupby(col).count()
should be used when you want to find the frequency of valid values present in columns with respect to specified col
.
.groupby(col).count()
当您想要查找列中存在的有效值的频率时,应该使用指定的col
.
.value_counts()
should be used to find the frequencies of a series.
.value_counts()
应该用于查找系列的频率。