Python Pandas,获取数据帧列中单个值的计数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36067894/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas, Get count of a single value in a Column of a Dataframe
提问by Randhawa
Using pandas, I would like to get count of a specific value in a column.I know using df.somecolumn.ravel() will give me all the unique values and their count.But how to get count of some specific value.
使用熊猫,我想获得列中特定值的计数。我知道使用 df.somecolumn.ravel() 会给我所有唯一值及其计数。但是如何获得某些特定值的计数。
In[5]:df
Out[5]:
col
1
1
1
1
2
2
2
1
Desired :
期望:
To get count of 1.
In[6]:df.somecalulation(1)
Out[6]: 5
To get count of 2.
In[6]:df.somecalulation(2)
Out[6]: 3
回答by jezrael
You can try value_counts
:
你可以试试value_counts
:
df = df['col'].value_counts().reset_index()
df.columns = ['col', 'count']
print df
col count
0 1 5
1 2 3
EDIT:
编辑:
print (df['col'] == 1).sum()
5
Or:
或者:
def somecalulation(x):
return (df['col'] == x).sum()
print somecalulation(1)
5
print somecalulation(2)
3
Or:
或者:
ser = df['col'].value_counts()
def somecalulation(s, x):
return s[x]
print somecalulation(ser, 1)
5
print somecalulation(ser, 2)
3
EDIT2:
编辑2:
If you need something really fast, use numpy.in1d
:
如果您需要非常快速的东西,请使用numpy.in1d
:
import pandas as pd
import numpy as np
a = pd.Series([1, 1, 1, 1, 2, 2])
#for testing len(a) = 6000
a = pd.concat([a]*1000).reset_index(drop=True)
print np.in1d(a,1).sum()
4000
print (a == 1).sum()
4000
print np.sum(a==1)
4000
Timings:
时间:
len(a)=6
:
len(a)=6
:
In [131]: %timeit np.in1d(a,1).sum()
The slowest run took 9.17 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 29.9 μs per loop
In [132]: %timeit np.sum(a == 1)
10000 loops, best of 3: 196 μs per loop
In [133]: %timeit (a == 1).sum()
1000 loops, best of 3: 180 μs per loop
len(a)=6000
:
len(a)=6000
:
In [135]: %timeit np.in1d(a,1).sum()
The slowest run took 7.29 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 48.5 μs per loop
In [136]: %timeit np.sum(a == 1)
The slowest run took 5.23 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 273 μs per loop
In [137]: %timeit (a == 1).sum()
1000 loops, best of 3: 271 μs per loop
回答by Ami Tavory
If you take the value_counts
return, you can query it for multiple values:
如果value_counts
取回,则可以查询多个值:
import pandas as pd
a = pd.Series([1, 1, 1, 1, 2, 2])
counts = a.value_counts()
>>> counts[1], counts[2]
(4, 2)
However, to count only a single item, it would be faster to use
但是,要仅计算单个项目,使用会更快
import numpy as np
np.sum(a == 1)
回答by Kalpana
Get the total count:
获取总数:
column = df['specific_column']
column.count()
Get the specific value total count:
获取具体值总计数:
column.loc[specific_column > 0].count()
do not need to add comas ('') to indicate specific_column
.
不需要加逗号('')来表示specific_column
。