Python 在pandas DataFrame中查找并选择列中出现频率最高的数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/21082671/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
find and select the most frequent data of column in pandas DataFrame
提问by user1345283
I have a dataframe with the following column:
我有一个包含以下列的数据框:
file['DirViento']
Fecha
2011-01-01 ENE
2011-01-02 ENE
2011-01-03 ENE
2011-01-04 NNE
2011-01-05 ENE
2011-01-06 ENE
2011-01-07 ENE
2011-01-08 ENE
2011-01-09 NNE
2011-01-10 ENE
2011-01-11 ENE
2011-01-12 ENE
2011-01-13 ESE
2011-01-14 ENE
2011-01-15 ENE
...
2011-12-17 ENE
2011-12-18 ENE
2011-12-19 ENE
2011-12-20 ENE
2011-12-21 ENE
2011-12-22 ENE
2011-12-23 ENE
2011-12-24 ENE
2011-12-25 ENE
2011-12-26 ESE
2011-12-27 ENE
2011-12-28 NE
2011-12-29 ENE
2011-12-30 NNE
2011-12-31 ENE
Name: DirViento, Length: 290, dtype: object
The column has daily records of wind direction for each month of the year. I'm trying to get the dominant direction for each month. To accomplish this, select the data most often repeated during the month:
该列每天记录一年中每个月的风向。我试图获得每个月的主导方向。为此,请选择当月最常重复的数据:
file['DirViento'].groupby(lambda x: x.month).value_counts()
1 ENE 23
NNE 6
E 1
ESE 1
2 ENE 21
NNO 3
NNE 2
NE 1
3 ENE 21
OSO 1
ESE 1
SSE 1
4 ENE 21
NNE 2
ESE 1
NNO 1
6 ENE 15
ESE 2
SSE 2
ONO 1
E 1
7 ENE 22
ONO 1
OSO 1
NE 1
NNE 1
NNO 1
8 ENE 23
NNE 5
NE 1
ONO 1
ESE 1
9 ENE 17
NNE 7
ONO 2
NE 1
E 1
ESE 1
NNO 1
10 ENE 16
NNE 2
ESE 2
NNO 2
ONO 1
NE 1
E 1
11 ENE 13
NNE 2
ESE 2
ONO 1
12 ENE 26
NNE 3
NE 1
ESE 1
Length: 54, dtype: int64
When running the following line of code
运行以下代码行时
wind_moda=file['DirViento'].groupby(lambda x: x.month).agg(lambda x: stats.mode(x)[0][0])
Should get something like this
应该得到这样的东西
1 ENE
2 ENE
3 ENE
4 ENE
6 ENE
7 ENE
8 ENE
9 ENE
10 ENE
11 ENE
12 ENE
But I get the following:
但我得到以下信息:
1 E
2 ENE
3 ENE
4 ENE
6 E
7 ENE
8 ENE
9 E
10 E
11 ENE
12 ENE
Why in 4 of the 12 months is not taking into account the most frequent data?
为什么在 12 个月中有 4 个月没有考虑最频繁的数据?
Am I doing something wrong ?
难道我做错了什么 ?
Any idea to get the most common data each month?
知道每个月获取最常见的数据吗?
回答by Dan Allan
This is not as straightforward as it could be (should be).
这并不像它可能(应该)那么简单。
As you probably know, the statistics jargon for the most common value is the "mode." Numpy does not have a built-in function for this, but scipy does. Import it like so:
您可能知道,最常见值的统计术语是“模式”。Numpy 没有内置函数,但 scipy 有。像这样导入它:
from scipy.stats.mstats import mode
It does more than simply return the most common value, as you can read about in the docs, so it's convenient to define a function that uses modeto just get the most common value.
它不仅仅是简单地返回最常见的值,正如您可以在 docs 中阅读的那样,因此定义一个mode用于获取最常见值的函数很方便。
f = lambda x: mode(x, axis=None)[0]
And now, instead of value_counts(), use apply(f). Here is an example:
现在,代替value_counts(),使用apply(f)。下面是一个例子:
In [20]: DataFrame([1,1,2,2,2,3], index=[1,1,1,2,2,2]).groupby(level=0).apply(f)
Out[20]:
1 1.0
2 2.0
dtype: object
Update:Scipy's modedoes not work with strings. For your string data, you'll need to define a more general mode function. This answershould do the trick.
更新:Scipy'smode不适用于字符串。对于您的字符串数据,您需要定义一个更通用的模式函数。这个答案应该可以解决问题。
回答by mvbentes
回答by Hrushikesh
For whole dataframe, you can use:
dataframe.mode()For specific column:
dataframe.mode()['Column'][0]
对于整个数据框,您可以使用:
dataframe.mode()对于特定列:
dataframe.mode()['Column'][0]
Second case is more useful in imputing the values.
第二种情况在估算值时更有用。

