Python 在pandas DataFrame中查找并选择列中出现频率最高的数据

Question

提问by user1345283

I have a dataframe with the following column:

我有一个包含以下列的数据框：

file['DirViento']

Fecha
2011-01-01    ENE
2011-01-02    ENE
2011-01-03    ENE
2011-01-04    NNE 
2011-01-05    ENE
2011-01-06    ENE
2011-01-07    ENE
2011-01-08    ENE
2011-01-09    NNE
2011-01-10    ENE
2011-01-11    ENE
2011-01-12    ENE
2011-01-13    ESE
2011-01-14    ENE
2011-01-15    ENE
... 
2011-12-17    ENE
2011-12-18    ENE
2011-12-19    ENE
2011-12-20    ENE
2011-12-21    ENE
2011-12-22    ENE
2011-12-23    ENE
2011-12-24    ENE
2011-12-25    ENE
2011-12-26    ESE
2011-12-27    ENE
2011-12-28     NE
2011-12-29    ENE
2011-12-30    NNE
2011-12-31    ENE
Name: DirViento, Length: 290, dtype: object

The column has daily records of wind direction for each month of the year. I'm trying to get the dominant direction for each month. To accomplish this, select the data most often repeated during the month:

该列每天记录一年中每个月的风向。我试图获得每个月的主导方向。为此，请选择当月最常重复的数据：

file['DirViento'].groupby(lambda x: x.month).value_counts()


1   ENE    23
    NNE     6
    E       1
    ESE     1
2   ENE    21
    NNO     3
    NNE     2
    NE      1
3   ENE    21
    OSO     1
    ESE     1
    SSE     1
4   ENE    21
    NNE     2
    ESE     1
    NNO     1
6   ENE    15
    ESE     2
    SSE     2
    ONO     1
    E       1
7   ENE    22
    ONO     1
    OSO     1
    NE      1
    NNE     1
    NNO     1
8   ENE    23
    NNE     5
    NE      1
    ONO     1
    ESE     1
9   ENE    17
    NNE     7
    ONO     2
    NE      1
    E       1
    ESE     1
    NNO     1
10  ENE    16
    NNE     2
    ESE     2
    NNO     2
    ONO     1
    NE      1
    E       1
11  ENE    13
    NNE     2
    ESE     2
    ONO     1
12  ENE    26
    NNE     3
    NE      1
    ESE     1
Length: 54, dtype: int64

When running the following line of code

运行以下代码行时

wind_moda=file['DirViento'].groupby(lambda x: x.month).agg(lambda x: stats.mode(x)[0][0])

Should get something like this

应该得到这样的东西

     1  ENE    
     2  ENE    
     3  ENE  
     4  ENE
     6  ENE
     7  ENE    
     8  ENE    
     9  ENE
    10  ENE  
    11  ENE
    12  ENE

But I get the following:

但我得到以下信息：

 1          E  
 2        ENE  
 3        ENE  
 4        ENE  
 6          E  
 7        ENE  
 8        ENE  
 9          E  
 10         E  
 11       ENE  
 12       ENE

Why in 4 of the 12 months is not taking into account the most frequent data?

为什么在 12 个月中有 4 个月没有考虑最频繁的数据？

Am I doing something wrong ?

难道我做错了什么？

Any idea to get the most common data each month?

知道每个月获取最常见的数据吗？

Answer 1

回答by Dan Allan

This is not as straightforward as it could be (should be).

这并不像它可能（应该）那么简单。

As you probably know, the statistics jargon for the most common value is the "mode." Numpy does not have a built-in function for this, but scipy does. Import it like so:

您可能知道，最常见值的统计术语是“模式”。Numpy 没有内置函数，但 scipy 有。像这样导入它：

from scipy.stats.mstats import mode

It does more than simply return the most common value, as you can read about in the docs, so it's convenient to define a function that uses modeto just get the most common value.

它不仅仅是简单地返回最常见的值，正如您可以在 docs 中阅读的那样，因此定义一个mode用于获取最常见值的函数很方便。

f = lambda x: mode(x, axis=None)[0]

And now, instead of value_counts(), use apply(f). Here is an example:

现在，代替value_counts()，使用apply(f)。下面是一个例子：

In [20]: DataFrame([1,1,2,2,2,3], index=[1,1,1,2,2,2]).groupby(level=0).apply(f)
Out[20]: 
1    1.0
2    2.0
dtype: object

Update:Scipy's modedoes not work with strings. For your string data, you'll need to define a more general mode function. This answershould do the trick.

更新：Scipy'smode不适用于字符串。对于您的字符串数据，您需要定义一个更通用的模式函数。这个答案应该可以解决问题。

Answer 2

回答by mvbentes

Pandas 0.15.2 has a DataFrame.mode()method. It might be of use to someone looking for this as I was.

Pandas 0.15.2 有一个DataFrame.mode()方法。它可能对像我一样寻找这个的人有用。

Here are the docs.

这是文档。

Edit: For the Value:

编辑：对于价值：

DataFrame.mode()[0]

Answer 3

回答by Hrushikesh

For whole dataframe, you can use:
```
dataframe.mode()
```
For specific column:
```
dataframe.mode()['Column'][0]
```

对于整个数据框，您可以使用：
```
dataframe.mode()
```
对于特定列：
```
dataframe.mode()['Column'][0]
```

Second case is more useful in imputing the values.

第二种情况在估算值时更有用。

Python 在pandas DataFrame中查找并选择列中出现频率最高的数据

提问by user1345283

回答by Dan Allan

回答by mvbentes

回答by Hrushikesh

相关推荐

最近更新

标签

Python 在pandas DataFrame中查找并​​选择列中出现频率最高的数据

提问by user1345283

回答by Dan Allan

回答by mvbentes

回答by Hrushikesh

相关推荐

在 Python 3 中获取“OrderedDict”第一项的最短方法

Python中的命令行输入

Python 读取 CSV 的单列并存储在数组中

Python 在直方图中绘制平均线（matplotlib）

相关推荐

最近更新

标签

Python 在pandas DataFrame中查找并选择列中出现频率最高的数据