Python 如何获取列中出现频率最高的值的个数?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15138973/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 13:27:13  来源:igfitidea点击:

How to get the number of the most frequent value in a column?

pythonpandascounterfrequencyseries

提问by Roman

I have a data frame and I would like to know how many times a given column has the most frequent value.

我有一个数据框,我想知道给定列具有最频繁值的次数。

I try to do it in the following way:

我尝试通过以下方式做到这一点:

items_counts = df['item'].value_counts()
max_item = items_counts.max()

As a result I get:

结果我得到:

ValueError: cannot convert float NaN to integer

As far as I understand, with the first line I get series in which the values from a column are used as key and frequency of these values are used as values. So, I just need to find the largest value in the series and, because of some reason, it does not work. Does anybody know how this problem can be solved?

据我了解,第一行我得到了一系列,其中列中的值用作键,这些值的频率用作值。所以,我只需要找到系列中的最大值,但由于某种原因,它不起作用。有谁知道如何解决这个问题?

采纳答案by beardc

It looks like you may have some nulls in the column. You can drop them with df = df.dropna(subset=['item']). Then df['item'].value_counts().max()should give you the max counts, and df['item'].value_counts().idxmax()should give you the most frequent value.

看起来您的列中可能有一些空值。您可以使用df = df.dropna(subset=['item']). 然后df['item'].value_counts().max()应该给你最大计数,并且df['item'].value_counts().idxmax()应该给你最频繁的值。

回答by jonathanrocher

You may also consider using scipy's modefunction which ignores NaN. A solution using it could look like:

您也可以考虑使用mode忽略 NaN 的scipy函数。使用它的解决方案可能如下所示:

from scipy.stats import mode
from numpy import nan
df = DataFrame({"a": [1,2,2,4,2], "b": [nan, nan, nan, 3, 3]})
print mode(df)

The output would look like

输出看起来像

(array([[ 2.,  3.]]), array([[ 3.,  2.]]))

meaning that the most common values are 2for the first columns and 3for the second, with frequencies 3and 2respectively.

这意味着最常见的值是2第一列和3第二列32分别有频率和。

回答by Anton Protopopov

To continue to @jonathanrocher answer you could use modein pandas DataFrame. It'll give a most frequent values (one or two) across the rows or columns:

要继续@jonathanrocher 回答,您可以mode在 Pandas DataFrame 中使用。它将在行或列中提供最频繁的值(一或两个):

import pandas as pd
import numpy as np
df = pd.DataFrame({"a": [1,2,2,4,2], "b": [np.nan, np.nan, np.nan, 3, 3]})

In [2]: df.mode()
Out[2]: 
   a    b
0  2  3.0

回答by jpp

Just take the first row of your items_countsseries:

只需取items_counts系列的第一行:

top = items_counts.head(1)  # or items_counts.iloc[[0]]
value, count = top.index[0], top.iat[0]

This works because pd.Series.value_countshas sort=Trueby default and so is already orderedby counts, highest count first. Extracting a value from an index by location has O(1) complexity, while pd.Series.idxmaxhas O(n) complexity where nis the number of categories.

这是有效的,因为默认情况下pd.Series.value_countshassort=True和 so已经按计数排序,最高计数在前。按位置从索引中提取值具有 O(1) 复杂度,而pd.Series.idxmax具有 O( n) 复杂度,其中n是类别数。

Specifying sort=Falseis still possible and then idxmaxis recommended:

指定sort=False仍然是可能的,然后idxmax建议:

items_counts = df['item'].value_counts(sort=False)
top = items_counts.loc[[items_counts.idxmax()]]
value, count = top.index[0], top.iat[0]

Notice in this case you don't need to call maxand idxmaxseparately, just extract the index via idxmaxand feed to the loclabel-based indexer.

请注意,在这种情况下,您不需要单独调用maxidxmax,只需通过提取索引idxmax并将其提供给loc基于标签的索引器即可。

回答by user9114146

Add this line of code to find the most frequent value

添加这行代码以查找最频繁的值

df["item"].value_counts().nlargest(n=1).values[0]

回答by Ambati Vaishnavi

The NaN values are omitted for calculating frequencies. Please check your code functionality hereBut you can use the below code for same functionality.

计算频率时省略 NaN 值。 请在此处检查您的代码功能但您可以使用以下代码实现相同的功能。

**>> Code:**
    # Importing required module
    from collections import Counter

    # Creating a dataframe
    df = pd.DataFrame({ 'A':["jan","jan","jan","mar","mar","feb","jan","dec",
                             "mar","jan","dec"]  }) 
    # Creating a counter object
    count = Counter(df['A'])
    # Calling a method of Counter object(count)
    count.most_common(3)

**>> Output:**

    [('jan', 5), ('mar', 3), ('dec', 2)]