Python 在每列中查找 DataFrame 中不同元素的计数

Question

提问by ajknzhol

I am trying to find the count of distinct values in each column using Pandas. This is what I did.

我正在尝试使用 Pandas 查找每列中不同值的计数。这就是我所做的。

import pandas as pd
import numpy as np

# Generate data.
NROW = 10000
NCOL = 100
df = pd.DataFrame(np.random.randint(1, 100000, (NROW, NCOL)),
                  columns=['col' + x for x in np.arange(NCOL).astype(str)])

I need to count the number of distinct elements for each column, like this:

我需要计算每列的不同元素的数量，如下所示：

col0    9538
col1    9505
col2    9524

What would be the most efficient way to do this, as this method will be applied to files which have size greater than 1.5GB?

执行此操作的最有效方法是什么，因为此方法将应用于大小大于 1.5GB 的文件？

Based upon the answers, df.apply(lambda x: len(x.unique()))is the fastest (notebook).

根据答案，df.apply(lambda x: len(x.unique()))是最快的（笔记本）。

%timeit df.apply(lambda x: len(x.unique())) 10 loops, best of 3: 49.5 ms per loop %timeit df.nunique() 10 loops, best of 3: 59.7 ms per loop %timeit df.apply(pd.Series.nunique) 10 loops, best of 3: 60.3 ms per loop %timeit df.T.apply(lambda x: x.nunique(), axis=1) 10 loops, best of 3: 60.5 ms per loop

Answer 1

采纳答案by EdChum

As of pandas 0.20we can use nuniquedirectly on DataFrames, i.e.:

从pandas 0.20 开始，我们可以nunique直接在DataFrames上使用，即：

df.nunique()
a    4
b    5
c    1
dtype: int64

Other legacy options:

其他传统选项：

You could do a transpose of the df and then using applycall nuniquerow-wise:

您可以对 df 进行转置，然后按行apply调用nunique：

In [205]:
df = pd.DataFrame({'a':[0,1,1,2,3],'b':[1,2,3,4,5],'c':[1,1,1,1,1]})
df

Out[205]:
   a  b  c
0  0  1  1
1  1  2  1
2  1  3  1
3  2  4  1
4  3  5  1

In [206]:
df.T.apply(lambda x: x.nunique(), axis=1)

Out[206]:
a    4
b    5
c    1
dtype: int64

EDIT

编辑

As pointed out by @ajcr the transpose is unnecessary:

正如@ajcr 所指出的，转置是不必要的：

In [208]:
df.apply(pd.Series.nunique)

Out[208]:
a    4
b    5
c    1
dtype: int64

Answer 2

回答by CaMaDuPe85

A Pandas.Serieshas a .value_counts()function that provides exactly what you want to. Check out the documentation for the function.

APandas.Series有一个.value_counts()函数，可以准确地提供您想要的内容。查看函数的文档。

Answer 3

回答by Wendao Liu

Recently, I have same issues of counting unique value of each column in DataFrame, and I found some other function that runs faster than the applyfunction:

最近，我在计算 DataFrame 中每列的唯一值时遇到了同样的问题，并且我发现了一些其他运行速度比该apply函数更快的函数：

#Select the way how you want to store the output, could be pd.DataFrame or Dict, I will use Dict to demonstrate:
col_uni_val={}
for i in df.columns:
    col_uni_val[i] = len(df[i].unique())

#Import pprint to display dic nicely:
import pprint
pprint.pprint(col_uni_val)

This works for me almost twice faster than df.apply(lambda x: len(x.unique()))

这对我来说比几乎快两倍 df.apply(lambda x: len(x.unique()))

Answer 4

回答by Sander van den Oord

Already some great answers here :) but this one seems to be missing:

这里已经有一些很好的答案:)但是这个似乎不见了：

df.apply(lambda x: x.nunique())

As of pandas 0.20.0, DataFrame.nunique()is also available.

从 pandas 0.20.0 开始，DataFrame.nunique()也可用。

Answer 5

回答by zehai

df.apply(lambda x: len(x.unique()))

Answer 6

回答by Ayyasamy

Need to segregate only the columns with more than 20 unique values for all the columns in pandas_python:

只需要为 pandas_python 中的所有列隔离具有 20 个以上唯一值的列：

enter code here
col_with_morethan_20_unique_values_cat=[]
for col in data.columns:
    if data[col].dtype =='O':
        if len(data[col].unique()) >20:

        ....col_with_morethan_20_unique_values_cat.append(data[col].name)
        else:
            continue

print(col_with_morethan_20_unique_values_cat)
print('total number of columns with more than 20 number of unique value is',len(col_with_morethan_20_unique_values_cat))



 # The o/p will be as:
['CONTRACT NO', 'X2','X3',,,,,,,..]
total number of columns with more than 20 number of unique value is 25

Answer 7

回答by Preetham

Adding the example code for the answer given by @CaMaDuPe85

为@CaMaDuPe85 给出的答案添加示例代码

df = pd.DataFrame({'a':[0,1,1,2,3],'b':[1,2,3,4,5],'c':[1,1,1,1,1]})
df

# df
    a   b   c
0   0   1   1
1   1   2   1
2   1   3   1
3   2   4   1
4   3   5   1


for cs in df.columns:
    print(cs,df[cs].value_counts().count()) 
    # using value_counts in each column and count it 

# Output

a 4
b 5
c 1

Answer 8

回答by yami

I found:

我发现：

df.agg(['nunique']).T

much faster

快多了

Python 在每列中查找 DataFrame 中不同元素的计数

提问by ajknzhol

采纳答案by EdChum

回答by CaMaDuPe85

回答by Wendao Liu

回答by Sander van den Oord

回答by zehai

回答by Ayyasamy

回答by Preetham

回答by yami

相关推荐

最近更新

标签

Python 在每列中查找 DataFrame 中不同元素的计数

提问by ajknzhol

采纳答案by EdChum

回答by CaMaDuPe85

回答by Wendao Liu

回答by Sander van den Oord

回答by zehai

回答by Ayyasamy

回答by Preetham

回答by yami

相关推荐

如何在python中使用espeak

Python 如何计算 Pandas 中另一列分组的平均值

Python 在 Django HttpResponse 对象中为 Shopify 应用设置 Content-Type

Python PyPlot - 设置绘图的网格线间距

相关推荐

最近更新

标签