Python 在每列中查找 DataFrame 中不同元素的计数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30503321/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 08:32:49  来源:igfitidea点击:

Finding count of distinct elements in DataFrame in each column

pythonnumpypandas

提问by ajknzhol

I am trying to find the count of distinct values in each column using Pandas. This is what I did.

我正在尝试使用 Pandas 查找每列中不同值的计数。这就是我所做的。

import pandas as pd
import numpy as np

# Generate data.
NROW = 10000
NCOL = 100
df = pd.DataFrame(np.random.randint(1, 100000, (NROW, NCOL)),
                  columns=['col' + x for x in np.arange(NCOL).astype(str)])

I need to count the number of distinct elements for each column, like this:

我需要计算每列的不同元素的数量,如下所示:

col0    9538
col1    9505
col2    9524

What would be the most efficient way to do this, as this method will be applied to files which have size greater than 1.5GB?

执行此操作的最有效方法是什么,因为此方法将应用于大小大于 1.5GB 的文件?



Based upon the answers, df.apply(lambda x: len(x.unique()))is the fastest (notebook).

根据答案,df.apply(lambda x: len(x.unique()))是最快的(笔记本)。

%timeit df.apply(lambda x: len(x.unique())) 10 loops, best of 3: 49.5 ms per loop %timeit df.nunique() 10 loops, best of 3: 59.7 ms per loop %timeit df.apply(pd.Series.nunique) 10 loops, best of 3: 60.3 ms per loop %timeit df.T.apply(lambda x: x.nunique(), axis=1) 10 loops, best of 3: 60.5 ms per loop

%timeit df.apply(lambda x: len(x.unique())) 10 loops, best of 3: 49.5 ms per loop %timeit df.nunique() 10 loops, best of 3: 59.7 ms per loop %timeit df.apply(pd.Series.nunique) 10 loops, best of 3: 60.3 ms per loop %timeit df.T.apply(lambda x: x.nunique(), axis=1) 10 loops, best of 3: 60.5 ms per loop

采纳答案by EdChum

As of pandas 0.20we can use nuniquedirectly on DataFrames, i.e.:

pandas 0.20 开始,我们可以nunique直接在DataFrames上使用,即:

df.nunique()
a    4
b    5
c    1
dtype: int64

Other legacy options:

其他传统选项:

You could do a transpose of the df and then using applycall nuniquerow-wise:

您可以对 df 进行转置,然后按行apply调用nunique

In [205]:
df = pd.DataFrame({'a':[0,1,1,2,3],'b':[1,2,3,4,5],'c':[1,1,1,1,1]})
df

Out[205]:
   a  b  c
0  0  1  1
1  1  2  1
2  1  3  1
3  2  4  1
4  3  5  1

In [206]:
df.T.apply(lambda x: x.nunique(), axis=1)

Out[206]:
a    4
b    5
c    1
dtype: int64

EDIT

编辑

As pointed out by @ajcr the transpose is unnecessary:

正如@ajcr 所指出的,转置是不必要的:

In [208]:
df.apply(pd.Series.nunique)

Out[208]:
a    4
b    5
c    1
dtype: int64

回答by CaMaDuPe85

A Pandas.Serieshas a .value_counts()function that provides exactly what you want to. Check out the documentation for the function.

APandas.Series有一个.value_counts()函数,可以准确地提供您想要的内容。查看函数的文档

回答by Wendao Liu

Recently, I have same issues of counting unique value of each column in DataFrame, and I found some other function that runs faster than the applyfunction:

最近,我在计算 DataFrame 中每列的唯一值时遇到了同样的问题,并且我发现了一些其他运行速度比该apply函数更快的函数:

#Select the way how you want to store the output, could be pd.DataFrame or Dict, I will use Dict to demonstrate:
col_uni_val={}
for i in df.columns:
    col_uni_val[i] = len(df[i].unique())

#Import pprint to display dic nicely:
import pprint
pprint.pprint(col_uni_val)

This works for me almost twice faster than df.apply(lambda x: len(x.unique()))

这对我来说比几乎快两倍 df.apply(lambda x: len(x.unique()))

回答by Sander van den Oord

Already some great answers here :) but this one seems to be missing:

这里已经有一些很好的答案:)但是这个似乎不见了:

df.apply(lambda x: x.nunique())

As of pandas 0.20.0, DataFrame.nunique()is also available.

从 pandas 0.20.0 开始,DataFrame.nunique()也可用。

回答by zehai

df.apply(lambda x: len(x.unique()))

回答by Ayyasamy

Need to segregate only the columns with more than 20 unique values for all the columns in pandas_python:

只需要为 pandas_python 中的所有列隔离具有 20 个以上唯一值的列:

enter code here
col_with_morethan_20_unique_values_cat=[]
for col in data.columns:
    if data[col].dtype =='O':
        if len(data[col].unique()) >20:

        ....col_with_morethan_20_unique_values_cat.append(data[col].name)
        else:
            continue

print(col_with_morethan_20_unique_values_cat)
print('total number of columns with more than 20 number of unique value is',len(col_with_morethan_20_unique_values_cat))



 # The o/p will be as:
['CONTRACT NO', 'X2','X3',,,,,,,..]
total number of columns with more than 20 number of unique value is 25

回答by Preetham

Adding the example code for the answer given by @CaMaDuPe85

为@CaMaDuPe85 给出的答案添加示例代码

df = pd.DataFrame({'a':[0,1,1,2,3],'b':[1,2,3,4,5],'c':[1,1,1,1,1]})
df

# df
    a   b   c
0   0   1   1
1   1   2   1
2   1   3   1
3   2   4   1
4   3   5   1


for cs in df.columns:
    print(cs,df[cs].value_counts().count()) 
    # using value_counts in each column and count it 

# Output

a 4
b 5
c 1

回答by yami

I found:

我发现:

df.agg(['nunique']).T

much faster

快多了