Python 在每列中查找 DataFrame 中不同元素的计数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30503321/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Finding count of distinct elements in DataFrame in each column
提问by ajknzhol
I am trying to find the count of distinct values in each column using Pandas. This is what I did.
我正在尝试使用 Pandas 查找每列中不同值的计数。这就是我所做的。
import pandas as pd
import numpy as np
# Generate data.
NROW = 10000
NCOL = 100
df = pd.DataFrame(np.random.randint(1, 100000, (NROW, NCOL)),
columns=['col' + x for x in np.arange(NCOL).astype(str)])
I need to count the number of distinct elements for each column, like this:
我需要计算每列的不同元素的数量,如下所示:
col0 9538
col1 9505
col2 9524
What would be the most efficient way to do this, as this method will be applied to files which have size greater than 1.5GB?
执行此操作的最有效方法是什么,因为此方法将应用于大小大于 1.5GB 的文件?
Based upon the answers, df.apply(lambda x: len(x.unique()))
is the fastest (notebook).
根据答案,df.apply(lambda x: len(x.unique()))
是最快的(笔记本)。
%timeit df.apply(lambda x: len(x.unique()))
10 loops, best of 3: 49.5 ms per loop
%timeit df.nunique()
10 loops, best of 3: 59.7 ms per loop
%timeit df.apply(pd.Series.nunique)
10 loops, best of 3: 60.3 ms per loop
%timeit df.T.apply(lambda x: x.nunique(), axis=1)
10 loops, best of 3: 60.5 ms per loop
%timeit df.apply(lambda x: len(x.unique()))
10 loops, best of 3: 49.5 ms per loop
%timeit df.nunique()
10 loops, best of 3: 59.7 ms per loop
%timeit df.apply(pd.Series.nunique)
10 loops, best of 3: 60.3 ms per loop
%timeit df.T.apply(lambda x: x.nunique(), axis=1)
10 loops, best of 3: 60.5 ms per loop
采纳答案by EdChum
As of pandas 0.20we can use nunique
directly on DataFrame
s, i.e.:
从pandas 0.20 开始,我们可以nunique
直接在DataFrame
s上使用,即:
df.nunique()
a 4
b 5
c 1
dtype: int64
Other legacy options:
其他传统选项:
You could do a transpose of the df and then using apply
call nunique
row-wise:
您可以对 df 进行转置,然后按行apply
调用nunique
:
In [205]:
df = pd.DataFrame({'a':[0,1,1,2,3],'b':[1,2,3,4,5],'c':[1,1,1,1,1]})
df
Out[205]:
a b c
0 0 1 1
1 1 2 1
2 1 3 1
3 2 4 1
4 3 5 1
In [206]:
df.T.apply(lambda x: x.nunique(), axis=1)
Out[206]:
a 4
b 5
c 1
dtype: int64
EDIT
编辑
As pointed out by @ajcr the transpose is unnecessary:
正如@ajcr 所指出的,转置是不必要的:
In [208]:
df.apply(pd.Series.nunique)
Out[208]:
a 4
b 5
c 1
dtype: int64
回答by CaMaDuPe85
A Pandas.Series
has a .value_counts()
function that provides exactly what you want to. Check out the documentation for the function.
APandas.Series
有一个.value_counts()
函数,可以准确地提供您想要的内容。查看函数的文档。
回答by Wendao Liu
Recently, I have same issues of counting unique value of each column in DataFrame, and I found some other function that runs faster than the apply
function:
最近,我在计算 DataFrame 中每列的唯一值时遇到了同样的问题,并且我发现了一些其他运行速度比该apply
函数更快的函数:
#Select the way how you want to store the output, could be pd.DataFrame or Dict, I will use Dict to demonstrate:
col_uni_val={}
for i in df.columns:
col_uni_val[i] = len(df[i].unique())
#Import pprint to display dic nicely:
import pprint
pprint.pprint(col_uni_val)
This works for me almost twice faster than df.apply(lambda x: len(x.unique()))
这对我来说比几乎快两倍 df.apply(lambda x: len(x.unique()))
回答by Sander van den Oord
Already some great answers here :) but this one seems to be missing:
这里已经有一些很好的答案:)但是这个似乎不见了:
df.apply(lambda x: x.nunique())
As of pandas 0.20.0, DataFrame.nunique()
is also available.
从 pandas 0.20.0 开始,DataFrame.nunique()
也可用。
回答by zehai
df.apply(lambda x: len(x.unique()))
回答by Ayyasamy
Need to segregate only the columns with more than 20 unique values for all the columns in pandas_python:
只需要为 pandas_python 中的所有列隔离具有 20 个以上唯一值的列:
enter code here
col_with_morethan_20_unique_values_cat=[]
for col in data.columns:
if data[col].dtype =='O':
if len(data[col].unique()) >20:
....col_with_morethan_20_unique_values_cat.append(data[col].name)
else:
continue
print(col_with_morethan_20_unique_values_cat)
print('total number of columns with more than 20 number of unique value is',len(col_with_morethan_20_unique_values_cat))
# The o/p will be as:
['CONTRACT NO', 'X2','X3',,,,,,,..]
total number of columns with more than 20 number of unique value is 25
回答by Preetham
Adding the example code for the answer given by @CaMaDuPe85
为@CaMaDuPe85 给出的答案添加示例代码
df = pd.DataFrame({'a':[0,1,1,2,3],'b':[1,2,3,4,5],'c':[1,1,1,1,1]})
df
# df
a b c
0 0 1 1
1 1 2 1
2 1 3 1
3 2 4 1
4 3 5 1
for cs in df.columns:
print(cs,df[cs].value_counts().count())
# using value_counts in each column and count it
# Output
a 4
b 5
c 1
回答by yami
I found:
我发现:
df.agg(['nunique']).T
much faster
快多了