Python 检查数据框列是否分类

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26924904/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:10:48  来源:igfitidea点击:

Check if dataframe column is Categorical

pythonpandas

提问by Marius

I can't seem to get a simple dtype check working with Pandas' improved Categoricals in v0.15+. Basically I just want something like is_categorical(column) -> True/False.

我似乎无法在 v0.15+ 中使用 Pandas 改进的 Categoricals 进行简单的 dtype 检查。基本上我只想要像is_categorical(column) -> True/False.

import pandas as pd
import numpy as np
import random

df = pd.DataFrame({
    'x': np.linspace(0, 50, 6),
    'y': np.linspace(0, 20, 6),
    'cat_column': random.sample('abcdef', 6)
})
df['cat_column'] = pd.Categorical(df2['cat_column'])

We can see that the dtypefor the categorical column is 'category':

我们可以看到dtype分类列的 'category' :

df.cat_column.dtype
Out[20]: category

And normally we can do a dtype check by just comparing to the name of the dtype:

通常我们可以通过与 dtype 的名称进行比较来进行 dtype 检查:

df.x.dtype == 'float64'
Out[21]: True

But this doesn't seem to work when trying to check if the xcolumn is categorical:

但这在尝试检查x列是否分类时似乎不起作用:

df.x.dtype == 'category'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-22-94d2608815c4> in <module>()
----> 1 df.x.dtype == 'category'

TypeError: data type "category" not understood

Is there any way to do these types of checks in pandas v0.15+?

有没有办法在 pandas v0.15+ 中进行这些类型的检查?

采纳答案by Jeff Tratner

Use the nameproperty to do the comparison instead, it should always work because it's just a string:

使用该name属性进行比较,它应该始终有效,因为它只是一个字符串:

>>> import numpy as np
>>> arr = np.array([1, 2, 3, 4])
>>> arr.dtype.name
'int64'

>>> import pandas as pd
>>> cat = pd.Categorical(['a', 'b', 'c'])
>>> cat.dtype.name
'category'

So, to sum up, you can end up with a simple, straightforward function:

所以,总而言之,你可以得到一个简单直接的函数:

def is_categorical(array_like):
    return array_like.dtype.name == 'category'

回答by joris

First, the string representation of the dtype is 'category'and not 'categorical', so this works:

首先,dtype 的字符串表示形式 is'category'和 not 'categorical',所以这是有效的:

In [41]: df.cat_column.dtype == 'category'
Out[41]: True

But indeed, as you noticed, this comparison gives a TypeErrorfor other dtypes, so you would have to wrap it with a try .. except ..block.

但确实,正如您所注意到的,此比较TypeError为其他 dtype提供了 a ,因此您必须用try .. except ..块包装它。



Other ways to check using pandas internals:

使用熊猫内部检查的其他方法:

In [42]: isinstance(df.cat_column.dtype, pd.api.types.CategoricalDtype)
Out[42]: True

In [43]: pd.api.types.is_categorical_dtype(df.cat_column)
Out[43]: True

For non-categorical columns, those statements will return Falseinstead of raising an error. For example:

对于非分类列,这些语句将返回False而不是引发错误。例如:

In [44]: pd.api.types.is_categorical_dtype(df.x)
Out[44]: False

For much older version of pandas, replace pd.api.typesin the above snippet with pd.core.common.

对于更旧版本的pandas,将pd.api.types上面的代码片段替换为pd.core.common.

回答by jorijnsmit

Just putting this here because pandas.DataFrame.select_dtypes()is what I was actuallylooking for:

把它放在这里是因为这pandas.DataFrame.select_dtypes()是我真正想要的:

df['column'].name in df.select_dtypes(include='category').columns

Thanks to @Jeff.

感谢@Jeff。

回答by DieterDP

In my pandas version (v1.0.3), a shorter version of joris' answer is available.

在我的 Pandas 版本 (v1.0.3) 中,提供了一个较短版本的 joris 答案。

df = pd.DataFrame({'noncat': [1, 2, 3], 'categ': pd.Categorical(['A', 'B', 'C'])})

print(isinstance(df.noncat.dtype, pd.CategoricalDtype))  # False
print(isinstance(df.categ.dtype, pd.CategoricalDtype))   # True

print(pd.CategoricalDtype.is_dtype(df.noncat)) # False
print(pd.CategoricalDtype.is_dtype(df.categ))  # True