检测 pandas.DataFrame 中的列是否为分类的一个好的启发式方法是什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35826912/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:49:19  来源:igfitidea点击:

What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?

pythonpandasscikit-learn

提问by Randy Olson

I've been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data differently. In particular, I want to be able to apply, e.g., a OneHotEncoder to onlythe categorical data.

我一直在开发一种工具,可以自动预处理 pandas.DataFrame 格式的数据。在这个预处理步骤中,我想以不同的方式处理连续数据和分类数据。特别是,我希望能够将OneHotEncoder 应用于分类数据。

Now, let's assume that we're provided a pandas.DataFrame and have no other information about the data in the DataFrame. What is a good heuristic to use to determine whether a column in the pandas.DataFrame is categorical?

现在,让我们假设我们提供了一个 pandas.DataFrame 并且没有关于 DataFrame 中数据的其他信息。用于确定 pandas.DataFrame 中的列是否为分类的一个很好的启发式方法是什么?

My initial thoughts are:

我最初的想法是:

1) If there are strings in the column (e.g., the column data type is object), then the column very likely contains categorical data

1) 如果列中有字符串(例如,列数据类型为object),则该列很可能包含分类数据

2) If some percentage of the values in the column is unique (e.g., >=20%), then the column very likely contains continuous data

2) 如果列中某些百分比的值是唯一的(例如,>=20%),则该列很可能包含连续数据

I've found 1)to work fine, but 2)hasn't panned out very well. I need better heuristics. How would you solve this problem?

我发现1)工作正常,但2)效果不佳。我需要更好的启发式方法。你会如何解决这个问题?

Edit:Someone requested that I explain why 2)didn't work well. There were some tests cases where we still had continuous values in a column but there weren't many unique values in the column. The heuristic in 2)obviously failed in that case. There were also issues where we had a categorical column that had many, many unique values, e.g., passenger names in the Titanic data set. Same column type misclassification problem there.

编辑:有人要求我解释为什么2)效果不佳。有一些测试案例,我们在列中仍然有连续值,但列中没有很多唯一值。2)在这种情况下,启发式方法显然失败了。还有一些问题,我们有一个分类列,其中包含许多独特的值,例如泰坦尼克号数据集中的乘客姓名。存在相同的列类型错误分类问题。

回答by Rishabh Srivastava

Here are a couple of approaches:

这里有几种方法:

  1. Find the ratio of number of unique values to the total number of unique values. Something like the following

    likely_cat = {}
    for var in df.columns:
        likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold
    
  2. Check if the top n unique values account for more than a certain proportion of all values

    top_n = 10 
    likely_cat = {}
    for var in df.columns:
        likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold
    
  1. 求唯一值的数量与唯一值总数的比率。类似于以下内容

    likely_cat = {}
    for var in df.columns:
        likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold
    
  2. 检查前 n 个唯一值是否占所有值的一定比例以上

    top_n = 10 
    likely_cat = {}
    for var in df.columns:
        likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold
    

Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a 'long-tailed distribution', where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.

方法 1) 通常比方法 2) 对我更有效。但是如果存在“长尾分布”,则方法 2) 会更好,其中少量分类变量具有高频率,而大量分类变量具有低频率。

回答by Diego

There's are many places where you could "steal" the definitions of formats that can be cast as "number". ##,#e-# would be one of such format, just to illustrate. Maybe you'll be able to find a library to do so. I try to cast everything to numbers first and what is left, well, there's no other way left but to keep them as categorical.

有很多地方可以“窃取”可以转换为“数字”的格式定义。##,#e-# 将是此类格式之一,仅用于说明。也许你能找到一个图书馆来做到这一点。我尝试先将所有内容都转换为数字,然后再将剩下的全部转换为数字,好吧,除了将它们保持为分类之外别无他法。

回答by rd11

I think the real question here is whether you'd like to bother the user once in a while or silently fail once in a while.

我认为这里真正的问题是你是想偶尔打扰用户还是偶尔默默地失败。

If you don't mind bothering the user, maybe detecting ambiguity and raising an error is the way to go.

如果你不介意打扰用户,也许检测歧义并引发错误是要走的路。

If you don't mind failing silently, then your heuristics are ok. I don't think you'll find anything that's significantly better. I guess you could make this into a learning problem if you really want to. Download a bunch of datasets, assume they are collectively a decent representation of all data sets in the world, and train based on features over each data set / column to predict categorical vs. continuous.

如果您不介意默默地失败,那么您的启发式方法就可以了。我不认为你会发现任何明显更好的东西。我想如果你真的想的话,你可以把它变成一个学习问题。下载一堆数据集,假设它们共同代表了世界上所有数据集,并根据每个数据集/列的特征进行训练,以预测分类与连续。

But of course in the end nothing can be perfect. E.g. is the column [1, 8, 22, 8, 9, 8] referring to hours of the day or to dog breeds?

但当然,最终没有什么是完美的。例如,列 [1, 8, 22, 8, 9, 8] 是指一天中的几个小时还是狗的品种?

回答by Karl Rosaen

I've been thinking about a similar problem and the more that I consider it, it seems that this itself is a classification problem that could benefit from training a model.

我一直在考虑一个类似的问题,而且我考虑得越多,这似乎本身就是一个分类问题,可以从训练模型中受益。

I bet if you examined a bunch of datasets and extracted these features for each column / pandas.Series:

我敢打赌,如果您检查了一堆数据集并为每一列/pandas.Series 提取了这些特征:

  • % floats: percentage of values that are float
  • % int: percentage of values that are whole numbers
  • % string: percentage of values that are strings
  • % unique string: number of unique string values / total number
  • % unique integers: number of unique integer values / total number
  • mean numerical value (non numerical values considered 0 for this)
  • std deviation of numerical values
  • % floats:浮动值的百分比
  • % int:整数值的百分比
  • % string:字符串值的百分比
  • % 唯一字符串:唯一字符串值的数量/总数
  • % unique integers:唯一整数值​​的数量/总数
  • 平均数值(为此将非数值视为 0)
  • 数值的标准偏差

and trained a model, it could get pretty good at inferring column types, where the possible output values are: categorical, ordinal, quantitative.

并训练了一个模型,它可以很好地推断列类型,其中可能的输出值是:分类、有序、定量。

Side note: as far as a Series with a limited number of numerical values goes, it seems like the interesting problem would be determining categorical vs ordinal; it doesn't hurt to think a variable is ordinal if it turns out to be quantitative right? The preprocessing steps would encode the ordinal values numerically anyways without one-hot encoding.

旁注:就数值数量有限的系列而言,有趣的问题似乎是确定分类与序数;如果结果是定量的,那么认为一个变量是有序的也没有坏处吗?预处理步骤无论如何都会对序数值进行数字编码,而无需进行单热编码。

A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? E.g in the forest-cover-type-prediction kaggle contest, you would automatically know that soil type is a single categorical variable.

一个有趣的相关问题:给定一组列,你能判断它们是否已经被单热编码了吗?例如,在森林覆盖类型预测 kaggle 比赛中,您会自动知道土壤类型是一个单一的分类变量。

回答by VicKat

You could define which datatypes count as numerics and then exclude the corresponding variables

您可以定义哪些数据类型算作数字,然后排除相应的变量

If initial dataframe is df:

如果初始数据帧是 df:

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
dataframe = df.select_dtypes(exclude=numerics)

回答by Jan Katins

IMO the opposite strategy, identifying categoricals is better because it depends on what the data is about. Technically address data can be thought of as unordered categorical data, but usually I wouldn't use it that way.

IMO 相反的策略,识别分类更好,因为它取决于数据的内容。从技术上讲,地址数据可以被认为是无序的分类数据,但通常我不会那样使用它。

For survey data, an idea would be to look for Likert scales, e.g. 5-8 values, either strings (which might probably need hardcoded (and translated) levels to look for "good", "bad", ".agree.", "very .*",...) or int values in the 0-8 range + NA.

对于调查数据,一个想法是寻找李克特量表,例如 5-8 个值,或者是字符串(可能需要硬编码(和翻译)级别来寻找“好”、“坏”、“。同意。”, "very .*",...) 或 0-8 范围内的整数值 + NA。

Countries and such things might also be identifiable...

国家之类的东西也可能是可以识别的……

Age groups (".-.") might also work.

年龄组(“. -.”)也可能有效。

回答by FChm

I've been looking at this, thought it maybe useful to share what I have. This builds on @Rishabh Srivastava answer.

我一直在看这个,认为分享我所拥有的可能有用。这建立在@Rishabh Srivastava 的回答之上。

import pandas as pd

def remove_cat_features(X, method='fraction_unique', cat_cols=None, min_fraction_unique=0.05):
    """Removes categorical features using a given method.
       X: pd.DataFrame, dataframe to remove categorical features from."""

    if method=='fraction_unique':
        unique_fraction = X.apply(lambda col: len(pd.unique(col))/len(col)) 
        reduced_X = X.loc[:, unique_fraction>min_fraction_unique]

    if method=='named_columns':
        non_cat_cols = [col not in cat_cols for col in X.columns]
        reduced_X = X.loc[:, non_cat_cols]

    return reduced_X

You can then call this function, giving a pandas df as Xand you can either remove named categorical columns or you can choose to remove columns with a low number of unique values (specified by min_fraction_unique).

然后你可以调用这个函数,给一个pandas df as X,你可以删除命名的分类列,或者你可以选择删除具有少量唯一值(由 指定min_fraction_unique)的列。