Python XGBoost 分类变量：Dummification 与编码

Question

提问by ishido

When using XGBoostwe need to convert categorical variables into numeric.

使用时XGBoost我们需要将分类变量转换为数值。

Would there be any difference in performance/evaluation metrics between the methods of:

以下方法之间的性能/评估指标是否有任何差异：

dummifying your categorical variables
encoding your categorical variables from e.g. (a,b,c) to (1,2,3)

虚拟化你的分类变量
将分类变量从例如 (a,b,c) 编码到 (1,2,3)

ALSO:

还：

Would there be any reasons not to go with method 2 by using for example labelencoder?

例如，labelencoder是否有任何理由不使用方法 2 ？

Answer 1

回答by T. Scharf

xgboostonly deals with numeric columns.

xgboost只处理数字列。

if you have a feature [a,b,b,c]which describes a categorical variable (i.e. no numeric relationship)

如果你有一个[a,b,b,c]描述分类变量的特征（即没有数字关系）

Using LabelEncoderyou will simply have this:

使用LabelEncoder，您将只需要：

array([0, 1, 1, 2])

Xgboostwill wrongly interpret this feature as having a numeric relationship!This just maps each string ('a','b','c')to an integer, nothing more.

Xgboost会错误地将此特征解释为具有数字关系！这只是将每个字符串映射('a','b','c')到一个整数，仅此而已。

Proper way

合适的方式

Using OneHotEncoderyou will eventually get to this:

使用OneHotEncoder你最终会得到这个：

array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

This is the proper representationof a categorical variable for xgboostor any other machine learning tool.

这是xgboost任何其他机器学习工具的分类变量的正确表示。

Pandas get_dummiesis a nice tool for creating dummy variables (which is easier to use, in my opinion).

Pandas get_dummies是一个很好的创建虚拟变量的工具（我认为它更容易使用）。

Method #2 in above question will not represent the data properly

上述问题中的方法 #2 无法正确表示数据

Answer 2

回答by mamafoku

I want to answer this question not just in terms of XGBoost but in terms of any problem dealing with categorical data. While "dummification" creates a very sparse setup, specially if you have multiple categorical columns with different levels, label encoding is often biased as the mathematical representation is not reflective of the relationship between levels.

我不仅想从 XGBoost 的角度回答这个问题，还想从处理分类数据的任何问题的角度回答这个问题。虽然“虚拟化”创建了一个非常稀疏的设置，特别是如果您有多个具有不同级别的分类列，标签编码通常是有偏差的，因为数学表示不能反映级别之间的关系。

For Binary Classificationproblems, a genius yet unexplored approach which is highly leveraged in traditional credit scoring models is to use Weight of Evidenceto replace the categorical levels. Basically every categorical level is replaced by the proportion of Goods/ Proportion of Bads.

对于二元分类问题，在传统信用评分模型中高度利用的一种天才但尚未探索的方法是使用证据权重来代替分类级别。基本上每个分类级别都被商品的比例/不良品的比例所取代。

Can read more about it here.

可以在这里阅读更多相关信息。

Python library here.

Python 库在这里。

This method allows you to capture the "levels" under one column and avoid sparsity or induction of bias that would occur through dummifying or encoding.

此方法允许您捕获一列下的“级别”，并避免通过虚拟化或编码会出现的稀疏或引入偏差。

Hope this helps !

希望这可以帮助！

Answer 3

回答by Roei Bahumi

Here is a code example of adding One hot encodings columns to a Pandas DataFrame with Categorical columns:

这是将 One hot encodings 列添加到具有 Categorical 列的 Pandas DataFrame 的代码示例：

ONE_HOT_COLS = ["categorical_col1", "categorical_col2", "categorical_col3"]
print("Starting DF shape: %d, %d" % df.shape)


for col in ONE_HOT_COLS:
    s = df[col].unique()

    # Create a One Hot Dataframe with 1 row for each unique value
    one_hot_df = pd.get_dummies(s, prefix='%s_' % col)
    one_hot_df[col] = s

    print("Adding One Hot values for %s (the column has %d unique values)" % (col, len(s)))
    pre_len = len(df)

    # Merge the one hot columns
    df = df.merge(one_hot_df, on=[col], how="left")
    assert len(df) == pre_len
    print(df.shape)

Python XGBoost 分类变量：Dummification 与编码

提问by ishido

回答by T. Scharf

回答by mamafoku

回答by Roei Bahumi

相关推荐

最近更新

标签

Python XGBoost 分类变量：Dummification 与编码

提问by ishido

回答by T. Scharf

回答by mamafoku

回答by Roei Bahumi

相关推荐

在python中对二维numpy数组进行下采样

python 3的主管？

Python seaborn 热图 y 轴逆序

Python Pandas 中 map、applymap 和 apply 方法的区别

相关推荐

最近更新

标签