Python XGBoost 分类变量:Dummification 与编码
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34265102/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
XGBoost Categorical Variables: Dummification vs encoding
提问by ishido
When using XGBoost
we need to convert categorical variables into numeric.
使用时XGBoost
我们需要将分类变量转换为数值。
Would there be any difference in performance/evaluation metrics between the methods of:
以下方法之间的性能/评估指标是否有任何差异:
- dummifying your categorical variables
- encoding your categorical variables from e.g. (a,b,c) to (1,2,3)
- 虚拟化你的分类变量
- 将分类变量从例如 (a,b,c) 编码到 (1,2,3)
ALSO:
还:
Would there be any reasons not to go with method 2 by using for example labelencoder
?
例如,labelencoder
是否有任何理由不使用方法 2 ?
回答by T. Scharf
xgboost
only deals with numeric columns.
xgboost
只处理数字列。
if you have a feature [a,b,b,c]
which describes a categorical variable (i.e. no numeric relationship)
如果你有一个[a,b,b,c]
描述分类变量的特征(即没有数字关系)
Using LabelEncoderyou will simply have this:
使用LabelEncoder,您将只需要:
array([0, 1, 1, 2])
Xgboost
will wrongly interpret this feature as having a numeric relationship!This just maps each string ('a','b','c')
to an integer, nothing more.
Xgboost
会错误地将此特征解释为具有数字关系!这只是将每个字符串映射('a','b','c')
到一个整数,仅此而已。
Proper way
合适的方式
Using OneHotEncoderyou will eventually get to this:
使用OneHotEncoder你最终会得到这个:
array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]])
This is the proper representationof a categorical variable for xgboost
or any other machine learning tool.
这是xgboost
任何其他机器学习工具的分类变量的正确表示。
Pandas get_dummiesis a nice tool for creating dummy variables (which is easier to use, in my opinion).
Pandas get_dummies是一个很好的创建虚拟变量的工具(我认为它更容易使用)。
Method #2 in above question will not represent the data properly
上述问题中的方法 #2 无法正确表示数据
回答by mamafoku
I want to answer this question not just in terms of XGBoost but in terms of any problem dealing with categorical data. While "dummification" creates a very sparse setup, specially if you have multiple categorical columns with different levels, label encoding is often biased as the mathematical representation is not reflective of the relationship between levels.
我不仅想从 XGBoost 的角度回答这个问题,还想从处理分类数据的任何问题的角度回答这个问题。虽然“虚拟化”创建了一个非常稀疏的设置,特别是如果您有多个具有不同级别的分类列,标签编码通常是有偏差的,因为数学表示不能反映级别之间的关系。
For Binary Classificationproblems, a genius yet unexplored approach which is highly leveraged in traditional credit scoring models is to use Weight of Evidenceto replace the categorical levels. Basically every categorical level is replaced by the proportion of Goods/ Proportion of Bads.
对于二元分类问题,在传统信用评分模型中高度利用的一种天才但尚未探索的方法是使用证据权重来代替分类级别。基本上每个分类级别都被商品的比例/不良品的比例所取代。
Can read more about it here.
可以在这里阅读更多相关信息。
Python library here.
Python 库在这里。
This method allows you to capture the "levels" under one column and avoid sparsity or induction of bias that would occur through dummifying or encoding.
此方法允许您捕获一列下的“级别”,并避免通过虚拟化或编码会出现的稀疏或引入偏差。
Hope this helps !
希望这可以帮助 !
回答by Roei Bahumi
Here is a code example of adding One hot encodings columns to a Pandas DataFrame with Categorical columns:
这是将 One hot encodings 列添加到具有 Categorical 列的 Pandas DataFrame 的代码示例:
ONE_HOT_COLS = ["categorical_col1", "categorical_col2", "categorical_col3"]
print("Starting DF shape: %d, %d" % df.shape)
for col in ONE_HOT_COLS:
s = df[col].unique()
# Create a One Hot Dataframe with 1 row for each unique value
one_hot_df = pd.get_dummies(s, prefix='%s_' % col)
one_hot_df[col] = s
print("Adding One Hot values for %s (the column has %d unique values)" % (col, len(s)))
pre_len = len(df)
# Merge the one hot columns
df = df.merge(one_hot_df, on=[col], how="left")
assert len(df) == pre_len
print(df.shape)