Python 如何在sklearn中编码分类变量？

Question

提问by tonicebrian

I'm trying to use the car evaluation dataset from the UCI repository and I wonder whether there is a convenient way to binarize categorical variables in sklearn. One approach would be to use the DictVectorizer of LabelBinarizer but here I'm getting k different features whereas you should have just k-1 in order to avoid collinearization. I guess I could write my own function and drop one column but this bookkeeping is tedious, is there an easy way to perform such transformations and get as a result a sparse matrix?

我正在尝试使用来自 UCI 存储库的汽车评估数据集，我想知道是否有一种方便的方法可以在 sklearn 中对分类变量进行二值化。一种方法是使用 LabelBinarizer 的 DictVectorizer，但在这里我得到了 k 个不同的特征，而你应该只有 k-1 以避免共线性。我想我可以编写自己的函数并删除一列，但是这种簿记很乏味，是否有一种简单的方法来执行此类转换并获得稀疏矩阵？

Answer 1

采纳答案by Peter Prettenhofer

DictVectorizer is the recommended way to generate a one-hot encoding of categorical variables; you can use the sparseargument to create a sparse CSR matrix instead of a dense numpy array. I usually don't care about multicollinearity and I haven't noticed a problem with the approaches that I tend to use (i.e. LinearSVC, SGDClassifier, Tree-based methods).

DictVectorizer 是生成分类变量的 one-hot 编码的推荐方法；您可以使用该sparse参数来创建一个稀疏的 CSR 矩阵，而不是一个密集的 numpy 数组。我通常不关心多重共线性，并且我没有注意到我倾向于使用的方法（即 LinearSVC、SGDClassifier、基于树的方法）的问题。

It shouldn't be a problem to patch the DictVectorizer to drop one column per categorical feature - you simple need to remove one term from DictVectorizer.vocabularyat the end of the fitmethod. (Pull requests are always welcome!)

修补 DictVectorizer 以删除每个分类特征的一列应该不是问题 - 您只需从方法DictVectorizer.vocabulary末尾删除一个术语fit。（总是欢迎拉取请求！）

Answer 2

回答by NetSmoothMF

The basic method is

基本方法是

import numpy as np
import pandas as pd, os
from sklearn.feature_extraction import DictVectorizer

def one_hot_dataframe(data, cols, replace=False):
    vec = DictVectorizer()
    mkdict = lambda row: dict((col, row[col]) for col in cols)
    vecData = pd.DataFrame(vec.fit_transform(data[cols].apply(mkdict, axis=1)).toarray())
    vecData.columns = vec.get_feature_names()
    vecData.index = data.index
    if replace is True:
        data = data.drop(cols, axis=1)
        data = data.join(vecData)
    return (data, vecData, vec)

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

df = pd.DataFrame(data)

df2, _, _ = one_hot_dataframe(df, ['state'], replace=True)
print df2

Here is how to do in sparse format

这是稀疏格式的操作方法

import numpy as np
import pandas as pd, os
import scipy.sparse as sps
import itertools

def one_hot_column(df, cols, vocabs):
    mats = []; df2 = df.drop(cols,axis=1)
    mats.append(sps.lil_matrix(np.array(df2)))
    for i,col in enumerate(cols):
        mat = sps.lil_matrix((len(df), len(vocabs[i])))
        for j,val in enumerate(np.array(df[col])):
            mat[j,vocabs[i][val]] = 1.
        mats.append(mat)

    res = sps.hstack(mats)   
    return res

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': ['2000', '2001', '2002', '2001', '2002'],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

df = pd.DataFrame(data)
print df

vocabs = []
vals = ['Ohio','Nevada']
vocabs.append(dict(itertools.izip(vals,range(len(vals)))))
vals = ['2000','2001','2002']
vocabs.append(dict(itertools.izip(vals,range(len(vals)))))

print vocabs

print one_hot_column(df, ['state','year'], vocabs).todense()

Answer 3

回答by rezakhorshidi

if your data is a pandas DataFrame, then you can simply call get_dummies. Assume that your data frame is df, and you want to have one binary variable per level of variable 'key'. You can simply call:

如果您的数据是 Pandas DataFrame，那么您只需调用 get_dummies。假设您的数据框是 df，并且您希望每一级变量“key”有一个二进制变量。您可以简单地调用：

pd.get_dummies(df['key'])

and then delete one of the dummy variables, to avoid the multi-colinearity problem. I hope this helps ...

然后删除其中一个虚拟变量，以避免多重共线性问题。我希望这有帮助 ...

Python 如何在sklearn中编码分类变量？

提问by tonicebrian

采纳答案by Peter Prettenhofer

回答by NetSmoothMF

回答by rezakhorshidi

相关推荐

最近更新

标签

Python 如何在sklearn中编码分类变量？

提问by tonicebrian

采纳答案by Peter Prettenhofer

回答by NetSmoothMF

回答by rezakhorshidi

相关推荐

Python 如何将负数转换为正数？

如何在 iPython notebook 中预览大熊猫 DataFrame 的一部分？

在python中生成密码

Python 删除因读行引起的回车

相关推荐

最近更新

标签