Python scikit-learn 中跨多列的标签编码
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24458645/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Label encoding across multiple columns in scikit-learn
提问by Bryan
I'm trying to use scikit-learn's LabelEncoder
to encode a pandas DataFrame
of string labels. As the dataframe has many (50+) columns, I want to avoid creating a LabelEncoder
object for each column; I'd rather just have one big LabelEncoder
objects that works across allmy columns of data.
我正在尝试使用 scikit-learnLabelEncoder
来编码DataFrame
字符串标签的熊猫。由于数据框有很多(50+)列,我想避免LabelEncoder
为每一列创建一个对象;我宁愿只拥有一个LabelEncoder
适用于我所有数据列的大对象。
Throwing the entire DataFrame
into LabelEncoder
creates the below error. Please bear in mind that I'm using dummy data here; in actuality I'm dealing with about 50 columns of string labeled data, so need a solution that doesn't reference any columns by name.
将整个DataFrame
放入LabelEncoder
会产生以下错误。请记住,我在这里使用的是虚拟数据;实际上,我正在处理大约 50 列标记为字符串的数据,因此需要一个不按名称引用任何列的解决方案。
import pandas
from sklearn import preprocessing
df = pandas.DataFrame({
'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'],
'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'],
'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',
'New_York']
})
le = preprocessing.LabelEncoder()
le.fit(df)
Traceback (most recent call last): File "", line 1, in File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 103, in fit y = column_or_1d(y, warn=True) File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 306, in column_or_1d raise ValueError("bad input shape {0}".format(shape)) ValueError: bad input shape (6, 3)
回溯(最近一次调用):文件“”,第 1 行,在文件“/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py”,第 103 行,适合 y = column_or_1d(y, warn=True) File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 306, in column_or_1d raise ValueError("bad input shape { 0}".format(shape)) ValueError: 错误的输入形状 (6, 3)
Any thoughts on how to get around this problem?
关于如何解决这个问题的任何想法?
回答by Fred Foo
No, LabelEncoder
does not do this. It takes 1-d arrays of class labels and produces 1-d arrays. It's designed to handle class labels in classification problems, not arbitrary data, and any attempt to force it into other uses will require code to transform the actual problem to the problem it solves (and the solution back to the original space).
不,LabelEncoder
不这样做。它需要类标签的一维数组并生成一维数组。它旨在处理分类问题中的类标签,而不是任意数据,任何将其强制用于其他用途的尝试都需要代码将实际问题转换为它解决的问题(并将解决方案返回到原始空间)。
回答by TehTechGuy
Assuming you are simply trying to get a sklearn.preprocessing.LabelEncoder()
object that can be used to represent your columns, all you have to do is:
假设您只是想获得一个sklearn.preprocessing.LabelEncoder()
可用于表示您的列的对象,您所要做的就是:
le.fit(df.columns)
In the above code you will have a unique number corresponding to each column.
More precisely, you will have a 1:1 mapping of df.columns
to le.transform(df.columns.get_values())
. To get a column's encoding, simply pass it to le.transform(...)
. As an example, the following will get the encoding for each column:
在上面的代码中,您将拥有与每一列对应的唯一编号。更准确地说,您将有一个 1:1 的df.columns
to映射le.transform(df.columns.get_values())
。要获取列的编码,只需将其传递给le.transform(...)
. 例如,以下将获取每列的编码:
le.transform(df.columns.get_values())
Assuming you want to create a sklearn.preprocessing.LabelEncoder()
object for all of your row labels you can do the following:
假设您要sklearn.preprocessing.LabelEncoder()
为所有行标签创建一个对象,您可以执行以下操作:
le.fit([y for x in df.get_values() for y in x])
In this case, you most likely have non-unique row labels (as shown in your question). To see what classes the encoder created you can do le.classes_
. You'll note that this should have the same elements as in set(y for x in df.get_values() for y in x)
. Once again to convert a row label to an encoded label use le.transform(...)
. As an example, if you want to retrieve the label for the first column in the df.columns
array and the first row, you could do this:
在这种情况下,您很可能有非唯一的行标签(如您的问题所示)。要查看编码器创建的类,您可以执行哪些操作le.classes_
。您会注意到这应该具有与 中相同的元素set(y for x in df.get_values() for y in x)
。再次将行标签转换为编码标签使用le.transform(...)
. 例如,如果要检索df.columns
数组中第一列和第一行的标签,可以执行以下操作:
le.transform([df.get_value(0, df.columns[0])])
The question you had in your comment is a bit more complicated, but can still be accomplished:
您在评论中提出的问题有点复杂,但仍然可以完成:
le.fit([str(z) for z in set((x[0], y) for x in df.iteritems() for y in x[1])])
The above code does the following:
上面的代码执行以下操作:
- Make a unique combination of all of the pairs of (column, row)
- Represent each pair as a string version of the tuple. This is a workaround to overcome the
LabelEncoder
class not supporting tuples as a class name. - Fits the new items to the
LabelEncoder
.
- 对(列,行)的所有对进行唯一组合
- 将每一对表示为元组的字符串版本。这是一种解决
LabelEncoder
不支持元组作为类名的类的解决方法。 - 使新项目适合
LabelEncoder
.
Now to use this new model it's a bit more complicated. Assuming we want to extract the representation for the same item we looked up in the previous example (the first column in df.columns and the first row), we can do this:
现在使用这个新模型有点复杂。假设我们要提取我们在上一个示例中查找的同一项目的表示(df.columns 中的第一列和第一行),我们可以这样做:
le.transform([str((df.columns[0], df.get_value(0, df.columns[0])))])
Remember that each lookup is now a string representation of a tuple that contains the (column, row).
请记住,每个查找现在都是包含 (column, row) 的元组的字符串表示形式。
回答by PriceHardman
As mentioned by larsmans, LabelEncoder() only takes a 1-d array as an argument. That said, it is quite easy to roll your own label encoder that operates on multiple columns of your choosing, and returns a transformed dataframe. My code here is based in part on Zac Stewart's excellent blog post found here.
正如 larsmans 所提到的,LabelEncoder() 只需要一个一维数组作为参数。也就是说,滚动您自己的标签编码器非常容易,该编码器在您选择的多个列上运行,并返回一个转换后的数据帧。我的代码部分基于 Zac Stewart 在此处找到的优秀博客文章。
Creating a custom encoder involves simply creating a class that responds to the fit()
, transform()
, and fit_transform()
methods. In your case, a good start might be something like this:
创建自定义编码器包括简单地创建一个类来响应fit()
,transform()
和fit_transform()
方法。在你的情况下,一个好的开始可能是这样的:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
# Create some toy data in a Pandas dataframe
fruit_data = pd.DataFrame({
'fruit': ['apple','orange','pear','orange'],
'color': ['red','orange','green','green'],
'weight': [5,6,3,4]
})
class MultiColumnLabelEncoder:
def __init__(self,columns = None):
self.columns = columns # array of column names to encode
def fit(self,X,y=None):
return self # not relevant here
def transform(self,X):
'''
Transforms columns of X specified in self.columns using
LabelEncoder(). If no columns specified, transforms all
columns in X.
'''
output = X.copy()
if self.columns is not None:
for col in self.columns:
output[col] = LabelEncoder().fit_transform(output[col])
else:
for colname,col in output.iteritems():
output[colname] = LabelEncoder().fit_transform(col)
return output
def fit_transform(self,X,y=None):
return self.fit(X,y).transform(X)
Suppose we want to encode our two categorical attributes (fruit
and color
), while leaving the numeric attribute weight
alone. We could do this as follows:
假设我们想对我们的两个分类属性(fruit
和color
)进行编码,而weight
单独留下数字属性。我们可以这样做:
MultiColumnLabelEncoder(columns = ['fruit','color']).fit_transform(fruit_data)
Which transforms our fruit_data
dataset from
这将我们的fruit_data
数据集从
to
到
Passing it a dataframe consisting entirely of categorical variables and omitting the columns
parameter will result in every column being encoded (which I believe is what you were originally looking for):
将一个完全由分类变量组成的数据帧传递给它并省略columns
参数将导致每一列都被编码(我相信这是你最初寻找的):
MultiColumnLabelEncoder().fit_transform(fruit_data.drop('weight',axis=1))
This transforms
这转变
to
到
.
.
Note that it'll probably choke when it tries to encode attributes that are already numeric (add some code to handle this if you like).
请注意,当它尝试对已经是数字的属性进行编码时,它可能会窒息(如果您愿意,可以添加一些代码来处理这个问题)。
Another nice feature about this is that we can use this custom transformer in a pipeline:
另一个很好的特性是我们可以在管道中使用这个自定义转换器:
encoding_pipeline = Pipeline([
('encoding',MultiColumnLabelEncoder(columns=['fruit','color']))
# add more pipeline steps as needed
])
encoding_pipeline.fit_transform(fruit_data)
回答by Napitupulu Jon
You can easily do this though,
你可以很容易地做到这一点,
df.apply(LabelEncoder().fit_transform)
EDIT2:
编辑2:
In scikit-learn 0.20, the recommended way is
在 scikit-learn 0.20 中,推荐的方式是
OneHotEncoder().fit_transform(df)
as the OneHotEncoder now supports string input. Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer.
因为 OneHotEncoder 现在支持字符串输入。使用 ColumnTransformer 可以将 OneHotEncoder 仅应用于某些列。
EDIT:
编辑:
Since this answer is over a year ago, and generated many upvotes (including a bounty), I should probably extend this further.
由于这个答案是一年多前的,并且产生了很多赞成(包括赏金),我可能应该进一步扩展它。
For inverse_transform and transform, you have to do a little bit of hack.
对于inverse_transform 和transform,你必须做一些小技巧。
from collections import defaultdict
d = defaultdict(LabelEncoder)
With this, you now retain all columns LabelEncoder
as dictionary.
有了这个,您现在将所有列保留LabelEncoder
为字典。
# Encoding the variable
fit = df.apply(lambda x: d[x.name].fit_transform(x))
# Inverse the encoded
fit.apply(lambda x: d[x.name].inverse_transform(x))
# Using the dictionary to label future data
df.apply(lambda x: d[x.name].transform(x))
回答by Jason Wolosonovich
This is a year-and-a-half after the fact, but I too, needed to be able to .transform()
multiple pandas dataframe columns at once (and be able to .inverse_transform()
them as well). This expands upon the excellent suggestion of @PriceHardman above:
这是一年半之后的事实,但我也需要能够同时处理.transform()
多个 Pandas 数据框列(并且也能够处理.inverse_transform()
它们)。这扩展了上面@PriceHardman 的优秀建议:
class MultiColumnLabelEncoder(LabelEncoder):
"""
Wraps sklearn LabelEncoder functionality for use on multiple columns of a
pandas dataframe.
"""
def __init__(self, columns=None):
self.columns = columns
def fit(self, dframe):
"""
Fit label encoder to pandas columns.
Access individual column classes via indexig `self.all_classes_`
Access individual column encoders via indexing
`self.all_encoders_`
"""
# if columns are provided, iterate through and get `classes_`
if self.columns is not None:
# ndarray to hold LabelEncoder().classes_ for each
# column; should match the shape of specified `columns`
self.all_classes_ = np.ndarray(shape=self.columns.shape,
dtype=object)
self.all_encoders_ = np.ndarray(shape=self.columns.shape,
dtype=object)
for idx, column in enumerate(self.columns):
# fit LabelEncoder to get `classes_` for the column
le = LabelEncoder()
le.fit(dframe.loc[:, column].values)
# append the `classes_` to our ndarray container
self.all_classes_[idx] = (column,
np.array(le.classes_.tolist(),
dtype=object))
# append this column's encoder
self.all_encoders_[idx] = le
else:
# no columns specified; assume all are to be encoded
self.columns = dframe.iloc[:, :].columns
self.all_classes_ = np.ndarray(shape=self.columns.shape,
dtype=object)
for idx, column in enumerate(self.columns):
le = LabelEncoder()
le.fit(dframe.loc[:, column].values)
self.all_classes_[idx] = (column,
np.array(le.classes_.tolist(),
dtype=object))
self.all_encoders_[idx] = le
return self
def fit_transform(self, dframe):
"""
Fit label encoder and return encoded labels.
Access individual column classes via indexing
`self.all_classes_`
Access individual column encoders via indexing
`self.all_encoders_`
Access individual column encoded labels via indexing
`self.all_labels_`
"""
# if columns are provided, iterate through and get `classes_`
if self.columns is not None:
# ndarray to hold LabelEncoder().classes_ for each
# column; should match the shape of specified `columns`
self.all_classes_ = np.ndarray(shape=self.columns.shape,
dtype=object)
self.all_encoders_ = np.ndarray(shape=self.columns.shape,
dtype=object)
self.all_labels_ = np.ndarray(shape=self.columns.shape,
dtype=object)
for idx, column in enumerate(self.columns):
# instantiate LabelEncoder
le = LabelEncoder()
# fit and transform labels in the column
dframe.loc[:, column] =\
le.fit_transform(dframe.loc[:, column].values)
# append the `classes_` to our ndarray container
self.all_classes_[idx] = (column,
np.array(le.classes_.tolist(),
dtype=object))
self.all_encoders_[idx] = le
self.all_labels_[idx] = le
else:
# no columns specified; assume all are to be encoded
self.columns = dframe.iloc[:, :].columns
self.all_classes_ = np.ndarray(shape=self.columns.shape,
dtype=object)
for idx, column in enumerate(self.columns):
le = LabelEncoder()
dframe.loc[:, column] = le.fit_transform(
dframe.loc[:, column].values)
self.all_classes_[idx] = (column,
np.array(le.classes_.tolist(),
dtype=object))
self.all_encoders_[idx] = le
return dframe.loc[:, self.columns].values
def transform(self, dframe):
"""
Transform labels to normalized encoding.
"""
if self.columns is not None:
for idx, column in enumerate(self.columns):
dframe.loc[:, column] = self.all_encoders_[
idx].transform(dframe.loc[:, column].values)
else:
self.columns = dframe.iloc[:, :].columns
for idx, column in enumerate(self.columns):
dframe.loc[:, column] = self.all_encoders_[idx]\
.transform(dframe.loc[:, column].values)
return dframe.loc[:, self.columns].values
def inverse_transform(self, dframe):
"""
Transform labels back to original encoding.
"""
if self.columns is not None:
for idx, column in enumerate(self.columns):
dframe.loc[:, column] = self.all_encoders_[idx]\
.inverse_transform(dframe.loc[:, column].values)
else:
self.columns = dframe.iloc[:, :].columns
for idx, column in enumerate(self.columns):
dframe.loc[:, column] = self.all_encoders_[idx]\
.inverse_transform(dframe.loc[:, column].values)
return dframe.loc[:, self.columns].values
Example:
例子:
If df
and df_copy()
are mixed-type pandas
dataframes, you can apply the MultiColumnLabelEncoder()
to the dtype=object
columns in the following way:
如果df
和df_copy()
是混合型pandas
dataframes,您可以应用MultiColumnLabelEncoder()
到dtype=object
以下列方式列:
# get `object` columns
df_object_columns = df.iloc[:, :].select_dtypes(include=['object']).columns
df_copy_object_columns = df_copy.iloc[:, :].select_dtypes(include=['object']).columns
# instantiate `MultiColumnLabelEncoder`
mcle = MultiColumnLabelEncoder(columns=object_columns)
# fit to `df` data
mcle.fit(df)
# transform the `df` data
mcle.transform(df)
# returns output like below
array([[1, 0, 0, ..., 1, 1, 0],
[0, 5, 1, ..., 1, 1, 2],
[1, 1, 1, ..., 1, 1, 2],
...,
[3, 5, 1, ..., 1, 1, 2],
# transform `df_copy` data
mcle.transform(df_copy)
# returns output like below (assuming the respective columns
# of `df_copy` contain the same unique values as that particular
# column in `df`
array([[1, 0, 0, ..., 1, 1, 0],
[0, 5, 1, ..., 1, 1, 2],
[1, 1, 1, ..., 1, 1, 2],
...,
[3, 5, 1, ..., 1, 1, 2],
# inverse `df` data
mcle.inverse_transform(df)
# outputs data like below
array([['August', 'Friday', '2013', ..., 'N', 'N', 'CA'],
['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
['August', 'Monday', '2014', ..., 'N', 'N', 'NJ'],
...,
['February', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
['March', 'Tuesday', '2013', ..., 'N', 'N', 'NJ']], dtype=object)
# inverse `df_copy` data
mcle.inverse_transform(df_copy)
# outputs data like below
array([['August', 'Friday', '2013', ..., 'N', 'N', 'CA'],
['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
['August', 'Monday', '2014', ..., 'N', 'N', 'NJ'],
...,
['February', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
['March', 'Tuesday', '2013', ..., 'N', 'N', 'NJ']], dtype=object)
You can access individual column classes, column labels, and column encoders used to fit each column via indexing:
您可以通过索引访问用于拟合每列的各个列类、列标签和列编码器:
mcle.all_classes_
mcle.all_encoders_
mcle.all_labels_
mcle.all_classes_
mcle.all_encoders_
mcle.all_labels_
回答by Alexander
We don't need a LabelEncoder.
我们不需要 LabelEncoder。
You can convert the columns to categoricals and then get their codes. I used a dictionary comprehension below to apply this process to every column and wrap the result back into a dataframe of the same shape with identical indices and column names.
您可以将列转换为分类,然后获取它们的代码。我使用下面的字典理解将这个过程应用于每一列,并将结果包装回具有相同索引和列名的相同形状的数据帧。
>>> pd.DataFrame({col: df[col].astype('category').cat.codes for col in df}, index=df.index)
location owner pets
0 1 1 0
1 0 2 1
2 0 0 0
3 1 1 2
4 1 3 1
5 0 2 1
To create a mapping dictionary, you can just enumerate the categories using a dictionary comprehension:
要创建映射字典,您可以使用字典理解来枚举类别:
>>> {col: {n: cat for n, cat in enumerate(df[col].astype('category').cat.categories)}
for col in df}
{'location': {0: 'New_York', 1: 'San_Diego'},
'owner': {0: 'Brick', 1: 'Champ', 2: 'Ron', 3: 'Veronica'},
'pets': {0: 'cat', 1: 'dog', 2: 'monkey'}}
回答by Anurag Priyadarshi
this does not directly answer your question (for which Naputipulu Jon and PriceHardman have fantastic replies)
这并没有直接回答你的问题(Napitipulu Jon 和 PriceHardman 对此有很好的回答)
However, for the purpose of a few classification tasks etc. you could use
但是,出于一些分类任务等的目的,您可以使用
pandas.get_dummies(input_df)
this can input dataframe with categorical data and return a dataframe with binary values. variable values are encoded into column names in the resulting dataframe. more
这可以输入具有分类数据的数据帧并返回具有二进制值的数据帧。变量值被编码到结果数据帧中的列名中。更多的
回答by Puja Sharma
if we have single column to do the label encoding and its inverse transform its easy how to do it when there are multiple columns in python
如果我们有单列来进行标签编码及其逆变换,那么当 python 中有多列时,如何轻松做到这一点
def stringtocategory(dataset):
'''
@author puja.sharma
@see The function label encodes the object type columns and gives label encoded and inverse tranform of the label encoded data
@param dataset dataframe on whoes column the label encoding has to be done
@return label encoded and inverse tranform of the label encoded data.
'''
data_original = dataset[:]
data_tranformed = dataset[:]
for y in dataset.columns:
#check the dtype of the column object type contains strings or chars
if (dataset[y].dtype == object):
print("The string type features are : " + y)
le = preprocessing.LabelEncoder()
le.fit(dataset[y].unique())
#label encoded data
data_tranformed[y] = le.transform(dataset[y])
#inverse label transform data
data_original[y] = le.inverse_transform(data_tranformed[y])
return data_tranformed,data_original
回答by Dror
Following up on the comments raised on the solution of @PriceHardmanI would propose the following version of the class:
跟进对@PriceHardman的解决方案提出的意见,我将提出以下版本的课程:
class LabelEncodingColoumns(BaseEstimator, TransformerMixin):
def __init__(self, cols=None):
pdu._is_cols_input_valid(cols)
self.cols = cols
self.les = {col: LabelEncoder() for col in cols}
self._is_fitted = False
def transform(self, df, **transform_params):
"""
Scaling ``cols`` of ``df`` using the fitting
Parameters
----------
df : DataFrame
DataFrame to be preprocessed
"""
if not self._is_fitted:
raise NotFittedError("Fitting was not preformed")
pdu._is_cols_subset_of_df_cols(self.cols, df)
df = df.copy()
label_enc_dict = {}
for col in self.cols:
label_enc_dict[col] = self.les[col].transform(df[col])
labelenc_cols = pd.DataFrame(label_enc_dict,
# The index of the resulting DataFrame should be assigned and
# equal to the one of the original DataFrame. Otherwise, upon
# concatenation NaNs will be introduced.
index=df.index
)
for col in self.cols:
df[col] = labelenc_cols[col]
return df
def fit(self, df, y=None, **fit_params):
"""
Fitting the preprocessing
Parameters
----------
df : DataFrame
Data to use for fitting.
In many cases, should be ``X_train``.
"""
pdu._is_cols_subset_of_df_cols(self.cols, df)
for col in self.cols:
self.les[col].fit(df[col])
self._is_fitted = True
return self
This class fits the encoder on the training set and uses the fitted version when transforming. Initial version of the code can be found here.
该类在训练集上拟合编码器,并在转换时使用拟合版本。可以在此处找到代码的初始版本。
回答by Tom
A short way to LabelEncoder()
multiple columns with a dict()
:
LabelEncoder()
多列的一种简短方法dict()
:
from sklearn.preprocessing import LabelEncoder
le_dict = {col: LabelEncoder() for col in columns }
for col in columns:
le_dict[col].fit_transform(df[col])
and you can use this le_dict
to labelEncode any other column:
您可以使用它le_dict
来对任何其他列进行 labelEncode:
le_dict[col].transform(df_another[col])