pandas 在多个程序中正确使用 Scikit 的 LabelEncoder
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28656736/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using Scikit's LabelEncoder correctly across multiple programs
提问by alphacentauri
The basic task that I have at hand is
我手头的基本任务是
a) Read some tab separated data.
a) 读取一些制表符分隔的数据。
b) Do some basic preprocessing
b) 做一些基本的预处理
c) For each categorical column use LabelEncoderto create a mapping. This is don somewhat like this
c) 为每个分类列使用LabelEncoder创建一个映射。这有点像这样
mapper={}
#Converting Categorical Data
for x in categorical_list:
mapper[x]=preprocessing.LabelEncoder()
for x in categorical_list:
df[x]=mapper[x].fit_transform(df.__getattr__(x))
where dfis a pandas dataframe and categorical_listis a list of column headers that need to be transformed.
其中df是Pandas数据框,categorical_list是需要转换的列标题列表。
d) Train a classifier and save it to disk using pickle
d)训练分类器并将其保存到磁盘使用 pickle
e) Now in a different program, the model saved is loaded.
e) 现在在另一个程序中,保存的模型被加载。
f) The test data is loaded and the same preprocessing is performed.
f) 加载测试数据并进行相同的预处理。
g) The LabelEncoder'sare used for converting categorical data.
g)LabelEncoder's用于转换分类数据。
h) The model is used to predict.
h) 模型用于预测。
Now the question that I have is, will the step g)work correctly?
现在我的问题是,这一步g)能正常工作吗?
As the documentation for LabelEncodersays
正如文档LabelEncoder所说
It can also be used to transform non-numerical labels (as long as
they are hashable and comparable) to numerical labels.
So will each entry hash to the exact same value everytime?
那么每个条目每次都会哈希到完全相同的值吗?
If No, what is a good way to go about this. Any way to retrive the mappings of the encoder? Or an altogether different way from LabelEncoder?
如果否,有什么好的方法可以解决这个问题。有什么方法可以检索编码器的映射?还是与 LabelEncoder 完全不同的方式?
回答by Artem Sobolev
According to the LabelEncoderimplementation, the pipeline you've described will work correctly if and only if you fitLabelEncoders at the test time with data that have exactly the same set of unique values.
根据LabelEncoder实现,当且仅当您fit在测试时 LabelEncoders 使用具有完全相同的唯一值集的数据时,您所描述的管道才能正常工作。
There's a somewhat hacky way to reuse LabelEncoders you got during train. LabelEncoderhas only one property, namely, classes_. You can pickle it, and then restore like
重用您在训练期间获得的 LabelEncoders 有一种有点老套的方法。LabelEncoder只有一个属性,即classes_。你可以腌制它,然后像这样恢复
Train:
火车:
encoder = LabelEncoder()
encoder.fit(X)
numpy.save('classes.npy', encoder.classes_)
Test
测试
encoder = LabelEncoder()
encoder.classes_ = numpy.load('classes.npy')
# Now you should be able to use encoder
# as you would do after `fit`
This seems more efficient than refitting it using the same data.
这似乎比使用相同的数据重新拟合更有效。
回答by Shady Sherif
For me the easiest way was exporting LabelEncoder as .pklfile for each column. You have to export the encoder for each column after using the fit_transform()function
对我来说,最简单的方法是将 LabelEncoder 作为.pkl每列的文件导出。使用该fit_transform()功能后,您必须导出每列的编码器
For example
例如
from sklearn.preprocessing import LabelEncoder
import pickle
import pandas as pd
df_train = pd.read_csv('traing_data.csv')
le = LabelEncoder()
df_train['Departure'] = le.fit_transform(df_train['Departure'])
#exporting the departure encoder
output = open('Departure_encoder.pkl', 'wb')
pickle.dump(le, output)
output.close()
Then in the testing project, you can load the LabelEncoder object and apply transform()function directly
然后在testing工程中,可以transform()直接加载LabelEncoder对象和apply函数
from sklearn.preprocessing import LabelEncoder
import pandas as pd
df_test = pd.read_csv('testing_data.csv')
#load the encoder file
import pickle
pkl_file = open('Departure_encoder.pkl', 'rb')
le_departure = pickle.load(pkl_file)
pkl_file.close()
df_test['Departure'] = le_departure.transform(df_test['Departure'])
回答by wannabe_nerd
What works for me is LabelEncoder().fit(X_train[col]), pickling these objects for each categorical column coland then reusing the same objects for transforming the same categorical column colin the validation dataset. Basically you have a label encoder object for each of your categorical columns.
对我LabelEncoder().fit(X_train[col])有用的是,为每个分类列腌制这些对象col,然后重用相同的对象来转换col验证数据集中的相同分类列。基本上,您的每个分类列都有一个标签编码器对象。
- So
fit()on training data and pickle the objects/models corresponding to each column in the training dataframeX_train. - For each
colin columns of validation setX_cv, load the corresponding object/model and apply the transformation by accessing the transform function as:transform(X_cv[col]).
- 因此,
fit()在训练数据上并pickle 与训练数据框中每一列对应的对象/模型X_train。 - 对于每一个
col在验证组的列X_cv,加载相应的对象/模型和通过访问变换函数作为应用变换:transform(X_cv[col])。
回答by geniolius
You can do something like this after you have encoded the values with the "le" object:
在使用“le”对象对值进行编码后,您可以执行以下操作:
encoding = {}
for i in list(le.classes_):
encoding[i]=le.transform([i])[0]
You will get the "encoding" dictionary with the encoding for later use, with pandas you can export this dictionary to a csv for example.
您将获得带有编码的“编码”字典供以后使用,例如,您可以使用 Pandas 将此字典导出到 csv。
回答by geniolius
You can do this after you have encoded the values with the "le" object:
您可以在使用“le”对象对值进行编码后执行此操作:
encoding = {}
for i in list(le.classes_):
encoding[i]=le.transform([i])[0]
You will get the "encoding" dictionary with the encoding for later use, with pandas you can export this dictionary to a csv for example.
您将获得带有编码的“编码”字典供以后使用,例如,您可以使用 Pandas 将此字典导出到 csv。

