Python 标签编码器编码缺失值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36808434/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
label-encoder encoding missing values
提问by saurabh agarwal
I am using the label encoder to convert categorical data into numeric values.
我正在使用标签编码器将分类数据转换为数值。
How does LabelEncoder handle missing values?
LabelEncoder 如何处理缺失值?
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
a = pd.DataFrame(['A','B','C',np.nan,'D','A'])
le = LabelEncoder()
le.fit_transform(a)
Output:
输出:
array([1, 2, 3, 0, 4, 1])
For the above example, label encoder changed NaN values to a category. How would I know which category represents missing values?
对于上面的示例,标签编码器将 NaN 值更改为一个类别。我怎么知道哪个类别代表缺失值?
采纳答案by dukebody
Don't use LabelEncoder
with missing values. I don't know which version of scikit-learn
you're using, but in 0.17.1 your code raises TypeError: unorderable types: str() > float()
.
不要LabelEncoder
与缺失值一起使用。我不知道scikit-learn
您使用的是哪个版本,但在 0.17.1 中您的代码引发了TypeError: unorderable types: str() > float()
.
As you can see in the sourceit uses numpy.unique
against the data to encode, which raises TypeError
if missing values are found. If you want to encode missing values, first change its type to a string:
正如您在源代码中看到的那样,它使用numpy.unique
数据进行编码,TypeError
如果找到缺失值,则会引发。如果要对缺失值进行编码,首先将其类型更改为字符串:
a[pd.isnull(a)] = 'NaN'
回答by Kerem T
Hello a little computational hack I did for my own work:
你好,我为自己的工作做了一个小计算:
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
a = pd.DataFrame(['A','B','C',np.nan,'D','A'])
le = LabelEncoder()
### fit with the desired col, col in position 0 for this example
fit_by = pd.Series([i for i in a.iloc[:,0].unique() if type(i) == str])
le.fit(fit_by)
### Set transformed col leaving np.NaN as they are
a["transformed"] = fit_by.apply(lambda x: le.transform([x])[0] if type(x) == str else x)
回答by Niclas von Caprivi
This is my solution, because I was not pleased with the solutions posted here. I needed a LabelEncoder that keeps my missing values as 'NaN' to use an Imputer afterwards. So I have written my own LabelEncoder class. It works with DataFrames.
这是我的解决方案,因为我对这里发布的解决方案不满意。我需要一个 LabelEncoder 将我的缺失值保留为 'NaN' 以便之后使用 Imputer。所以我编写了自己的 LabelEncoder 类。它适用于数据帧。
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
from sklearn.preprocessing import LabelEncoder
class LabelEncoderByCol(BaseEstimator, TransformerMixin):
def __init__(self,col):
#List of column names in the DataFrame that should be encoded
self.col = col
#Dictionary storing a LabelEncoder for each column
self.le_dic = {}
for el in self.col:
self.le_dic[el] = LabelEncoder()
def fit(self,x,y=None):
#Fill missing values with the string 'NaN'
x[self.col] = x[self.col].fillna('NaN')
for el in self.col:
#Only use the values that are not 'NaN' to fit the Encoder
a = x[el][x[el]!='NaN']
self.le_dic[el].fit(a)
return self
def transform(self,x,y=None):
#Fill missing values with the string 'NaN'
x[self.col] = x[self.col].fillna('NaN')
for el in self.col:
#Only use the values that are not 'NaN' to fit the Encoder
a = x[el][x[el]!='NaN']
#Store an ndarray of the current column
b = x[el].get_values()
#Replace the elements in the ndarray that are not 'NaN'
#using the transformer
b[b!='NaN'] = self.le_dic[el].transform(a)
#Overwrite the column in the DataFrame
x[el]=b
#return the transformed DataFrame
return x
You can enter a DataFrame, not only a 1-dim Series. with col you can chose the columns that should be encoded.
您可以输入一个 DataFrame,而不仅仅是一个 1-dim 系列。使用 col 您可以选择应该编码的列。
I would like to here some feedback.
我想在这里提供一些反馈。
回答by ulrich
you can also use a mask to replace form the original data frame after labelling
您也可以在标记后使用掩码替换原始数据框
df = pd.DataFrame({'A': ['x', np.NaN, 'z'], 'B': [1, 6, 9], 'C': [2, 1, np.NaN]})
A B C
0 x 1 2.0
1 NaN 6 1.0
2 z 9 NaN
original = df
mask = df_1.isnull()
A B C
0 False False False
1 True False False
2 False False True
df = df.astype(str).apply(LabelEncoder().fit_transform)
df.where(~mask, original)
A B C
0 1.0 0 1.0
1 NaN 1 0.0
2 2.0 2 NaN
回答by raghu nanden
You can fill the na's by some value and later change the dataframe column type to string to make things work.
您可以用某个值填充 na,然后将数据框列类型更改为字符串以使其正常工作。
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
a = pd.DataFrame(['A','B','C',np.nan,'D','A'])
a.fillna(99)
le = LabelEncoder()
le.fit_transform(a.astype(str))
回答by prony
The most voted answer by @Kerem has typos, therefore I am posting the corrected and improved answer here:
@Kerem 投票最多的答案有错别字,因此我在这里发布更正和改进的答案:
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
a = pd.DataFrame(['A','B','C',np.nan,'D','A'])
for j in a.columns.values:
le = LabelEncoder()
### fit with the desired col, col in position 0 for this ###example
fit_by = pd.Series([i for i in a[j].unique() if type(i) == str])
le.fit(fit_by)
### Set transformed col leaving np.NaN as they are
a["transformed"] = a[j].apply(lambda x: le.transform([x])[0] if type(x) == str else x)
回答by Ashok Kumar Pant
Following encoder addresses None values in each category.
以下编码器解决了每个类别中的 None 值。
class MultiColumnLabelEncoder:
def __init__(self):
self.columns = None
self.led = defaultdict(preprocessing.LabelEncoder)
def fit(self, X):
self.columns = X.columns
for col in self.columns:
cat = X[col].unique()
cat = [x if x is not None else "None" for x in cat]
self.led[col].fit(cat)
return self
def fit_transform(self, X):
if self.columns is None:
self.fit(X)
return self.transform(X)
def transform(self, X):
return X.apply(lambda x: self.led[x.name].transform(x.apply(lambda e: e if e is not None else "None")))
def inverse_transform(self, X):
return X.apply(lambda x: self.led[x.name].inverse_transform(x))
Uses Example
使用示例
df = pd.DataFrame({
'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'],
'owner': ['Champ', 'Ron', 'Brick', None, 'Veronica', 'Ron'],
'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',
None]
})
print(df)
location owner pets
0 San_Diego Champ cat
1 New_York Ron dog
2 New_York Brick cat
3 San_Diego None monkey
4 San_Diego Veronica dog
5 None Ron dog
le = MultiColumnLabelEncoder()
le.fit(df)
transformed = le.transform(df)
print(transformed)
location owner pets
0 2 1 0
1 0 3 1
2 0 0 0
3 2 2 2
4 2 4 1
5 1 3 1
inverted = le.inverse_transform(transformed)
print(inverted)
location owner pets
0 San_Diego Champ cat
1 New_York Ron dog
2 New_York Brick cat
3 San_Diego None monkey
4 San_Diego Veronica dog
5 None Ron dog
回答by muon
This is how I did it:
我是这样做的:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
UNKNOWN_TOKEN = '<unknown>'
a = pd.Series(['A','B','C', 'D','A'], dtype=str).unique().tolist()
a.append(UNKNOWN_TOKEN)
le = LabelEncoder()
le.fit_transform(a)
embedding_map = dict(zip(le.classes_, le.transform(le.classes_)))
and when applying to new test data:
当应用到新的测试数据时:
test_df = test_df.apply(lambda x: x if x in embedding_map else UNKNOWN_TOKEN)
le.transform(test_df)
回答by chankane
回答by rorance_
I also wanted to contribute my workaround, as I found the others a bit more tedious when working with categorical data which contains missing values
我还想贡献我的解决方法,因为我发现其他方法在处理包含缺失值的分类数据时有点乏味
# Create a random dataframe
foo = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
# Randomly intersperse column 'A' with missing data (NaN)
foo['A'][np.random.randint(0,len(foo), size=20)] = np.nan
# Convert this series to string, to simulate our problem
series = foo['A'].astype(str)
# np.nan are converted to the string "nan", mask these out
mask = (series == "nan")
# Apply the LabelEncoder to the unmasked series, replace the masked series with np.nan
series[~mask] = LabelEncoder().fit_transform(series[~mask])
series[mask] = np.nan
foo['A'] = series