Python 标签编码器编码缺失值

Question

提问by saurabh agarwal

I am using the label encoder to convert categorical data into numeric values.

我正在使用标签编码器将分类数据转换为数值。

How does LabelEncoder handle missing values?

LabelEncoder 如何处理缺失值？

from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
a = pd.DataFrame(['A','B','C',np.nan,'D','A'])
le = LabelEncoder()
le.fit_transform(a)

Output:

输出：

array([1, 2, 3, 0, 4, 1])

For the above example, label encoder changed NaN values to a category. How would I know which category represents missing values?

对于上面的示例，标签编码器将 NaN 值更改为一个类别。我怎么知道哪个类别代表缺失值？

Answer 1

采纳答案by dukebody

Don't use LabelEncoderwith missing values. I don't know which version of scikit-learnyou're using, but in 0.17.1 your code raises TypeError: unorderable types: str() > float().

不要LabelEncoder与缺失值一起使用。我不知道scikit-learn您使用的是哪个版本，但在 0.17.1 中您的代码引发了TypeError: unorderable types: str() > float().

As you can see in the sourceit uses numpy.uniqueagainst the data to encode, which raises TypeErrorif missing values are found. If you want to encode missing values, first change its type to a string:

正如您在源代码中看到的那样，它使用numpy.unique数据进行编码，TypeError如果找到缺失值，则会引发。如果要对缺失值进行编码，首先将其类型更改为字符串：

a[pd.isnull(a)]  = 'NaN'

Answer 2

回答by Kerem T

Hello a little computational hack I did for my own work:

你好，我为自己的工作做了一个小计算：

from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
a = pd.DataFrame(['A','B','C',np.nan,'D','A'])
le = LabelEncoder()
### fit with the desired col, col in position 0 for this example
fit_by = pd.Series([i for i in a.iloc[:,0].unique() if type(i) == str])
le.fit(fit_by)
### Set transformed col leaving np.NaN as they are
a["transformed"] = fit_by.apply(lambda x: le.transform([x])[0] if type(x) == str else x)

Answer 3

回答by Niclas von Caprivi

This is my solution, because I was not pleased with the solutions posted here. I needed a LabelEncoder that keeps my missing values as 'NaN' to use an Imputer afterwards. So I have written my own LabelEncoder class. It works with DataFrames.

这是我的解决方案，因为我对这里发布的解决方案不满意。我需要一个 LabelEncoder 将我的缺失值保留为 'NaN' 以便之后使用 Imputer。所以我编写了自己的 LabelEncoder 类。它适用于数据帧。

from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
from sklearn.preprocessing import LabelEncoder

class LabelEncoderByCol(BaseEstimator, TransformerMixin):
    def __init__(self,col):
        #List of column names in the DataFrame that should be encoded
        self.col = col
        #Dictionary storing a LabelEncoder for each column
        self.le_dic = {}
        for el in self.col:
            self.le_dic[el] = LabelEncoder()

    def fit(self,x,y=None):
        #Fill missing values with the string 'NaN'
        x[self.col] = x[self.col].fillna('NaN')
        for el in self.col:
            #Only use the values that are not 'NaN' to fit the Encoder
            a = x[el][x[el]!='NaN']
            self.le_dic[el].fit(a)
        return self

    def transform(self,x,y=None):
        #Fill missing values with the string 'NaN'
        x[self.col] = x[self.col].fillna('NaN')
        for el in self.col:
            #Only use the values that are not 'NaN' to fit the Encoder
            a = x[el][x[el]!='NaN']
            #Store an ndarray of the current column
            b = x[el].get_values()
            #Replace the elements in the ndarray that are not 'NaN'
            #using the transformer
            b[b!='NaN'] = self.le_dic[el].transform(a)
            #Overwrite the column in the DataFrame
            x[el]=b
        #return the transformed DataFrame
        return x

You can enter a DataFrame, not only a 1-dim Series. with col you can chose the columns that should be encoded.

您可以输入一个 DataFrame，而不仅仅是一个 1-dim 系列。使用 col 您可以选择应该编码的列。

I would like to here some feedback.

我想在这里提供一些反馈。

Answer 4

回答by ulrich

you can also use a mask to replace form the original data frame after labelling

您也可以在标记后使用掩码替换原始数据框

df = pd.DataFrame({'A': ['x', np.NaN, 'z'], 'B': [1, 6, 9], 'C': [2, 1, np.NaN]})

    A   B   C
0   x   1   2.0
1   NaN 6   1.0
2   z   9   NaN

original = df
mask = df_1.isnull()
       A    B   C
0   False   False   False
1   True    False   False
2   False   False   True

df = df.astype(str).apply(LabelEncoder().fit_transform)
df.where(~mask, original)

A   B   C
0   1.0 0   1.0
1   NaN 1   0.0
2   2.0 2   NaN

Answer 5

回答by raghu nanden

You can fill the na's by some value and later change the dataframe column type to string to make things work.

您可以用某个值填充 na，然后将数据框列类型更改为字符串以使其正常工作。

from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
a = pd.DataFrame(['A','B','C',np.nan,'D','A'])
a.fillna(99)
le = LabelEncoder()
le.fit_transform(a.astype(str))

Answer 6

回答by prony

The most voted answer by @Kerem has typos, therefore I am posting the corrected and improved answer here:

@Kerem 投票最多的答案有错别字，因此我在这里发布更正和改进的答案：

from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
a = pd.DataFrame(['A','B','C',np.nan,'D','A'])
for j in a.columns.values:
    le = LabelEncoder()
### fit with the desired col, col in position 0 for this ###example
    fit_by = pd.Series([i for i in a[j].unique() if type(i) == str])
    le.fit(fit_by)
    ### Set transformed col leaving np.NaN as they are
    a["transformed"] = a[j].apply(lambda x: le.transform([x])[0] if type(x) == str else x)

Answer 7

回答by Ashok Kumar Pant

Following encoder addresses None values in each category.

以下编码器解决了每个类别中的 None 值。

class MultiColumnLabelEncoder:
    def __init__(self):
        self.columns = None
        self.led = defaultdict(preprocessing.LabelEncoder)

    def fit(self, X):
        self.columns = X.columns
        for col in self.columns:
            cat = X[col].unique()
            cat = [x if x is not None else "None" for x in cat]
            self.led[col].fit(cat)
        return self

    def fit_transform(self, X):
        if self.columns is None:
            self.fit(X)
        return self.transform(X)

    def transform(self, X):
        return X.apply(lambda x:  self.led[x.name].transform(x.apply(lambda e: e if e is not None else "None")))

    def inverse_transform(self, X):
        return X.apply(lambda x: self.led[x.name].inverse_transform(x))

Uses Example

使用示例

df = pd.DataFrame({
    'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'],
    'owner': ['Champ', 'Ron', 'Brick', None, 'Veronica', 'Ron'],
    'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',
                 None]
})


print(df)

   location     owner    pets
0  San_Diego     Champ     cat
1   New_York       Ron     dog
2   New_York     Brick     cat
3  San_Diego      None  monkey
4  San_Diego  Veronica     dog
5       None       Ron     dog

le = MultiColumnLabelEncoder()
le.fit(df)

transformed = le.transform(df)
print(transformed)

   location  owner  pets
0         2      1     0
1         0      3     1
2         0      0     0
3         2      2     2
4         2      4     1
5         1      3     1

inverted = le.inverse_transform(transformed)
print(inverted)

        location     owner    pets
0  San_Diego     Champ     cat
1   New_York       Ron     dog
2   New_York     Brick     cat
3  San_Diego      None  monkey
4  San_Diego  Veronica     dog
5       None       Ron     dog

Answer 8

回答by muon

This is how I did it:

我是这样做的：

import pandas as pd
from sklearn.preprocessing import LabelEncoder

UNKNOWN_TOKEN = '<unknown>'
a = pd.Series(['A','B','C', 'D','A'], dtype=str).unique().tolist()
a.append(UNKNOWN_TOKEN)
le = LabelEncoder()
le.fit_transform(a)
embedding_map = dict(zip(le.classes_, le.transform(le.classes_)))

and when applying to new test data:

当应用到新的测试数据时：

test_df = test_df.apply(lambda x: x if x in embedding_map else UNKNOWN_TOKEN)
le.transform(test_df)

Answer 9

回答by chankane

An easy way is this

一个简单的方法是这个

It is an example of Titanic

这是泰坦尼克号的一个例子

LABEL_COL = ["Sex", "Embarked"]

def label(df):
    _df = df.copy()
    le = LabelEncoder()
    for col in LABEL_COL:
        # Not NaN index
        idx = ~_df[col].isna()
        _df.loc[idx, col] \
            = le.fit(_df.loc[idx, col]).transform(_df.loc[idx, col])
    return _df

Answer 10

回答by rorance_

I also wanted to contribute my workaround, as I found the others a bit more tedious when working with categorical data which contains missing values

我还想贡献我的解决方法，因为我发现其他方法在处理包含缺失值的分类数据时有点乏味

# Create a random dataframe
foo = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

# Randomly intersperse column 'A' with missing data (NaN)
foo['A'][np.random.randint(0,len(foo), size=20)] = np.nan

# Convert this series to string, to simulate our problem
series = foo['A'].astype(str)

# np.nan are converted to the string "nan", mask these out
mask = (series == "nan")

# Apply the LabelEncoder to the unmasked series, replace the masked series with np.nan
series[~mask] = LabelEncoder().fit_transform(series[~mask])
series[mask] = np.nan

foo['A'] = series

Python 标签编码器编码缺失值

提问by saurabh agarwal

采纳答案by dukebody

回答by Kerem T

回答by Niclas von Caprivi

回答by ulrich

回答by raghu nanden

回答by prony

回答by Ashok Kumar Pant

回答by muon

回答by chankane

回答by rorance_

相关推荐

最近更新

标签

Python 标签编码器编码缺失值

提问by saurabh agarwal

采纳答案by dukebody

回答by Kerem T

回答by Niclas von Caprivi

回答by ulrich

回答by raghu nanden

回答by prony

回答by Ashok Kumar Pant

回答by muon

回答by chankane

回答by rorance_

相关推荐

Python PyQT：如何打开新窗口

Python 如何在 PySpark 中创建一个返回字符串数组的 udf？

使用 BeautifulSoup 和 Python 获取元标记内容属性

Python 减少 pyinstaller exe 的大小

相关推荐

最近更新

标签