pandas 无法使用 Keras 和 Sklearn 将字符串列转换为分类矩阵

Question

提问by Hugo Sanchez

I am trying to build a simple Keras model, with Python3.6 on MacOS, to predict house prices in a given range but I fail to transform the output into a category matrix. I am using this datasetfrom Kaggle.

我正在尝试使用 MacOS 上的 Python3.6 构建一个简单的 Keras 模型来预测给定范围内的房价，但我未能将输出转换为类别矩阵。我正在使用来自 Kaggle 的这个数据集。

I've created a new column in the dataframe with different price ranges as strings to serve as target output in my model, then use keras.utils and Sklearn LabelEncoder to try to create the output binary matrix but I keep getting the error:

我在数据框中创建了一个新列，将不同的价格范围作为字符串作为我模型中的目标输出，然后使用 keras.utils 和 Sklearn LabelEncoder 尝试创建输出二进制矩阵，但我不断收到错误消息：

ValueError: invalid literal for int() with base 10: '0 - 50000'

Here is my code:

这是我的代码：

import pandas as pd
import numpy as np
from keras.layers import Dense
from keras.models import Sequential, load_model
from keras.callbacks import EarlyStopping
from keras.utils import to_categorical, np_utils
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

seed = 7
np.random.seed(seed)

data = pd.read_csv("Melbourne_housing_FULL.csv")
data.fillna(0, inplace=True)

price_range = 50000
bins = np.arange(0, 12000000, price_range)
labels = ['{} - {}'.format(i + 1, j) for i, j in zip(bins[:-1], bins[1:])] 

#correct first value 
labels[0] = '0 - 50000'

for item in labels:
    str(item)

print (labels[:10])
['0 - 50000', '50001 - 100000', '100001 - 150000', '150001 - 200000', 
 '200001 - 250000', '250001 - 300000', '300001 - 350000', '350001 - 400000', 
 '400001 - 450000', '450001 - 500000']

data['PriceRange'] = pd.cut(data.Price, 
                            bins=bins, 
                            labels=labels, 
                            right=True, 
                            include_lowest=True)

#print(data.PriceRange.value_counts())
output_len = len(labels)
print(output_len)

Everything is correct here until I run the next piece:

在我运行下一段之前，这里一切都是正确的：

predictors = data.drop(['Suburb', 'Address', 'SellerG', 'CouncilArea', 
                        'Propertycount', 'Date', 'Type', 'Price', 'PriceRange'], axis=1).as_matrix()

target = data['PriceRange']


# encode class values as integers
encoder = LabelEncoder()
encoder.fit(target)
encoded_Y = encoder.transform(target)

target = np_utils.to_categorical(data.PriceRange)

n_cols = predictors.shape[1]

And I get the ValueError: invalid literal for int() with base 10: '0 - 50000'

我得到 ValueError: invalid literal for int() with base 10: '0 - 50000'

Con someone help me here? Don't really understand what I am doing wrong.

有人帮我吗？真的不明白我做错了什么。

Many thanks

非常感谢

Answer 1

回答by Bharath

Its because np_utils.to_categoricaltakes y of datatype int, but you have strings either convert them into int by giving them a key i.e :

这是因为np_utils.to_categorical需要 y 的数据类型 int，但你有字符串要么通过给它们一个键将它们转换为 int ，即：

cats = data.PriceRange.values.categories
di = dict(zip(cats,np.arange(len(cats))))
#{'0 - 50000': 0,
# '10000001 - 10050000': 200,
# '1000001 - 1050000': 20,
# '100001 - 150000': 2,
# '10050001 - 10100000': 201,
# '10100001 - 10150000': 202,

target = np_utils.to_categorical(data.PriceRange.map(di))

or since you are using pandas you can use pd.get_dummiesto get one hot encoding.

或者因为您使用的是Pandas，您可以使用它pd.get_dummies来获得一种热编码。

onehot = pd.get_dummies(data.PriceRange)
target_labels = onehot.columns
target = onehot.as_matrix()

array([[ 1.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 1.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

Answer 2

回答by Marco Cerliani

With only one line of code...

只有一行代码...

np_utils.to_categorical(data.PriceRange.factorize()[0])

pandas 无法使用 Keras 和 Sklearn 将字符串列转换为分类矩阵

提问by Hugo Sanchez

回答by Bharath

回答by Marco Cerliani

相关推荐

最近更新

标签

pandas 无法使用 Keras 和 Sklearn 将字符串列转换为分类矩阵

提问by Hugo Sanchez

回答by Bharath

回答by Marco Cerliani

相关推荐

如何将字节数据转换为 python pandas 数据帧？

pandas 熊猫数据框 to_csv 与 sep='\n' 一起使用，但不适用于 sep='\t'

pandas 如何将巨大的熊猫数据框保存到 hdfs？

pandas 提高pandas groupby的性能

相关推荐

最近更新

标签