pandas 无法使用 Keras 和 Sklearn 将字符串列转换为分类矩阵
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47573293/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Unable to transform string column to categorical matrix using Keras and Sklearn
提问by Hugo Sanchez
I am trying to build a simple Keras model, with Python3.6 on MacOS, to predict house prices in a given range but I fail to transform the output into a category matrix. I am using this datasetfrom Kaggle.
我正在尝试使用 MacOS 上的 Python3.6 构建一个简单的 Keras 模型来预测给定范围内的房价,但我未能将输出转换为类别矩阵。我正在使用来自 Kaggle 的这个数据集。
I've created a new column in the dataframe with different price ranges as strings to serve as target output in my model, then use keras.utils and Sklearn LabelEncoder to try to create the output binary matrix but I keep getting the error:
我在数据框中创建了一个新列,将不同的价格范围作为字符串作为我模型中的目标输出,然后使用 keras.utils 和 Sklearn LabelEncoder 尝试创建输出二进制矩阵,但我不断收到错误消息:
ValueError: invalid literal for int() with base 10: '0 - 50000'
Here is my code:
这是我的代码:
import pandas as pd
import numpy as np
from keras.layers import Dense
from keras.models import Sequential, load_model
from keras.callbacks import EarlyStopping
from keras.utils import to_categorical, np_utils
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
seed = 7
np.random.seed(seed)
data = pd.read_csv("Melbourne_housing_FULL.csv")
data.fillna(0, inplace=True)
price_range = 50000
bins = np.arange(0, 12000000, price_range)
labels = ['{} - {}'.format(i + 1, j) for i, j in zip(bins[:-1], bins[1:])]
#correct first value
labels[0] = '0 - 50000'
for item in labels:
str(item)
print (labels[:10])
['0 - 50000', '50001 - 100000', '100001 - 150000', '150001 - 200000',
'200001 - 250000', '250001 - 300000', '300001 - 350000', '350001 - 400000',
'400001 - 450000', '450001 - 500000']
data['PriceRange'] = pd.cut(data.Price,
bins=bins,
labels=labels,
right=True,
include_lowest=True)
#print(data.PriceRange.value_counts())
output_len = len(labels)
print(output_len)
Everything is correct here until I run the next piece:
在我运行下一段之前,这里一切都是正确的:
predictors = data.drop(['Suburb', 'Address', 'SellerG', 'CouncilArea',
'Propertycount', 'Date', 'Type', 'Price', 'PriceRange'], axis=1).as_matrix()
target = data['PriceRange']
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(target)
encoded_Y = encoder.transform(target)
target = np_utils.to_categorical(data.PriceRange)
n_cols = predictors.shape[1]
And I get the ValueError: invalid literal for int() with base 10: '0 - 50000'
我得到 ValueError: invalid literal for int() with base 10: '0 - 50000'
Con someone help me here? Don't really understand what I am doing wrong.
有人帮我吗?真的不明白我做错了什么。
Many thanks
非常感谢
回答by Bharath
Its because np_utils.to_categorical
takes y of datatype int, but you have strings either convert them into int by giving them a key i.e :
这是因为np_utils.to_categorical
需要 y 的数据类型 int,但你有字符串要么通过给它们一个键将它们转换为 int ,即:
cats = data.PriceRange.values.categories
di = dict(zip(cats,np.arange(len(cats))))
#{'0 - 50000': 0,
# '10000001 - 10050000': 200,
# '1000001 - 1050000': 20,
# '100001 - 150000': 2,
# '10050001 - 10100000': 201,
# '10100001 - 10150000': 202,
target = np_utils.to_categorical(data.PriceRange.map(di))
or since you are using pandas you can use pd.get_dummies
to get one hot encoding.
或者因为您使用的是Pandas,您可以使用它pd.get_dummies
来获得一种热编码。
onehot = pd.get_dummies(data.PriceRange)
target_labels = onehot.columns
target = onehot.as_matrix()
array([[ 1., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
...,
[ 0., 0., 0., ..., 0., 0., 0.],
[ 1., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.]])
回答by Marco Cerliani
With only one line of code...
只有一行代码...
np_utils.to_categorical(data.PriceRange.factorize()[0])