Python 的 Pandas:例外:数据必须是一维的

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45828228/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:17:58  来源:igfitidea点击:

Pandas for Python: Exception: Data must be 1-dimensional

pythonpandasscikit-learnone-hot-encoding

提问by Tyler L

Here's what I got from a tutorial

这是我从教程中得到的

# Data Preprocessing

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values

# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

# Encoding categorical data
# Encoding the Independent Variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
# Encoding the Dependent Variable
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

This is the X matrix with encoded dummy variables

这是带有编码虚拟变量的 X 矩阵

1.000000000000000000e+00    0.000000000000000000e+00    0.000000000000000000e+00    4.400000000000000000e+01    7.200000000000000000e+04
0.000000000000000000e+00    0.000000000000000000e+00    1.000000000000000000e+00    2.700000000000000000e+01    4.800000000000000000e+04
0.000000000000000000e+00    1.000000000000000000e+00    0.000000000000000000e+00    3.000000000000000000e+01    5.400000000000000000e+04
0.000000000000000000e+00    0.000000000000000000e+00    1.000000000000000000e+00    3.800000000000000000e+01    6.100000000000000000e+04
0.000000000000000000e+00    1.000000000000000000e+00    0.000000000000000000e+00    4.000000000000000000e+01    6.377777777777778101e+04
1.000000000000000000e+00    0.000000000000000000e+00    0.000000000000000000e+00    3.500000000000000000e+01    5.800000000000000000e+04
0.000000000000000000e+00    0.000000000000000000e+00    1.000000000000000000e+00    3.877777777777777857e+01    5.200000000000000000e+04
1.000000000000000000e+00    0.000000000000000000e+00    0.000000000000000000e+00    4.800000000000000000e+01    7.900000000000000000e+04
0.000000000000000000e+00    1.000000000000000000e+00    0.000000000000000000e+00    5.000000000000000000e+01    8.300000000000000000e+04
1.000000000000000000e+00    0.000000000000000000e+00    0.000000000000000000e+00    3.700000000000000000e+01    6.700000000000000000e+04

The problem is there are no column labels. I tried

问题是没有列标签。我试过

something = pd.get_dummies(X)

But I get the following Exception

但我得到以下异常

Exception: Data must be 1-dimensional

采纳答案by andrew_reece

Most sklearnmethods don't care about column names, as they're mainly concerned with the math behind the ML algorithms they implement. You can add column names back onto the OneHotEncoderoutput after fit_transform(), if you can figure out the label encoding ahead of time.

大多数sklearn方法不关心列名,因为它们主要关心它们实现的 ML 算法背后的数学。如果您能提前弄清楚标签编码,您可以OneHotEncoder在 之后将列名添加回输出fit_transform()

First, grab the column names of your predictors from the original dataset, excluding the first one (which we reserve for LabelEncoder):

首先,从原始 中获取预测变量的列名dataset,不包括第一个(我们为 保留LabelEncoder):

X_cols = dataset.columns[1:-1]
X_cols
# Index(['Age', 'Salary'], dtype='object')

Now get the order of the encoded labels. In this particular case, it looks like LabelEncoder()organizes its integer mapping alphabetically:

现在获取编码标签的顺序。在这种特殊情况下,它看起来像LabelEncoder()按字母顺序组织其整数映射:

labels = labelencoder_X.fit(X[:, 0]).classes_ 
labels
# ['France' 'Germany' 'Spain']

Combine these column names, and then add them to Xwhen you convert to DataFrame:

组合这些列名,然后X在转换为时将它们添加到DataFrame

# X gets re-used, so make sure to define encoded_cols after this line
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
encoded_cols = np.append(labels, X_cols)
# ...
X = onehotencoder.fit_transform(X).toarray()
encoded_df = pd.DataFrame(X, columns=encoded_cols)

encoded_df
   France  Germany  Spain        Age        Salary
0     1.0      0.0    0.0  44.000000  72000.000000
1     0.0      0.0    1.0  27.000000  48000.000000
2     0.0      1.0    0.0  30.000000  54000.000000
3     0.0      0.0    1.0  38.000000  61000.000000
4     0.0      1.0    0.0  40.000000  63777.777778
5     1.0      0.0    0.0  35.000000  58000.000000
6     0.0      0.0    1.0  38.777778  52000.000000
7     1.0      0.0    0.0  48.000000  79000.000000
8     0.0      1.0    0.0  50.000000  83000.000000
9     1.0      0.0    0.0  37.000000  67000.000000

NB:For example data I'm using this dataset, which seems either very similar or identical to the one used by OP. Note how the output is identical to OP's Xmatrix.

注意:例如,我正在使用此数据集的数据,它看起来与 OP使用的数据非常相似或相同。请注意输出如何与 OP 的X矩阵相同。