Python 将来自 model.predict() 的结果与原始 Pandas DataFrame 合并？

Question

提问by blacksite

I am trying to merge the results of a predictmethod back with the original data in a pandas.DataFrameobject.

我正在尝试将predict方法的结果与pandas.DataFrame对象中的原始数据合并。

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np

data = load_iris()

# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)

# add outcome variable
df['class'] = data.target

X = np.matrix(df.loc[:, [0, 1, 2, 3]])
y = np.array(df['class'])

# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

# I've got my predictions now
y_hats = model.predict(X_test)

To merge these predictions back with the original df, I try this:

要将这些预测与原始预测合并df，我试试这个：

df['y_hats'] = y_hats

But that raises:

但这引发了：

ValueError: Length of values does not match length of index

ValueError：值的长度与索引的长度不匹配

I know I could split the dfinto train_dfand test_dfand this problem would be solved, but in reality I need to follow the path above to create the matrices Xand y(my actual problem is a text classification problem in which I normalize the entirefeature matrix before splitting into train and test). How can I align these predicted values with the appropriate rows in my df, since the y_hatsarray is zero-indexed and seemingly all information about whichrows were included in the X_testand y_testis lost? Or will I be relegated to splitting dataframes into train-test first, and then building feature matrices? I'd like to just fill the rows included in trainwith np.nanvalues in the dataframe.

我知道我可以拆分df成train_df和test_df这个问题会得到解决，但实际上我需要按照上面的路径来创建矩阵X和y（我的实际问题是一个文本分类问题，在拆分成之前我对整个特征矩阵进行了标准化训练和测试）。我怎样才能将这些预测值与 my 中的适当行对齐df，因为y_hats数组是零索引的，并且似乎所有关于哪些行包含在X_test和中的信息y_test都丢失了？或者我会被降级为首先将数据帧拆分为训练测试，然后构建特征矩阵？我想只需填写包括在行train与np.nan数据框中的值。

Answer 1

回答by flyingmeatball

your y_hats length will only be the length on the test data (20%) because you predicted on X_test. Once your model is validated and you're happy with the test predictions (by examining the accuracy of your model on the X_test predictions compared to the X_test true values), you should rerun the predict on the full dataset (X). Add these two lines to the bottom:

你的 y_hats 长度只会是测试数据的长度 (20%)，因为你是在 X_test 上预测的。一旦您的模型得到验证并且您对测试预测感到满意（通过检查模型在 X_test 预测上与 X_test 真实值相比的准确性），您应该在完整数据集 (X) 上重新运行预测。将这两行添加到底部：

y_hats2 = model.predict(X)

df['y_hats'] = y_hats2

EDITper your comment, here is an updated result the returns the dataset with the prediction appended where they were in the test datset

根据您的评论编辑，这是一个更新的结果，它返回数据集，并在测试数据集中的位置附加了预测

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np

data = load_iris()

# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)

# add outcome variable
df_class = pd.DataFrame(data = data.target)

# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

# I've got my predictions now
y_hats = model.predict(X_test)

y_test['preds'] = y_hats

df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)

Answer 2

回答by Adam Milecki

You can create a y_hat dataframe copying indices from X_test then merge with the original data.

您可以创建一个 y_hat 数据帧，从 X_test 复制索引，然后与原始数据合并。

y_hats_df = pd.DataFrame(data = y_hats, columns = ['y_hats'], index = X_test.index.copy())
df_out = pd.merge(df, y_hats_df, how = 'left', left_index = True, right_index = True)

Note, left join will include train data rows. Omitting 'how' parameter will result in just test data.

请注意，左连接将包括训练数据行。省略 'how' 参数将只产生测试数据。

Answer 3

回答by PATRICK KANYI

Try this:

尝试这个：

y_hats2 = model.predict(X)
df[['y_hats']] = y_hats2

Answer 4

回答by Nidhi Garg

You can probably make a new dataframe and add to it the test data along with the predicted values:

您可以创建一个新的数据框并将测试数据与预测值一起添加到其中：

data['y_hats'] = y_hats
data.to_csv('data1.csv')

Answer 5

回答by asmgx

I have the same problem (almost)

我有同样的问题（几乎）

I fixed it this way

我是这样修的

...
.
.
.
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

y_hats = model.predict(X_test)

y_hats  = pd.DataFrame(y_hats)

df_out = X_test.reset_index()
df_out["Actual"] = y_test.reset_index()["Columns_Name"]
df_out["Prediction"] = y_hats.reset_index()[0]


y_test['preds'] = y_hats

df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)

Answer 6

回答by ambar003

you can also use

你也可以使用

y_hats = model.predict(X)

df['y_hats'] = y_hats.reset_index()['name of the target column']

Python 将来自 model.predict() 的结果与原始 Pandas DataFrame 合并？

提问by blacksite

回答by flyingmeatball

回答by Adam Milecki

回答by PATRICK KANYI

回答by Nidhi Garg

回答by asmgx

回答by ambar003

相关推荐

最近更新

标签

Python 将来自 model.predict() 的结果与原始 Pandas DataFrame 合并？

提问by blacksite

回答by flyingmeatball

回答by Adam Milecki

回答by PATRICK KANYI

回答by Nidhi Garg

回答by asmgx

回答by ambar003

相关推荐

Python 如何知道哪个正在 Jupyter notebook 中运行？

限制 Python 列表的长度

MacOS：如何降级自制 Python？

Python Kivy 不工作（错误：无法找到任何有价值的 Window 提供程序。）

相关推荐

最近更新

标签