Python 将来自 model.predict() 的结果与原始 Pandas DataFrame 合并?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40729162/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 23:55:13  来源:igfitidea点击:

Merging results from model.predict() with original pandas DataFrame?

pythonpandasscikit-learn

提问by blacksite

I am trying to merge the results of a predictmethod back with the original data in a pandas.DataFrameobject.

我正在尝试将predict方法的结果与pandas.DataFrame对象中的原始数据合并。

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np

data = load_iris()

# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)

# add outcome variable
df['class'] = data.target

X = np.matrix(df.loc[:, [0, 1, 2, 3]])
y = np.array(df['class'])

# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

# I've got my predictions now
y_hats = model.predict(X_test)

To merge these predictions back with the original df, I try this:

要将这些预测与原始预测合并df,我试试这个:

df['y_hats'] = y_hats

But that raises:

但这引发了:

ValueError: Length of values does not match length of index

ValueError:值的长度与索引的长度不匹配

I know I could split the dfinto train_dfand test_dfand this problem would be solved, but in reality I need to follow the path above to create the matrices Xand y(my actual problem is a text classification problem in which I normalize the entirefeature matrix before splitting into train and test). How can I align these predicted values with the appropriate rows in my df, since the y_hatsarray is zero-indexed and seemingly all information about whichrows were included in the X_testand y_testis lost? Or will I be relegated to splitting dataframes into train-test first, and then building feature matrices? I'd like to just fill the rows included in trainwith np.nanvalues in the dataframe.

我知道我可以拆分dftrain_dftest_df这个问题会得到解决,但实际上我需要按照上面的路径来创建矩阵Xy(我的实际问题是一个文本分类问题,在拆分成之前我对整个特征矩阵进行了标准化训练和测试)。我怎样才能将这些预测值与 my 中的适当行对齐df,因为y_hats数组是零索引的,并且似乎所有关于哪些行包含在X_test和 中的信息y_test都丢失了?或者我会被降级为首先将数据帧拆分为训练测试,然后构建特征矩阵?我想只需填写包括在行trainnp.nan数据框中的值。

回答by flyingmeatball

your y_hats length will only be the length on the test data (20%) because you predicted on X_test. Once your model is validated and you're happy with the test predictions (by examining the accuracy of your model on the X_test predictions compared to the X_test true values), you should rerun the predict on the full dataset (X). Add these two lines to the bottom:

你的 y_hats 长度只会是测试数据的长度 (20%),因为你是在 X_test 上预测的。一旦您的模型得到验证并且您对测试预测感到满意(通过检查模型在 X_test 预测上与 X_test 真实值相比的准确性),您应该在完整数据集 (X) 上重新运行预测。将这两行添加到底部:

y_hats2 = model.predict(X)

df['y_hats'] = y_hats2

EDITper your comment, here is an updated result the returns the dataset with the prediction appended where they were in the test datset

根据您的评论编辑,这是一个更新的结果,它返回数据集,并在测试数据集中的位置附加了预测

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np

data = load_iris()

# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)

# add outcome variable
df_class = pd.DataFrame(data = data.target)

# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

# I've got my predictions now
y_hats = model.predict(X_test)

y_test['preds'] = y_hats

df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)

回答by Adam Milecki

You can create a y_hat dataframe copying indices from X_test then merge with the original data.

您可以创建一个 y_hat 数据帧,从 X_test 复制索引,然后与原始数据合并。

y_hats_df = pd.DataFrame(data = y_hats, columns = ['y_hats'], index = X_test.index.copy())
df_out = pd.merge(df, y_hats_df, how = 'left', left_index = True, right_index = True)

Note, left join will include train data rows. Omitting 'how' parameter will result in just test data.

请注意,左连接将包括训练数据行。省略 'how' 参数将只产生测试数据。

回答by PATRICK KANYI

Try this:

尝试这个:

y_hats2 = model.predict(X)
df[['y_hats']] = y_hats2

回答by Nidhi Garg

You can probably make a new dataframe and add to it the test data along with the predicted values:

您可以创建一个新的数据框并将测试数据与预测值一起添加到其中:

data['y_hats'] = y_hats
data.to_csv('data1.csv')

回答by asmgx

I have the same problem (almost)

我有同样的问题(几乎)

I fixed it this way

我是这样修的

...
.
.
.
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

y_hats = model.predict(X_test)

y_hats  = pd.DataFrame(y_hats)

df_out = X_test.reset_index()
df_out["Actual"] = y_test.reset_index()["Columns_Name"]
df_out["Prediction"] = y_hats.reset_index()[0]


y_test['preds'] = y_hats

df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)

回答by ambar003

you can also use

你也可以使用

y_hats = model.predict(X)

df['y_hats'] = y_hats.reset_index()['name of the target column']