pandas 将列添加到python中的数据集
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/44562743/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
add column to data set in python
提问by zipline86
I am trying to add predicted data back to my original dataset in Python. I think I'm supposed to use Pandas and ASSIGN and pd.DataFrame but I have no clue how to write this after reading all the documentation (sorry I'm new to all this and just started learning coding recently). I've written my code below and just need help with the code for adding my predictions back to the dataset. Thanks for the help!
我正在尝试将预测数据添加回 Python 中的原始数据集。我想我应该使用 Pandas 和 ASSIGN 和 pd.DataFrame 但在阅读所有文档后我不知道如何编写它(对不起,我是这一切的新手,最近才开始学习编码)。我已经在下面编写了我的代码,只是需要有关将我的预测添加回数据集的代码的帮助。谢谢您的帮助!
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25,
random_state = 0)
# Feature Scaling X_train and X_test
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#Feature scaling the all independent variables used to build the model
whole_dataset = sc.transform(X)
# Fitting classifier to the Training set
# Create your Naive Bayes here
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict_proba(X_test)
# Predicting the results for the whole dataset
y_pred2 = classifier.predict_proba(whole_dataset)
# Add y_pred2 predictions back to the dataset
???
回答by EdChum
You can just do dataset['prediction'] = y_pred
to add a new column.
您只需dataset['prediction'] = y_pred
添加一个新列即可。
Pandas supports a simple syntax for adding new columns, here it will add a new column and probably take a view on the numpy array returned from sklearn so it should be nice and fast.
Pandas 支持添加新列的简单语法,在这里它将添加一个新列,并可能查看从 sklearn 返回的 numpy 数组,因此它应该很好且快速。
EDIT
编辑
Looking at your code and the data, you're misunderstanding what train_test_split
does, this is splitting the data into 3/4 1/4 splits of your original dataset which has 400 rows, your X train data contains 300 rows, the test data is 100 rows. You're then trying to assign back to your original dataset which is 400 rows. Firstly the number of rows don't match, secondly what is returned from predict_proba
is a matrix of the predicted classes as a percentage. So what you want to do after training is to predict on the original dataset and assign this back as 2 columns by sub-selecting each column:
查看您的代码和数据,您误解了什么train_test_split
,这是将数据拆分为原始数据集的 3/4 1/4 拆分,其中包含 400 行,您的 X 训练数据包含 300 行,测试数据为 100行。然后,您尝试分配回 400 行的原始数据集。首先行数不匹配,其次返回的predict_proba
是预测类矩阵的百分比。因此,您在训练后要做的是对原始数据集进行预测,并通过子选择每一列将其分配回 2 列:
y_pred = classifier.predict_proba(X)
now assign this back :
现在分配回:
dataset['predict_class_1'],dataset['predict_class_2'] = y_pred[:,0],y_pred[:,1]
回答by CDtoday
There are several solutions. The answer of EdChurmhad mentioned one. As far as I know, pandas has other 2 methods to work with it.
有几种解决方案。EdChurm 的回答提到了一个。据我所知,pandas 有其他 2 种方法可以使用它。
Since you didn't provide the data in use, here's a pretty simple example.
由于您没有提供正在使用的数据,这是一个非常简单的示例。
import pandas as pd
import numpy as np
np.random.seed(1)
df = pd.DataFrame(np.random.randn(10), columns=['raw'])
df = df.assign(cube_raw=df['raw']**2)
df.insert(1,'square_raw',df['raw']**3)
df
raw square_raw cube_raw
0 1.624345 2.638498 4.285832
1 -0.611756 0.374246 -0.228947
2 -0.528172 0.278965 -0.147342
3 -1.072969 1.151262 -1.235268
4 0.865408 0.748930 0.648130
5 -2.301539 5.297080 -12.191435
6 1.744812 3.044368 5.311849
7 -0.761207 0.579436 -0.441071
8 0.319039 0.101786 0.032474
9 -0.249370 0.062186 -0.015507
Just keep in mind that df.assign()
doesn't work inplace, so you should reassign to your previous variable.
请记住,这df.assign()
并不能就地工作,因此您应该重新分配给之前的变量。
In my opinion, I prefer df.insert()
the most, for it allows you to assign which location you want to insert. (with parameter loc
)
在我看来,我df.insert()
最喜欢它,因为它允许您指定要插入的位置。(带参数loc
)