pandas 用于线性回归的熊猫数据框转换
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29975325/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas dataframe conversion for linear regression
提问by Jin
I read the CSV file and get a dataframe (name: data) that has a few columns, the first a few are in format numeric long(type:pandas.core.series.Series) and the last column(label) is a binary response variable string 'P(ass)'/'F(ail)'
我读取了 CSV 文件并获得了一个包含几列的数据框(名称:数据),前几列采用数字 long 格式(类型:pandas.core.series.Series),最后一列(标签)是二进制格式响应变量字符串'P(ass)'/'F(ail)'
import statsmodels.api as sm
label = data.ix[:, -1]
label[label == 'P'] = 1
label[label == 'F'] = 0
fea = data.ix[:, 0: -1]
logit = sm.Logit(label, fea)
result = logit.fit()
print result.summary()
Pandas throws me this error message: "ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data)"Numpy,Pandas etc modules are imported already. I tried to convert fea columns to float but still does not go through. Could someone tell me how to correct?
Thanks
Pandas 向我抛出此错误消息:“ ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data)“Numpy,Pandas 等模块已经导入。我试图将 fea 列转换为浮动,但仍然没有通过。有人能告诉我如何纠正吗?
谢谢
update:
更新:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 500 entries, 68135 to 3002
Data columns (total 8 columns):
TestQty 500 non-null int64
WaferSize 500 non-null int64
ChuckTemp 500 non-null int64
Notch 500 non-null int64
ORIGINALDIEX 500 non-null int64
ORIGINALDIEY 500 non-null int64
DUTNo 500 non-null int64
PassFail 500 non-null object
dtypes: int64(7), object(1)
memory usage: 35.2+ KB
data.sum()
TestQty 530
WaferSize 6000
ChuckTemp 41395
Notch 135000
ORIGINALDIEX 12810
ORIGINALDIEY 7885
DUTNo 271132
PassFail 20
dtype: float64
回答by Alexander
Shouldn't your features be this:
你的特征不应该是这样的:
fea = data.ix[:, 0:-1]
From you data, you see that PassFail sums to 20 before you convert 'P' to 1 and 'F' to zero. I believe that is the source of your error.
从您的数据中,您可以看到 PassFail 在将“P”转换为 1 并将“F”转换为 0 之前的总和为 20。我相信这是你错误的根源。
To see what is in there, try:
要查看里面有什么,请尝试:
data.PassFail.unique()
To verify that it totals to 500 (the number of rows in the DataFrame):
要验证它总计为 500(DataFrame 中的行数):
sum(label[label == 0]) + sum(label[label == 1)
Finally, try passing values to the function rather than Series and DataFrames:
最后,尝试将值传递给函数而不是 Series 和 DataFrames:
logit = sm.Logit(label.values, fea.values)

