Python RandomForestClassfier.fit(): ValueError: 无法将字符串转换为浮点数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30384995/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
RandomForestClassfier.fit(): ValueError: could not convert string to float
提问by nilkn
Given is a simple CSV file:
给定的是一个简单的 CSV 文件:
A,B,C
Hello,Hi,0
Hola,Bueno,1
Obviously the real dataset is far more complex than this, but this one reproduces the error. I'm attempting to build a random forest classifier for it, like so:
显然真实的数据集远比这复杂得多,但这个重现了错误。我正在尝试为它构建一个随机森林分类器,如下所示:
cols = ['A','B','C']
col_types = {'A': str, 'B': str, 'C': int}
test = pd.read_csv('test.csv', dtype=col_types)
train_y = test['C'] == 1
train_x = test[cols]
clf_rf = RandomForestClassifier(n_estimators=50)
clf_rf.fit(train_x, train_y)
But I just get this traceback when invoking fit():
但是在调用 fit() 时我只是得到了这个回溯:
ValueError: could not convert string to float: 'Bueno'
scikit-learn version is 0.16.1.
scikit-learn 版本是 0.16.1。
采纳答案by RPresle
You have to do some encoding before using fit. As it was told fit() does not accept Strings but you solve this.
在使用 fit 之前,您必须进行一些编码。据说 fit() 不接受字符串,但您解决了这个问题。
There are several classes that can be used :
有几个类可以使用:
- LabelEncoder: turn your string into incremental value
- OneHotEncoder: use One-of-K algorithm to transform your String into integer
- LabelEncoder:将您的字符串转换为增量值
- OneHotEncoder:使用 One-of-K 算法将您的 String 转换为整数
Personally I have post almost the same questionon StackOverflow some time ago. I wanted to have a scalable solution but didn't get any answer. I selected OneHotEncoder that binarize all the strings. It is quite effective but if you have a lot different strings the matrix will grow very quickly and memory will be required.
我个人前段时间在 StackOverflow 上发布了几乎相同的问题。我想要一个可扩展的解决方案,但没有得到任何答案。我选择了对所有字符串进行二值化的 OneHotEncoder。它非常有效,但如果您有很多不同的字符串,矩阵将增长得非常快,并且需要内存。
回答by farhawa
You can't pass str
to your model fit()
method. as it mentioned here
你不能传递str
给你的模型fit()
方法。正如它在这里提到的
The training input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csc_matrix.
训练输入样本。在内部,它将被转换为 dtype=np.float32 并且如果稀疏矩阵提供给稀疏 csc_matrix。
Try transforming your data to float and give a try to LabelEncoder.
尝试将您的数据转换为 float 并尝试使用LabelEncoder。
回答by SinOfWrath
LabelEncoding worked for me (basically you've to encode your data feature-wise) (mydata is a 2d array of string datatype):
LabelEncoding 为我工作(基本上你必须对数据进行特征编码)(mydata 是字符串数据类型的二维数组):
myData=np.genfromtxt(filecsv, delimiter=",", dtype ="|a20" ,skip_header=1);
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for i in range(*NUMBER OF FEATURES*):
myData[:,i] = le.fit_transform(myData[:,i])
回答by pittsburgh137
I had a similar issue and found that pandas.get_dummies()solved the problem. Specifically, it splits out columns of categorical data into sets of boolean columns, one new column for each unique value in each input column. In your case, you would replace train_x = test[cols]
with:
我有一个类似的问题,发现pandas.get_dummies()解决了这个问题。具体来说,它将分类数据列拆分为一组布尔列,每个输入列中的每个唯一值对应一个新列。在您的情况下,您将替换train_x = test[cols]
为:
train_x = pandas.get_dummies(test[cols])
This transforms the train_x Dataframe into the following form, which RandomForestClassifier can accept:
这会将 train_x 数据帧转换为 RandomForestClassifier 可以接受的以下形式:
C A_Hello A_Hola B_Bueno B_Hi
0 0 1 0 0 1
1 1 0 1 1 0
回答by jo nova
You may not pass str
to fit this kind of classifier.
您可能无法通过str
适合这种分类器。
For example, if you have a feature column named 'grade' which has 3 different grades:
例如,如果您有一个名为“等级”的特征列,它有 3 个不同的等级:
A,B and C.
A,B和C。
you have to transfer those str
"A","B","C"to matrix by encoder like the following:
您必须通过编码器将那些str
“A”、“B”、“C”传输到矩阵,如下所示:
A = [1,0,0]
B = [0,1,0]
C = [0,0,1]
because the str
does not have numerical meaning for the classifier.
因为 对分类str
器没有数字意义。
In scikit-learn, OneHotEncoder
and LabelEncoder
are available in inpreprocessing
module.
However OneHotEncoder
does not support to fit_transform()
of string.
"ValueError: could not convert string to float" may happen during transform.
在 scikit-learn 中,OneHotEncoder
并且LabelEncoder
在inpreprocessing
模块中可用。但是OneHotEncoder
不支持 to fit_transform()
of 字符串。转换过程中可能会发生“ValueError:无法将字符串转换为浮点数”。
You may use LabelEncoder
to transfer from str
to continuous numerical values. Then you are able to transfer by OneHotEncoder
as you wish.
您可以使用LabelEncoder
从 转换str
为连续数值。然后你就可以OneHotEncoder
随心所欲地转移了。
In the Pandas dataframe, I have to encode all the data which are categorized to dtype:object
. The following code works for me and I hope this will help you.
在 Pandas 数据框中,我必须对所有分类为dtype:object
. 以下代码对我有用,希望对您有所帮助。
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for column_name in train_data.columns:
if train_data[column_name].dtype == object:
train_data[column_name] = le.fit_transform(train_data[column_name])
else:
pass
回答by raghu
As your input is in string you are getting value error message use countvectorizer it will convert data set in to sparse matrix and train your ml algorithm you will get the result
由于您的输入是字符串,您会收到值错误消息,使用 countvectorizer 它将数据集转换为稀疏矩阵并训练您的 ml 算法,您将获得结果
回答by Aleksandar Gakovic
Indeed a one-hot encoder will work just fine here, convert any string and numerical categorical variables you want into 1's and 0's this way and random forest should not complain.
事实上,单热编码器在这里工作得很好,以这种方式将您想要的任何字符串和数字分类变量转换为 1 和 0,随机森林不应该抱怨。