Python RandomForestClassfier.fit(): ValueError: 无法将字符串转换为浮点数

Question

提问by nilkn

Given is a simple CSV file:

给定的是一个简单的 CSV 文件：

A,B,C
Hello,Hi,0
Hola,Bueno,1

Obviously the real dataset is far more complex than this, but this one reproduces the error. I'm attempting to build a random forest classifier for it, like so:

显然真实的数据集远比这复杂得多，但这个重现了错误。我正在尝试为它构建一个随机森林分类器，如下所示：

cols = ['A','B','C']
col_types = {'A': str, 'B': str, 'C': int}
test = pd.read_csv('test.csv', dtype=col_types)

train_y = test['C'] == 1
train_x = test[cols]

clf_rf = RandomForestClassifier(n_estimators=50)
clf_rf.fit(train_x, train_y)

But I just get this traceback when invoking fit():

但是在调用 fit() 时我只是得到了这个回溯：

ValueError: could not convert string to float: 'Bueno'

scikit-learn version is 0.16.1.

scikit-learn 版本是 0.16.1。

Answer 1

采纳答案by RPresle

You have to do some encoding before using fit. As it was told fit() does not accept Strings but you solve this.

在使用 fit 之前，您必须进行一些编码。据说 fit() 不接受字符串，但您解决了这个问题。

There are several classes that can be used :

有几个类可以使用：

LabelEncoder: turn your string into incremental value
OneHotEncoder: use One-of-K algorithm to transform your String into integer

LabelEncoder：将您的字符串转换为增量值
OneHotEncoder：使用 One-of-K 算法将您的 String 转换为整数

Personally I have post almost the same questionon StackOverflow some time ago. I wanted to have a scalable solution but didn't get any answer. I selected OneHotEncoder that binarize all the strings. It is quite effective but if you have a lot different strings the matrix will grow very quickly and memory will be required.

我个人前段时间在 StackOverflow 上发布了几乎相同的问题。我想要一个可扩展的解决方案，但没有得到任何答案。我选择了对所有字符串进行二值化的 OneHotEncoder。它非常有效，但如果您有很多不同的字符串，矩阵将增长得非常快，并且需要内存。

Answer 2

回答by farhawa

You can't pass strto your model fit()method. as it mentioned here

你不能传递str给你的模型fit()方法。正如它在这里提到的

The training input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csc_matrix.

训练输入样本。在内部，它将被转换为 dtype=np.float32 并且如果稀疏矩阵提供给稀疏 csc_matrix。

Try transforming your data to float and give a try to LabelEncoder.

尝试将您的数据转换为 float 并尝试使用LabelEncoder。

Answer 3

回答by SinOfWrath

LabelEncoding worked for me (basically you've to encode your data feature-wise) (mydata is a 2d array of string datatype):

LabelEncoding 为我工作（基本上你必须对数据进行特征编码）（mydata 是字符串数据类型的二维数组）：

myData=np.genfromtxt(filecsv, delimiter=",", dtype ="|a20" ,skip_header=1);

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for i in range(*NUMBER OF FEATURES*):
    myData[:,i] = le.fit_transform(myData[:,i])

Answer 4

回答by pittsburgh137

I had a similar issue and found that pandas.get_dummies()solved the problem. Specifically, it splits out columns of categorical data into sets of boolean columns, one new column for each unique value in each input column. In your case, you would replace train_x = test[cols]with:

我有一个类似的问题，发现pandas.get_dummies()解决了这个问题。具体来说，它将分类数据列拆分为一组布尔列，每个输入列中的每个唯一值对应一个新列。在您的情况下，您将替换train_x = test[cols]为：

train_x = pandas.get_dummies(test[cols])

This transforms the train_x Dataframe into the following form, which RandomForestClassifier can accept:

这会将 train_x 数据帧转换为 RandomForestClassifier 可以接受的以下形式：

   C  A_Hello  A_Hola  B_Bueno  B_Hi
0  0        1       0        0     1
1  1        0       1        1     0

Answer 5

回答by jo nova

You may not pass strto fit this kind of classifier.

您可能无法通过str适合这种分类器。

For example, if you have a feature column named 'grade' which has 3 different grades:

例如，如果您有一个名为“等级”的特征列，它有 3 个不同的等级：

A,B and C.

A，B和C。

you have to transfer those str"A","B","C"to matrix by encoder like the following:

您必须通过编码器将那些str“A”、“B”、“C”传输到矩阵，如下所示：

A = [1,0,0]

B = [0,1,0]

C = [0,0,1]

because the strdoes not have numerical meaning for the classifier.

因为对分类str器没有数字意义。

In scikit-learn, OneHotEncoderand LabelEncoderare available in inpreprocessingmodule. However OneHotEncoderdoes not support to fit_transform()of string. "ValueError: could not convert string to float" may happen during transform.

在 scikit-learn 中，OneHotEncoder并且LabelEncoder在inpreprocessing模块中可用。但是OneHotEncoder不支持 to fit_transform()of 字符串。转换过程中可能会发生“ValueError：无法将字符串转换为浮点数”。

You may use LabelEncoderto transfer from strto continuous numerical values. Then you are able to transfer by OneHotEncoderas you wish.

您可以使用LabelEncoder从转换str为连续数值。然后你就可以OneHotEncoder随心所欲地转移了。

In the Pandas dataframe, I have to encode all the data which are categorized to dtype:object. The following code works for me and I hope this will help you.

在 Pandas 数据框中，我必须对所有分类为dtype:object. 以下代码对我有用，希望对您有所帮助。

 from sklearn import preprocessing
    le = preprocessing.LabelEncoder()
    for column_name in train_data.columns:
        if train_data[column_name].dtype == object:
            train_data[column_name] = le.fit_transform(train_data[column_name])
        else:
            pass

Answer 6

回答by raghu

As your input is in string you are getting value error message use countvectorizer it will convert data set in to sparse matrix and train your ml algorithm you will get the result

由于您的输入是字符串，您会收到值错误消息，使用 countvectorizer 它将数据集转换为稀疏矩阵并训练您的 ml 算法，您将获得结果

Answer 7

回答by Aleksandar Gakovic

Indeed a one-hot encoder will work just fine here, convert any string and numerical categorical variables you want into 1's and 0's this way and random forest should not complain.

事实上，单热编码器在这里工作得很好，以这种方式将您想要的任何字符串和数字分类变量转换为 1 和 0，随机森林不应该抱怨。

Python RandomForestClassfier.fit(): ValueError: 无法将字符串转换为浮点数

提问by nilkn

采纳答案by RPresle

回答by farhawa

回答by SinOfWrath

回答by pittsburgh137

回答by jo nova

回答by raghu

回答by Aleksandar Gakovic

相关推荐

最近更新

标签

Python RandomForestClassfier.fit(): ValueError: 无法将字符串转换为浮点数

提问by nilkn

采纳答案by RPresle

回答by farhawa

回答by SinOfWrath

回答by pittsburgh137

回答by jo nova

回答by raghu

回答by Aleksandar Gakovic

相关推荐

Python 在 Scrapy 中发送帖子请求

在 Python 2.7 中四舍五入到两位小数？

使用 python-Scrapy 抓取动态内容

如何在python中对数字列表求和

相关推荐

最近更新

标签