pandas 如何创建具有多个特征的 SVM 进行分类?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43000825/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:16:27  来源:igfitidea点击:

How to create an SVM with multiple features for classification?

pythonpandasopencvscikit-learnsvm

提问by Thom Elliott

I am writing a piece of code to identify different 2D shapes using opencv. I get 4 sets of data from each image of a 2D shape and these are stored in the multidimensional array featureVectors.

我正在编写一段代码来使用 opencv 识别不同的 2D 形状。我从 2D 形状的每个图像中获得 4 组数据,这些数据存储在多维数组 featureVectors 中。

I am trying to write an svm/svc that takes into account all 4 features obtained from the image. I have been able to make it work with just 2 features but when i try all 4 my graph comes out looking like this.

我正在尝试编写一个 svm/svc,它考虑了从图像中获得的所有 4 个特征。我已经能够使其仅使用 2 个功能,但是当我尝试所有 4 个功能时,我的图表看起来像这样。

My Graph which is incorrect

我的图表不正确

My values for featureVectors are:

我对 featureVectors 的价值是:

[[  4.00000000e+00   1.74371349e-03   6.49705560e-01   9.07957236e+01]
 [  4.00000000e+00   4.60937436e-02   1.97642179e-01   9.02041472e+01]
 [  1.00000000e+00   1.18553450e-03   3.03491372e-01   6.03489082e+01]
 [  1.00000000e+00   1.54552898e-02   8.38091425e-01   1.09021207e+02]
 [  3.00000000e+00   1.69961646e-02   4.13691915e+01   1.36838300e+02]]

And my Labels are:

我的标签是:

[[2]
 [2]
 [0]
 [0]
 [1]]

Here is my code for the SVM:

这是我的 SVM 代码:

#Saving featureVectors to a csv file
values1 = featureVectors
header1 = ["Number of Sides", "Standard Deviation of Number of     Sides/Perimeter",
           "Standard Deviation of the Angles", "Largest Angle"]
my_df = pd.DataFrame(featureVectors)
my_df.to_csv('featureVectors.csv', index=True, header=header1)

#Saving labels to a csv file
values2 = labels
header2 = ["Label"]
my_df = pd.DataFrame(labels)
my_df.to_csv('labels.csv', index=True, header=header2)

#Writing the SVM
def Build_Data_Set(features = header1, features1 = header2):

    data_df = pd.DataFrame.from_csv("featureVectors.csv")
    #data_df = data_df[:250]
    X = np.array(data_df[features].values)

    data_df2 = pd.DataFrame.from_csv("labels.csv")
    y = np.array(data_df2[features1].values)
    #print(X)
    #print(y)

    return X,y

def Analysis():
    X,y = Build_Data_Set()

    clf = svm.SVC(kernel = 'linear', C = 1.0)
    clf.fit(X, y)

    w = clf.coef_[0]
    a = -w[0] / w[1]
    xx = np.linspace(0,5)
    yy = np.linspace(0,185)

    h0 = plt.plot(xx,yy, "k-", label="non weighted")

    plt.scatter(X[:, 0],X[:, 1],c=y)
    plt.ylabel("Maximum Angle (Degrees)")
    plt.xlabel("Number Of Sides")
    plt.title('Shapes')
    plt.legend()


    plt.show()

Analysis()

I have only used 5 data sets(shapes) so far because I knew it wasn't working correctly.

到目前为止,我只使用了 5 个数据集(形状),因为我知道它不能正常工作。

回答by s1h

The SVM part of your code is actually correct. The plotting part around it is not, and given the code I'll try to give you some pointers.

您代码的 SVM 部分实际上是正确的。围绕它的绘图部分不是,鉴于代码,我会尝试为您提供一些指导。

First of all:

首先:

another example I found(i cant find the link again) said to do that

我发现的另一个例子(我又找不到链接了)说要这样做

Copying code without understanding it will probably cause more problems than it solves. Given your code, I'm assuming you used thisexample as a starter.

在不理解的情况下复制代码可能会导致比解决的问题更多的问题。鉴于您的代码,我假设您将此示例用作入门。

plt.scatter(X[:, 0],X[:, 1],c=y)

In the sk-learn example, this snippet is used to plot data points, coloring them according to their label. This works because in the example we're dealing with 2-dimensional data, so this is fine. The data you're dealing with is 4-dimensional, so you're actually just plotting the first two dimensions.

在 sk-learn 示例中,此代码段用于绘制数据点,并根据标签对其进行着色。这是有效的,因为在示例中我们处理的是二维数据,所以这很好。您正在处理的数据是 4 维的,因此您实际上只是在绘制前两个维度。

plt.scatter(X[:, 0], y, c=y)

on the other hand makes no sense.

另一方面没有意义。

xx = np.linspace(0,5)
yy = np.linspace(0,185)

h0 = plt.plot(xx,yy, "k-", label="non weighted")

Your decision boundary has actually nothing to do with the actual decision boundary. It's just a plot of y over x of your coordinate system. (In addition to that, you're dealing with multi class data, so you'll have as much decision boundaries as you have classes.)

您的决策边界实际上与实际的决策边界无关。它只是坐标系 x 上的 y 绘图。(除此之外,您正在处理多类数据,因此您将拥有与类一样多的决策边界。)

Now your actual problem is data dimensionality. You're trying to plot 4-dimensional data in a 2d plot, which simply won't work. A possible approach would be to perform dimensionality reduction to map your 4d data into a lower dimensional space, so if you want to, I'd suggest you reading e.g. the excellent sklearn documentationfor an introduction to SVMs and in addition something about dimensionality reduction.

现在您的实际问题是数据维度。您试图在二维图中绘制 4 维数据,这根本行不通。一种可能的方法是执行维以将您的 4d 数据映射到较低维空间,因此,如果您愿意,我建议您阅读例如优秀的 sklearn 文档,以了解 SVM 的介绍以及有关维的内容。