在 Python 中使用 LibSVM 预计算内核

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2474460/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-04 00:44:25  来源:igfitidea点击:

Precomputed Kernels with LibSVM in Python

pythonmachine-learninglibsvm

提问by Lyyli

I've been searching the net for ~3 hours but I couldn't find a solution yet. I want to give a precomputed kernel to libsvm and classify a dataset, but:

我已经在网上搜索了大约 3 个小时,但还没有找到解决方案。我想为 libsvm 提供一个预先计算的内核并对数据集进行分类,但是:

  • How can I generate a precomputed kernel? (for example, what is the basic precomputed kernel for Iris data?)

  • In the libsvm documentation, it is stated that:

    For precomputed kernels, the first element of each instance must be the ID. For example,

            samples = [[1, 0, 0, 0, 0], [2, 0, 1, 0, 1], [3, 0, 0, 1, 1], [4, 0, 1, 1, 2]]
            problem = svm_problem(labels, samples)
            param = svm_parameter(kernel_type=PRECOMPUTED)
    
  • 如何生成预计算内核?(例如,虹膜数据的基本预计算内核是什么?)

  • 在 libsvm 文档中,指出:

    对于预先计算的内核,每个实例的第一个元素必须是 ID。例如,

            samples = [[1, 0, 0, 0, 0], [2, 0, 1, 0, 1], [3, 0, 0, 1, 1], [4, 0, 1, 1, 2]]
            problem = svm_problem(labels, samples)
            param = svm_parameter(kernel_type=PRECOMPUTED)
    

What is a ID? There's no further details on that. Can I assign ID's sequentially?

我所说?没有进一步的细节。我可以按顺序分配 ID 吗?

Any libsvm help and an example of precomputed kernels really appreciated.

任何 libsvm 帮助和预计算内核的示例都非常感谢。

回答by Stompchicken

First of all, some background to kernels and SVMs...

首先,一些内核和 SVM 的背景......

If you want to pre-compute a kernel for nvectors (of any dimension), what need to do is calculate the kernel function between each pair of examples. The kernel function takes two vectors and gives a scalar, so you can think of a precomputed kernel as a nxnmatrix of scalars. It's usually called the kernel matrix, or sometimes the Gram matrix.

如果你想为n向量(任何维度)预先计算一个核,需要做的是计算每对例子之间的核函数。核函数接受两个向量并给出一个标量,因此您可以将预先计算的核视为nxn标量矩阵。它通常称为核矩阵,有时称为 Gram 矩阵。

There are many different kernels, the simplest is the linear kernel (also known as the dot product):

有许多不同的核,最简单的是线性核(也称为点积):

sum(x_i * y_i) for i in [1..N] where (x_1,...,x_N) (y_1,..,y_N) are vectors

sum(x_i * y_i) for i in [1..N] where (x_1,...,x_N) (y_1,..,y_N) are vectors

Secondly, trying to answer your problem...

其次,试图回答你的问题......

The documentation about precomputed kernels in libsvm is actually pretty good...

关于 libsvm 中预计算内核的文档实际上非常好......

Assume the original training data has three four-feature instances 
and testing data has one instance:

15  1:1 2:1 3:1 4:1
45      2:3     4:3
25          3:1
15  1:1     3:1

If the linear kernel is used, we have the following 
new training/testing sets:

15  0:1 1:4 2:6  3:1
45  0:2 1:6 2:18 3:0 
25  0:3 1:1 2:0  3:1

15  0:? 1:2 2:0  3:1

Each vector here in the second example is a row in the kernel matrix. The value at index zero is the ID value and it just seems to be a sequential count. The value at index 1 of the first vector is the value of the kernel function of the first vector from the first example with itself (i.e. (1x1)+(1x1)+(1x1)+(1x1) = 4), the second is the value of the kernel function of the first vector with the second (i.e. (1x3)+(1x3)=6). It follows on like that for the rest of the example. You can see in that the kernel matrix is symmetric, as it should be, because K(x,y) = K(y,x).

第二个例子中的每个向量都是核矩阵中的一行。索引零处的值是 ID 值,它似乎只是一个顺序计数。第一个向量的索引 1 处的值是来自第一个示例的第一个向量的核函数值(即(1x1)+(1x1)+(1x1)+(1x1) = 4),第二个是第一个向量与第二个(即(1x3)+(1x3)=6)的核函数值。对于本示例的其余部分,也是如此。你可以看到核矩阵是对称的,因为它应该是对称的,因为 K(x,y) = K(y,x)。

It's worth pointing out that the first set of vectors are represented in a sparse format (i.e. missing values are zero), but the kernel matrix isn't and shouldn't be sparse. I don't know why that is, it just seems to be a libsvm thing.

值得指出的是,第一组向量以稀疏格式表示(即缺失值为零),但核矩阵不是也不应该是稀疏的。我不知道为什么会这样,它似乎只是 libsvm 的事情。

回答by Fabian Pedregosa

scikit-learn hides most of the details of libsvm when handling custom kernels. You can either just pass an arbitrary function as your kernel and it will compute the gram matrix for you or pass the precomputed Gram matrix of the kernel.

scikit-learn 在处理自定义内核时隐藏了 libsvm 的大部分细节。您可以将任意函数作为内核传递,它会为您计算 gram 矩阵,或者传递内核的预先计算的 Gram 矩阵。

For the first one, the syntax is:

对于第一个,语法是:

   >>> from scikits.learn import svm
   >>> clf = svm.SVC(kernel=my_kernel)

where my_kernel is your kernel function, and then you can call clf.fit(X, y) and it will compute the kernel matrix for you. In the second case the syntax is:

其中 my_kernel 是您的内核函数,然后您可以调用 clf.fit(X, y) 它将为您计算内核矩阵。在第二种情况下,语法是:

   >>> from scikits.learn import svm
   >>> clf = svm.SVC(kernel="precomputed")

And when you call clf.fit(X, y), X must be the matrix k(X, X), where k is your kernel. See also this example for more details:

当您调用 clf.fit(X, y) 时,X 必须是矩阵 k(X, X),其中 k 是您的内核。另请参阅此示例以获取更多详细信息:

http://scikit-learn.org/stable/auto_examples/svm/plot_custom_kernel.html

http://scikit-learn.org/stable/auto_examples/svm/plot_custom_kernel.html

回答by Fabian Pedregosa

Here is a simple two category 3 vector custom kernel input file that works correctly. I will explain the parts (though you should also see StompChicken's answer):

这是一个简单的二类 3 向量自定义内核输入文件,可以正常工作。我将解释这些部分(尽管您还应该看到 StompChicken 的回答):

1 0:1 1:10 2:12 3:21
2 0:2 1:12 2:19 3:30
1 0:3 1:21 2:30 3:130

1 0:1 1:10 2:12 3:21
2 0:2 1:12 2:19 3:30
1 0:3 1:21 2:30 3:130

The first number on each line is which category it belongs to. The next entry on each line is of the form 0:n and it must be sequential, i.e.
0:1 on first entry
0:2 on second entry
0:3 on thrid entry

每行的第一个数字是它属于哪个类别。每行的下一个条目的格式为 0:n,并且必须是连续的,即
第一个条目为
0:1 第二个条目为
0:2 第三个条目为0:3

A possible reason for this is that libsvm returns values alpha_i that go with your vectors in the output file, but for precomputed kernels the vectors are not displayed (which could be truly huge) rather the index 0:n that went with that vector is shown to make your output easier to match up with your input. Especially since the output is not in the same order you put them in it is grouped by category. It is thus very useful for you when reading the input file to be able to match up libsvm's outputs with your own inputs to have those 0:n values. Here you can see the output

一个可能的原因是 libsvm 返回值 alpha_i 与输出文件中的向量相匹配,但对于预先计算的内核,不显示向量(这可能真的很大),而是显示了与该向量一起使用的索引 0:n使您的输出更容易与您的输入匹配。特别是因为输出的顺序与您放入的顺序不同,所以按类别分组。因此,在读取输入文件以将 libsvm 的输出与您自己的输入相匹配以获得那些 0:n 值时,这对您非常有用。在这里你可以看到输出

svm_type c_svc
kernel_type precomputed
nr_class 2
total_sv 3
rho -1.53951
label 1 2
nr_sv 2 1
SV
0.4126650675419768 0:1
0.03174528241667363 0:3
-0.4444103499586504 0:2

svm_type c_svc
kernel_type 预计算
nr_class 2
total_sv 3
rho -1.53​​951
标签 1 2
nr_sv 2 1
SV
0.4126650675419768 0:1
0.0317452824166:
349504040000000

It is importantto note that with precomputed kernels you cannot omit the zero entries like you can with all other kernels. They must be explicitly included.

这是重要的,与预先计算内核不能省略零个条目像您可以与所有其他内核要注意。它们必须明确包含在内。

回答by Gael Varoquaux

I believe that the scikit-learn's python binding of libSVM should address the problem.

我相信libSVM 的 scikit-learn的 python 绑定应该可以解决这个问题。

See the documentation at http://scikit-learn.sourceforge.net/modules/svm.html#kernel-functions, for more information.

有关更多信息,请参阅http://scikit-learn.sourceforge.net/modules/svm.html#kernel-functions 上的文档。