Python 训练 SVM 分类器需要多长时间?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18165213/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How much time does take train SVM classifier?
提问by Il'ya Zhenin
I wrote following code and test it on small data:
我编写了以下代码并在小数据上进行测试:
classif = OneVsRestClassifier(svm.SVC(kernel='rbf'))
classif.fit(X, y)
Where X, y
(X - 30000x784 matrix, y - 30000x1) are numpy arrays. On small data algorithm works well and give me right results.
其中X, y
(X - 30000x784 矩阵,y - 30000x1) 是 numpy 数组。在小数据算法上运行良好,并给我正确的结果。
But I run my program about 10 hours ago... And it is still in process.
但是我大约在 10 小时前运行了我的程序......它仍在进行中。
I want to know how long it will take, or it stuck in some way? (Laptop specs 4 GB Memory, Core i5-480M)
我想知道需要多长时间,或者它以某种方式卡住了?(笔记本电脑规格 4 GB 内存,Core i5-480M)
回答by lejlot
SVM training can be arbitrary long, this depends on dozens of parameters:
SVM 训练可以任意长,这取决于几十个参数:
C
parameter - greater the missclassification penalty, slower the process- kernel - more complicated the kernel, slower the process (rbf is the most complex from the predefined ones)
- data size/dimensionality - again, the same rule
C
参数 - 错误分类惩罚越大,过程越慢- 内核 - 内核越复杂,进程越慢(rbf 是预定义内核中最复杂的)
- 数据大小/维度——同样的规则
in general, basic SMO algorithm is O(n^3)
, so in case of 30 000
datapoints it has to run number of operations proportional to the2 700 000 000 000
which is realy huge number. What are your options?
一般来说,基本的 SMO 算法是O(n^3)
,所以在30 000
数据点的情况下,它必须运行与 成比例的操作数量,2 700 000 000 000
这确实是一个巨大的数字。你有哪些选择?
- change a kernel to the linear one, 784 features is quite a lot, rbf can be redundant
- reduce features' dimensionality (PCA?)
- lower the
C
parameter - train model on the subset of your data to find the good parameters and then train the whole one on some cluster/supercomputer
- 将内核改成线性内核,784个特征相当多,rbf可以是多余的
- 减少特征的维数(PCA?)
- 降低
C
参数 - 在数据子集上训练模型以找到好的参数,然后在某个集群/超级计算机上训练整个模型