Python sklearn 中的 SVM 是否支持增量(在线)学习?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23056460/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Does the SVM in sklearn support incremental (online) learning?
提问by Michael Aquilina
I am currently in the process of designing a recommender system for text articles (a binary case of 'interesting' or 'not interesting'). One of my specifications is that it should continuously update to changing trends.
我目前正在为文本文章设计推荐系统(“有趣”或“不有趣”的二进制案例)。我的规范之一是它应该不断更新以适应不断变化的趋势。
From what I can tell, the best way to do this is to make use of machine learning algorithm that supports incremental/online learning.
据我所知,最好的方法是利用支持增量/在线学习的机器学习算法。
Algorithms like the Perceptron and Winnow support online learning but I am not completely certain about Support Vector Machines. Does the scikit-learn python library support online learning and if so, is a support vector machine one of the algorithms that can make use of it?
Perceptron 和 Winnow 等算法支持在线学习,但我对支持向量机并不完全确定。scikit-learn python 库是否支持在线学习,如果支持,支持向量机是可以利用它的算法之一吗?
I am obviously not completely tied down to using support vector machines, but they are usually the go to algorithm for binary classification due to their all round performance. I would be willing to change to whatever fits best in the end.
我显然并没有完全依赖于使用支持向量机,但由于它们的全面性能,它们通常是二进制分类的首选算法。我愿意最终改变为最适合的。
采纳答案by Raff.Edward
While online algorithms for SVMs do exist, it has become important to specify if you want kernel or linear SVMs, as many efficient algorithms have been developed for the special case of linear SVMs.
虽然 SVM 的在线算法确实存在,但指定您想要核 SVM 还是线性 SVM 变得很重要,因为已经为线性 SVM 的特殊情况开发了许多有效的算法。
For the linear case, if you use the SGD classifier in scikit-learnwith the hinge loss and L2 regularization you will get an SVM that can be updated online/incrementall. You can combine this with feature transforms that approximate a kernelto get similar to an online kernel SVM.
对于线性情况,如果您在 scikit-learn 中使用带有铰链损失和 L2 正则化的SGD 分类器,您将获得一个可以在线/增量更新的 SVM。您可以将其与近似内核的特征变换结合起来,以获得类似于在线内核 SVM 的效果。
One of my specifications is that it should continuously update to changing trends.
我的规范之一是它应该不断更新以适应不断变化的趋势。
This is referred to as concept drift,and will not be handled well by a simple online SVM. Using the PassiveAggresive classifier will likely give you better results, as it's learning rate does not decrease over time.
这称为概念漂移,简单的在线 SVM 无法很好地处理。使用 PassiveAggressive 分类器可能会给你更好的结果,因为它的学习率不会随着时间的推移而降低。
Assuming you get feedback while training / running, you can attempt to detect decreases in accuracy over time and begin training a new model when the accuracy starts to decrease (and switch to the new one when you believe that it has become more accurate). JSAThas 2 drift detection methods (see jsat.driftdetectors) that can be used to track accuracy and alert you when it has changed.
假设您在训练/跑步时获得反馈,您可以尝试检测准确度随时间的下降,并在准确度开始下降时开始训练新模型(并在您认为它变得更准确时切换到新模型)。JSAT有 2 种漂移检测方法(请参阅jsat.driftdetectors),可用于跟踪准确性并在发生变化时提醒您。
It also has more online linear and kernel methods.
它还有更多的在线线性和核方法。
(bias note: I'm the author of JSAT).
(偏见说明:我是 JSAT 的作者)。
回答by lejlot
Technical aspects
技术方面
The short answer is no. Sklearn implementation (as well as most of the existing others) do not support online SVM training. It is possible to train SVM in an incremental way, but it is not so trivial task.
简短的回答是否定的。Sklearn 实现(以及大多数现有的其他实现)不支持在线 SVM 训练。以增量方式训练 SVM 是可能的,但这并不是一项微不足道的任务。
If you want to limit yourself to the linear case, than the answer is yes, as sklearn provides you with Stochastic Gradient Descent (SGD), which has option to minimize the SVM criterion.
如果您想将自己限制在线性情况下,那么答案是肯定的,因为 sklearn 为您提供了随机梯度下降 (SGD),它可以选择最小化 SVM 标准。
You can also try out pegasos library instead, which supports online SVM training.
你也可以试试 pegasos 库,它支持在线 SVM 训练。
Theoretical aspects
理论方面
The problem of trend adaptation is currently very popular in ML community. As @Raff stated, it is called concept drift, and has numerous approaches, which are often kinds of meta models, which analyze "how the trend is behaving" and change the underlying ML model (by for example forcing it to retrain on the subset of the data). So you have two independent problems here:
趋势适应问题目前在 ML 社区非常流行。正如@Raff 所说,它被称为概念漂移,并且有许多方法,通常是元模型的种类,它们分析“趋势如何表现”并改变底层 ML 模型(例如强制它在子集上重新训练数据)。所以你在这里有两个独立的问题:
- the online training issue, which is purely technical, and can be addressed by SGD or other libraries than sklearn
- concept drift, which is currently a hot topic and has no just worksanswers There are many possibilities, hypothesis and proofes of concepts, while there is no one, generaly accepted way of dealing with this phenomena, in fact many phd dissertations in ML are currenlly based on this issue.
- 在线培训问题,这纯粹是技术问题,可以通过 SGD 或 sklearn 以外的其他库解决
- 概念漂移,目前是一个热门话题,并没有正确的答案 概念的可能性、假设和证明有很多,但没有一种普遍接受的方法来处理这种现象,实际上许多 ML 的博士论文都是当前的基于这个问题。
回答by Jariani
Maybe it's me being naive but I think it is worth mentioning how to actually update the sci-kit SGD classifierwhen you present your data incrementally:
也许是我太天真了,但我认为值得一提的是如何在增量呈现数据时实际更新sci-kit SGD 分类器:
clf = linear_model.SGDClassifier()
x1 = some_new_data
y1 = the_labels
clf.partial_fit(x1,y1)
x2 = some_newer_data
y2 = the_labels
clf.partial_fit(x2,y2)
回答by SemanticBeeng
If interested in online learning with concept drift then here is some previous work
如果对概念漂移的在线学习感兴趣,那么这里是一些以前的工作
Learning under Concept Drift: an Overview https://arxiv.org/pdf/1010.4784.pdf
The problem of concept drift: definitions and related work http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.58.9085&rep=rep1&type=pdf
A Survey on Concept Drift Adaptation http://www.win.tue.nl/~mpechen/publications/pubs/Gama_ACMCS_AdaptationCD_accepted.pdf
MOA Concept Drift Active Learning Strategies for Streaming Data http://videolectures.net/wapa2011_bifet_moa/
A Stream of Algorithms for Concept Drift http://people.cs.georgetown.edu/~maloof/pubs/maloof.heilbronn12.handout.pdf
MINING DATA STREAMS WITH CONCEPT DRIFT http://www.cs.put.poznan.pl/dbrzezinski/publications/ConceptDrift.pdf
Analyzing time series data with stream processing and machine learning http://www.ibmbigdatahub.com/blog/analyzing-time-series-data-stream-processing-and-machine-learning
在概念漂移下学习:概述 https://arxiv.org/pdf/1010.4784.pdf
概念漂移问题:定义及相关工作 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.58.9085&rep=rep1&type=pdf
概念漂移适应调查 http://www.win.tue.nl/~mpechen/publications/pubs/Gama_ACMCS_AdaptationCD_accepted.pdf
MOA 概念漂移 流式数据的主动学习策略 http://videolectures.net/wapa2011_bifet_moa/
概念漂移算法流 http://people.cs.georgetown.edu/~maloof/pubs/maloof.heilbronn12.handout.pdf
使用概念漂移挖掘数据流 http://www.cs.put.poznan.pl/dbrzezinski/publications/ConceptDrift.pdf
使用流处理和机器学习分析时间序列数据 http://www.ibmbigdatahub.com/blog/analyzing-time-series-data-stream-processing-and-machine-learning
回答by Alaleh Rz
SGD for batch learning tasks normally has a decreasing learning rate and goes over training set multiple times. So, for purely online learning, make sure learning_rate is set to 'constant' in sklearn.linear_model.SGDClassifier() and eta0= 0.1 or any desired value. Therefore the process is as follows:
用于批量学习任务的 SGD 通常具有递减的学习率并且多次遍历训练集。因此,对于纯在线学习,请确保在 sklearn.linear_model.SGDClassifier() 中将 learning_rate 设置为“constant”,并且 eta0= 0.1 或任何所需的值。因此,过程如下:
clf= sklearn.linear_model.SGDClassifier(learning_rate = 'constant', eta0 = 0.1, shuffle = False, n_iter = 1)
# get x1, y1 as a new instance
clf.partial_fit(x1, y1)
# get x2, y2
# update accuracy if needed
clf.partial_fit(x2, y2)
回答by Sergey Zakharov
A way to scale SVM could be split your large dataset into batches that can be safely consumed by an SVM algorithm, then find support vectors for each batch separately, and then build a resulting SVM model on a dataset consisting of all the support vectors found in all the batches.
一种扩展 SVM 的方法可以将您的大型数据集拆分为可由 SVM 算法安全使用的批次,然后分别为每个批次找到支持向量,然后在包含所有支持向量的数据集上构建结果 SVM 模型所有批次。
Updating to trends could be achieved by maintaining a time window each time you run your training pipeline. For example, if you do your training once a day and there is enough information in a month's historical data, create your traning dataset from the historical data obtained in the recent 30 days.
每次运行训练管道时,都可以通过维护一个时间窗口来更新趋势。例如,如果您每天进行一次训练并且一个月的历史数据中有足够的信息,则根据最近 30 天获得的历史数据创建您的训练数据集。