Python scikit-learn 中的不平衡

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15065833/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 13:13:50  来源:igfitidea点击:

Imbalance in scikit-learn

pythonscikit-learn

提问by Maoritzio

I'm using scikit-learn in my Python program in order to perform some machine-learning operations. The problem is that my data-set has severe imbalance issues.

我在我的 Python 程序中使用 scikit-learn 来执行一些机器学习操作。问题是我的数据集存在严重的不平衡问题。

Is anyone familiar with a solution for imbalance in scikit-learn or in python in general? In Java there's the SMOTE mechanizm. Is there something parallel in python?

有没有人熟悉 scikit-learn 或 python 中不平衡的解决方案?在 Java 中有 SMOTE 机制。python中有没有并行的东西?

回答by Junuxx

SMOTE is not a builtin in scikit-learn, but there are implementations available online nevertheless.

SMOTE 不是 scikit-learn 中的内置函数,但仍有在线可用的实现。

Edit: The discussion with a SMOTE implementation on GManethat I originally linked to, appears to be no longer available. The code is preserved here.

编辑:与我最初链接的GMane上的SMOTE实现的讨论似乎不再可用。代码保存在这里

The newer answer below, by @nos, is also quite good.

@nos 下面的较新答案也很好。

回答by Lucas Ribeiro

In Scikit learn there are some imbalance correction techniques, which vary according with which learning algorithm are you using.

在 Scikit learn 中,有一些不平衡校正技术,根据您使用的学习算法而有所不同。

Some one of them, like Svmor logistic regression, have the class_weightparameter. If you instantiate an SVCwith this parameter set on 'auto', it will weight each class example proportionally to the inverse of its frequency.

其中一些,如Svm逻辑回归,有class_weight参数。如果您SVC将此参数设置为 on 来实例化 an 'auto',它将按与频率的倒数成比例地对每个类示例进行加权。

Unfortunately, there isn't a preprocessor tool with this purpose.

不幸的是,没有用于此目的的预处理器工具。

回答by burgersmoke

I found one other library here which implements undersampling and also multiple oversampling techniques including multiple SMOTEimplementations and another which uses SVM:

我在这里找到了另一个库,它实现了欠采样和多种过采样技术,包括多个SMOTE实现和另一个使用SVM

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

解决机器学习中不平衡数据集诅咒的 Python 包

回答by nos

There is a new one here

这里有一个新的

https://github.com/scikit-learn-contrib/imbalanced-learn

https://github.com/scikit-learn-contrib/imbalanced-learn

It contains many algorithms in the following categories, including SMOTE

它包含以下类别的许多算法,包括 SMOTE

  • Under-sampling the majority class(es).
  • Over-sampling the minority class.
  • Combining over- and under-sampling.
  • Create ensemble balanced sets.
  • 对多数类进行欠采样。
  • 对少数类进行过采样。
  • 结合过采样和欠采样。
  • 创建合奏平衡集。

回答by Matt Elgazar

Since others have listed links to the very popular imbalanced-learn library I'll give an overview about how to properly use it along with some links.

由于其他人已经列出了非常流行的不平衡学习库的链接,我将概述如何正确使用它以及一些链接。

https://imbalanced-learn.org/en/stable/generated/imblearn.under_sampling.RandomUnderSampler.html

https://imbalanced-learn.org/en/stable/generated/imblearn.under_sampling.RandomUnderSampler.html

https://imbalanced-learn.org/en/stable/generated/imblearn.over_sampling.RandomOverSampler.html

https://imbalanced-learn.org/en/stable/generated/imblearn.over_sampling.RandomOverSampler.html

https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html

https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html

https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/over-sampling/plot_comparison_over_sampling.html#sphx-glr-auto-examples-over-sampling-plot-comparison-over-sampling-py

https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/over-sampling/plot_comparison_over_sampling.html#sphx-glr-auto-examples-over-sampling-plot-comparison-over-sampling-py

https://imbalanced-learn.org/en/stable/combine.html

https://imbalanced-learn.org/en/stable/combine.html

Some common over-sampling and under-sampling techniques in imbalanced-learn are imblearn.over_sampling.RandomOverSampler, imblearn.under_sampling.RandomUnderSampler, and imblearn.SMOTE. For these libraries there is a nice parameter that allows the user to change the sampling ratio.

不平衡学习中一些常见的过采样和欠采样技术是 imblearn.over_sampling.RandomOverSampler、imblearn.under_sampling.RandomUnderSampler 和 imblearn.SMOTE。对于这些库,有一个很好的参数,允许用户更改采样率。

For example, in SMOTE, to change the ratio you would input a dictionary, and all values must be greater than or equal to the largest class (since SMOTE is an over-sampling technique). The reason I have found SMOTE to be a better fit for model performance in my experience is probably because with RandomOverSampler you are duplicating rows, which means the model can start to memorize the data rather than generalize to new data. SMOTE uses the K-Nearest-Neighbors algorithm to make "similar" data points to those under sampled ones.

例如,在 SMOTE 中,要更改比率,您将输入字典,并且所有值必须大于或等于最大类(因为 SMOTE 是一种过采样技术)。根据我的经验,我发现 SMOTE 更适合模型性能的原因可能是因为使用 RandomOverSampler 可以复制行,这意味着模型可以开始记忆数据而不是泛化到新数据。SMOTE 使用 K-Nearest-Neighbors 算法使“相似”数据点与采样数据点相似。

It is not good practice to blindly use SMOTE, setting the ratio to it's default (even class balance) because the model may overfit one or more of the minority classes (even though SMOTE is using nearest neighbors to make "similar" observations). In a similar way that you tune hyperparameters of a ML model you will tune the hyperparameters of the SMOTE algorithm, such as the ratio and/or knn. Below is a working example of how to properly use SMOTE.

盲目使用 SMOTE,将比率设置为默认值(甚至类平衡)并不是一个好习惯,因为模型可能会过度拟合一个或多个少数类(即使 SMOTE 使用最近的邻居来进行“相似”的观察)。以与调整 ML 模型的超参数类似的方式,您将调整 SMOTE 算法的超参数,例如比率和/或 knn。以下是如何正确使用 SMOTE 的工作示例。

NOTE: It is vital that you do not use SMOTE on the full data set. You MUST use SMOTE on the training set only (after you split). Then validate on your val/test sets and see if your SMOTE model out performed your other model(s). If you do not do this there will be data leakage and your model is essentially cheating.

注意:不要在完整数据集上使用 SMOTE,这一点至关重要。您必须仅在训练集上使用 SMOTE(拆分后)。然后在您的 val/test 集上进行验证,看看您的 SMOTE 模型是否优于您的其他模型。如果你不这样做,就会有数据泄露,你的模型本质上就是在作弊。

from collections import Counter
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
import numpy as np
from xgboost import XGBClassifier
import warnings

warnings.filterwarnings(action='ignore', category=DeprecationWarning)
sm = SMOTE(random_state=0, n_jobs=8, ratio={'class1':100, 'class2':100, 'class3':80, 'class4':60, 'class5':90})
X_resampled, y_resampled = sm.fit_sample(X_normalized, y)

print('Original dataset shape:', Counter(y))
print('Resampled dataset shape:', Counter(y_resampled))

X_train_smote, X_test_smote, y_train_smote, y_test_smote = train_test_split(X_resampled, y_resampled)
X_train_smote.shape, X_test_smote.shape, y_train_smote.shape, y_test_smote.shape, X_resampled.shape, y_resampled.shape

smote_xgbc = XGBClassifier(n_jobs=8).fit(X_train_smote, y_train_smote)

print('TRAIN')
print(accuracy_score(smote_xgbc.predict(np.array(X_train_normalized)), y_train))
print(f1_score(smote_xgbc.predict(np.array(X_train_normalized)), y_train))

print('TEST')
print(accuracy_score(smote_xgbc.predict(np.array(X_test_normalized)), y_test))
print(f1_score(smote_xgbc.predict(np.array(X_test_normalized)), y_test))