如何在 Python 中构建提升图(又名增益图)?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42699243/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 22:02:19  来源:igfitidea点击:

How to build a lift chart (a.k.a gains chart) in Python?

pythonmachine-learningmodelingevaluation

提问by Abhishek Arora

I just created a model using scikit-learn which estimates the probability of how likely a client will respond to some offer. Now I'm trying to evaluate my model. For that I want to plot the lift chart. I understand the concept of lift, but I'm struggling to understand how to actually implement it in python.

我刚刚使用 scikit-learn 创建了一个模型,该模型估计了客户对某些报价做出响应的可能性。现在我正在尝试评估我的模型。为此,我想绘制提升图。我理解提升的概念,但我很难理解如何在 python 中实际实现它。

采纳答案by morganics

Lift/cumulative gains charts aren't a good way to evaluate a model (as it cannot be used for comparison between models), and are instead a means of evaluating the results where your resources are finite. Either because there's a cost to action each result (in a marketing scenario) or you want to ignore a certain number of guaranteed voters, and only action those that are on the fence. Where your model is very good, and has high classification accuracy for all results, you won't get much lift from ordering your results by confidence.

提升/累积收益图表不是评估模型的好方法(因为它不能用于模型之间的比较),而是在资源有限的情况下评估结果的一种方式。要么是因为每个结果都有行动成本(在营销场景中),要么你想忽略一定数量的有保证的选民,只对那些处于围栏的人采取行动。如果您的模型非常好,并且对所有结果都具有很高的分类准确度,那么您不会从信心十足地对结果进行排序中获得多少提升。

import sklearn.metrics
import pandas as pd

def calc_cumulative_gains(df: pd.DataFrame, actual_col: str, predicted_col:str, probability_col:str):

    df.sort_values(by=probability_col, ascending=False, inplace=True)

    subset = df[df[predicted_col] == True]

    rows = []
    for group in np.array_split(subset, 10):
        score = sklearn.metrics.accuracy_score(group[actual_col].tolist(),
                                                   group[predicted_col].tolist(),
                                                   normalize=False)

        rows.append({'NumCases': len(group), 'NumCorrectPredictions': score})

    lift = pd.DataFrame(rows)

    #Cumulative Gains Calculation
    lift['RunningCorrect'] = lift['NumCorrectPredictions'].cumsum()
    lift['PercentCorrect'] = lift.apply(
        lambda x: (100 / lift['NumCorrectPredictions'].sum()) * x['RunningCorrect'], axis=1)
    lift['CumulativeCorrectBestCase'] = lift['NumCases'].cumsum()
    lift['PercentCorrectBestCase'] = lift['CumulativeCorrectBestCase'].apply(
        lambda x: 100 if (100 / lift['NumCorrectPredictions'].sum()) * x > 100 else (100 / lift[
            'NumCorrectPredictions'].sum()) * x)
    lift['AvgCase'] = lift['NumCorrectPredictions'].sum() / len(lift)
    lift['CumulativeAvgCase'] = lift['AvgCase'].cumsum()
    lift['PercentAvgCase'] = lift['CumulativeAvgCase'].apply(
        lambda x: (100 / lift['NumCorrectPredictions'].sum()) * x)

    #Lift Chart
    lift['NormalisedPercentAvg'] = 1
    lift['NormalisedPercentWithModel'] = lift['PercentCorrect'] / lift['PercentAvgCase']

    return lift

To plot the cumulative gains chart, you can use this code below.

要绘制累积收益图表,您可以使用下面的代码。

    import matplotlib.pyplot as plt
    def plot_cumulative_gains(lift: pd.DataFrame):
        fig, ax = plt.subplots()
        fig.canvas.draw()

        handles = []
        handles.append(ax.plot(lift['PercentCorrect'], 'r-', label='Percent Correct Predictions'))
        handles.append(ax.plot(lift['PercentCorrectBestCase'], 'g-', label='Best Case (for current model)'))
        handles.append(ax.plot(lift['PercentAvgCase'], 'b-', label='Average Case (for current model)'))
        ax.set_xlabel('Total Population (%)')
        ax.set_ylabel('Number of Respondents (%)')

        ax.set_xlim([0, 9])
        ax.set_ylim([10, 100])

        labels = [int((label+1)*10) for label in [float(item.get_text()) for item in ax.get_xticklabels()]]

        ax.set_xticklabels(labels)

        fig.legend(handles, labels=[h[0].get_label() for h in handles])
        fig.show()

And to visualise lift:

并可视化提升:

    def plot_lift_chart(lift: pd.DataFrame):
        plt.figure()
        plt.plot(lift['NormalisedPercentAvg'], 'r-', label='Normalised \'response rate\' with no model')
        plt.plot(lift['NormalisedPercentWithModel'], 'g-', label='Normalised \'response rate\' with using model')
        plt.legend()
        plt.show()

Result looks like:

结果看起来像:

Cumulative Gains Chart

累积收益图表

I found these websites useful for reference:

我发现这些网站可供参考:

Edit:

编辑:

I found the MS link somewhat misleading in its descriptions, but the Paul Te Braak link very informative. To answer the comment;

我发现 MS 链接在其描述中有些误导,但 Paul Te Braak 链接非常有用。回复评论;

@Tanguy for the cumulative gains chart above, all the calculations are based upon the accuracy for that specific model. As the Paul Te Braak link notes, how can my model's prediction accuracy reach 100% (the red line in the chart)? The best case scenario (the green line) is how quickly we can reach the same accuracy that the red line achieves over the course of the whole population (e.g. our optimum cumulative gains scenario). Blue is if we just randomly pick the classification for each sample in the population. So the cumulative gains and lift charts are purelyfor understanding how that model (and that model only) will give me more impact in a scenario where I'm not going to interact with the entire population.

@Tanguy 对于上面的累积增益图表,所有计算都基于该特定模型的准确性。正如 Paul Te Braak 链接所指出的,我的模型的预测准确度如何达到 100%(图表中的红线)?最好的情况(绿线)是我们能以多快的速度达到红线在整个种群过程中达到的相同精度(例如,我们的最佳累积增益方案)。蓝色是如果我们只是随机选择总体中每个样本的分类。因此,累积收益和提升图表纯粹是为了了解该模型(仅该模型)在我不打算与整个人群互动的情况下如何给我带来更大的影响。

One scenario I have used the cumulative gains chart is for fraud cases, where I want to know how many applications we can essentially ignore or prioritise (because I know that the model predicts them as well as it can) for the top X percent. In that case, for the 'average model' I instead selected the classification from the real unordered dataset (to show how existing applications were being processed, and how - using the model - we could instead prioritise types of application).

我使用累积收益图表的一个场景是针对欺诈案件,我想知道对于最高的 X%,我们基本上可以忽略或优先考虑多少个应用程序(因为我知道模型可以尽可能地预测它们)。在这种情况下,对于“平均模型”,我从真实的无序数据集中选择了分类(以显示如何处理现有应用程序,以及如何使用该模型来优先考虑应用程序类型)。

So, for comparing models, just stick with ROC/AUC, and once you're happy with the selected model, use the cumulative gains/ lift chart to see how it responds to the data.

因此,为了比较模型,只需坚持使用 ROC/AUC,一旦您对所选模型感到满意,就可以使用累积增益/提升图表来查看它如何响应数据。

回答by Jonny Brooks

You can use the scikit-plotpackage to do the heavy lifting.

您可以使用scikit-plot包来完成繁重的工作。

skplt.metrics.plot_cumulative_gain(y_test, predicted_probas)

Example

例子

# The usual train-test split mumbo-jumbo
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
test_size=0.33)
nb = GaussianNB()
nb.fit(X_train, y_train)
predicted_probas = nb.predict_proba(X_test)

# The magic happens here
import matplotlib.pyplot as plt
import scikitplot as skplt
skplt.metrics.plot_cumulative_gain(y_test, predicted_probas)
plt.show()

This should result in a plot like this: enter image description here

这应该导致这样的情节: 在此处输入图片说明