Python 修剪决策树

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/49428469/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 19:05:42  来源:igfitidea点击:

Pruning Decision Trees

pythonscikit-learn

提问by David Dale

Hi guys below is a snippet of the decision tree as it is pretty huge.

大家好,下面是决策树的一个片段,因为它非常庞大。

enter image description here

在此处输入图片说明

How to make the tree stop growing when the lowest valuein a node is under 5. Here is the code to produce the decision tree. On SciKit - Decission Treewe can see the only way to do so is by min_impurity_decreasebut I am not sure how it specifically works.

如何在节点中的最低低于 5时使树停止生长。这是生成决策树的代码。在SciKit - 决策树上,我们可以看到这样做的唯一方法是通过min_impurity_decrease但我不确定它具体是如何工作的。

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier


X, y = make_classification(n_samples=1000,
                           n_features=6,
                           n_informative=3,
                           n_classes=2,
                           random_state=0,
                           shuffle=False)

# Creating a dataFrame
df = pd.DataFrame({'Feature 1':X[:,0],
                                  'Feature 2':X[:,1],
                                  'Feature 3':X[:,2],
                                  'Feature 4':X[:,3],
                                  'Feature 5':X[:,4],
                                  'Feature 6':X[:,5],
                                  'Class':y})


y_train = df['Class']
X_train = df.drop('Class',axis = 1)

dt = DecisionTreeClassifier( random_state=42)                
dt.fit(X_train, y_train)

from IPython.display import display, Image
import pydotplus
from sklearn import tree
from sklearn.tree import _tree
from sklearn import tree
import collections
import drawtree
import os  

os.environ["PATH"] += os.pathsep + 'C:\Anaconda3\Library\bin\graphviz'

dot_data = tree.export_graphviz(dt, out_file = 'thisIsTheImagetree.dot',
                                 feature_names=X_train.columns, filled   = True
                                    , rounded  = True
                                    , special_characters = True)

graph = pydotplus.graph_from_dot_file('thisIsTheImagetree.dot')  

thisIsTheImage = Image(graph.create_png())
display(thisIsTheImage)
#print(dt.tree_.feature)

from subprocess import check_call
check_call(['dot','-Tpng','thisIsTheImagetree.dot','-o','thisIsTheImagetree.png'])

Update

更新

I think min_impurity_decreasecan in a way help reach the goal. As tweaking min_impurity_decreasedoes actually prune the tree. Can anyone kindly explain min_impurity_decrease.

我认为min_impurity_decrease可以在某种程度上帮助实现目标。由于调整min_impurity_decrease实际上会修剪树。任何人都可以解释min_impurity_decrease。

I am trying to understand the equation in scikit learn but I am not sure what is the value of right_impurity and left_impurity.

我试图理解 scikit learn 中的方程,但我不确定 right_impurity 和 left_impurity 的值是什么。

N = 256
N_t = 256
impurity = ??
N_t_R = 242
N_t_L = 14
right_impurity = ??
left_impurity = ??

New_Value = N_t / N * (impurity - ((N_t_R / N_t) * right_impurity)
                    - ((N_t_L / N_t) * left_impurity))
New_Value

Update 2

更新 2

Instead of pruning at a certain value, we prune under a certain condition. such as We do split at 6/4 and 5/5 but not at 6000/4 or 5000/5. Let's say if one value is under a certain percentage in comparison with its adjacent value in the node, rather than a certain value.

我们不是在某个值进行修剪,而是在特定条件下进行修剪。例如我们确实在 6/4 和 5/5 处拆分,但不会在 6000/4 或 5000/5 处拆分。假设一个值与其在节点中的相邻值相比低于某个百分比,而不是某个值。

      11/9
   /       \
  6/4       5/5
 /   \     /   \
6/0  0/4  2/2  3/3

回答by David Dale

Directly restricting the lowest value (number of occurences of a particular class) of a leaf cannot be done with min_impurity_decrease or any other built-in stopping criteria.

不能使用 min_impurity_decrease 或任何其他内置停止标准来直接限制叶子的最低值(特定类别的出现次数)。

I think the only way you can accomplish this without changing the source code of scikit-learn is to post-pruneyour tree. To accomplish this, you can just traverse the tree and remove all children of the nodes with minimum class count less that 5 (or any other condition you can think of). I will continue your example:

我认为在不更改 scikit-learn 源代码的情况下完成此操作的唯一方法是对树进行后修剪。为此,您只需遍历树并删除最小类数小于 5(或您能想到的任何其他条件)的节点的所有子节点。我将继续你的例子:

from sklearn.tree._tree import TREE_LEAF

def prune_index(inner_tree, index, threshold):
    if inner_tree.value[index].min() < threshold:
        # turn node into a leaf by "unlinking" its children
        inner_tree.children_left[index] = TREE_LEAF
        inner_tree.children_right[index] = TREE_LEAF
    # if there are shildren, visit them as well
    if inner_tree.children_left[index] != TREE_LEAF:
        prune_index(inner_tree, inner_tree.children_left[index], threshold)
        prune_index(inner_tree, inner_tree.children_right[index], threshold)

print(sum(dt.tree_.children_left < 0))
# start pruning from the root
prune_index(dt.tree_, 0, 5)
sum(dt.tree_.children_left < 0)

this code will print first 74, and then 91. It means that the code has created 17 new leaf nodes (by practically removing links to their ancestors). The tree, which has looked before like

此代码将首先打印74,然后打印91。这意味着代码创建了 17 个新的叶节点(实际上是通过删除指向它们祖先的链接)。这棵树,以前看起来像

enter image description here

在此处输入图片说明

now looks like

现在看起来像

enter image description here

在此处输入图片说明

so you can see that is indeed has decreased a lot.

所以你可以看到确实减少了很多。

回答by Seljuk Gülcan

Edit :This is not correct as @SBylemans and @Viktor point out in the comments. I'm not deleting the answer since someone else may also think this is the solution.

编辑:正如@SBylemans 和@Viktor 在评论中指出的那样,这是不正确的。我不会删除答案,因为其他人可能也认为这是解决方案。

Set min_samples_leafto 5.

设置min_samples_leaf为 5。

min_samples_leaf:

min_samples_leaf

The minimum number of samples required to be at a leaf node:

叶节点所需的最小样本数:

Update :I think it cannot be done with min_impurity_decrease. Think of the following scenario :

更新:我认为不能用min_impurity_decrease. 考虑以下场景:

      11/9
   /         \
  6/4       5/5
 /   \     /   \
6/0  0/4  2/2  3/3

According to your rule, you do not want to split node 6/4since 4 is less than 5 but you want to split 5/5node. However, splitting 6/4node has 0.48 information gain and splitting 5/5has 0 information gain.

根据您的规则,您不想拆分节点,6/4因为 4 小于 5,但您想拆分5/5节点。然而,分裂6/4节点的信息增益为 0.48 ,分裂节点的信息增益5/5为 0。

回答by James

Interestingly, min_impurity_decreasedoesn't look as if it would allow growth of any of the nodes you have shown in the snippet you provided (the sum of impurities after splitting equals the pre-split impurity, so there is no impurity decrease). However, while it won't give you exactly the result you want (terminate node if lowest value is under 5), it may give you something similar.

有趣的是,min_impurity_decrease它看起来似乎不允许您在您提供的代码片段中显示的任何节点增长(分裂后的杂质总和等于分裂前的杂质,因此没有杂质减少)。然而,虽然它不会给你你想要的结果(如果最低值低于 5,则终止节点),它可能会给你类似的结果。

If my testing is right, the official docs make it look more complicated than it actually is. Just take the lower value from the potential parent node, then subtract the sum of the lower values of the proposed new nodes - this is the grossimpurity reduction. Then divide by the total number of samples in the whole tree- this gives you the fractional impurity decreaseachieved if the node is split.

如果我的测试是正确的,官方文档使它看起来比实际更复杂。只需从潜在父节点中取较低的值,然后减去提议的新节点的较低值的总和 - 这就是杂质减少量。然后除以整个树中的样本总数-如果节点被拆分,这将为您提供杂质减少分数

If you have 1000 samples, and a node with a lower value of 5 (i.e. 5 "impurities"), 5/1000 represents the maximum impurity decrease you could achieve if this node was perfectly split. So setting a min_impurity_decreaseof of 0.005 would approximate stopping the leaf with <5 impurities. It would actually stop most leaves with a bit more than 5 impurities (depending upon the impurities resulting from the proposed split), so it is only an approximation, but as best I can tell its the closest you can get without post-pruning.

如果您有 1000 个样本,并且有一个值为 5 的节点(即 5 个“杂质”),则 5/1000 表示如果该节点被完美分割,您可以实现的最大杂质减少量。因此,将 a 设置min_impurity_decrease为 0.005 将近似停止含有 <5 个杂质的叶子。它实际上会阻止大多数含有超过 5 个杂质的叶子(取决于提议的分裂产生的杂质),所以这只是一个近似值,但我可以告诉它最接近的结果,而无需进行后修剪。