Python 使用 Scikit-learn 计算信息增益

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/46752650/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:49:15  来源:igfitidea点击:

Information Gain calculation with Scikit-learn

pythonmachine-learningscikit-learntext-classificationfeature-selection

提问by Characeae

I am using Scikit-learn for text classification. I want to calculate the Information Gain for each attribute with respect to a class in a (sparse) document-term matrix. The Information Gain is defined as H(Class) - H(Class | Attribute), where H is the entropy.

我正在使用 Scikit-learn 进行文本分类。我想针对(稀疏)文档项矩阵中的类计算每个属性的信息增益。信息增益定义为 H(Class) - H(Class | Attribute),其中 H 是熵。

Using weka, this can be accomplished with the InfoGainAttribute. But I haven't found this measure in scikit-learn.

使用 weka,这可以通过InfoGainAttribute来完成。但是我还没有在 scikit-learn 中找到这个度量。

However, it has been suggestedthat the formula above for Information Gain is the same measure as mutual information. This matches also the definition in wikipedia.

但是,有人建议上述信息增益公式与互信息是相同的度量。这也符合wikipedia 中的定义。

Is it possible to use a specific setting for mutual information in scikit-learn to accomplish this task?

是否可以在 scikit-learn 中使用特定的互信息设置来完成此任务?

采纳答案by sgDysregulation

You can use scikit-learn's mutual_info_classifhere is an example

你可以使用 scikit-learn's mutual_info_classifhere is an example

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_extraction.text import CountVectorizer

categories = ['talk.religion.misc',
              'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
                                      categories=categories)

X, Y = newsgroups_train.data, newsgroups_train.target
cv = CountVectorizer(max_df=0.95, min_df=2,
                                     max_features=10000,
                                     stop_words='english')
X_vec = cv.fit_transform(X)

res = dict(zip(cv.get_feature_names(),
               mutual_info_classif(X_vec, Y, discrete_features=True)
               ))
print(res)

this will output a dictionary of each attribute, i.e. item in the vocabulary as keys and their information gain as values

这将输出每个属性的字典,即词汇表中的项目作为键,它们的信息增益作为值

here is a sample of the output

这是输出示例

{'bible': 0.072327479595571439,
 'christ': 0.057293733680219089,
 'christian': 0.12862867565281702,
 'christians': 0.068511328611810071,
 'file': 0.048056478042481157,
 'god': 0.12252523919766867,
 'gov': 0.053547274485785577,
 'graphics': 0.13044709565039875,
 'jesus': 0.09245436105573257,
 'launch': 0.059882179387444862,
 'moon': 0.064977781072557236,
 'morality': 0.050235104394123153,
 'nasa': 0.11146392824624819,
 'orbit': 0.087254803670582998,
 'people': 0.068118370234354936,
 'prb': 0.049176995204404481,
 'religion': 0.067695617096125316,
 'shuttle': 0.053440976618359261,
 'space': 0.20115901737978983,
 'thanks': 0.060202010019767334}

回答by Ga?l Bernard

Here is my proposition to calculate the information gain using pandas:

这是我使用熊猫计算信息增益的建议:

from scipy.stats import entropy
import pandas as pd
def information_gain(members, split):
    '''
    Measures the reduction in entropy after the split  
    :param v: Pandas Series of the members
    :param split:
    :return:
    '''
    entropy_before = entropy(members.value_counts(normalize=True))
    split.name = 'split'
    members.name = 'members'
    grouped_distrib = members.groupby(split) \
                        .value_counts(normalize=True) \
                        .reset_index(name='count') \
                        .pivot_table(index='split', columns='members', values='count').fillna(0) 
    entropy_after = entropy(grouped_distrib, axis=1)
    entropy_after *= split.value_counts(sort=False, normalize=True)
    return entropy_before - entropy_after.sum()

members = pd.Series(['yellow','yellow','green','green','blue'])
split = pd.Series([0,0,1,1,0])
print (information_gain(members, split))