Python 使用 Scikit-learn 计算信息增益
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46752650/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Information Gain calculation with Scikit-learn
提问by Characeae
I am using Scikit-learn for text classification. I want to calculate the Information Gain for each attribute with respect to a class in a (sparse) document-term matrix. The Information Gain is defined as H(Class) - H(Class | Attribute), where H is the entropy.
我正在使用 Scikit-learn 进行文本分类。我想针对(稀疏)文档项矩阵中的类计算每个属性的信息增益。信息增益定义为 H(Class) - H(Class | Attribute),其中 H 是熵。
Using weka, this can be accomplished with the InfoGainAttribute. But I haven't found this measure in scikit-learn.
使用 weka,这可以通过InfoGainAttribute来完成。但是我还没有在 scikit-learn 中找到这个度量。
However, it has been suggestedthat the formula above for Information Gain is the same measure as mutual information. This matches also the definition in wikipedia.
但是,有人建议上述信息增益公式与互信息是相同的度量。这也符合wikipedia 中的定义。
Is it possible to use a specific setting for mutual information in scikit-learn to accomplish this task?
是否可以在 scikit-learn 中使用特定的互信息设置来完成此任务?
采纳答案by sgDysregulation
You can use scikit-learn's mutual_info_classif
here is an example
你可以使用 scikit-learn's mutual_info_classif
here is an example
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_extraction.text import CountVectorizer
categories = ['talk.religion.misc',
'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
categories=categories)
X, Y = newsgroups_train.data, newsgroups_train.target
cv = CountVectorizer(max_df=0.95, min_df=2,
max_features=10000,
stop_words='english')
X_vec = cv.fit_transform(X)
res = dict(zip(cv.get_feature_names(),
mutual_info_classif(X_vec, Y, discrete_features=True)
))
print(res)
this will output a dictionary of each attribute, i.e. item in the vocabulary as keys and their information gain as values
这将输出每个属性的字典,即词汇表中的项目作为键,它们的信息增益作为值
here is a sample of the output
这是输出示例
{'bible': 0.072327479595571439,
'christ': 0.057293733680219089,
'christian': 0.12862867565281702,
'christians': 0.068511328611810071,
'file': 0.048056478042481157,
'god': 0.12252523919766867,
'gov': 0.053547274485785577,
'graphics': 0.13044709565039875,
'jesus': 0.09245436105573257,
'launch': 0.059882179387444862,
'moon': 0.064977781072557236,
'morality': 0.050235104394123153,
'nasa': 0.11146392824624819,
'orbit': 0.087254803670582998,
'people': 0.068118370234354936,
'prb': 0.049176995204404481,
'religion': 0.067695617096125316,
'shuttle': 0.053440976618359261,
'space': 0.20115901737978983,
'thanks': 0.060202010019767334}
回答by Ga?l Bernard
Here is my proposition to calculate the information gain using pandas:
这是我使用熊猫计算信息增益的建议:
from scipy.stats import entropy
import pandas as pd
def information_gain(members, split):
'''
Measures the reduction in entropy after the split
:param v: Pandas Series of the members
:param split:
:return:
'''
entropy_before = entropy(members.value_counts(normalize=True))
split.name = 'split'
members.name = 'members'
grouped_distrib = members.groupby(split) \
.value_counts(normalize=True) \
.reset_index(name='count') \
.pivot_table(index='split', columns='members', values='count').fillna(0)
entropy_after = entropy(grouped_distrib, axis=1)
entropy_after *= split.value_counts(sort=False, normalize=True)
return entropy_before - entropy_after.sum()
members = pd.Series(['yellow','yellow','green','green','blue'])
split = pd.Series([0,0,1,1,0])
print (information_gain(members, split))