Python 使用 Scikit-learn 计算信息增益

Question

提问by Characeae

I am using Scikit-learn for text classification. I want to calculate the Information Gain for each attribute with respect to a class in a (sparse) document-term matrix. The Information Gain is defined as H(Class) - H(Class | Attribute), where H is the entropy.

我正在使用 Scikit-learn 进行文本分类。我想针对（稀疏）文档项矩阵中的类计算每个属性的信息增益。信息增益定义为 H(Class) - H(Class | Attribute)，其中 H 是熵。

Using weka, this can be accomplished with the InfoGainAttribute. But I haven't found this measure in scikit-learn.

使用 weka，这可以通过InfoGainAttribute来完成。但是我还没有在 scikit-learn 中找到这个度量。

However, it has been suggestedthat the formula above for Information Gain is the same measure as mutual information. This matches also the definition in wikipedia.

但是，有人建议上述信息增益公式与互信息是相同的度量。这也符合wikipedia 中的定义。

Is it possible to use a specific setting for mutual information in scikit-learn to accomplish this task?

是否可以在 scikit-learn 中使用特定的互信息设置来完成此任务？

Answer 1

采纳答案by sgDysregulation

You can use scikit-learn's mutual_info_classifhere is an example

你可以使用 scikit-learn's mutual_info_classifhere is an example

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_extraction.text import CountVectorizer

categories = ['talk.religion.misc',
              'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
                                      categories=categories)

X, Y = newsgroups_train.data, newsgroups_train.target
cv = CountVectorizer(max_df=0.95, min_df=2,
                                     max_features=10000,
                                     stop_words='english')
X_vec = cv.fit_transform(X)

res = dict(zip(cv.get_feature_names(),
               mutual_info_classif(X_vec, Y, discrete_features=True)
               ))
print(res)

this will output a dictionary of each attribute, i.e. item in the vocabulary as keys and their information gain as values

这将输出每个属性的字典，即词汇表中的项目作为键，它们的信息增益作为值

here is a sample of the output

这是输出示例

{'bible': 0.072327479595571439,
 'christ': 0.057293733680219089,
 'christian': 0.12862867565281702,
 'christians': 0.068511328611810071,
 'file': 0.048056478042481157,
 'god': 0.12252523919766867,
 'gov': 0.053547274485785577,
 'graphics': 0.13044709565039875,
 'jesus': 0.09245436105573257,
 'launch': 0.059882179387444862,
 'moon': 0.064977781072557236,
 'morality': 0.050235104394123153,
 'nasa': 0.11146392824624819,
 'orbit': 0.087254803670582998,
 'people': 0.068118370234354936,
 'prb': 0.049176995204404481,
 'religion': 0.067695617096125316,
 'shuttle': 0.053440976618359261,
 'space': 0.20115901737978983,
 'thanks': 0.060202010019767334}

Answer 2

回答by Ga?l Bernard

Here is my proposition to calculate the information gain using pandas:

这是我使用熊猫计算信息增益的建议：

from scipy.stats import entropy
import pandas as pd
def information_gain(members, split):
    '''
    Measures the reduction in entropy after the split  
    :param v: Pandas Series of the members
    :param split:
    :return:
    '''
    entropy_before = entropy(members.value_counts(normalize=True))
    split.name = 'split'
    members.name = 'members'
    grouped_distrib = members.groupby(split) \
                        .value_counts(normalize=True) \
                        .reset_index(name='count') \
                        .pivot_table(index='split', columns='members', values='count').fillna(0) 
    entropy_after = entropy(grouped_distrib, axis=1)
    entropy_after *= split.value_counts(sort=False, normalize=True)
    return entropy_before - entropy_after.sum()

members = pd.Series(['yellow','yellow','green','green','blue'])
split = pd.Series([0,0,1,1,0])
print (information_gain(members, split))

Python 使用 Scikit-learn 计算信息增益

提问by Characeae

采纳答案by sgDysregulation

回答by Ga?l Bernard

相关推荐

最近更新

标签

Python 使用 Scikit-learn 计算信息增益

提问by Characeae

采纳答案by sgDysregulation

回答by Ga?l Bernard

相关推荐

Python 读取文本文件并将字符串转换为浮点数

Python 连接两个不同维度的数组numpy

Python 如何将 numpy 模块导入/打开到 IDLE

为什么 Python 没有 switch-case？

相关推荐

最近更新

标签