pandas 使用 MRMR 进行特征选择

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/49232854/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:18:41  来源:igfitidea点击:

Feature Selection using MRMR

pythonpandasnumpy

提问by

I found two ways to implement MRMR for feature selection in python. The source of the paper that contains the method is:

我找到了两种在 python 中实现 MRMR 进行特征选择的方法。包含该方法的论文来源是:

https://www.dropbox.com/s/tr7wjpc2ik5xpxs/doc.pdf?dl=0

https://www.dropbox.com/s/tr7wjpc2ik5xpxs/doc.pdf?dl=0

This is my code for the dataset.

这是我的数据集代码。

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

X, y = make_classification(n_samples=10000,
                           n_features=6,
                           n_informative=3,
                           n_classes=2,
                           random_state=0,
                           shuffle=False)

# Creating a dataFrame
df = pd.DataFrame({'Feature 1':X[:,0],
                                  'Feature 2':X[:,1],
                                  'Feature 3':X[:,2],
                                  'Feature 4':X[:,3],
                                  'Feature 5':X[:,4],
                                  'Feature 6':X[:,5],
                                  'Class':y})


y_train = df['Class']
X_train = df.drop('Class', axis=1)

Method 1: Applying MRMR using pymrmr

方法 1:使用 pymrmr 应用 MRMR

Contains MID and MIQ

包含 MID 和 MIQ

Which is published by the author The link is https://github.com/fbrundu/pymrmr

由作者发布链接是 https://github.com/fbrundu/pymrmr

import pymrmr

pymrmr.mRMR(df, 'MIQ',6)

['Feature 4', 'Feature 5', 'Feature 2', 'Feature 6', 'Feature 1', 'Feature 3']

['功能4','功能5','功能2','功能6','功能1','功能3']

or running using the second way

或使用第二种方式运行

pymrmr.mRMR(df, 'MID',6)

['Feature 4', 'Feature 6', 'Feature 5', 'Feature 2', 'Feature 1', 'Feature 3']

['功能4','功能6','功能5','功能2','功能1','功能3']

Both these methods, on the above dataset yields this 2 output. Another author on GitHub claims that you can use his version to apply the MRMR method. However when I use it for the same dataset I have a different result.

这两种方法,在上述数据集上都会产生这 2 个输出。GitHub 上的另一位作者声称您可以使用他的版本来应用 MRMR 方法。但是,当我将它用于相同的数据集时,我得到了不同的结果。

Method 2: Applying MRMR using MIFS

方法 2:使用 MIFS 应用 MRMR

Github link

Github 链接

https://github.com/danielhomola/mifs

https://github.com/danielhomola/mifs

import mifs

for i in range(1,11):

    feat_selector = mifs.MutualInformationFeatureSelector('MRMR',k=i)
    feat_selector.fit(X_train, y_train)

    # call transform() on X to filter it down to selected features
    X_filtered = feat_selector.transform(X_train.values)

    #Create list of features
    feature_name = X_train.columns[feat_selector.ranking_]


    print(feature_name)

And if you run the above iteration for all different values of i, there will come no time where both methods actually yield the same feature selection output.

如果您对 i 的所有不同值运行上述迭代,则两种方法实际上不会产生相同的特征选择输出。

What seems to be the problem here ?

这里似乎有什么问题?

采纳答案by carrdelling

You'll probably need to contact either the authors of the original paper and/or the owner of the Github repo for a final answer, but most likely the differences here come from the fact that you are comparing 3 different algorithms (despite the name).

您可能需要联系原始论文的作者和/或 Github 存储库的所有者以获得最终答案,但很可能这里的差异来自于您正在比较 3 种不同算法的事实(尽管名称不同) .

Minimum redundancy Maximum relevance algorithmsare actually a family of feature selection algorithms whose common objective is to select features that are mutually far away from each other while still having "high" correlation to the classification variable.

最小冗余最大相关算法实际上是一系列特征选择算法,它们的共同目标是选择相互远离的特征,同时仍然与分类变量具有“高”相关性

You can measure that objective using Mutual Information measures, but the specific method to follow(i.e. what to do with the scores computed? In what order? What other post-processing methods will be used? ...) is going to be different from one author to another - even in the paper they are actually giving you two different implementations, MIQand MID.

您可以使用互信息度量来衡量该目标,但是要遵循的具体方法(即如何处理计算出的分数?以什么顺序?将使用哪些其他后处理方法?...)将与一位作者对另一位作者 - 即使在论文中,他们实际上也为您提供了两种不同的实现,MIQ并且MID.

So my suggestion would be to just choose the implementation you are more comfortable with (or even better, the one that produces better results in your pipeline after conducting a proper validation), and just report which specific source did you choose and why.

所以我的建议是只选择你更喜欢的实现(或者甚至更好,在进行适当的验证后在你的管道中产生更好结果的实现),然后报告你选择了哪个特定来源以及为什么。