pandas 使用 MRMR 进行特征选择

Question

提问by

I found two ways to implement MRMR for feature selection in python. The source of the paper that contains the method is:

我找到了两种在 python 中实现 MRMR 进行特征选择的方法。包含该方法的论文来源是：

https://www.dropbox.com/s/tr7wjpc2ik5xpxs/doc.pdf?dl=0

This is my code for the dataset.

这是我的数据集代码。

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

X, y = make_classification(n_samples=10000,
                           n_features=6,
                           n_informative=3,
                           n_classes=2,
                           random_state=0,
                           shuffle=False)

# Creating a dataFrame
df = pd.DataFrame({'Feature 1':X[:,0],
                                  'Feature 2':X[:,1],
                                  'Feature 3':X[:,2],
                                  'Feature 4':X[:,3],
                                  'Feature 5':X[:,4],
                                  'Feature 6':X[:,5],
                                  'Class':y})


y_train = df['Class']
X_train = df.drop('Class', axis=1)

Method 1: Applying MRMR using pymrmr

方法 1：使用 pymrmr 应用 MRMR

Contains MID and MIQ

包含 MID 和 MIQ

Which is published by the author The link is https://github.com/fbrundu/pymrmr

由作者发布链接是 https://github.com/fbrundu/pymrmr

import pymrmr

pymrmr.mRMR(df, 'MIQ',6)

['Feature 4', 'Feature 5', 'Feature 2', 'Feature 6', 'Feature 1', 'Feature 3']

['功能4'，'功能5'，'功能2'，'功能6'，'功能1'，'功能3']

or running using the second way

或使用第二种方式运行

pymrmr.mRMR(df, 'MID',6)

['Feature 4', 'Feature 6', 'Feature 5', 'Feature 2', 'Feature 1', 'Feature 3']

['功能4'，'功能6'，'功能5'，'功能2'，'功能1'，'功能3']

Both these methods, on the above dataset yields this 2 output. Another author on GitHub claims that you can use his version to apply the MRMR method. However when I use it for the same dataset I have a different result.

这两种方法，在上述数据集上都会产生这 2 个输出。GitHub 上的另一位作者声称您可以使用他的版本来应用 MRMR 方法。但是，当我将它用于相同的数据集时，我得到了不同的结果。

Method 2: Applying MRMR using MIFS

方法 2：使用 MIFS 应用 MRMR

Github link

Github 链接

https://github.com/danielhomola/mifs

import mifs

for i in range(1,11):

    feat_selector = mifs.MutualInformationFeatureSelector('MRMR',k=i)
    feat_selector.fit(X_train, y_train)

    # call transform() on X to filter it down to selected features
    X_filtered = feat_selector.transform(X_train.values)

    #Create list of features
    feature_name = X_train.columns[feat_selector.ranking_]


    print(feature_name)

And if you run the above iteration for all different values of i, there will come no time where both methods actually yield the same feature selection output.

如果您对 i 的所有不同值运行上述迭代，则两种方法实际上不会产生相同的特征选择输出。

What seems to be the problem here ?

这里似乎有什么问题？

Answer 1

采纳答案by carrdelling

You'll probably need to contact either the authors of the original paper and/or the owner of the Github repo for a final answer, but most likely the differences here come from the fact that you are comparing 3 different algorithms (despite the name).

您可能需要联系原始论文的作者和/或 Github 存储库的所有者以获得最终答案，但很可能这里的差异来自于您正在比较 3 种不同算法的事实（尽管名称不同） .

Minimum redundancy Maximum relevance algorithmsare actually a family of feature selection algorithms whose common objective is to select features that are mutually far away from each other while still having "high" correlation to the classification variable.

最小冗余最大相关算法实际上是一系列特征选择算法，它们的共同目标是选择相互远离的特征，同时仍然与分类变量具有“高”相关性。

You can measure that objective using Mutual Information measures, but the specific method to follow(i.e. what to do with the scores computed? In what order? What other post-processing methods will be used? ...) is going to be different from one author to another - even in the paper they are actually giving you two different implementations, MIQand MID.

您可以使用互信息度量来衡量该目标，但是要遵循的具体方法（即如何处理计算出的分数？以什么顺序？将使用哪些其他后处理方法？...）将与一位作者对另一位作者 - 即使在论文中，他们实际上也为您提供了两种不同的实现，MIQ并且MID.

So my suggestion would be to just choose the implementation you are more comfortable with (or even better, the one that produces better results in your pipeline after conducting a proper validation), and just report which specific source did you choose and why.

所以我的建议是只选择你更喜欢的实现（或者甚至更好，在进行适当的验证后在你的管道中产生更好结果的实现），然后报告你选择了哪个特定来源以及为什么。

pandas 使用 MRMR 进行特征选择

提问by

采纳答案by carrdelling

相关推荐

最近更新

标签

pandas 使用 MRMR 进行特征选择

提问by

采纳答案by carrdelling

相关推荐

pandas TypeError: 'Index' object is not callable and SyntaxError: invalid syntax

将 Pandas 数据框 json 列切成列

pandas 在熊猫数据框中搜索和替换点和逗号

pandas 读取文本文件数据到pandas DataFrame

相关推荐

最近更新

标签