pandas 使用 MRMR 进行特征选择
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/49232854/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Feature Selection using MRMR
提问by
I found two ways to implement MRMR for feature selection in python. The source of the paper that contains the method is:
我找到了两种在 python 中实现 MRMR 进行特征选择的方法。包含该方法的论文来源是:
https://www.dropbox.com/s/tr7wjpc2ik5xpxs/doc.pdf?dl=0
https://www.dropbox.com/s/tr7wjpc2ik5xpxs/doc.pdf?dl=0
This is my code for the dataset.
这是我的数据集代码。
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
X, y = make_classification(n_samples=10000,
n_features=6,
n_informative=3,
n_classes=2,
random_state=0,
shuffle=False)
# Creating a dataFrame
df = pd.DataFrame({'Feature 1':X[:,0],
'Feature 2':X[:,1],
'Feature 3':X[:,2],
'Feature 4':X[:,3],
'Feature 5':X[:,4],
'Feature 6':X[:,5],
'Class':y})
y_train = df['Class']
X_train = df.drop('Class', axis=1)
Method 1: Applying MRMR using pymrmr
方法 1:使用 pymrmr 应用 MRMR
Contains MID and MIQ
包含 MID 和 MIQ
Which is published by the author The link is https://github.com/fbrundu/pymrmr
由作者发布链接是 https://github.com/fbrundu/pymrmr
import pymrmr
pymrmr.mRMR(df, 'MIQ',6)
['Feature 4', 'Feature 5', 'Feature 2', 'Feature 6', 'Feature 1', 'Feature 3']
['功能4','功能5','功能2','功能6','功能1','功能3']
or running using the second way
或使用第二种方式运行
pymrmr.mRMR(df, 'MID',6)
['Feature 4', 'Feature 6', 'Feature 5', 'Feature 2', 'Feature 1', 'Feature 3']
['功能4','功能6','功能5','功能2','功能1','功能3']
Both these methods, on the above dataset yields this 2 output. Another author on GitHub claims that you can use his version to apply the MRMR method. However when I use it for the same dataset I have a different result.
这两种方法,在上述数据集上都会产生这 2 个输出。GitHub 上的另一位作者声称您可以使用他的版本来应用 MRMR 方法。但是,当我将它用于相同的数据集时,我得到了不同的结果。
Method 2: Applying MRMR using MIFS
方法 2:使用 MIFS 应用 MRMR
Github link
Github 链接
https://github.com/danielhomola/mifs
https://github.com/danielhomola/mifs
import mifs
for i in range(1,11):
feat_selector = mifs.MutualInformationFeatureSelector('MRMR',k=i)
feat_selector.fit(X_train, y_train)
# call transform() on X to filter it down to selected features
X_filtered = feat_selector.transform(X_train.values)
#Create list of features
feature_name = X_train.columns[feat_selector.ranking_]
print(feature_name)
And if you run the above iteration for all different values of i, there will come no time where both methods actually yield the same feature selection output.
如果您对 i 的所有不同值运行上述迭代,则两种方法实际上不会产生相同的特征选择输出。
What seems to be the problem here ?
这里似乎有什么问题?
采纳答案by carrdelling
You'll probably need to contact either the authors of the original paper and/or the owner of the Github repo for a final answer, but most likely the differences here come from the fact that you are comparing 3 different algorithms (despite the name).
您可能需要联系原始论文的作者和/或 Github 存储库的所有者以获得最终答案,但很可能这里的差异来自于您正在比较 3 种不同算法的事实(尽管名称不同) .
Minimum redundancy Maximum relevance algorithmsare actually a family of feature selection algorithms whose common objective is to select features that are mutually far away from each other while still having "high" correlation to the classification variable.
最小冗余最大相关算法实际上是一系列特征选择算法,它们的共同目标是选择相互远离的特征,同时仍然与分类变量具有“高”相关性。
You can measure that objective using Mutual Information measures, but the specific method to follow(i.e. what to do with the scores computed? In what order? What other post-processing methods will be used? ...) is going to be different from one author to another - even in the paper they are actually giving you two different implementations, MIQ
and MID
.
您可以使用互信息度量来衡量该目标,但是要遵循的具体方法(即如何处理计算出的分数?以什么顺序?将使用哪些其他后处理方法?...)将与一位作者对另一位作者 - 即使在论文中,他们实际上也为您提供了两种不同的实现,MIQ
并且MID
.
So my suggestion would be to just choose the implementation you are more comfortable with (or even better, the one that produces better results in your pipeline after conducting a proper validation), and just report which specific source did you choose and why.
所以我的建议是只选择你更喜欢的实现(或者甚至更好,在进行适当的验证后在你的管道中产生更好结果的实现),然后报告你选择了哪个特定来源以及为什么。