pandas 在 Scikit 特征选择后保留特征名称
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39812885/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Retain feature names after Scikit Feature Selection
提问by Zakery Alexander Fyke
After running a Variance Threshold from Scikit-Learn on a set of data, it removes a couple of features. I feel I'm doing something simple yet stupid, but I'd like to retain the names of the remaining features. The following code:
在对一组数据运行 Scikit-Learn 的方差阈值后,它删除了几个特征。我觉得我在做一些简单而愚蠢的事情,但我想保留其余功能的名称。以下代码:
def VarianceThreshold_selector(data):
selector = VarianceThreshold(.5)
selector.fit(data)
selector = (pd.DataFrame(selector.transform(data)))
return selector
x = VarianceThreshold_selector(data)
print(x)
changes the following data (this is just a small subset of the rows):
更改以下数据(这只是行的一小部分):
Survived Pclass Sex Age SibSp Parch Nonsense
0 3 1 22 1 0 0
1 1 2 38 1 0 0
1 3 2 26 0 0 0
into this (again just a small subset of the rows)
进入这个(再次只是行的一小部分)
0 1 2 3
0 3 22.0 1 0
1 1 38.0 1 0
2 3 26.0 0 0
Using the get_support method, I know that these are Pclass, Age, Sibsp, and Parch, so I'd rather this return something more like :
使用 get_support 方法,我知道这些是 Pclass、Age、Sibsp 和 Parch,所以我宁愿返回更像这样的东西:
Pclass Age Sibsp Parch
0 3 22.0 1 0
1 1 38.0 1 0
2 3 26.0 0 0
Is there an easy way to do this? I'm very new with Scikit Learn, so I'm probably just doing something silly.
是否有捷径可寻?我对 Scikit Learn 很陌生,所以我可能只是在做一些愚蠢的事情。
回答by Jarad
Would something like this help? If you pass it a pandas dataframe, it will get the columns and use get_support
like you mentioned to iterate over the columns list by their indices to pull out only the column headers that met the variance threshold.
这样的事情会有帮助吗?如果您将它传递给 Pandas 数据框,它将获取列并使用get_support
您提到的方式通过它们的索引迭代列列表,以仅提取满足方差阈值的列标题。
>>> df
Survived Pclass Sex Age SibSp Parch Nonsense
0 0 3 1 22 1 0 0
1 1 1 2 38 1 0 0
2 1 3 2 26 0 0 0
>>> from sklearn.feature_selection import VarianceThreshold
>>> def variance_threshold_selector(data, threshold=0.5):
selector = VarianceThreshold(threshold)
selector.fit(data)
return data[data.columns[selector.get_support(indices=True)]]
>>> variance_threshold_selector(df, 0.5)
Pclass Age
0 3 22
1 1 38
2 3 26
>>> variance_threshold_selector(df, 0.9)
Age
0 22
1 38
2 26
>>> variance_threshold_selector(df, 0.1)
Survived Pclass Sex Age SibSp
0 0 3 1 22 1
1 1 1 2 38 1
2 1 3 2 26 0
回答by pteehan
I came here looking for a way to get transform()
or fit_transform()
to return a data frame, but I suspect it's not supported.
我来这里是为了寻找获取transform()
或fit_transform()
返回数据框的方法,但我怀疑它不受支持。
However, you can subset the data a bit more cleanly like this:
但是,您可以像这样更干净地对数据进行子集化:
data_transformed = data.loc[:, selector.get_support()]
回答by Zakery Alexander Fyke
There's probably better ways to do this, but for those interested here's how I did:
可能有更好的方法来做到这一点,但对于那些感兴趣的人,我是这样做的:
def VarianceThreshold_selector(data):
#Select Model
selector = VarianceThreshold(0) #Defaults to 0.0, e.g. only remove features with the same value in all samples
#Fit the Model
selector.fit(data)
features = selector.get_support(indices = True) #returns an array of integers corresponding to nonremoved features
features = [column for column in data[features]] #Array of all nonremoved features' names
#Format and Return
selector = pd.DataFrame(selector.transform(data))
selector.columns = features
return selector
回答by Jan Janiszewski
As I had some problems with the function by Jarad, I have mixed it up with the solution by pteehan, which I found is more reliable. I also added NA replacement as a standard as VarianceThreshold does not like NA values.
由于 Jarad 的功能存在一些问题,因此我将其与 pteehan 的解决方案混合在一起,我发现后者更可靠。我还添加了 NA 替换作为标准,因为 VarianceThreshold 不喜欢 NA 值。
def variance_threshold_select(df, thresh=0.0, na_replacement=-999):
df1 = df.copy(deep=True) # Make a deep copy of the dataframe
selector = VarianceThreshold(thresh)
selector.fit(df1.fillna(na_replacement)) # Fill NA values as VarianceThreshold cannot deal with those
df2 = df.loc[:,selector.get_support(indices=False)] # Get new dataframe with columns deleted that have NA values
return df2
回答by SaTa
You can use Pandas for thresholding too
您也可以使用 Pandas 进行阈值处理
data_new = data.loc[:, data.std(axis=0) > 0.75]