如何使用python或pandas计算所有列之间的相关性并删除高度相关的列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29294983/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to calculate correlation between all columns and remove highly correlated ones using python or pandas
提问by jax
I have a huge data set and prior to machine learning modeling it is always suggested that first you should remove highly correlated descriptors(columns) how can i calculate the column wice correlation and remove the column with a threshold value say remove all the columns or descriptors having >0.8 correlation. also it should retained the headers in reduce data..
我有一个庞大的数据集,在机器学习建模之前,总是建议您首先删除高度相关的描述符(列)我如何计算列的相关性并删除具有阈值的列,例如删除所有列或描述符具有 >0.8 的相关性。它还应该保留减少数据中的标题..
Example data set
示例数据集
GA PN PC MBP GR AP
0.033 6.652 6.681 0.194 0.874 3.177
0.034 9.039 6.224 0.194 1.137 3.4
0.035 10.936 10.304 1.015 0.911 4.9
0.022 10.11 9.603 1.374 0.848 4.566
0.035 2.963 17.156 0.599 0.823 9.406
0.033 10.872 10.244 1.015 0.574 4.871
0.035 21.694 22.389 1.015 0.859 9.259
0.035 10.936 10.304 1.015 0.911 4.5
Please help....
请帮忙....
回答by Jamie Bull
Firstly, I'd suggest using something like PCA as a dimensionality reductionmethod, but if you have to roll your own then your question is insufficiently constrained. Where two columns are correlated, which one do you want to remove? What if column A is correlated with column B, while column B is correlated with column C, but not column A?
首先,我建议使用 PCA 之类的方法作为降维方法,但是如果您必须自己动手,那么您的问题就没有受到足够的约束。在两列相关的情况下,您要删除哪一列?如果 A 列与 B 列相关,而 B 列与 C 列相关,但不与 A 列相关呢?
You can get a pairwise matrix of correlations by calling DataFrame.corr()
(docs) which might help you with developing your algorithm, but eventually you need to convert that into a list of columns to keep.
您可以通过调用DataFrame.corr()
( docs)来获得相关的成对矩阵,这可能有助于您开发算法,但最终您需要将其转换为要保留的列列表。
回答by TomDobbs
Plug your features dataframe in this function and just set your correlation threshold. It'll auto drop columns, but will also give you a diagnostic of the columns it drops if you want to do it manually.
在这个函数中插入你的特征数据框,然后设置你的相关阈值。它会自动删除列,但如果您想手动删除列,它也会为您提供它删除的列的诊断信息。
def corr_df(x, corr_val):
'''
Obj: Drops features that are strongly correlated to other features.
This lowers model complexity, and aids in generalizing the model.
Inputs:
df: features df (x)
corr_val: Columns are dropped relative to the corr_val input (e.g. 0.8)
Output: df that only includes uncorrelated features
'''
# Creates Correlation Matrix and Instantiates
corr_matrix = x.corr()
iters = range(len(corr_matrix.columns) - 1)
drop_cols = []
# Iterates through Correlation Matrix Table to find correlated columns
for i in iters:
for j in range(i):
item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]
col = item.columns
row = item.index
val = item.values
if val >= corr_val:
# Prints the correlated feature set and the corr val
print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2))
drop_cols.append(i)
drops = sorted(set(drop_cols))[::-1]
# Drops the correlated columns
for i in drops:
col = x.iloc[:, (i+1):(i+2)].columns.values
df = x.drop(col, axis=1)
return df
回答by NISHA DAGA
Here is the approach which I have used -
这是我使用的方法 -
def correlation(dataset, threshold):
col_corr = set() # Set of all the names of deleted columns
corr_matrix = dataset.corr()
for i in range(len(corr_matrix.columns)):
for j in range(i):
if (corr_matrix.iloc[i, j] >= threshold) and (corr_matrix.columns[j] not in col_corr):
colname = corr_matrix.columns[i] # getting the name of column
col_corr.add(colname)
if colname in dataset.columns:
del dataset[colname] # deleting the column from the dataset
print(dataset)
Hope this helps!
希望这可以帮助!
回答by Mojgan Mazouchi
You can use the following for a given data frame df:
您可以对给定的数据框 df 使用以下内容:
corr_matrix = df.corr().abs()
high_corr_var=np.where(corr_matrix>0.8)
high_corr_var=[(corr_matrix.columns[x],corr_matrix.columns[y]) for x,y in zip(*high_corr_var) if x!=y and x<y]
回答by azuber
I took the liberty to modify TomDobbs' answer. The reported bug in the comments is removed now. Also, the new function filters out the negative correlation, too.
我冒昧地修改了 TomDobbs 的回答。评论中报告的错误现已删除。此外,新函数也过滤掉了负相关。
def corr_df(x, corr_val):
'''
Obj: Drops features that are strongly correlated to other features.
This lowers model complexity, and aids in generalizing the model.
Inputs:
df: features df (x)
corr_val: Columns are dropped relative to the corr_val input (e.g. 0.8)
Output: df that only includes uncorrelated features
'''
# Creates Correlation Matrix and Instantiates
corr_matrix = x.corr()
iters = range(len(corr_matrix.columns) - 1)
drop_cols = []
# Iterates through Correlation Matrix Table to find correlated columns
for i in iters:
for j in range(i):
item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]
col = item.columns
row = item.index
val = item.values
if abs(val) >= corr_val:
# Prints the correlated feature set and the corr val
print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2))
drop_cols.append(i)
drops = sorted(set(drop_cols))[::-1]
# Drops the correlated columns
for i in drops:
col = x.iloc[:, (i+1):(i+2)].columns.values
x = x.drop(col, axis=1)
return x
回答by Cherry Wu
The method here worked well for me, only a few lines of code: https://chrisalbon.com/machine_learning/feature_selection/drop_highly_correlated_features/
这里的方法对我来说效果很好,只有几行代码:https: //chrisalbon.com/machine_learning/feature_selection/drop_highly_correlated_features/
import numpy as np
# Create correlation matrix
corr_matrix = df.corr().abs()
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Find features with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
# Drop features
df.drop(to_drop, axis=1, inplace=True)
回答by Ryan
A small revision to the solution posted by user3025698 that resolves an issue where the correlation between the first two columns is not captured and some data type checking.
对 user3025698 发布的解决方案的小修订,解决了未捕获前两列之间的相关性和某些数据类型检查的问题。
def filter_df_corr(inp_data, corr_val):
'''
Returns an array or dataframe (based on type(inp_data) adjusted to drop \
columns with high correlation to one another. Takes second arg corr_val
that defines the cutoff
----------
inp_data : np.array, pd.DataFrame
Values to consider
corr_val : float
Value [0, 1] on which to base the correlation cutoff
'''
# Creates Correlation Matrix
if isinstance(inp_data, np.ndarray):
inp_data = pd.DataFrame(data=inp_data)
array_flag = True
else:
array_flag = False
corr_matrix = inp_data.corr()
# Iterates through Correlation Matrix Table to find correlated columns
drop_cols = []
n_cols = len(corr_matrix.columns)
for i in range(n_cols):
for k in range(i+1, n_cols):
val = corr_matrix.iloc[k, i]
col = corr_matrix.columns[i]
row = corr_matrix.index[k]
if abs(val) >= corr_val:
# Prints the correlated feature set and the corr val
print(col, "|", row, "|", round(val, 2))
drop_cols.append(col)
# Drops the correlated columns
drop_cols = set(drop_cols)
inp_data = inp_data.drop(columns=drop_cols)
# Return same type as inp
if array_flag:
return inp_data.values
else:
return inp_data
回答by iPhoneDeveloper
Another effective way i found to find correlation is to use pandas profiling. Once you have your dataframe ready just use
我发现找到相关性的另一种有效方法是使用 Pandas 分析。准备好数据框后,只需使用
import pandas_profiling as pp
your_df_report= pp.ProfileReport(your_df)
your_df_report.to_file("your_df_report.html")
This report in html clearly gives you detailed report on your data frame which is nothing but EDA which includes your co relation between different features as well. It will suggest you to drop columns with high co relation as well.
这份 html 报告清楚地为您提供了关于您的数据框的详细报告,它只不过是 EDA,其中还包括您在不同功能之间的关联。它也会建议您删除具有高关联的列。
回答by Celso
This is the approach I used on my job last month. Perhaps it is not the best or quickest way, but it works fine. Here, df is my original Pandas dataframe:
这是我上个月在工作中使用的方法。也许这不是最好或最快的方法,但它工作得很好。在这里,df 是我原来的 Pandas 数据框:
dropvars = []
threshold = 0.95
df_corr = df.corr().stack().reset_index().rename(columns={'level_0': 'Var 1', 'level_1': 'Var 2', 0: 'Corr'})
df_corr = df_corr[(df_corr['Corr'].abs() >= threshold) & (df_corr['Var 1'] != df_corr['Var 2'])]
while len(df_corr) > 0:
var = df_corr['Var 1'].iloc[0]
df_corr = df_corr[((df_corr['Var 1'] != var) & (df_corr['Var 2'] != var))]
dropvars.append(var)
df.drop(columns=dropvars, inplace=True)
My idea is as follows: first, I create a dataframe containing columna Var 1, Var 2 and Corr, where I keep only those pairs of variables whose correlation is higher than or equal my threshold (in absolute value). Then, I iteratively choose the first variable (Var 1 value) in this correlations dataframe, add it to dropvar list, and remove all lines of the correlations dataframe where it appears, until my correlations dataframe is empty. In the end, I remove the columns in my dropvar list from my original dataframe.
我的想法如下:首先,我创建一个包含列 Var 1、Var 2 和 Corr 的数据框,其中我只保留那些相关性高于或等于我的阈值(绝对值)的变量对。然后,我迭代地选择此相关数据框中的第一个变量(Var 1 值),将其添加到 dropvar 列表中,并删除相关数据帧出现的所有行,直到我的相关数据帧为空。最后,我从原始数据框中删除了 dropvar 列表中的列。
回答by b-shields
I had a similar question today and came across this post. This is what I ended up with.
我今天有一个类似的问题,并遇到了这篇文章。这就是我的结果。
def uncorrelated_features(df, threshold=0.7):
"""
Returns a subset of df columns with Pearson correlations
below threshold.
"""
corr = df.corr().abs()
keep = []
for i in range(len(corr.iloc[:,0])):
above = corr.iloc[:i,i]
if len(keep) > 0: above = above[keep]
if len(above[above < threshold]) == len(above):
keep.append(corr.columns.values[i])
return df[keep]