Python 使用熊猫绘制相关矩阵

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29432629/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 04:32:45  来源:igfitidea点击:

Plot correlation matrix using pandas

pythonpandasmatplotlibdata-visualizationinformation-visualization

提问by Gaurav Singh

I have a data set with huge number of features, so analysing the correlation matrix has become very difficult. I want to plot a correlation matrix which we get using dataframe.corr()function from pandas library. Is there any built-in function provided by the pandas library to plot this matrix?

我有一个包含大量特征的数据集,因此分析相关矩阵变得非常困难。我想绘制一个相关矩阵,我们使用dataframe.corr()Pandas 库中的函数获得该矩阵。pandas 库是否提供了任何内置函数来绘制这个矩阵?

采纳答案by jrjc

You can use pyplot.matshow()from matplotlib:

您可以使用pyplot.matshow()matplotlib

import matplotlib.pyplot as plt

plt.matshow(dataframe.corr())
plt.show()


Edit:

编辑:

In the comments was a request for how to change the axis tick labels. Here's a deluxe version that is drawn on a bigger figure size, has axis labels to match the dataframe, and a colorbar legend to interpret the color scale.

在评论中是关于如何更改轴刻度标签的请求。这是在更大的图形尺寸上绘制的豪华版本,具有与数据框匹配的轴标签,以及用于解释色标的颜色条图例。

I'm including how to adjust the size and rotation of the labels, and I'm using a figure ratio that makes the colorbar and the main figure come out the same height.

我包括如何调整标签的大小和旋转,并且我使用了一个图形比例,使颜色条和主要图形出现相同的高度。

f = plt.figure(figsize=(19, 15))
plt.matshow(df.corr(), fignum=f.number)
plt.xticks(range(df.shape[1]), df.columns, fontsize=14, rotation=45)
plt.yticks(range(df.shape[1]), df.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=16);

correlation plot example

相关图示例

回答by Apogentus

Try this function, which also displays variable names for the correlation matrix:

试试这个函数,它也显示相关矩阵的变量名称:

def plot_corr(df,size=10):
    '''Function plots a graphical correlation matrix for each pair of columns in the dataframe.

    Input:
        df: pandas DataFrame
        size: vertical and horizontal size of the plot'''

    corr = df.corr()
    fig, ax = plt.subplots(figsize=(size, size))
    ax.matshow(corr)
    plt.xticks(range(len(corr.columns)), corr.columns);
    plt.yticks(range(len(corr.columns)), corr.columns);

回答by rafaelvalle

Seaborn's heatmap version:

Seaborn 的热图版本:

import seaborn as sns
corr = dataframe.corr()
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

回答by phanindravarma

You can observe the relation between features either by drawing a heat map from seaborn or scatter matrix from pandas.

您可以通过从 seaborn 绘制热图或从 Pandas 绘制散点矩阵来观察特征之间的关系。

Scatter Matrix:

散点矩阵:

pd.scatter_matrix(dataframe, alpha = 0.3, figsize = (14,8), diagonal = 'kde');

If you want to visualize each feature's skewness as well - use seaborn pairplots.

如果您还想可视化每个特征的偏度 - 使用 seaborn pairplots。

sns.pairplot(dataframe)

Sns Heatmap:

Sns热图:

import seaborn as sns

f, ax = pl.subplots(figsize=(10, 8))
corr = dataframe.corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True),
            square=True, ax=ax)

The output will be a correlation map of the features. i.e. see the below example.

输出将是特征的相关图。即见下面的例子。

enter image description here

在此处输入图片说明

The correlation between grocery and detergents is high. Similarly:

杂货和洗涤剂之间的相关性很高。相似地:

具有高相关性的产物:
  1. Grocery and Detergents.
  1. 杂货和洗涤剂。
具有中等相关性的产品:
  1. Milk and Grocery
  2. Milk and Detergents_Paper
  1. 牛奶和杂货
  2. 牛奶和洗涤剂_纸
相关性低的产品:
  1. Milk and Deli
  2. Frozen and Fresh.
  3. Frozen and Deli.
  1. 牛奶和熟食店
  2. 冷冻和新鲜。
  3. 冷冻和熟食店。

From Pairplots: You can observe same set of relations from pairplots or scatter matrix. But from these we can say that whether the data is normally distributed or not.

从配对图:您可以从配对图或散点矩阵观察相同的一组关系。但是从这些我们可以说数据是否是正态分布的。

enter image description here

在此处输入图片说明

Note: The above is same graph taken from the data, which is used to draw heatmap.

注意:上图是从数据中提取的同一张图,用于绘制热图。

回答by joelostblom

If your main goal is to visualize the correlation matrix, rather than creating a plot per se, the convenient pandasstyling optionsis a viable built-in solution:

如果您的主要目标是可视化相关矩阵,而不是创建绘图本身,那么方便的pandas样式选项是一个可行的内置解决方案:

import pandas as pd
import numpy as np

rs = np.random.RandomState(0)
df = pd.DataFrame(rs.rand(10, 10))
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm')
# 'RdBu_r' & 'BrBG' are other good diverging colormaps

enter image description here

在此处输入图片说明

Note that this needs to be in a backend that supports rendering HTML, such as the JupyterLab Notebook. (The automatic light text on dark backgrounds is from an existing PR and not the latest released version, pandas0.23).

请注意,这需要在支持呈现 HTML 的后端中,例如 JupyterLab Notebook。(深色背景上的自动浅色文本来自现有的 PR 而不是最新发布的版本pandas0.23)。



Styling

造型

You can easily limit the digit precision:

您可以轻松限制数字精度:

corr.style.background_gradient(cmap='coolwarm').set_precision(2)

enter image description here

在此处输入图片说明

Or get rid of the digits altogether if you prefer the matrix without annotations:

如果您更喜欢没有注释的矩阵,或者完全摆脱数字:

corr.style.background_gradient(cmap='coolwarm').set_properties(**{'font-size': '0pt'})

enter image description here

在此处输入图片说明

The styling documentation also includes instructions of more advanced styles, such as how to change the display of the cell the mouse pointer is hovering over. To save the output you could return the HTML by appending the render()method and then write it to a file (or just take a screenshot for less formal purposes).

样式文档还包括更高级样式的说明,例如如何更改鼠标指针悬停在其上的单元格的显示。要保存输出,您可以通过附加render()方法返回 HTML ,然后将其写入文件(或者只是出于不太正式的目的截取屏幕截图)。



Time comparison

时间对比

In my testing, style.background_gradient()was 4x faster than plt.matshow()and 120x faster than sns.heatmap()with a 10x10 matrix. Unfortunately it doesn't scale as well as plt.matshow(): the two take about the same time for a 100x100 matrix, and plt.matshow()is 10x faster for a 1000x1000 matrix.

在我的测试中,style.background_gradient()plt.matshow()sns.heatmap()10x10 矩阵快4 倍和 120倍。不幸的是,它的扩展性不如plt.matshow():对于 100x100 矩阵,两者花费的时间大致相同,plt.matshow()对于 1000x1000 矩阵,速度快 10 倍。



Saving

保存

There are a few possible ways to save the stylized dataframe:

有几种可能的方法来保存风格化的数据框:

  • Return the HTML by appending the render()method and then write the output to a file.
  • Save as an .xslxfile with conditional formatting by appending the to_excel()method.
  • Combine with imgkit to save a bitmap
  • Take a screenshot (for less formal purposes).
  • 通过附加render()方法返回 HTML ,然后将输出写入文件。
  • .xslx通过附加to_excel()方法另存为具有条件格式的文件。
  • 结合imgkit保存位图
  • 截取屏幕截图(用于不太正式的目的)。


Update for pandas >= 0.24

大熊猫更新 >= 0.24

By setting axis=None, it is now possible to compute the colors based on the entire matrix rather than per column or per row:

通过设置axis=None,现在可以基于整个矩阵而不是每列或每行计算颜色:

corr.style.background_gradient(cmap='coolwarm', axis=None)

enter image description here

在此处输入图片说明

回答by Khandelwal-manik

You can use imshow() method from matplotlib

您可以使用 matplotlib 中的 imshow() 方法

import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')

plt.imshow(X.corr(), cmap=plt.cm.Reds, interpolation='nearest')
plt.colorbar()
tick_marks = [i for i in range(len(X.columns))]
plt.xticks(tick_marks, X.columns, rotation='vertical')
plt.yticks(tick_marks, X.columns)
plt.show()

回答by Harvey

If you dataframe is dfyou can simply use:

如果你的数据框是df你可以简单地使用:

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(15, 10))
sns.heatmap(df.corr(), annot=True)

回答by Shahriar Miraj

statmodels graphics also gives a nice view of correlation matrix

statmodels 图形还提供了一个很好的相关矩阵视图

import statsmodels.api as sm
import matplotlib.pyplot as plt

corr = dataframe.corr()
sm.graphics.plot_corr(corr, xnames=list(corr.columns))
plt.show()

回答by Marcin

For completeness, the simplest solution i know with seabornas of late 2019, if one is using Jupyter:

为完整起见,如果有人使用Jupyter,我在 2019 年底知道的最简单的解决方案是seaborn

import seaborn as sns
sns.heatmap(dataframe.corr())

回答by Nishant Tyagi

Along with other methods it is also good to have pairplot which will give scatter plot for all the cases-

与其他方法一起使用 pairplot 也很好,它可以为所有情况提供散点图 -

import pandas as pd
import numpy as np
import seaborn as sns
rs = np.random.RandomState(0)
df = pd.DataFrame(rs.rand(10, 10))
sns.pairplot(df)