pandas 使用 seaborn.pairplot() 以多种颜色绘制数据框?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/54317168/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 06:17:22  来源:igfitidea点击:

Plotting a dataframe with seaborn.pairplot() in multiple colors?

pythonpandasseaborn

提问by Philipp

I want to create a plot similar to this image in order to compare multiple dims of my dataset. The dataset is no preset. I managed to display the data correctly in one color, but I want one colour for y=0 and one for y=1 to compare the points. Just like in the image of the iris dataset. As soon as I include the hue='y'in the sns.pairplotmethod the code will not compile until the end.

我想创建一个类似于此图像的图,以便比较我的数据集的多个维度。数据集没有预设。我设法以一种颜色正确显示数据,但我想要 y=0 的一种颜色和 y=1 的一种颜色来比较点。就像在 iris 数据集的图像中一样。一旦我hue='y'sns.pairplot方法中包含 ,代码直到最后才会编译。

Also I dont understand the console output. What's the issue?

我也不明白控制台输出。有什么问题?

enter image description hereimport seaborn as sns; sns.set(style="ticks", color_codes=True) import pandas as pd

在此处输入图片说明将 seaborn 作为 sns 导入;sns.set(style="ticks", color_codes=True) 将Pandas导入为 pd

dataframe = pd.DataFrame(dict(F1=X[:, 0], F2=X[:, 1], F3=X[:, 2], F4=X[:, 3], y=y))

print(dataframe)

g = sns.pairplot(dataframe, hue='y')

This is the output for the dataframe. It looks alright to me:

这是dataframe. 我觉得没问题:

            F1        F2        F3        F4    y
0     3.173182  2.849991  2.497907  2.851715  0.0
1     2.468625 -0.216985  0.275206  1.232518  1.0
2     2.398419  2.258931  2.255533  4.895872  0.0
3     1.379937  1.041677  1.165911  1.992650  1.0
4     2.489665  2.269068  4.129961  2.218203  0.0
5     4.140160  2.809088  2.973027  3.553128  0.0
6     2.997969  1.701299  2.978875  1.946793  0.0
7     3.864436  3.554276  3.568455  2.839489  0.0
8    -0.000605  1.376971  1.128350  1.293777  1.0
9     2.398057  1.180861  2.400801  2.264726  1.0
10    0.997385 -0.560205  0.954628  2.788858  1.0

...        ...       ...       ...       ...  ...

3990  3.334553  4.576306  2.470476  3.032781  0.0
3991  1.465784  2.304793  1.267303 -0.030802  1.0
3992  0.505905 -0.280769 -1.223464  1.077305  1.0
3993  2.581596  3.924394  3.878303  2.579366  0.0
3994  4.362067  2.247818  2.948595  1.906314  0.0
3995  2.310546  0.006672  2.382227  1.940343  1.0
3996 -0.944635  1.387136  0.604135  2.421478  1.0
3997  1.290999  1.485965  0.262792  0.899340  1.0
3998  0.864532  1.759607  1.118346  1.038935  1.0
3999  1.819110  2.218838  3.927945  2.593009  0.0

[4000 rows x 5 columns]

But eventually I receive this error:

但最终我收到了这个错误:

Traceback (most recent call last):
  File "/Users//PycharmProjects//V3_multiTops/vergleich.py", line 131, in <module>
    g = sns.pairplot(dataframe, hue='y')
  File "/Users//PycharmProjects//venv/lib/python3.7/site-packages/seaborn/axisgrid.py", line 2111, in pairplot
    grid.map_diag(kdeplot, **diag_kws)
  File "/Users//PycharmProjects//venv/lib/python3.7/site-packages/seaborn/axisgrid.py", line 1399, in map_diag
    func(data_k, label=label_k, color=color, **kwargs)
  File "/Users//PycharmProjects//venv/lib/python3.7/site-packages/seaborn/distributions.py", line 691, in kdeplot
    cumulative=cumulative, **kwargs)
  File "/Users//PycharmProjects//venv/lib/python3.7/site-packages/seaborn/distributions.py", line 294, in _univariate_kdeplot
    x, y = _scipy_univariate_kde(data, bw, gridsize, cut, clip)
  File "/Users//PycharmProjects//venv/lib/python3.7/site-packages/seaborn/distributions.py", line 366, in _scipy_univariate_kde
    kde = stats.gaussian_kde(data, bw_method=bw)
  File "/Users//PycharmProjects//venv/lib/python3.7/site-packages/scipy/stats/kde.py", line 172, in __init__
    self.set_bandwidth(bw_method=bw_method)
  File "/Users//PycharmProjects//venv/lib/python3.7/site-packages/scipy/stats/kde.py", line 499, in set_bandwidth
    self._compute_covariance()
  File "/Users//PycharmProjects//venv/lib/python3.7/site-packages/scipy/stats/kde.py", line 510, in _compute_covariance
    self._data_inv_cov = linalg.inv(self._data_covariance)
  File "/Users//PycharmProjects//venv/lib/python3.7/site-packages/scipy/linalg/basic.py", line 975, in inv
    raise LinAlgError("singular matrix")
numpy.linalg.linalg.LinAlgError: singular matrix

I think I am doing something wrong with the sns.pairplot(), which I don't understand yet. Can you explain it to me please?

我想我做错了什么sns.pairplot(),我还不明白。你能给我解释一下吗?

回答by ImportanceOfBeingErnest

The problem seems to be that the "y"column itself is numeric. It would hence be included in the pairgrid as a column/row. This seems undesired anyways. To select the variables that shall take part in the grid, use the pairplot's varskeyword.

问题似乎是"y"列本身是数字。因此,它将作为列/行包含在pairgrid 中。无论如何,这似乎是不受欢迎的。要选择应参与网格的变量,请使用pairplot'svars关键字。

 sns.pairplot(df, vars=df.columns[:-1], hue="y")

The reason the irisdataset works without specifying varsis that the huecolumn is not numeric. Non-numeric columns are not included in the grid.

iris数据集在没有指定的情况下工作的原因vars是该hue列不是数字。非数字列不包含在网格中。

Complete example:

完整示例:

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randn(300, 4), columns=[f"F{i+1}" for i in range(4)])
df["y"] = np.random.choice([1., 0.], size=len(df))

sns.pairplot(df, vars=df.columns[:-1], hue="y")
plt.show()

enter image description here

在此处输入图片说明