Python Seaborn pairplot 和 NaN 值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31493446/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 10:06:00  来源:igfitidea点击:

Seaborn pairplot and NaN values

pythonpandasmatplotlibseaborn

提问by Diziet Asahi

I'm trying to understand why this fails, even though the documentation says:

我试图理解为什么会失败,即使文档说:

dropna : boolean, optional Drop missing values from the data before plotting.

dropna :布尔值,可选在绘图前从数据中删除缺失值。

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.__version__
# '0.7.dev'
# generate an example DataFrame
a = pd.DataFrame(data={
    'a': np.random.normal(size=(100,)),
    'b': np.random.lognormal(size=(100,)),
    'c': np.random.exponential(size=(100,))})
sns.pairplot(a) # this works as expected
# snip
b = a.copy()
b.iloc[5,2] = np.nan # replace one value in col 'c' by a NaN
sns.pairplot(b) # this fails with error 
                # "AttributeError: max must be larger than min in range parameter."
                # in histogram(a, bins, range, normed, weights, density)"
> sns.pairplot(b, dropna=True) # same error as above

采纳答案by Diziet Asahi

I'm going to post an answer to my own question, even though it doesn't exactly solve the problem in general, but at least it solves myproblem.

我将发布我自己问题的答案,尽管它并不能完全解决一般问题,但至少它解决了我的问题。

The problem arises when trying to draw histograms. However, it looks like the kdes are much more robust to missing data. Therefore, this works, despite the NaNin the middle of the dataframe:

尝试绘制直方图时会出现问题。但是,看起来kdes 对缺失数据的鲁棒性要强得多。因此,尽管NaN在数据框的中间,这仍然有效:

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.__version__
# '0.7.dev'
# generate an example DataFrame
a = pd.DataFrame(data={
    'a': np.random.normal(size=(100,)),
    'b': np.random.lognormal(size=(100,)),
    'c': np.random.exponential(size=(100,))})
a.iloc[5,2] = np.nan # replace one value in col 'c' by a NaN
sns.pairplot(a, diag_kind='kde')

回答by Suresh2692

when you are using the data directly, ie

当您直接使用数据时,即

sns.pairplot(b) #Same as sns.pairplot(b, x_vars=['a','b','c'] , y_vars=['a','b','c'],dropna=True)

your are plotting against all the columns in the DataFrame,Then make sure no:of rows are same in all columns.

您正在针对 DataFrame 中的所有列进行绘图,然后确保所有列中的行数都相同。

sns.pairplot(b, x_vars=['a','c'] , y_vars=['a','b','c'],dropna=True)

In this case it works fine, but there will be a minute difference in the graph for removing the 'NaN value'.

在这种情况下,它工作正常,但在删除“NaN 值”的图表中会有细微的差异。

So, If you want to plot with the whole Data then :-

因此,如果您想使用整个数据进行绘图,则:-

  • either the null values must be replaced using "fillna()",

  • or the whole row containing 'nan values' must be dropped

    b = b.drop(b.index[5])
    sns.pairplot(b)
    

    pairplot for dropped values

  • 要么必须使用“fillna()”替换空值,

  • 或者必须删除包含“nan 值”的整行

    b = b.drop(b.index[5])
    sns.pairplot(b)
    

    删除值的配对图

回答by Tom Dowling

Something of a necro- but as I cracked the answer to this today I thought it might be worth sharing. I could not find this solution elsewhere on the web... If the Seaborn ignoreNa keyword has not worked for your data and you don't want to drop all rows that have any NaN. This should work for you.

有点像死灵——但当我今天破解这个答案时,我认为它可能值得分享。我在网络上的其他地方找不到这个解决方案......如果 Seaborn ignoreNa 关键字不适用于您的数据,并且您不想删除所有包含任何 NaN 的行。这应该对你有用。

All of this is in Seaborn 0.9 with pandas 0.23.4, assuming a data frame (df) with j rows (samples) that have n columns (attributes).

所有这些都在 Seaborn 0.9 和 pandas 0.23.4 中,假设数据框 (df) 具有 j 行(样本),其中包含 n 列(属性)。

The solution to the issue of Seaborn being unable to cope with NaN arrays being passed to it; particularly when you want to make sure you retain a row due to it having other data within it that is useful, is based on using a function to intercept the pair-wise columns before they are passed to the PairGridfor plotting.

Seaborn无法应对传递给它的NaN数组问题的解决方案;特别是当您想确保保留一行因为其中包含有用的其他数据时,基于使用函数拦截成对列,然后再将它们传递给PairGrid绘图。

Functions can be passed to the grid sectors to carry out an operation per subplot. A simple example of this would be to calculate and annotate RMSE for a column pair (subplot) onto each plot:

可以将函数传递给网格扇区以执行每个子图的操作。一个简单的例子是计算和注释每个图上的列对(子图)的 RMSE:

def rmse(x,y, **kwargs):
    rmse = math.sqrt(skm.mean_squared_error(x, y))

    label = 'RMSE = ' + str(round(rmse, 2))  
    ax = plt.gca()
    ax.annotate(label, xy = (0.1, 0.95), size = 20, xycoords = ax.transAxes)

grid = grid.map_upper(rmse)

Therefore by writing a function that Seaborn can take as a data plotting argument, which drops NaNs on a column pair basis as the grid.map_iterates over the main data frame, we can minimize data loss per sample (row). This is because one NaN in a row will not cause the entire row to be lost for all sub-plots. But rather just the sub-plot for that specific column pair will exclude the given row.

因此,通过编写一个 Seaborn 可以将其作为数据绘图参数的函数,该函数grid.map_在主数据帧上迭代时以列对为基础丢弃 NaN ,我们可以最大限度地减少每个样本(行)的数据丢失。这是因为一行中的一个 NaN 不会导致所有子图的整行都丢失。而只是该特定列对的子图将排除给定的行。

The following function carries out the pairwise NaN drop, returns the two series that seaborn then plots on the axes with matplotlibs scatter plot:

以下函数执行成对 NaN 下降,返回 seaborn 然后使用 matplotlibs 散点图在轴上绘制的两个系列:

df = [YOUR DF HERE]

def col_nan_scatter(x,y, **kwargs):
    df = pd.DataFrame({'x':x[:],'y':y[:]})
    df = df.dropna()
    x = df['x']
    y = df['y']
    plt.gca()
    plt.scatter(x,y)  

cols = df.columns
grid = sns.PairGrid(data= df, vars = cols, height = 4)
grid = grid.map_upper(col_nan_scatter)

The same can be done with seaborn plotting (with for example, just the x value):

使用 seaborn 绘图也可以这样做(例如,仅使用 x 值):

def col_nan_kde_histo(x, **kwargs):
    df = pd.DataFrame({'x':x[:]})
    df = df.dropna()
    x = df['x']
    plt.gca()
    sns.kdeplot(x)

cols = df.columns
grid = sns.PairGrid(data= df, vars = cols, height = 4)
grid = grid.map_upper(col_nan_scatter)
grid = grid.map_upper(col_nan_kde_histo)