Python Seaborn pairplot 和 NaN 值

Question

提问by Diziet Asahi

I'm trying to understand why this fails, even though the documentation says:

我试图理解为什么会失败，即使文档说：

dropna : boolean, optional Drop missing values from the data before plotting.

dropna ：布尔值，可选在绘图前从数据中删除缺失值。

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.__version__
# '0.7.dev'
# generate an example DataFrame
a = pd.DataFrame(data={
    'a': np.random.normal(size=(100,)),
    'b': np.random.lognormal(size=(100,)),
    'c': np.random.exponential(size=(100,))})
sns.pairplot(a) # this works as expected
# snip
b = a.copy()
b.iloc[5,2] = np.nan # replace one value in col 'c' by a NaN
sns.pairplot(b) # this fails with error 
                # "AttributeError: max must be larger than min in range parameter."
                # in histogram(a, bins, range, normed, weights, density)"
> sns.pairplot(b, dropna=True) # same error as above

Answer 1

采纳答案by Diziet Asahi

I'm going to post an answer to my own question, even though it doesn't exactly solve the problem in general, but at least it solves myproblem.

我将发布我自己问题的答案，尽管它并不能完全解决一般问题，但至少它解决了我的问题。

The problem arises when trying to draw histograms. However, it looks like the kdes are much more robust to missing data. Therefore, this works, despite the NaNin the middle of the dataframe:

尝试绘制直方图时会出现问题。但是，看起来kdes 对缺失数据的鲁棒性要强得多。因此，尽管NaN在数据框的中间，这仍然有效：

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.__version__
# '0.7.dev'
# generate an example DataFrame
a = pd.DataFrame(data={
    'a': np.random.normal(size=(100,)),
    'b': np.random.lognormal(size=(100,)),
    'c': np.random.exponential(size=(100,))})
a.iloc[5,2] = np.nan # replace one value in col 'c' by a NaN
sns.pairplot(a, diag_kind='kde')

Answer 2

回答by Suresh2692

when you are using the data directly, ie

当您直接使用数据时，即

sns.pairplot(b) #Same as sns.pairplot(b, x_vars=['a','b','c'] , y_vars=['a','b','c'],dropna=True)

your are plotting against all the columns in the DataFrame,Then make sure no:of rows are same in all columns.

您正在针对 DataFrame 中的所有列进行绘图，然后确保所有列中的行数都相同。

sns.pairplot(b, x_vars=['a','c'] , y_vars=['a','b','c'],dropna=True)

In this case it works fine, but there will be a minute difference in the graph for removing the 'NaN value'.

在这种情况下，它工作正常，但在删除“NaN 值”的图表中会有细微的差异。

So, If you want to plot with the whole Data then :-

因此，如果您想使用整个数据进行绘图，则：-

either the null values must be replaced using "fillna()",
or the whole row containing 'nan values' must be dropped
```
b = b.drop(b.index[5])
sns.pairplot(b)
```

要么必须使用“fillna()”替换空值，
或者必须删除包含“nan 值”的整行
```
b = b.drop(b.index[5])
sns.pairplot(b)
```

Answer 3

回答by Tom Dowling

Something of a necro- but as I cracked the answer to this today I thought it might be worth sharing. I could not find this solution elsewhere on the web... If the Seaborn ignoreNa keyword has not worked for your data and you don't want to drop all rows that have any NaN. This should work for you.

有点像死灵——但当我今天破解这个答案时，我认为它可能值得分享。我在网络上的其他地方找不到这个解决方案......如果 Seaborn ignoreNa 关键字不适用于您的数据，并且您不想删除所有包含任何 NaN 的行。这应该对你有用。

All of this is in Seaborn 0.9 with pandas 0.23.4, assuming a data frame (df) with j rows (samples) that have n columns (attributes).

所有这些都在 Seaborn 0.9 和 pandas 0.23.4 中，假设数据框 (df) 具有 j 行（样本），其中包含 n 列（属性）。

The solution to the issue of Seaborn being unable to cope with NaN arrays being passed to it; particularly when you want to make sure you retain a row due to it having other data within it that is useful, is based on using a function to intercept the pair-wise columns before they are passed to the PairGridfor plotting.

Seaborn无法应对传递给它的NaN数组问题的解决方案；特别是当您想确保保留一行因为其中包含有用的其他数据时，基于使用函数拦截成对列，然后再将它们传递给PairGrid绘图。

Functions can be passed to the grid sectors to carry out an operation per subplot. A simple example of this would be to calculate and annotate RMSE for a column pair (subplot) onto each plot:

可以将函数传递给网格扇区以执行每个子图的操作。一个简单的例子是计算和注释每个图上的列对（子图）的 RMSE：

def rmse(x,y, **kwargs):
    rmse = math.sqrt(skm.mean_squared_error(x, y))

    label = 'RMSE = ' + str(round(rmse, 2))  
    ax = plt.gca()
    ax.annotate(label, xy = (0.1, 0.95), size = 20, xycoords = ax.transAxes)

grid = grid.map_upper(rmse)

Therefore by writing a function that Seaborn can take as a data plotting argument, which drops NaNs on a column pair basis as the grid.map_iterates over the main data frame, we can minimize data loss per sample (row). This is because one NaN in a row will not cause the entire row to be lost for all sub-plots. But rather just the sub-plot for that specific column pair will exclude the given row.

因此，通过编写一个 Seaborn 可以将其作为数据绘图参数的函数，该函数grid.map_在主数据帧上迭代时以列对为基础丢弃 NaN ，我们可以最大限度地减少每个样本（行）的数据丢失。这是因为一行中的一个 NaN 不会导致所有子图的整行都丢失。而只是该特定列对的子图将排除给定的行。

The following function carries out the pairwise NaN drop, returns the two series that seaborn then plots on the axes with matplotlibs scatter plot:

以下函数执行成对 NaN 下降，返回 seaborn 然后使用 matplotlibs 散点图在轴上绘制的两个系列：

df = [YOUR DF HERE]

def col_nan_scatter(x,y, **kwargs):
    df = pd.DataFrame({'x':x[:],'y':y[:]})
    df = df.dropna()
    x = df['x']
    y = df['y']
    plt.gca()
    plt.scatter(x,y)  

cols = df.columns
grid = sns.PairGrid(data= df, vars = cols, height = 4)
grid = grid.map_upper(col_nan_scatter)

The same can be done with seaborn plotting (with for example, just the x value):

使用 seaborn 绘图也可以这样做（例如，仅使用 x 值）：

def col_nan_kde_histo(x, **kwargs):
    df = pd.DataFrame({'x':x[:]})
    df = df.dropna()
    x = df['x']
    plt.gca()
    sns.kdeplot(x)

cols = df.columns
grid = sns.PairGrid(data= df, vars = cols, height = 4)
grid = grid.map_upper(col_nan_scatter)
grid = grid.map_upper(col_nan_kde_histo)

Python Seaborn pairplot 和 NaN 值

提问by Diziet Asahi

采纳答案by Diziet Asahi

回答by Suresh2692

回答by Tom Dowling

相关推荐

最近更新

标签

Python Seaborn pairplot 和 NaN 值

提问by Diziet Asahi

采纳答案by Diziet Asahi

回答by Suresh2692

回答by Tom Dowling

相关推荐

Python 以亚秒级精度比较时间

Python 无法实例化抽象类......使用抽象方法

Python 如何删除列表中的最后一项？

python：使用 .iterrows() 创建列

相关推荐

最近更新

标签