Python 使用 seaborn 绘图时如何处理缺失值?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32902832/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 12:27:18  来源:igfitidea点击:

What to do with missing values when plotting with seaborn?

pythonpython-2.7pandasdata-analysisseaborn

提问by datavinci

I replaced the missing values with NaN using lambda following function:

我使用 lambda 以下函数用 NaN 替换了缺失值:

data = data.applymap(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)

data = data.applymap(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)

,where data is the dataframe I am working on.

,其中 data 是我正在处理的数据框。

Using seaborn afterwards,I tried to plot one of its attributes,alcconsumption using seaborn.distplot as follows:

之后使用 seaborn,我尝试使用 seaborn.distplot 绘制其属性之一,alcconsumption 如下:

seaborn.distplot(data['alcconsumption'],hist=True,bins=100)
plt.xlabel('AlcoholConsumption')
plt.ylabel('Frequency(normalized 0->1)')

It's giving me the following error:

它给了我以下错误:

AttributeError: max must be larger than min in range parameter.

回答by vestland

I would definitely handle missing values beforeyou plot your data. Whether ot not to use dropna()would depend entirely on the nature of your dataset. Is alcconsumptiona single series or part of a dataframe? In the latter case, using dropna()would remove the corresponding rows in other columns as well. Are the missing values few or many? Are they spread around in your series, or do they tend to occur in groups? Is there perhaps reason to believe that there is a trend in your dataset?

绘制数据之前,我肯定会处理缺失值。是否不使用dropna()将完全取决于您的数据集的性质。是alcconsumption单个系列还是数据帧的一部分?在后一种情况下, usingdropna()也会删除其他列中的相应行。缺失值是少还是多?它们是在您的系列中散布,还是倾向于成群出现?也许有理由相信您的数据集中存在趋势?

If the missing values are few and scattered, you could easiliy use dropna(). In other cases I would choose to fill missing values with the previously observed value (1). Or even fill the missing values with interpolated values (2). But be careful! Replacing a lot of data with filled or interpolated observations could seriously interrupt your dataset and lead to very wrong conlusions.

如果缺失值很少且分散,您可以轻松使用 dropna()。在其他情况下,我会选择用先前观察到的值 (1) 填充缺失值。或者甚至用插值 (2) 填充缺失值。不过要小心!用填充或插值观察替换大量数据可能会严重中断您的数据集并导致非常错误的结论。

Here are some examples that use your snippet...

以下是一些使用您的代码段的示例...

seaborn.distplot(data['alcconsumption'],hist=True,bins=100)
plt.xlabel('AlcoholConsumption')
plt.ylabel('Frequency(normalized 0->1)')

... on a synthetic dataset:

...在合成数据集上:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

def sample(rows, names):
    ''' Function to create data sample with random returns

    Parameters
    ==========
    rows : number of rows in the dataframe
    names: list of names to represent assets

    Example
    =======

    >>> sample(rows = 2, names = ['A', 'B'])

                  A       B
    2017-01-01  0.0027  0.0075
    2017-01-02 -0.0050 -0.0024
    '''
    listVars= names
    rng = pd.date_range('1/1/2017', periods=rows, freq='D')
    df_temp = pd.DataFrame(np.random.randint(-100,100,size=(rows, len(listVars))), columns=listVars) 
    df_temp = df_temp.set_index(rng)


    return df_temp

df = sample(rows = 15, names = ['A', 'B'])
df['A'][8:12] = np.nan
df

Output:

输出:

            A   B
2017-01-01 -63.0  10
2017-01-02  49.0  79
2017-01-03 -55.0  59
2017-01-04  89.0  34
2017-01-05 -13.0 -80
2017-01-06  36.0  90
2017-01-07 -41.0  86
2017-01-08  10.0 -81
2017-01-09   NaN -61
2017-01-10   NaN -80
2017-01-11   NaN -39
2017-01-12   NaN  24
2017-01-13 -73.0 -25
2017-01-14 -40.0  86
2017-01-15  97.0  60

(1) Using forward fill with pandas.DataFrame.fillna(method = ffill)

(1) 使用前向填充与pandas.DataFrame.fillna(method = ffill)

ffillwill "fill values forward", meaning it will replace the nan's with the value of the row above.

ffill将“向前填充值”,这意味着它将nan用上面行的值替换's 。

df = df['A'].fillna(axis=0, method='ffill')
sns.distplot(df, hist=True,bins=5)
plt.xlabel('AlcoholConsumption')
plt.ylabel('Frequency(normalized 0->1)')

enter image description here

在此处输入图片说明

(2) Using interpolation with pandas.DataFrame.interpolate()

(2) 使用pandas.DataFrame.interpolate() 进行插值

Interpolate values according to different methods. Time interpolation works on daily and higher resolution data to interpolate given length of interval.

根据不同的方法插值。时间插值适用于每日和更高分辨率的数据,以插入给定的间隔长度。

df['A'] = df['A'].interpolate(method = 'time')
sns.distplot(df['A'], hist=True,bins=5)
plt.xlabel('AlcoholConsumption')
plt.ylabel('Frequency(normalized 0->1)')

enter image description here

在此处输入图片说明

As you can see, the different methods render two very different results. I hope this will be useful to you. If not then let me know and I'll have a look at it again.

如您所见,不同的方法呈现两种截然不同的结果。我希望这对你有用。如果没有,请告诉我,我会再看一遍。

回答by jtlz2

This is a known issue with matplotlib/pylab histograms!

这是 matplotlib/pylab 直方图的一个已知问题!

See e.g. https://github.com/matplotlib/matplotlib/issues/6483

参见例如https://github.com/matplotlib/matplotlib/issues/6483

where various workarounds are suggested, two favourites (for example from https://stackoverflow.com/a/19090183/1021819) being:

在建议各种解决方法的地方,两个最喜欢的(例如来自https://stackoverflow.com/a/19090183/1021819)是:

import numpy as np
nbins=100
A=data['alcconsumption']
Anan=A[~np.isnan(A)] #?Remove the NaNs

seaborn.distplot(Anan,hist=True,bins=nbins)

Alternatively, specify bin edges (in this case by anyway making use of Anan...):

或者,指定 bin 边缘(在这种情况下,无论如何使用Anan...):

Amin=min(Anan)
Amax=max(Anan)
seaborn.distplot(A,hist=True,bins=np.linspace(Amin,Amax,nbins))

回答by ZicoNuna

You can use the following line to select the non-NaN values for a distribution plot using seaborn:

您可以使用以下行为使用 seaborn 的分布图选择非 NaN 值:

seaborn.distplot(data['alcconsumption'].notnull(),hist=True,bins=100)