在 Pandas/Python 中使用 GroupBy 绘图

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20915800/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:31:09  来源:igfitidea点击:

Plotting with GroupBy in Pandas/Python

pythonmatplotlibpandas

提问by David Hagan

Although it is straight-forward and easy to plot groupby objects in pandas, I am wondering what the most pythonic (pandastic?) way to grab the unique groups from a groupby object is. For example: I am working with atmospheric data and trying to plot diurnal trends over a period of several days or more. The following is the DataFrame containing many days worth of data where the timestamp is the index:

尽管在 Pandas 中绘制 groupby 对象很简单且容易,但我想知道从 groupby 对象中获取唯一组的最pythonic(pandastic?)的方法是什么。例如:我正在处理大气数据并尝试绘制几天或更长时间内的昼夜趋势。以下是包含多天数据的 DataFrame,其中时间戳是索引:

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10909 entries, 2013-08-04 12:01:00 to 2013-08-13 17:43:00
Data columns (total 17 columns):
Date     10909  non-null values
Flags    10909  non-null values
Time     10909  non-null values
convt    10909  non-null values
hino     10909  non-null values
hinox    10909  non-null values
intt     10909  non-null values
no       10909  non-null values
nox      10909  non-null values
ozonf    10909  non-null values
pmtt     10909  non-null values
pmtv     10909  non-null values
pres     10909  non-null values
rctt     10909  non-null values
smplf    10909  non-null values
stamp    10909  non-null values
no2      10909  non-null values
dtypes: datetime64[ns](1), float64(11), int64(2), object(3)

To be able to average (and take other statistics) the data at every minute for several days, I group the dataframe: data = no.groupby('Time')

为了能够在几天内每分钟平均(并获取其他统计数据)数据,我将数据框分组: data = no.groupby('Time')

I can then easily plot the mean NO concentration as well as quartiles:

然后我可以轻松绘制平均 NO 浓度以及四分位数:

ax = figure(figsize=(12,8)).add_subplot(111)
title('Diurnal Profile for NO, NO2, and NOx: East St. Louis Air Quality Study')
ylabel('Concentration [ppb]')
data.no.mean().plot(ax=ax, style='b', label='Mean')
data.no.apply(lambda x: percentile(x, 25)).plot(ax=ax, style='r', label='25%')
data.no.apply(lambda x: percentile(x, 75)).plot(ax=ax, style='r', label='75%')

The issue that fuels my question, is that in order to plot more interesting looking things like plots using like fill_between()it is necessary to know the x-axis information per the documentation

引发我的问题的问题是,为了绘制看起来更有趣的东西,例如使用 like 的图fill_between(),必须了解每个文档的 x 轴信息

fill_between(x, y1, y2=0, where=None, interpolate=False, hold=None, **kwargs)

For the life of me, I cannot figure out the best way to accomplish this. I have tried:

对于我的生活,我无法找出实现这一目标的最佳方法。我试过了:

  1. Iterating over the groupby object and creating an array of the groups
  2. Grabbing all of the unique Time entries from the original DataFrame
  1. 迭代 groupby 对象并创建组数组
  2. 从原始 DataFrame 中获取所有唯一的 Time 条目

I can make these work, but I know there is a better way. Python is far too beautiful. Any ideas/hints?

我可以使这些工作,但我知道有更好的方法。Python 太漂亮了。任何想法/提示?

UPDATES:The statistics can be dumped into a new dataframe using unstack()such as

更新:可以将统计信息转储到新的数据帧中,unstack()例如

no_new = no.groupby('Time')['no'].describe().unstack()
no_new.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1440 entries, 00:00 to 23:59
Data columns (total 8 columns):
count    1440  non-null values
mean     1440  non-null values
std      1440  non-null values
min      1440  non-null values
25%      1440  non-null values
50%      1440  non-null values
75%      1440  non-null values
max      1440  non-null values
dtypes: float64(8)

Although I should be able to plot with fill_between()using no_new.index, I receive a TypeError.

虽然我应该能够fill_between()使用 using进行绘图no_new.index,但我收到了一个TypeError.

Current Plot code and TypeError:

当前绘图代码和TypeError

ax = figure(figzise=(12,8)).add_subplot(111)
ax.plot(no_new['mean'])
ax.fill_between(no_new.index, no_new['mean'], no_new['75%'], alpha=.5, facecolor='green')

TypeError:

类型错误:

TypeError                                 Traceback (most recent call last)
<ipython-input-6-47493de920f1> in <module>()
      2 ax = figure(figsize=(12,8)).add_subplot(111)
      3 ax.plot(no_new['mean'])
----> 4 ax.fill_between(no_new.index, no_new['mean'], no_new['75%'], alpha=.5,     facecolor='green')
      5 #title('Diurnal Profile for NO, NO2, and NOx: East St. Louis Air Quality Study')
      6 #ylabel('Concentration [ppb]')

C:\Users\David\AppData\Local\Enthought\Canopy\User\lib\site-packages\matplotlib\axes.pyc in fill_between(self, x, y1, y2, where, interpolate, **kwargs)
   6986 
   6987         # Convert the arrays so we can work with them
-> 6988         x = ma.masked_invalid(self.convert_xunits(x))
   6989         y1 = ma.masked_invalid(self.convert_yunits(y1))
   6990         y2 = ma.masked_invalid(self.convert_yunits(y2))

C:\Users\David\AppData\Local\Enthought\Canopy\User\lib\site-packages\numpy\ma\core.pyc in masked_invalid(a, copy)
   2237         cls = type(a)
   2238     else:
-> 2239         condition = ~(np.isfinite(a))
   2240         cls = MaskedArray
   2241     result = a.view(cls)

TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

The plot as of now looks like this:enter image description here

现在的情节是这样的:在此处输入图片说明

回答by Garrett

Storing the groupby stats (mean/25/75) as columns in a new dataframe and then passing the new dataframe's indexas the xparameter of plt.fill_between()works for me (tested with matplotlib 1.3.1). e.g.,

将 groupby 统计信息 (mean/25/75) 作为列存储在新数据框中,然后将新数据框index作为我xplt.fill_between()作品参数传递(使用 matplotlib 1.3.1 测试)。例如,

gdf = df.groupby('Time')[col].describe().unstack()
plt.fill_between(gdf.index, gdf['25%'], gdf['75%'], alpha=.5)

gdf.info()should look like this:

gdf.info()应该是这样的:

<class 'pandas.core.frame.DataFrame'>
Index: 12 entries, 00:00:00 to 22:00:00
Data columns (total 8 columns):
count    12 non-null float64
mean     12 non-null float64
std      12 non-null float64
min      12 non-null float64
25%      12 non-null float64
50%      12 non-null float64
75%      12 non-null float64
max      12 non-null float64
dtypes: float64(8)

Update: to address the TypeError: ufunc 'isfinite' not supportedexception, it is necessary to first convert the Timecolumn from a series of string objects in "HH:MM" format to a series of datetime.timeobjects, which can be done as follows:

更新:为了解决TypeError: ufunc 'isfinite' not supported异常,需要先将Time列从“HH:MM”格式的一系列字符串对象转换为一系列datetime.time对象,具体操作如下:

df['Time'] = df.Time.map(lambda x: pd.datetools.parse(x).time())