如何使用 python (Pandas) 生成堆叠条形簇

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22787209/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:40:45  来源:igfitidea点击:

How to have clusters of stacked bars with python (Pandas)

pythonpandasmatplotlibplotseaborn

提问by jrjc

So here is how my data set looks like :

所以这是我的数据集的样子:

In [1]: df1=pd.DataFrame(np.random.rand(4,2),index=["A","B","C","D"],columns=["I","J"])

In [2]: df2=pd.DataFrame(np.random.rand(4,2),index=["A","B","C","D"],columns=["I","J"])

In [3]: df1
Out[3]: 
          I         J
A  0.675616  0.177597
B  0.675693  0.598682
C  0.631376  0.598966
D  0.229858  0.378817

In [4]: df2
Out[4]: 
          I         J
A  0.939620  0.984616
B  0.314818  0.456252
C  0.630907  0.656341
D  0.020994  0.538303

I want to have stacked bar plot for each dataframe but since they have same index, I'd like to have 2 stacked bars per index.

我想为每个数据框绘制堆积条形图,但由于它们具有相同的索引,我希望每个索引有 2 个堆积条形图。

I've tried to plot both on the same axes :

我试图在相同的轴上绘制两者:

In [5]: ax = df1.plot(kind="bar", stacked=True)

In [5]: ax2 = df2.plot(kind="bar", stacked=True, ax = ax)

But it overlaps.

但它重叠。

Then I tried to concat the two dataset first :

然后我尝试先连接两个数据集:

pd.concat(dict(df1 = df1, df2 = df2),axis = 1).plot(kind="bar", stacked=True)

but here everything is stacked

但这里一切都堆积如山

My best try is :

我最好的尝试是:

 pd.concat(dict(df1 = df1, df2 = df2),axis = 0).plot(kind="bar", stacked=True)

Which gives :

这使 :

enter image description here

在此处输入图片说明

This is basically what I want, except that I want the bar ordered as

这基本上就是我想要的,除了我希望酒吧订购为

(df1,A) (df2,A) (df1,B) (df2,B) etc...

(df1,A) (df2,A) (df1,B) (df2,B) 等等...

I guess there is a trick but I can't found it !

我想有一个技巧,但我找不到!



After @bgschiller's answer I got this :

在@bgschiller 的回答之后,我得到了这个:

enter image description here

在此处输入图片说明

Which is almost what I want. I would like the bar to be clustered by index, in order to have something visually clear.

这几乎就是我想要的。我希望 bar由 index 聚集,以便在视觉上清晰。

Bonus: Having the x-label not redundant, something like :

奖励:x 标签不是多余的,例如:

df1 df2    df1 df2
_______    _______ ...
   A          B

Thanks for helping.

谢谢你的帮助。

采纳答案by jrjc

So, I eventually found a trick (edit: see below for using seaborn and longform dataframe):

所以,我最终找到了一个技巧(编辑:见下文使用 seaborn 和 longform 数据框):

Solution with pandas and matplotlib

使用 pandas 和 matplotlib 的解决方案

Here it is with a more complete example :

这是一个更完整的例子:

import pandas as pd
import matplotlib.cm as cm
import numpy as np
import matplotlib.pyplot as plt

def plot_clustered_stacked(dfall, labels=None, title="multiple stacked bar plot",  H="/", **kwargs):
    """Given a list of dataframes, with identical columns and index, create a clustered stacked bar plot. 
labels is a list of the names of the dataframe, used for the legend
title is a string for the title of the plot
H is the hatch used for identification of the different dataframe"""

    n_df = len(dfall)
    n_col = len(dfall[0].columns) 
    n_ind = len(dfall[0].index)
    axe = plt.subplot(111)

    for df in dfall : # for each data frame
        axe = df.plot(kind="bar",
                      linewidth=0,
                      stacked=True,
                      ax=axe,
                      legend=False,
                      grid=False,
                      **kwargs)  # make bar plots

    h,l = axe.get_legend_handles_labels() # get the handles we want to modify
    for i in range(0, n_df * n_col, n_col): # len(h) = n_col * n_df
        for j, pa in enumerate(h[i:i+n_col]):
            for rect in pa.patches: # for each index
                rect.set_x(rect.get_x() + 1 / float(n_df + 1) * i / float(n_col))
                rect.set_hatch(H * int(i / n_col)) #edited part     
                rect.set_width(1 / float(n_df + 1))

    axe.set_xticks((np.arange(0, 2 * n_ind, 2) + 1 / float(n_df + 1)) / 2.)
    axe.set_xticklabels(df.index, rotation = 0)
    axe.set_title(title)

    # Add invisible data to add another legend
    n=[]        
    for i in range(n_df):
        n.append(axe.bar(0, 0, color="gray", hatch=H * i))

    l1 = axe.legend(h[:n_col], l[:n_col], loc=[1.01, 0.5])
    if labels is not None:
        l2 = plt.legend(n, labels, loc=[1.01, 0.1]) 
    axe.add_artist(l1)
    return axe

# create fake dataframes
df1 = pd.DataFrame(np.random.rand(4, 5),
                   index=["A", "B", "C", "D"],
                   columns=["I", "J", "K", "L", "M"])
df2 = pd.DataFrame(np.random.rand(4, 5),
                   index=["A", "B", "C", "D"],
                   columns=["I", "J", "K", "L", "M"])
df3 = pd.DataFrame(np.random.rand(4, 5),
                   index=["A", "B", "C", "D"], 
                   columns=["I", "J", "K", "L", "M"])

# Then, just call :
plot_clustered_stacked([df1, df2, df3],["df1", "df2", "df3"])

And it gives that :

它给出了:

multiple stacked bar plot

multiple stacked bar plot

You can change the colors of the bar by passing a cmapargument:

您可以通过传递cmap参数来更改栏的颜色:

plot_clustered_stacked([df1, df2, df3],
                       ["df1", "df2", "df3"],
                       cmap=plt.cm.viridis)


Solution with seaborn:

seaborn 的解决方案:

Given the same df1, df2, df3, below, I convert them in a long form:

鉴于下面相同的 df1、df2、df3,我将它们转换为长格式:

df1["Name"] = "df1"
df2["Name"] = "df2"
df3["Name"] = "df3"
dfall = pd.concat([pd.melt(i.reset_index(),
                           id_vars=["Name", "index"]) # transform in tidy format each df
                   for i in [df1, df2, df3]],
                   ignore_index=True)

The problem with seaborn is that it doesn't stack bars natively, so the trick is to plot the cumulative sum of each bar on top of each other:

seaborn 的问题在于它本身不会堆叠条形,所以诀窍是将每个条形的累积总和绘制在彼此的顶部:

dfall.set_index(["Name", "index", "variable"], inplace=1)
dfall["vcs"] = dfall.groupby(level=["Name", "index"]).cumsum()
dfall.reset_index(inplace=True) 

>>> dfall.head(6)
  Name index variable     value       vcs
0  df1     A        I  0.717286  0.717286
1  df1     B        I  0.236867  0.236867
2  df1     C        I  0.952557  0.952557
3  df1     D        I  0.487995  0.487995
4  df1     A        J  0.174489  0.891775
5  df1     B        J  0.332001  0.568868

Then loop over each group of variableand plot the cumulative sum:

然后遍历每组variable并绘制累积总和:

c = ["blue", "purple", "red", "green", "pink"]
for i, g in enumerate(dfall.groupby("variable")):
    ax = sns.barplot(data=g[1],
                     x="index",
                     y="vcs",
                     hue="Name",
                     color=c[i],
                     zorder=-i, # so first bars stay on top
                     edgecolor="k")
ax.legend_.remove() # remove the redundant legends 

multiple stack bar plot seaborn

multiple stack bar plot seaborn

It lacks the legend that can be added easily I think. The problem is that instead of hatches (which can be added easily) to differentiate the dataframes we have a gradient of lightness, and it's a bit too light for the first one, and I don't really know how to change that without changing each rectangle one by one (as in the first solution).

我认为它缺少可以轻松添加的图例。问题是,不是用阴影(可以很容易地添加)来区分数据帧,我们有一个亮度梯度,第一个它有点太轻了,我真的不知道如何在不改变每个的情况下改变它一个接一个的矩形(如第一个解决方案)。

Tell me if you don't understand something in the code.

如果您不理解代码中的某些内容,请告诉我。

Feel free to re-use this code which is under CC0.

随意重用这个在 CC0 下的代码。

回答by bgschiller

You're on the right track! In order to change the order of the bars, you should change the order in the index.

你在正确的轨道上!要更改柱线的顺序,您应该更改索引中的顺序。

In [5]: df_both = pd.concat(dict(df1 = df1, df2 = df2),axis = 0)

In [6]: df_both
Out[6]:
              I         J
df1 A  0.423816  0.094405
    B  0.825094  0.759266
    C  0.654216  0.250606
    D  0.676110  0.495251
df2 A  0.607304  0.336233
    B  0.581771  0.436421
    C  0.233125  0.360291
    D  0.519266  0.199637

[8 rows x 2 columns]

So we want to swap axes, then reorder. Here's an easy way to do this

所以我们想交换轴,然后重新排序。这是一个简单的方法来做到这一点

In [7]: df_both.swaplevel(0,1)
Out[7]:
              I         J
A df1  0.423816  0.094405
B df1  0.825094  0.759266
C df1  0.654216  0.250606
D df1  0.676110  0.495251
A df2  0.607304  0.336233
B df2  0.581771  0.436421
C df2  0.233125  0.360291
D df2  0.519266  0.199637

[8 rows x 2 columns]

In [8]: df_both.swaplevel(0,1).sort_index()
Out[8]:
              I         J
A df1  0.423816  0.094405
  df2  0.607304  0.336233
B df1  0.825094  0.759266
  df2  0.581771  0.436421
C df1  0.654216  0.250606
  df2  0.233125  0.360291
D df1  0.676110  0.495251
  df2  0.519266  0.199637

[8 rows x 2 columns]

If it's important that your horizontal labels show up in the old order (df1,A) rather than (A,df1), we can just swaplevels again and not sort_index:

如果您的水平标签以旧顺序 (df1,A) 而不是 (A,df1) 显示很重要,我们可以swaplevel再次 s 而不是sort_index

In [9]: df_both.swaplevel(0,1).sort_index().swaplevel(0,1)
Out[9]:
              I         J
df1 A  0.423816  0.094405
df2 A  0.607304  0.336233
df1 B  0.825094  0.759266
df2 B  0.581771  0.436421
df1 C  0.654216  0.250606
df2 C  0.233125  0.360291
df1 D  0.676110  0.495251
df2 D  0.519266  0.199637

[8 rows x 2 columns]

回答by Cord Kaldemeyer

I have managed to do the same using pandas and matplotlib subplots with basic commands.

我已经设法使用带有基本命令的 pandas 和 matplotlib 子图来做同样的事情。

Here's an example:

下面是一个例子:

fig, axes = plt.subplots(nrows=1, ncols=3)

ax_position = 0
for concept in df.index.get_level_values('concept').unique():
    idx = pd.IndexSlice
    subset = df.loc[idx[[concept], :],
                    ['cmp_tr_neg_p_wrk', 'exp_tr_pos_p_wrk',
                     'cmp_p_spot', 'exp_p_spot']]     
    print(subset.info())
    subset = subset.groupby(
        subset.index.get_level_values('datetime').year).sum()
    subset = subset / 4  # quarter hours
    subset = subset / 100  # installed capacity
    ax = subset.plot(kind="bar", stacked=True, colormap="Blues",
                     ax=axes[ax_position])
    ax.set_title("Concept \"" + concept + "\"", fontsize=30, alpha=1.0)
    ax.set_ylabel("Hours", fontsize=30),
    ax.set_xlabel("Concept \"" + concept + "\"", fontsize=30, alpha=0.0),
    ax.set_ylim(0, 9000)
    ax.set_yticks(range(0, 9000, 1000))
    ax.set_yticklabels(labels=range(0, 9000, 1000), rotation=0,
                       minor=False, fontsize=28)
    ax.set_xticklabels(labels=['2012', '2013', '2014'], rotation=0,
                       minor=False, fontsize=28)
    handles, labels = ax.get_legend_handles_labels()
    ax.legend(['Market A', 'Market B',
               'Market C', 'Market D'],
              loc='upper right', fontsize=28)
    ax_position += 1

# look "three subplots"
#plt.tight_layout(pad=0.0, w_pad=-8.0, h_pad=0.0)

# look "one plot"
plt.tight_layout(pad=0., w_pad=-16.5, h_pad=0.0)
axes[1].set_ylabel("")
axes[2].set_ylabel("")
axes[1].set_yticklabels("")
axes[2].set_yticklabels("")
axes[0].legend().set_visible(False)
axes[1].legend().set_visible(False)
axes[2].legend(['Market A', 'Market B',
                'Market C', 'Market D'],
               loc='upper right', fontsize=28)

The dataframe structure of "subset" before grouping looks like this:

分组前“子集”的数据帧结构如下所示:

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 105216 entries, (D_REC, 2012-01-01 00:00:00) to (D_REC, 2014-12-31 23:45:00)
Data columns (total 4 columns):
cmp_tr_neg_p_wrk    105216 non-null float64
exp_tr_pos_p_wrk    105216 non-null float64
cmp_p_spot          105216 non-null float64
exp_p_spot          105216 non-null float64
dtypes: float64(4)
memory usage: 4.0+ MB

and the plot like this:

和这样的情节:

enter image description here

enter image description here

It is formatted in the "ggplot" style with the following header:

它以“ggplot”样式格式化,带有以下标题:

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')

回答by Nipun Batra

Altair can be helpful here. Here is the produced plot.

Altair 可以在这里提供帮助。这是生成的情节。

enter image description here

enter image description here

Imports

进口

import pandas as pd
import numpy as np
from altair import *

Dataset creation

数据集创建

df1=pd.DataFrame(10*np.random.rand(4,2),index=["A","B","C","D"],columns=["I","J"])
df2=pd.DataFrame(10*np.random.rand(4,2),index=["A","B","C","D"],columns=["I","J"])

Preparing dataset

准备数据集

def prep_df(df, name):
    df = df.stack().reset_index()
    df.columns = ['c1', 'c2', 'values']
    df['DF'] = name
    return df

df1 = prep_df(df1, 'DF1')
df2 = prep_df(df2, 'DF2')

df = pd.concat([df1, df2])

Altair plot

牵牛星图

Chart(df).mark_bar().encode(y=Y('values', axis=Axis(grid=False)),
                            x='c2:N', 
                            column=Column('c1:N') ,
                            color='DF:N').configure_facet_cell( strokeWidth=0.0).configure_cell(width=200, height=200)

回答by Grant Langseth

This is a great start but I think the colors could be modified a bit for clarity. Also be careful about importing every argument in Altair as this may cause collisions with existing objects in your namespace. Here is some reconfigured code to display the correct color display when stacking the values:

这是一个很好的开始,但我认为为了清晰起见,可以稍微修改颜色。还要小心在 Altair 中导入每个参数,因为这可能会导致与命名空间中的现有对象发生冲突。以下是一些重新配置的代码,用于在堆叠值时显示正确的颜色:

Altair Clustered Column Chart

Altair Clustered Column Chart

Import packages

导入包

import pandas as pd
import numpy as np
import altair as alt

Generate some random data

生成一些随机数据

df1=pd.DataFrame(10*np.random.rand(4,3),index=["A","B","C","D"],columns=["I","J","K"])
df2=pd.DataFrame(10*np.random.rand(4,3),index=["A","B","C","D"],columns=["I","J","K"])
df3=pd.DataFrame(10*np.random.rand(4,3),index=["A","B","C","D"],columns=["I","J","K"])

def prep_df(df, name):
    df = df.stack().reset_index()
    df.columns = ['c1', 'c2', 'values']
    df['DF'] = name
    return df

df1 = prep_df(df1, 'DF1')
df2 = prep_df(df2, 'DF2')
df3 = prep_df(df3, 'DF3')

df = pd.concat([df1, df2, df3])

Plot data with Altair

使用 Altair 绘制数据

alt.Chart(df).mark_bar().encode(

    # tell Altair which field to group columns on
    x=alt.X('c2:N', title=None),

    # tell Altair which field to use as Y values and how to calculate
    y=alt.Y('sum(values):Q',
        axis=alt.Axis(
            grid=False,
            title=None)),

    # tell Altair which field to use to use as the set of columns to be  represented in each group
    column=alt.Column('c1:N', title=None),

    # tell Altair which field to use for color segmentation 
    color=alt.Color('DF:N',
            scale=alt.Scale(
                # make it look pretty with an enjoyable color pallet
                range=['#96ceb4', '#ffcc5c','#ff6f69'],
            ),
        ))\
    .configure_view(
        # remove grid lines around column clusters
        strokeOpacity=0    
    )

回答by billjoie

The answer by @jrjc for use of seabornis very clever, but it has a few problems, as noted by the author:

@jrjc 对 use of 的回答seaborn很聪明,但是有几个问题,作者指出:

  1. The "light" shading is too pale when only two or three categories are needed. It makes colour series (pale blue, blue, dark blue, etc.) difficult to distinguish.
  2. The legend is not produced to distinguish the meaning of the shadings ("pale" means what?)
  1. 当只需要两三个类别时,“浅色”阴影太苍白了。它使颜色系列(淡蓝色、蓝色、深蓝色等)难以区分。
  2. 生成图例不是为了区分阴影的含义(“苍白”是什么意思?)

More importantly, however, I found out that, because of the groupbystatement in the code:

然而,更重要的是,我发现,因为groupby代码中的语句:

  1. This solution works onlyif the columns are ordered alphabetically. If I rename columns ["I", "J", "K", "L", "M"]by something anti-alphabetical (["zI", "yJ", "xK", "wL", "vM"]), I get this graph instead:
  1. 此解决方案仅适用于按字母顺序排列列的情况。如果我["I", "J", "K", "L", "M"]用反字母 ( ["zI", "yJ", "xK", "wL", "vM"])重命名列,我会得到这个图

Stacked bar construction fails if columns are not in alphabetical order

Stacked bar construction fails if columns are not in alphabetical order



I strove to resolve these problems with the plot_grouped_stackedbars()function in this open-source python module.

我努力用这个开源 python 模块中plot_grouped_stackedbars()函数来解决这些问题。

  1. It keeps the shading within reasonable range
  2. It auto-generates a legend that explains the shading
  3. It does not rely on groupby
  1. 将阴影保持在合理范围内
  2. 它会自动生成解释阴影的图例
  3. 它不依赖 groupby

Proper grouped stacked-bars graph with legend and narrow shading range

Proper grouped stacked-bars graph with legend and narrow shading range

It also allows for

它还允许

  1. various normalization options (see below normalization to 100% of maximum value)
  2. the addition of error bars
  1. 各种标准化选项(见下文标准化为最大值的 100%)
  2. 误差线的添加

Example with normalization and error bars

Example with normalization and error bars

See full demo here. I hope this proves useful and can answer the original question.

在此处查看完整演示。我希望这证明是有用的,并且可以回答最初的问题。

回答by Simoons

I liked the solution of Cord Kaldemeyer, but it is not robust at all (and contain some useless lines). Here is a modified version. The idea is to reserve as much width as necessary for the plots. Then each cluster gets a subplot of the required length.

我喜欢 Cord Kaldemeyer 的解决方案,但它根本不健壮(并且包含一些无用的线条)。这是一个修改后的版本。这个想法是为绘图保留尽可能多的宽度。然后每个集群获得所需长度的子图。

# Data and imports

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.ticker import MaxNLocator
import matplotlib.gridspec as gridspec
import matplotlib

matplotlib.style.use('ggplot')

np.random.seed(0)

df = pd.DataFrame(np.asarray(1+5*np.random.random((10,4)), dtype=int),columns=["Cluster", "Bar", "Bar_part", "Count"])
df = df.groupby(["Cluster", "Bar", "Bar_part"])["Count"].sum().unstack(fill_value=0)
display(df)

# plotting

clusters = df.index.levels[0]
inter_graph = 0
maxi = np.max(np.sum(df, axis=1))
total_width = len(df)+inter_graph*(len(clusters)-1)

fig = plt.figure(figsize=(total_width,10))
gridspec.GridSpec(1, total_width)
axes=[]

ax_position = 0
for cluster in clusters:
    subset = df.loc[cluster]
    ax = subset.plot(kind="bar", stacked=True, width=0.8, ax=plt.subplot2grid((1,total_width), (0,ax_position), colspan=len(subset.index)))
    axes.append(ax)
    ax.set_title(cluster)
    ax.set_xlabel("")
    ax.set_ylim(0,maxi+1)
    ax.yaxis.set_major_locator(MaxNLocator(integer=True))
    ax_position += len(subset.index)+inter_graph

for i in range(1,len(clusters)):
    axes[i].set_yticklabels("")
    axes[i-1].legend().set_visible(False)
axes[0].set_ylabel("y_label")

fig.suptitle('Big Title', fontsize="x-large")
legend = axes[-1].legend(loc='upper right', fontsize=16, framealpha=1).get_frame()
legend.set_linewidth(3)
legend.set_edgecolor("black")

plt.show()

The result is the following:

结果如下:

(not able yet to post an image directly on the site)

(还不能直接在网站上发布图片)