Pandas Python 上按组计数的堆积条形图

Question

提问by Acerace.py

My csv data looks something like the one provided below. I wanted to create a stack bar plot with pandas/python where each bar represent male and female portions with two colors and on the top of the bar it shows the total count of both male and female taking the drug(in my case). For instance, for the Age of 20 fall total of 7 people and 6 of them are male and 1 is female so on the bar plot there should be 7 on the top of the bar and this 6:1 portion is shown in the bar with two colors. I managed to group the people according to their age count and plot it but I wanted to show the bar with two genders on different colors as well. Any help will be appreciated . Thank you.

我的 csv 数据类似于下面提供的数据。我想用 pandas/python 创建一个堆栈条形图，其中每个条形用两种颜色代表男性和女性部分，在条形的顶部显示男性和女性服用药物的总数（在我的情况下）。例如，对于 20 岁的秋天，总共有 7 个人，其中 6 人是男性，1 人是女性，所以在条形图中应该有 7 个在条形图的顶部，这个 6:1 的部分显示在条形图中两种颜色。我设法根据他们的年龄对他们进行分组并绘制它，但我也想用不同颜色的两种性别来显示酒吧。任何帮助将不胜感激。谢谢你。

Drug_ID,Age,Gender
FF90E5F780133BBCAEE9BA0E4654C5CA,15,F
2AB2529352E6A1D0BA91E167C5191231,17,M
6E4B2C80EA83C5B51675BC077A182E06,19,M
8AEB19A68DE4C301154973E0C87C774D,19,M
FA533E7DD1FCAB83822D4A2FC514AEE7,20,F
208F39485A8FBA7214CA004B53CC29B7,20,M
3172B95E8A5732D2EDB089A354787612,20,M
38FBAE52AAB04E56AB3A35C4AFBF3813,20,M
62D8A33130574C984EAF3A2E80C243A6,20,M
646AF35E192885FE1336649BA3735FC4,20,M
778D2B1354F30ED3D3BDD7B9437CF670,20,M
891312933FE5FB25703B3E958EF943E3,21,F
5CC63DFC5EF399F65CB9BC583A770DFB,21,M
97C9ED5750EC6FF618A059085F0BF4A6,21,M
C6F03C5E3CC37249E0308A09292F5A58,21,M
0713C427BDC3A90F291AF49599987F38,23,F
518EBD35FCDAB7C744334F993D8CEB28,23,F
A3FC185C4531AFF1E8B7B9F5985028CB,23,F
1837406921314CB182FB0C7BC5565204,23,M
CA0E6274BD39E8DE5C823F4E6F234252,23,M
E2F64030BB011C11541EDC192BAED09A,23,M
EF8F3D028C5759858AB7574864833015,23,M
5DCD427F26E05CC1C3F565BB05EAE10B,24,F
8646ED503722C3C6C6B44208EF1A5202,24,F
F9F45112C472282778E1F437F54B0B70,24,F
0CF4F20B89303CB4C03CF3BD3B27CAF0,24,M
63727039DFF77A46995DA8EDBC4E3497,24,M

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

data = pd.read_csv('/home/MedResearch/Desktop/drug_ids_age_gender.csv')
data.values

df = pd.DataFrame(data)
df2 = pd.merge(df1,df,  left_index = True, right_index = True)
temp1 = df2.groupby('Age').Age.count()

df3 = pd.merge(df1,df,  left_index = True, right_index = True)
temp2 = df3.groupby('Gender').Age.count()

ax = temp1.plot(kind='bar',stacked = False, colormap = 'Paired')
for p in ax.patches:
    ax.annotate(np.round(p.get_height(),decimals=0), (p.get_x()+p.get_width()/2.,   p.get_height()), ha='center', va='center', xytext=(2, 10), textcoords='offset points')
plt.ylabel('Numbers Of Patients Taking the drug')
plt.title('Age Distribution Of Patients Based Upon drug Intake')
plt.show()

Got something like this as a result:

结果是这样的：

Answer 1

回答by Diziet Asahi

This questions come back often, so I decided to write a step by step explanation. Note that I'm not a pandasguru, so there are things that could probably be optimized.

这个问题经常回来，所以我决定写一个一步一步的解释。请注意，我不是pandas大师，所以有些事情可能会被优化。

I started by generating getting a list of ages that I will use for my x-axis:

我首先生成一个我将用于我的 x 轴的年龄列表：

cvsdata = '''Drug_ID,Age,Gender
FF90E5F780133BBCAEE9BA0E4654C5CA,15,F
2AB2529352E6A1D0BA91E167C5191231,17,M
6E4B2C80EA83C5B51675BC077A182E06,19,M
8AEB19A68DE4C301154973E0C87C774D,19,M
FA533E7DD1FCAB83822D4A2FC514AEE7,20,F
208F39485A8FBA7214CA004B53CC29B7,20,M
3172B95E8A5732D2EDB089A354787612,20,M
38FBAE52AAB04E56AB3A35C4AFBF3813,20,M
62D8A33130574C984EAF3A2E80C243A6,20,M
646AF35E192885FE1336649BA3735FC4,20,M
778D2B1354F30ED3D3BDD7B9437CF670,20,M
891312933FE5FB25703B3E958EF943E3,21,F
5CC63DFC5EF399F65CB9BC583A770DFB,21,M
97C9ED5750EC6FF618A059085F0BF4A6,21,M
C6F03C5E3CC37249E0308A09292F5A58,21,M
0713C427BDC3A90F291AF49599987F38,23,F
518EBD35FCDAB7C744334F993D8CEB28,23,F
A3FC185C4531AFF1E8B7B9F5985028CB,23,F
1837406921314CB182FB0C7BC5565204,23,M
CA0E6274BD39E8DE5C823F4E6F234252,23,M
E2F64030BB011C11541EDC192BAED09A,23,M
EF8F3D028C5759858AB7574864833015,23,M
5DCD427F26E05CC1C3F565BB05EAE10B,24,F
8646ED503722C3C6C6B44208EF1A5202,24,F
F9F45112C472282778E1F437F54B0B70,24,F
0CF4F20B89303CB4C03CF3BD3B27CAF0,24,M
63727039DFF77A46995DA8EDBC4E3497,24,M'''

df = pd.read_csv(StringIO.StringIO(cvsdata))
ages = df.Age.unique()

array([15, 17, 19, 20, 21, 23, 24])

Then I generated a grouped dataframe with the counts of each M and F per age:

然后我生成了一个分组数据框，其中包含每个年龄的每个 M 和 F 的计数：

counts = df.groupby(['Age','Gender']).count()
print counts

            Drug_ID
Age Gender         
15  F             1
17  M             1
19  M             2
20  F             1
    M             6
21  F             1
    M             3
23  F             3
    M             4
24  F             3
    M             2

Using that, I can easily calculate the total number of individual per age group:

使用它，我可以轻松计算每个年龄段的个人总数：

totals = counts.sum(level=0)
print totals

     Drug_ID
Age         
15         1
17         1
19         2
20         7
21         4
23         7
24         5

To prepare for plotting, I'll transform my countsdataframe to separate each sex by columns, instead of by index. Here I also drop that 'Drug_ID' column name because the unstack()operation creates a MultiIndex and it's much easier to manipulate the dataframe without that MultiIndex.

为了准备绘图，我将转换我的counts数据框以按列而不是按索引分隔每个性别。在这里，我还删除了“Drug_ID”列名称，因为该unstack()操作会创建一个 MultiIndex，并且在没有该 MultiIndex 的情况下操作数据帧要容易得多。

counts = counts.unstack(level=1)
counts.columns = counts.columns.droplevel(level=0)
print counts

Gender    F    M
Age             
15      1.0  NaN
17      NaN  1.0
19      NaN  2.0
20      1.0  6.0
21      1.0  3.0
23      3.0  4.0
24      3.0  2.0

Looks pretty good. I'll just do a final refinement and replace the NaNby 0.

看起来不错。我只会做最后的改进并将替换为NaN0。

counts = counts.fillna(0)
print counts

Gender    F    M
Age             
15      1.0  0.0
17      0.0  1.0
19      0.0  2.0
20      1.0  6.0
21      1.0  3.0
23      3.0  4.0
24      3.0  2.0

With this dataframe, it is trivial to plot the stacked bars:

使用此数据框，绘制堆叠条形图很简单：

plt.bar(ages, counts['M'], bottom=None, color='blue', label='M')
plt.bar(ages, counts['F'], bottom=counts['M'], color='pink', label='F')
plt.legend()
plt.xlabel('Ages')
plt.ylabel('Count')

To plot the total counts on top of the bars, we'll use the annotate()function. We cannot do it in one single pass, instead we'll loop through the agesand the totals(for simplicity sake, I take the valuesand flatten()them because they're not quite in the right format, not exactly sure why here)

要在条形顶部绘制总计数，我们将使用该annotate()函数。我们不能一次性完成，而是循环遍历ages和totals（为简单起见，我采用了values和flatten()它们，因为它们的格式不太正确，不完全确定为什么在这里）

for age,tot in zip(ages,totals.values.flatten()):
    plt.annotate('N={:d}'.format(tot), xy=(age+0.4, tot), xytext=(0,5), textcoords='offset points', ha='center', va='bottom')

the coordinates for the annotations are (age+0.4, tot)because the bars go from xto x+widthwith width=0.8by default, and therefore x+0.4is the center of the bar, while totis of course the full height of the bar. To offset the text a bit, I offset the text by a few (5) points in the y direction. Adjust according to your liking.

注释的坐标是(age+0.4, tot)因为条形默认情况下从x到x+width与width=0.8，因此x+0.4是条形的中心，而tot当然是条形的全高。为了稍微偏移文本，我在 y 方向上将文本偏移了几 (5) 个点。根据自己的喜好调整。

Check out the documentation for bar()to adjust the parameters of the bar plots. Check out the documentation for annotate()to customize your annotations

查看文档bar()以调整条形图的参数。查看文档annotate()以自定义您的注释

Pandas Python 上按组计数的堆积条形图

提问by Acerace.py

回答by Diziet Asahi

相关推荐

最近更新

标签

Pandas Python 上按组计数的堆积条形图

提问by Acerace.py

回答by Diziet Asahi

相关推荐

Pandas DataFrame：如何水平打印单行？

我们可以在 pandas.core.groupby.SeriesGroupBy 对象中看到组数据吗

pandas 将字符串转换为日期 [带年份和季度]

pandas 熊猫中的列表理解

相关推荐

最近更新

标签