pandas 创建使用百分比而不是计数的 matplotlib 或 seaborn 直方图?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40092294/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Creating a matplotlib or seaborn histogram which uses percent rather than count?
提问by WillacyMe
Specifically I'm dealing with the Kaggle Titanic dataset. I've plotted a stacked histogram which shows ages that survived and died upon the titanic. Code below.
具体来说,我正在处理 Kaggle Titanic 数据集。我绘制了一个堆叠直方图,显示了在泰坦尼克号上幸存和死亡的年龄。代码如下。
figure = plt.figure(figsize=(15,8))
plt.hist([data[data['Survived']==1]['Age'], data[data['Survived']==0]['Age']], stacked=True, bins=30, label=['Survived','Dead'])
plt.xlabel('Age')
plt.ylabel('Number of passengers')
plt.legend()
I would like to alter the chart to show a single chart per bin of the percentage in that age group that survived. E.g. if a bin contained the ages between 10-20 years of age and 60% of people aboard the titanic in that age group survived, then the height would line up 60% along the y-axis.
我想更改图表以显示该年龄组中幸存百分比的每个垃圾箱的单个图表。例如,如果一个 bin 包含 10-20 岁之间的年龄,并且该年龄组中泰坦尼克号上 60% 的人幸存下来,那么高度将沿着 y 轴排列 60%。
Edit: I may have given a poor explanation to what I'm looking for. Rather than alter the y-axis values, I'm looking to change the actual shape of the bars based on the percentage that survived.
编辑:我可能对我正在寻找的内容给出了一个糟糕的解释。我不是改变 y 轴值,而是希望根据幸存的百分比来改变条形的实际形状。
The first bin on the graph shows roughly 65% survived in that age group. I would like this bin to line up against the y-axis at 65%. The following bins look to be 90%, 50%, 10% respectively, and so on.
图表上的第一个 bin 显示该年龄组大约有 65% 的人幸存下来。我希望这个 bin 在 65% 处与 y 轴对齐。以下 bin 看起来分别为 90%、50%、10%,依此类推。
The graph would end up actually looking something like this:
该图最终实际上看起来像这样:
采纳答案by bahaugen
Perhaps the following will help ...
也许以下内容会有所帮助...
Split the dataframe based on 'Survived'
df_survived=df[df['Survived']==1] df_not_survive=df[df['Survived']==0]
Create Bins
age_bins=np.linspace(0,80,21)
Use np.histogram to generate histogram data
survived_hist=np.histogram(df_survived['Age'],bins=age_bins,range=(0,80)) not_survive_hist=np.histogram(df_not_survive['Age'],bins=age_bins,range=(0,80))
Calculate survival rate in each bin
surv_rates=survived_hist[0]/(survived_hist[0]+not_survive_hist[0])
Plot
plt.bar(age_bins[:-1],surv_rates,width=age_bins[1]-age_bins[0]) plt.xlabel('Age') plt.ylabel('Survival Rate')
根据“幸存”拆分数据框
df_survived=df[df['Survived']==1] df_not_survive=df[df['Survived']==0]
创建箱
age_bins=np.linspace(0,80,21)
使用 np.histogram 生成直方图数据
survived_hist=np.histogram(df_survived['Age'],bins=age_bins,range=(0,80)) not_survive_hist=np.histogram(df_not_survive['Age'],bins=age_bins,range=(0,80))
计算每个 bin 的存活率
surv_rates=survived_hist[0]/(survived_hist[0]+not_survive_hist[0])
阴谋
plt.bar(age_bins[:-1],surv_rates,width=age_bins[1]-age_bins[0]) plt.xlabel('Age') plt.ylabel('Survival Rate')
回答by piRSquared
pd.Series.hist
uses np.histogram
underneath.
pd.Series.hist
np.histogram
下面使用。
Let's explore that
让我们探索一下
np.random.seed([3,1415])
s = pd.Series(np.random.randn(100))
d = np.histogram(s, normed=True)
print('\nthese are the normalized counts\n')
print(d[0])
print('\nthese are the bin values, or average of the bin edges\n')
print(d[1])
these are the normalized counts
[ 0.11552497 0.18483996 0.06931498 0.32346993 0.39278491 0.36967992
0.32346993 0.25415494 0.25415494 0.02310499]
these are the bin edges
[-2.25905503 -1.82624818 -1.39344133 -0.96063448 -0.52782764 -0.09502079
0.33778606 0.77059291 1.20339976 1.6362066 2.06901345]
We can plot these while calculating the mean bin edges
我们可以在计算平均 bin 边缘时绘制这些
pd.Series(d[0], pd.Series(d[1]).rolling(2).mean().dropna().round(2).values).plot.bar()
ACTUAL ANSWER
OR
实际答案
或
We could have simply passed normed=True
to the pd.Series.hist
method. Which passes it along to np.histogram
我们可以简单地传递normed=True
给pd.Series.hist
方法。它传递给np.histogram
s.hist(normed=True)
回答by Nikos Tavoularis
First of all it would be better if you create a function that splits your data in age groups
首先,如果您创建一个按年龄组拆分数据的函数会更好
# This function splits our data frame in predifined age groups
def cutDF(df):
return pd.cut(
df,[0, 10, 20, 30, 40, 50, 60, 70, 80],
labels=['0-10', '11-20', '21-30', '31-40', '41-50', '51-60', '61-70', '71-80'])
data['AgeGroup'] = data[['Age']].apply(cutDF)
Then you can plot your graph as follows:
然后,您可以按如下方式绘制图形:
survival_per_age_group = data.groupby('AgeGroup')['Survived'].mean()
# Creating the plot that will show survival % per age group and gender
ax = survival_per_age_group.plot(kind='bar', color='green')
ax.set_title("Survivors by Age Group", fontsize=14, fontweight='bold')
ax.set_xlabel("Age Groups")
ax.set_ylabel("Percentage")
ax.tick_params(axis='x', top='off')
ax.tick_params(axis='y', right='off')
plt.xticks(rotation='horizontal')
# Importing the relevant fuction to format the y axis
from matplotlib.ticker import FuncFormatter
ax.yaxis.set_major_formatter(FuncFormatter(lambda y, _: '{:.0%}'.format(y)))
plt.show()
回答by Ted Petrou
The library Dexplot is capable of returning relative frequencies of groups. Currently, you'll need to bin the age
variable in pandas with the cut
function. You can then, use Dexplot.
库 Dexplot 能够返回组的相对频率。目前,您需要age
使用该cut
函数将pandas 中的变量装箱。然后,您可以使用 Dexplot。
titanic['age2'] = pd.cut(titanic['age'], range(0, 110, 10))
Pass the variable you would like to count (age2
) to the agg
parameter. Subdivide the counts with the hue
parameter and normalize by age2
. Also, this might be a good time for a stacked bar plot
将您想要计数的变量 ( age2
) 传递给agg
参数。用hue
参数细分计数并按 归一化age2
。此外,这可能是堆叠条形图的好时机
dxp.aggplot(agg='age2', data=titanic, hue='survived', stacked=True, normalize='age2')