pandas 创建使用百分比而不是计数的 matplotlib 或 seaborn 直方图?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40092294/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:14:02  来源:igfitidea点击:

Creating a matplotlib or seaborn histogram which uses percent rather than count?

pythonpandasmatplotlibdatasethistogram

提问by WillacyMe

Specifically I'm dealing with the Kaggle Titanic dataset. I've plotted a stacked histogram which shows ages that survived and died upon the titanic. Code below.

具体来说,我正在处理 Kaggle Titanic 数据集。我绘制了一个堆叠直方图,显示了在泰坦尼克号上幸存和死亡的年龄。代码如下。

figure = plt.figure(figsize=(15,8))
plt.hist([data[data['Survived']==1]['Age'], data[data['Survived']==0]['Age']], stacked=True, bins=30, label=['Survived','Dead'])
plt.xlabel('Age')
plt.ylabel('Number of passengers')
plt.legend()

I would like to alter the chart to show a single chart per bin of the percentage in that age group that survived. E.g. if a bin contained the ages between 10-20 years of age and 60% of people aboard the titanic in that age group survived, then the height would line up 60% along the y-axis.

我想更改图表以显示该年龄组中幸存百分比的每个垃圾箱的单个图表。例如,如果一个 bin 包含 10-20 岁之间的年龄,并且该年龄组中泰坦尼克号上 60% 的人幸存下来,那么高度将沿着 y 轴排列 60%。

Edit: I may have given a poor explanation to what I'm looking for. Rather than alter the y-axis values, I'm looking to change the actual shape of the bars based on the percentage that survived.

编辑:我可能对我正在寻找的内容给出了一个糟糕的解释。我不是改变 y 轴值,而是希望根据幸存的百分比来改变条形的实际形状。

The first bin on the graph shows roughly 65% survived in that age group. I would like this bin to line up against the y-axis at 65%. The following bins look to be 90%, 50%, 10% respectively, and so on.

图表上的第一个 bin 显示该年龄组大约有 65% 的人幸存下来。我希望这个 bin 在 65% 处与 y 轴对齐。以下 bin 看起来分别为 90%、50%、10%,依此类推。

The graph would end up actually looking something like this:

该图最终实际上看起来像这样:

enter image description here

在此处输入图片说明

采纳答案by bahaugen

Perhaps the following will help ...

也许以下内容会有所帮助...

  1. Split the dataframe based on 'Survived'

    df_survived=df[df['Survived']==1]
    df_not_survive=df[df['Survived']==0]
    
  2. Create Bins

    age_bins=np.linspace(0,80,21)
    
  3. Use np.histogram to generate histogram data

    survived_hist=np.histogram(df_survived['Age'],bins=age_bins,range=(0,80))
    not_survive_hist=np.histogram(df_not_survive['Age'],bins=age_bins,range=(0,80))
    
  4. Calculate survival rate in each bin

    surv_rates=survived_hist[0]/(survived_hist[0]+not_survive_hist[0])
    
  5. Plot

    plt.bar(age_bins[:-1],surv_rates,width=age_bins[1]-age_bins[0])
    plt.xlabel('Age')
    plt.ylabel('Survival Rate')
    
  1. 根据“幸存”拆分数据框

    df_survived=df[df['Survived']==1]
    df_not_survive=df[df['Survived']==0]
    
  2. 创建箱

    age_bins=np.linspace(0,80,21)
    
  3. 使用 np.histogram 生成直方图数据

    survived_hist=np.histogram(df_survived['Age'],bins=age_bins,range=(0,80))
    not_survive_hist=np.histogram(df_not_survive['Age'],bins=age_bins,range=(0,80))
    
  4. 计算每个 bin 的存活率

    surv_rates=survived_hist[0]/(survived_hist[0]+not_survive_hist[0])
    
  5. 阴谋

    plt.bar(age_bins[:-1],surv_rates,width=age_bins[1]-age_bins[0])
    plt.xlabel('Age')
    plt.ylabel('Survival Rate')
    

enter image description here

在此处输入图片说明

回答by piRSquared

pd.Series.histuses np.histogramunderneath.

pd.Series.histnp.histogram下面使用。

Let's explore that

让我们探索一下

np.random.seed([3,1415])
s = pd.Series(np.random.randn(100))
d = np.histogram(s, normed=True)
print('\nthese are the normalized counts\n')
print(d[0])
print('\nthese are the bin values, or average of the bin edges\n')
print(d[1])

these are the normalized counts

[ 0.11552497  0.18483996  0.06931498  0.32346993  0.39278491  0.36967992
  0.32346993  0.25415494  0.25415494  0.02310499]

these are the bin edges

[-2.25905503 -1.82624818 -1.39344133 -0.96063448 -0.52782764 -0.09502079
  0.33778606  0.77059291  1.20339976  1.6362066   2.06901345]

We can plot these while calculating the mean bin edges

我们可以在计算平均 bin 边缘时绘制这些

pd.Series(d[0], pd.Series(d[1]).rolling(2).mean().dropna().round(2).values).plot.bar()

enter image description here

在此处输入图片说明

ACTUAL ANSWER
OR

实际答案

We could have simply passed normed=Trueto the pd.Series.histmethod. Which passes it along to np.histogram

我们可以简单地传递normed=Truepd.Series.hist方法。它传递给np.histogram

s.hist(normed=True)

enter image description here

在此处输入图片说明

回答by Nikos Tavoularis

First of all it would be better if you create a function that splits your data in age groups

首先,如果您创建一个按年龄组拆分数据的函数会更好

# This function splits our data frame in predifined age groups
def cutDF(df):
    return pd.cut(
        df,[0, 10, 20, 30, 40, 50, 60, 70, 80], 
        labels=['0-10', '11-20', '21-30', '31-40', '41-50', '51-60', '61-70', '71-80'])


data['AgeGroup'] = data[['Age']].apply(cutDF)

Then you can plot your graph as follows:

然后,您可以按如下方式绘制图形:

survival_per_age_group = data.groupby('AgeGroup')['Survived'].mean()

# Creating the plot that will show survival % per age group and gender
ax = survival_per_age_group.plot(kind='bar', color='green')
ax.set_title("Survivors by Age Group", fontsize=14, fontweight='bold')
ax.set_xlabel("Age Groups")
ax.set_ylabel("Percentage")
ax.tick_params(axis='x', top='off')
ax.tick_params(axis='y', right='off')
plt.xticks(rotation='horizontal')             

# Importing the relevant fuction to format the y axis 
from matplotlib.ticker import FuncFormatter

ax.yaxis.set_major_formatter(FuncFormatter(lambda y, _: '{:.0%}'.format(y)))
plt.show()

回答by Ted Petrou

The library Dexplot is capable of returning relative frequencies of groups. Currently, you'll need to bin the agevariable in pandas with the cutfunction. You can then, use Dexplot.

库 Dexplot 能够返回组的相对频率。目前,您需要age使用该cut函数将pandas 中的变量装箱。然后,您可以使用 Dexplot。

titanic['age2'] = pd.cut(titanic['age'], range(0, 110, 10))

Pass the variable you would like to count (age2) to the aggparameter. Subdivide the counts with the hueparameter and normalize by age2. Also, this might be a good time for a stacked bar plot

将您想要计数的变量 ( age2) 传递给agg参数。用hue参数细分计数并按 归一化age2。此外,这可能是堆叠条形图的好时机

dxp.aggplot(agg='age2', data=titanic, hue='survived', stacked=True, normalize='age2')

enter image description here

在此处输入图片说明