pandas 创建使用百分比而不是计数的 matplotlib 或 seaborn 直方图？

Question

提问by WillacyMe

Specifically I'm dealing with the Kaggle Titanic dataset. I've plotted a stacked histogram which shows ages that survived and died upon the titanic. Code below.

具体来说，我正在处理 Kaggle Titanic 数据集。我绘制了一个堆叠直方图，显示了在泰坦尼克号上幸存和死亡的年龄。代码如下。

figure = plt.figure(figsize=(15,8))
plt.hist([data[data['Survived']==1]['Age'], data[data['Survived']==0]['Age']], stacked=True, bins=30, label=['Survived','Dead'])
plt.xlabel('Age')
plt.ylabel('Number of passengers')
plt.legend()

I would like to alter the chart to show a single chart per bin of the percentage in that age group that survived. E.g. if a bin contained the ages between 10-20 years of age and 60% of people aboard the titanic in that age group survived, then the height would line up 60% along the y-axis.

我想更改图表以显示该年龄组中幸存百分比的每个垃圾箱的单个图表。例如，如果一个 bin 包含 10-20 岁之间的年龄，并且该年龄组中泰坦尼克号上 60% 的人幸存下来，那么高度将沿着 y 轴排列 60%。

Edit: I may have given a poor explanation to what I'm looking for. Rather than alter the y-axis values, I'm looking to change the actual shape of the bars based on the percentage that survived.

编辑：我可能对我正在寻找的内容给出了一个糟糕的解释。我不是改变 y 轴值，而是希望根据幸存的百分比来改变条形的实际形状。

The first bin on the graph shows roughly 65% survived in that age group. I would like this bin to line up against the y-axis at 65%. The following bins look to be 90%, 50%, 10% respectively, and so on.

图表上的第一个 bin 显示该年龄组大约有 65% 的人幸存下来。我希望这个 bin 在 65% 处与 y 轴对齐。以下 bin 看起来分别为 90%、50%、10%，依此类推。

The graph would end up actually looking something like this:

该图最终实际上看起来像这样：

Answer 1

采纳答案by bahaugen

Perhaps the following will help ...

也许以下内容会有所帮助...

Split the dataframe based on 'Survived'

df_survived=df[df['Survived']==1]
df_not_survive=df[df['Survived']==0]

Create Bins
```
age_bins=np.linspace(0,80,21)
```

Use np.histogram to generate histogram data

survived_hist=np.histogram(df_survived['Age'],bins=age_bins,range=(0,80))
not_survive_hist=np.histogram(df_not_survive['Age'],bins=age_bins,range=(0,80))

Calculate survival rate in each bin

surv_rates=survived_hist[0]/(survived_hist[0]+not_survive_hist[0])

Plot

plt.bar(age_bins[:-1],surv_rates,width=age_bins[1]-age_bins[0])
plt.xlabel('Age')
plt.ylabel('Survival Rate')

根据“幸存”拆分数据框

df_survived=df[df['Survived']==1]
df_not_survive=df[df['Survived']==0]

创建箱
```
age_bins=np.linspace(0,80,21)
```

使用 np.histogram 生成直方图数据

survived_hist=np.histogram(df_survived['Age'],bins=age_bins,range=(0,80))
not_survive_hist=np.histogram(df_not_survive['Age'],bins=age_bins,range=(0,80))

计算每个 bin 的存活率

surv_rates=survived_hist[0]/(survived_hist[0]+not_survive_hist[0])

阴谋

plt.bar(age_bins[:-1],surv_rates,width=age_bins[1]-age_bins[0])
plt.xlabel('Age')
plt.ylabel('Survival Rate')

Answer 2

回答by piRSquared

pd.Series.histuses np.histogramunderneath.

pd.Series.histnp.histogram下面使用。

Let's explore that

让我们探索一下

np.random.seed([3,1415])
s = pd.Series(np.random.randn(100))
d = np.histogram(s, normed=True)
print('\nthese are the normalized counts\n')
print(d[0])
print('\nthese are the bin values, or average of the bin edges\n')
print(d[1])

these are the normalized counts

[ 0.11552497  0.18483996  0.06931498  0.32346993  0.39278491  0.36967992
  0.32346993  0.25415494  0.25415494  0.02310499]

these are the bin edges

[-2.25905503 -1.82624818 -1.39344133 -0.96063448 -0.52782764 -0.09502079
  0.33778606  0.77059291  1.20339976  1.6362066   2.06901345]

We can plot these while calculating the mean bin edges

我们可以在计算平均 bin 边缘时绘制这些

pd.Series(d[0], pd.Series(d[1]).rolling(2).mean().dropna().round(2).values).plot.bar()

ACTUAL ANSWER
OR

实际答案
或

We could have simply passed normed=Trueto the pd.Series.histmethod. Which passes it along to np.histogram

我们可以简单地传递normed=True给pd.Series.hist方法。它传递给np.histogram

s.hist(normed=True)

Answer 3

回答by Nikos Tavoularis

First of all it would be better if you create a function that splits your data in age groups

首先，如果您创建一个按年龄组拆分数据的函数会更好

# This function splits our data frame in predifined age groups
def cutDF(df):
    return pd.cut(
        df,[0, 10, 20, 30, 40, 50, 60, 70, 80], 
        labels=['0-10', '11-20', '21-30', '31-40', '41-50', '51-60', '61-70', '71-80'])


data['AgeGroup'] = data[['Age']].apply(cutDF)

Then you can plot your graph as follows:

然后，您可以按如下方式绘制图形：

survival_per_age_group = data.groupby('AgeGroup')['Survived'].mean()

# Creating the plot that will show survival % per age group and gender
ax = survival_per_age_group.plot(kind='bar', color='green')
ax.set_title("Survivors by Age Group", fontsize=14, fontweight='bold')
ax.set_xlabel("Age Groups")
ax.set_ylabel("Percentage")
ax.tick_params(axis='x', top='off')
ax.tick_params(axis='y', right='off')
plt.xticks(rotation='horizontal')             

# Importing the relevant fuction to format the y axis 
from matplotlib.ticker import FuncFormatter

ax.yaxis.set_major_formatter(FuncFormatter(lambda y, _: '{:.0%}'.format(y)))
plt.show()

Answer 4

回答by Ted Petrou

The library Dexplot is capable of returning relative frequencies of groups. Currently, you'll need to bin the agevariable in pandas with the cutfunction. You can then, use Dexplot.

库 Dexplot 能够返回组的相对频率。目前，您需要age使用该cut函数将pandas 中的变量装箱。然后，您可以使用 Dexplot。

titanic['age2'] = pd.cut(titanic['age'], range(0, 110, 10))

Pass the variable you would like to count (age2) to the aggparameter. Subdivide the counts with the hueparameter and normalize by age2. Also, this might be a good time for a stacked bar plot

将您想要计数的变量 ( age2) 传递给agg参数。用hue参数细分计数并按归一化age2。此外，这可能是堆叠条形图的好时机

dxp.aggplot(agg='age2', data=titanic, hue='survived', stacked=True, normalize='age2')

pandas 创建使用百分比而不是计数的 matplotlib 或 seaborn 直方图？

提问by WillacyMe

采纳答案by bahaugen

回答by piRSquared

回答by Nikos Tavoularis

回答by Ted Petrou

相关推荐

最近更新

标签

pandas 创建使用百分比而不是计数的 matplotlib 或 seaborn 直方图？

提问by WillacyMe

采纳答案by bahaugen

回答by piRSquared

回答by Nikos Tavoularis

回答by Ted Petrou

相关推荐

Pandas 中的多个同名列

pandas 熊猫合并具有不同名称的列并避免重复

pandas 在同一图上将数据框绘制为“hist”和“kde”

to_datetime 值错误：至少必须指定 [年、月、日] Pandas

相关推荐

最近更新

标签