pandas - 绘制列变量的分布

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/49799356/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:27:23  来源:igfitidea点击:

pandas - Plot distribution of column variable

pythonpandasvisualization

提问by Mike S

I'm trying to visualize some data, but I'm not very experienced with the subject, and am having trouble finding the best bay to get what I'm looking for. I've searched around and found similar questions, but nothing that'll answer exactly what I want, so hopefully I'm not duplicating a common question.

我正在尝试将一些数据可视化,但我对这个主题不是很熟悉,并且无法找到最好的海湾来获得我正在寻找的东西。我四处搜索并发现了类似的问题,但没有什么能完全回答我想要的,所以希望我不会重复一个常见的问题。

Anyway, I have a DataFrame with a column for patient_id(and others, but this is the relevant one. For example:

无论如何,我有一个 DataFrame 有一列用于patient_id(和其他人,但这是相关的。例如:

   patient_id  other_stuff
0      000001          ...
1      000001          ...
2      000001          ...
3      000002          ...
4      000003          ...
5      000003          ...
6      000004          ...
etc

Where each row represents a specific episode that patient had. I want to plot the distribution in which the x axis is the number of episodes a patient had, and the y axis is the number of patients that have had said number of episodes. For example, based on the above, there's one patient with three episodes, one patient with two episodes, and two patients with one episode each, i.e. x = [1, 2, 3], y = [2, 1, 1]. Currently, I do the following:

其中每一行代表患者发生的特定事件。我想绘制分布,其中 x 轴是患者的发作次数,y 轴是出现所述发作次数的患者人数。例如,基于上述,有一个患者有 3 次发作,一名患者有 2 次发作,还有两名患者各有一次发作,即x = [1, 2, 3], y = [2, 1, 1]。目前,我执行以下操作:

episode_count_distribution = (
    patients.patient_id
    .value_counts() # the number of rows for each patient_id (i.e. episodes per patient)
    .value_counts() # the number of patients for each possible row count above (i.e. distribution of episodes per patient)
    .sort_index()
)
episode_count_distribution.plot()

This method does what I want, but strikes me as a bit opaque and hard to follow, so I'm wondering if there's a better way.

这种方法可以满足我的要求,但让我觉得有点不透明且难以遵循,所以我想知道是否有更好的方法。

回答by Ami Tavory

You might be looking for something like

你可能正在寻找类似的东西

df.procedure_id.groupby(df.patient_id).nunique().hist();

Explanation:

解释:

  • df.procedure_id.groupby(df.patient_id).nunique()finds the number of unique procedures per patient.

  • hist()plots a histogram.

  • df.procedure_id.groupby(df.patient_id).nunique()查找每位患者的唯一程序数。

  • hist()绘制直方图。

Example

例子

df = pd.DataFrame({'procedure_id': [3, 2, 3, 2, 4, 1, 2, 3], 'patient_id': [1, 2, 3, 2, 1, 2, 3, 2]})
df.procedure_id.groupby(df.patient_id).nunique().hist();
xlabel('num patients');
ylabel('num treatments');

enter image description here

在此处输入图片说明