Python 如何在 Pandas 中绘制 value_counts 的值,该值具有大量不均匀分布的不同计数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37598665/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to plot a value_counts in pandas that has a huge number of different counts not distributed evenly
提问by user3139545
Lets say I have the following data:
假设我有以下数据:
s2 = pd.Series([1,2,3,4,5,2,3,333,2,123,434,1,2,3,1,11,11,432,3,2,4,3,3,3,54,34,24,2,223,2535334,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30000, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2])
s2.value_counts(normalize=True).plot()
What I want to show in the plot is that there are a few numbers that make up the majority of cases.The problem is that this will be seen in the far left side of the graph and then there will be a straight line for all the other categories. In the real data the x axis will be categorical with about 18000 categories and 4% of the counts will be around 10000 high then the rest will drop of and be around 50.
我想在图中显示的是,有几个数字构成了大多数情况。问题是这将在图表的最左侧看到,然后所有的数字都会出现一条直线其他类别。在实际数据中,x 轴将是分类的,大约有 18000 个类别,4% 的计数将在 10000 左右高,然后其余的将下降并约为 50。
I want to show this for an audience of "ordinary" business people so cant be some fanzy hard to read solution.
我想向“普通”商人的观众展示这个,所以不能成为一些难以阅读的解决方案。
Update: see @unutbu answere
Updated code and im getting an error for qcut
when trying to use tuples.
更新:请参阅@unutbu answere 更新代码,我qcut
在尝试使用元组时遇到错误。
TypeError: unsupported operand type(s) for -: 'tuple' and 'tuple'
df = pd.DataFrame({'s1':[1,0,1,0], 's2':[1,0,1,1], 's3':[1,0,1,1], 's4':[0,0,0,1]})
perms = df.apply(tuple, axis=1)
prob = perms.value_counts(normalize=True).reset_index(drop='True')
category_classes = pd.qcut(prob, q=[0, .25, 0.95, 1.],
labels=['bottom 25%', 'mid 70%', 'top 5%'])
prob_groups = prob.groupby(category_classes).sum()
prob_groups.plot(kind='bar')
plt.xticks(rotation=0)
plt.show()
回答by unutbu
You could keep the normalized value counts above a certain threshold
. Then sum together the values below the threshold
and clump them together in one category which could be called, say, "other".
您可以将标准化值计数保持在某个threshold
. 然后将下面的值相加,threshold
并将它们聚集在一个可以称为“其他”的类别中。
By choosing threshold
high enough, you will able to display the most important contributors to the overall probability distribution, while still showing the size of the tail in the bar labeled "other":
通过选择threshold
足够高的值,您将能够显示对整体概率分布最重要的贡献者,同时仍然在标记为“其他”的条形图中显示尾部的大小:
import matplotlib.pyplot as plt
import pandas as pd
s2 = pd.Series([1,2,3,4,5,2,3,333,2,123,434,1,2,3,1,11,11,432,3,2,4,3,3,3,54,34,24,2,223,2535334,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30000, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2])
prob = s2.value_counts(normalize=True)
threshold = 0.02
mask = prob > threshold
tail_prob = prob.loc[~mask].sum()
prob = prob.loc[mask]
prob['other'] = tail_prob
prob.plot(kind='bar')
plt.xticks(rotation=25)
plt.show()
There is a limit to the number of category labels you can sensibly display on a bar graph. For a normal-sized graph 3000 is way too many. Moreover, it is probably not reasonable to expect an audience to glean any meaning out of reading 3000 labels.
您可以在条形图上合理显示的类别标签数量是有限的。对于正常大小的图形,3000 太多了。此外,期望观众从阅读 3000 个标签中收集任何意义可能是不合理的。
The graph should summarize the data. And the main point seems to be that 4 or 5% of the categories constitute the vast majority of the cases. So to drive home that point, perhaps use pd.qcut
to categorize the cases into simple categories such as bottom 25%
, mid 70%
, and top 5%
:
图表应汇总数据。重点似乎是 4% 或 5% 的类别构成了绝大多数案例。所以,开车回家这一点,也许使用pd.qcut
的情况下,分类为简单的分类,例如bottom 25%
,mid 70%
和top 5%
:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
N = 18000
categories = np.arange(N)
np.random.shuffle(categories)
M = int(N*0.04)
prob = pd.Series(np.concatenate([np.random.randint(9000, 11000, size=M),
np.random.randint(0, 100, size=N-M), ]), index=categories)
prob /= prob.sum()
category_classes = pd.qcut(prob, q=[0, .25, 0.95, 1.],
labels=['bottom 25%', 'mid 70%', 'top 5%'])
prob_groups = prob.groupby(category_classes).sum()
prob_groups.plot(kind='bar')
plt.xticks(rotation=0)
plt.show()
回答by BrutalGames
Just log the axis (I have no pandas, but it should be similar):
只需记录轴(我没有熊猫,但它应该是相似的):
import numpy as np
import matplotlib.pyplot as plt
s2 = np.log([1,2,3,4,5,2,3,333,2,123,434,1,2,3,1,11,11,432,3,2,4,3,3,3,54,34,24,2,223,2535334,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30000, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2])
plt.plot(s2)
plt.show()