Python 如何在 Pandas 中绘制 value_counts 的值，该值具有大量不均匀分布的不同计数

Question

提问by user3139545

Lets say I have the following data:

假设我有以下数据：

s2 = pd.Series([1,2,3,4,5,2,3,333,2,123,434,1,2,3,1,11,11,432,3,2,4,3,3,3,54,34,24,2,223,2535334,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30000, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2])
s2.value_counts(normalize=True).plot()

What I want to show in the plot is that there are a few numbers that make up the majority of cases.The problem is that this will be seen in the far left side of the graph and then there will be a straight line for all the other categories. In the real data the x axis will be categorical with about 18000 categories and 4% of the counts will be around 10000 high then the rest will drop of and be around 50.

我想在图中显示的是，有几个数字构成了大多数情况。问题是这将在图表的最左侧看到，然后所有的数字都会出现一条直线其他类别。在实际数据中，x 轴将是分类的，大约有 18000 个类别，4% 的计数将在 10000 左右高，然后其余的将下降并约为 50。

I want to show this for an audience of "ordinary" business people so cant be some fanzy hard to read solution.

我想向“普通”商人的观众展示这个，所以不能成为一些难以阅读的解决方案。

Update: see @unutbu answere Updated code and im getting an error for qcutwhen trying to use tuples.

更新：请参阅@unutbu answere 更新代码，我qcut在尝试使用元组时遇到错误。

TypeError: unsupported operand type(s) for -: 'tuple' and 'tuple'

df = pd.DataFrame({'s1':[1,0,1,0], 's2':[1,0,1,1], 's3':[1,0,1,1], 's4':[0,0,0,1]})
perms = df.apply(tuple, axis=1)
prob = perms.value_counts(normalize=True).reset_index(drop='True')
category_classes = pd.qcut(prob, q=[0, .25, 0.95, 1.], 
                 labels=['bottom 25%', 'mid 70%', 'top 5%'])
prob_groups = prob.groupby(category_classes).sum()
prob_groups.plot(kind='bar')
plt.xticks(rotation=0)
plt.show()

Answer 1

回答by unutbu

You could keep the normalized value counts above a certain threshold. Then sum together the values below the thresholdand clump them together in one category which could be called, say, "other".

您可以将标准化值计数保持在某个threshold. 然后将下面的值相加，threshold并将它们聚集在一个可以称为“其他”的类别中。

By choosing thresholdhigh enough, you will able to display the most important contributors to the overall probability distribution, while still showing the size of the tail in the bar labeled "other":

通过选择threshold足够高的值，您将能够显示对整体概率分布最重要的贡献者，同时仍然在标记为“其他”的条形图中显示尾部的大小：

import matplotlib.pyplot as plt
import pandas as pd
s2 = pd.Series([1,2,3,4,5,2,3,333,2,123,434,1,2,3,1,11,11,432,3,2,4,3,3,3,54,34,24,2,223,2535334,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30000, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2])
prob = s2.value_counts(normalize=True)
threshold = 0.02
mask = prob > threshold
tail_prob = prob.loc[~mask].sum()
prob = prob.loc[mask]
prob['other'] = tail_prob
prob.plot(kind='bar')
plt.xticks(rotation=25)
plt.show()

There is a limit to the number of category labels you can sensibly display on a bar graph. For a normal-sized graph 3000 is way too many. Moreover, it is probably not reasonable to expect an audience to glean any meaning out of reading 3000 labels.

您可以在条形图上合理显示的类别标签数量是有限的。对于正常大小的图形，3000 太多了。此外，期望观众从阅读 3000 个标签中收集任何意义可能是不合理的。

The graph should summarize the data. And the main point seems to be that 4 or 5% of the categories constitute the vast majority of the cases. So to drive home that point, perhaps use pd.qcutto categorize the cases into simple categories such as bottom 25%, mid 70%, and top 5%:

图表应汇总数据。重点似乎是 4% 或 5% 的类别构成了绝大多数案例。所以，开车回家这一点，也许使用pd.qcut的情况下，分类为简单的分类，例如bottom 25%，mid 70%和top 5%：

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

N = 18000
categories = np.arange(N)
np.random.shuffle(categories)
M = int(N*0.04)
prob = pd.Series(np.concatenate([np.random.randint(9000, 11000, size=M),
                      np.random.randint(0, 100, size=N-M), ]), index=categories)
prob /= prob.sum()
category_classes = pd.qcut(prob, q=[0, .25, 0.95, 1.], 
                 labels=['bottom 25%', 'mid 70%', 'top 5%'])
prob_groups = prob.groupby(category_classes).sum()
prob_groups.plot(kind='bar')
plt.xticks(rotation=0)
plt.show()

Answer 2

回答by BrutalGames

Just log the axis (I have no pandas, but it should be similar):

只需记录轴（我没有熊猫，但它应该是相似的）：

import numpy as np
import matplotlib.pyplot as plt

s2 = np.log([1,2,3,4,5,2,3,333,2,123,434,1,2,3,1,11,11,432,3,2,4,3,3,3,54,34,24,2,223,2535334,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30000, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2])
plt.plot(s2)
plt.show()

Python 如何在 Pandas 中绘制 value_counts 的值，该值具有大量不均匀分布的不同计数

提问by user3139545

回答by unutbu

回答by BrutalGames

相关推荐

最近更新

标签

Python 如何在 Pandas 中绘制 value_counts 的值，该值具有大量不均匀分布的不同计数

提问by user3139545

回答by unutbu

回答by BrutalGames

相关推荐

Python Matplotlib：仅将轴设置为 x 或 y 轴

Python 将 numpy 数组转换为 Pandas 数据框

Python 将灰度值的 2D Numpy 数组转换为 PIL 图像

如何从python中的.pb文件恢复Tensorflow模型？

相关推荐

最近更新

标签