Python 将字符串的 Pandas DataFrame 转换为直方图

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14992644/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 13:02:08  来源:igfitidea点击:

Turn Pandas DataFrame of strings into histogram

pythonpandasmatplotlibdataframe

提问by amatsukawa

Suppose I have a DataFrame of created like this:

假设我有一个像这样创建的 DataFrame:

import pandas as pd
s1 = pd.Series(['a', 'b', 'a', 'c', 'a', 'b'])
s2 = pd.Series(['a', 'f', 'a', 'd', 'a', 'f', 'f'])
d = pd.DataFrame({'s1': s1, 's2', s2})

There is quite a lot of sparsity in the strings in the real data. I would like to create histograms of the occurrence of strings that looks like what is generated by d.hist() (eg. with subplots) for s1 and s2 (one per subplot).

真实数据中的字符串有相当多的稀疏性。我想为 s1 和 s2(每个子图一个)创建看起来像 d.hist() 生成的字符串的直方图(例如,带有子图)。

Just doing d.hist() gives this error:

只是做 d.hist() 给出了这个错误:

/Library/Python/2.7/site-packages/pandas/tools/plotting.pyc in hist_frame(data, column, by, grid, xlabelsize, xrot, ylabelsize, yrot, ax, sharex, sharey, **kwds)
   1725         ax.xaxis.set_visible(True)
   1726         ax.yaxis.set_visible(True)
-> 1727         ax.hist(data[col].dropna().values, **kwds)
   1728         ax.set_title(col)
   1729         ax.grid(grid)

/Library/Python/2.7/site-packages/matplotlib/axes.pyc in hist(self, x, bins, range, normed, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, **kwargs)
   8099             # this will automatically overwrite bins,
   8100             # so that each histogram uses the same bins
-> 8101             m, bins = np.histogram(x[i], bins, weights=w[i], **hist_kwargs)
   8102             if mlast is None:
   8103                 mlast = np.zeros(len(bins)-1, m.dtype)

/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/lib/function_base.pyc in histogram(a, bins, range, normed, weights, density)
    167             else:
    168                 range = (a.min(), a.max())
--> 169         mn, mx = [mi+0.0 for mi in range]
    170         if mn == mx:
    171             mn -= 0.5

TypeError: cannot concatenate 'str' and 'float' objects

I suppose I could manually go through each series, do a value_counts(), then plot it as a bar plot, and manually create the subplots. I wanted to check if there is a simpler way.

我想我可以手动浏览每个系列,做一个value_counts(),然后将其绘制为条形图,然后手动创建子图。我想检查是否有更简单的方法。

回答by tacaswell

I would shove the Series into a collections.Counter(documentation) (You might need to convert it to a list first). I am not a pandasexpert, but I think you should be able to fold the Counterobject back into a Series, indexed by the strings, and use that to make your plots.

我会将系列推入collections.Counter文档)(您可能需要先将其转换为列表)。我不是pandas专家,但我认为您应该能够将Counter对象折叠回Series由字符串索引的 ,并使用它来制作您的绘图。

This is not working because it is (rightly) raising errors when it tries to guess where the bin edges should be, which simply makes no sense with strings.

这是行不通的,因为当它试图猜测 bin 边缘应该在哪里时(正确地)引发错误,这对字符串毫无意义。

回答by bmu

You can use pd.value_counts(value_counts is also a series method):

您可以使用pd.value_counts(value_counts 也是一个系列方法):

In [20]: d.apply(pd.value_counts)
Out[20]: 
   s1  s2
a   3   3
b   2 NaN
c   1 NaN
d NaN   1
f NaN   3

and than plot the resulting DataFrame.

然后绘制生成的 DataFrame。

回答by Aman

Recreating the dataframe:

重新创建数据框:

import pandas as pd
s1 = pd.Series(['a', 'b', 'a', 'c', 'a', 'b'])
s2 = pd.Series(['a', 'f', 'a', 'd', 'a', 'f', 'f'])
d = pd.DataFrame({'s1': s1, 's2': s2})

To get the histogram with subplots as desired:

要根据需要获取带有子图的直方图:

d.apply(pd.value_counts).plot(kind='bar', subplots=True)

enter image description here

在此处输入图片说明

The OP mentioned pd.value_countsin the question. I think the missing piece is just that there is no reason to "manually" create the desired bar plot.

pd.value_counts问题中提到的OP 。我认为缺少的部分只是没有理由“手动”创建所需的条形图。

The output from d.apply(pd.value_counts)is a pandas dataframe. We can plot the values like any other dataframe, and selecting the option subplots=Truegives us what we want.

的输出d.apply(pd.value_counts)是一个熊猫数据框。我们可以像任何其他数据框一样绘制值,然后选择该选项即可获得subplots=True我们想要的结果。