Python pandas.qcut 和 pandas.cut 有什么区别？

Question

提问by WillZ

The documentation says:

文档说：

http://pandas.pydata.org/pandas-docs/dev/basics.html

"Continuous values can be discretized using the cut (bins based on values) and qcut (bins based on sample quantiles) functions"

“可以使用 cut（基于值的 bins）和 qcut（基于样本分位数的 bins）函数来离散连续值”

Sounds very abstract to me... I can see the differences in the example below but what does qcut (sample quantile) actually do/mean? When would you use qcut versus cut?

对我来说听起来很抽象......我可以看到下面示例中的差异，但是qcut（样本分位数）实际上是做什么/意味着什么？你什么时候会使用 qcut 和 cut？

Thanks.

谢谢。

factors = np.random.randn(30)

In [11]:
pd.cut(factors, 5)
Out[11]:
[(-0.411, 0.575], (-0.411, 0.575], (-0.411, 0.575], (-0.411, 0.575], (0.575, 1.561], ..., (-0.411, 0.575], (-1.397, -0.411], (0.575, 1.561], (-2.388, -1.397], (-0.411, 0.575]]
Length: 30
Categories (5, object): [(-2.388, -1.397] < (-1.397, -0.411] < (-0.411, 0.575] < (0.575, 1.561] < (1.561, 2.547]]

In [14]:
pd.qcut(factors, 5)
Out[14]:
[(-0.348, 0.0899], (-0.348, 0.0899], (0.0899, 1.19], (0.0899, 1.19], (0.0899, 1.19], ..., (0.0899, 1.19], (-1.137, -0.348], (1.19, 2.547], [-2.383, -1.137], (-0.348, 0.0899]]
Length: 30
Categories (5, object): [[-2.383, -1.137] < (-1.137, -0.348] < (-0.348, 0.0899] < (0.0899, 1.19] < (1.19, 2.547]]`

Answer 1

采纳答案by JohnE

To begin, note that quantiles is just the most general term for things like percentiles, quartiles, and medians. You specified five bins in your example, so you are asking qcutfor quintiles.

首先，请注意分位数只是百分位数、四分位数和中位数等最通用的术语。您在示例中指定了五个 bin，因此您要求qcut五分之一。

So, when you ask for quintiles with qcut, the bins will be chosen so that you have the same number of records in each bin. You have 30 records, so should have 6 in each bin (your output should look like this, although the breakpoints will differ due to the random draw):

因此，当您使用请求五分位数时qcut，将选择垃圾箱，以便您在每个垃圾箱中拥有相同数量的记录。您有 30 条记录，因此每个 bin 中应该有 6 条记录（您的输出应如下所示，尽管断点会因随机抽取而有所不同）：

pd.qcut(factors, 5).value_counts()

[-2.578, -0.829]    6
(-0.829, -0.36]     6
(-0.36, 0.366]      6
(0.366, 0.868]      6
(0.868, 2.617]      6

Conversely, for cutyou will see something more uneven:

相反，因为cut你会看到更不均匀的东西：

pd.cut(factors, 5).value_counts()

(-2.583, -1.539]    5
(-1.539, -0.5]      5
(-0.5, 0.539]       9
(0.539, 1.578]      9
(1.578, 2.617]      2

That's because cutwill choose the bins to be evenly spaced according to the values themselves and not the frequencyof those values. Hence, because you drew from a random normal, you'll see higher frequencies in the inner bins and fewer in the outer. This is essentially going to be a tabular form of a histogram (which you would expect to be fairly bell shaped with 30 records).

那是因为cut将根据值本身而不是这些值的频率选择均匀间隔的 bin 。因此，因为您从随机法线中抽取，您会看到内部 bin 中的频率较高，而外部 bin 中的频率较少。这本质上将是直方图的表格形式（您希望它具有 30 条记录的相当钟形）。

Answer 2

回答by Mir H.

So qcut ensures a more even distribution of the values in each bin even if they cluster in the sample space. This means you are less likely to have a bin full of data with very close values and another bin with 0 values. In general, it's better sampling.

因此 qcut 确保每个 bin 中的值分布更均匀，即使它们在样本空间中聚集。这意味着您不太可能有一个装满数据的 bin 值非常接近，而另一个 bin 的值为 0。一般来说，它是更好的采样。

Answer 3

回答by Ashish Anand

cut command creates equispaced binsbut frequency of samples is unequal in each bin
qcut command creates unequal size binsbut frequency of samples is equal in each bin.

cut 命令创建等距的 bin，但每个 bin 中的样本频率不相等
qcut 命令创建大小不等的 bin，但每个 bin 中的样本频率相等。

    >>> x=np.array([24,  7,  2, 25, 22, 29])
    >>> x
    array([24,  7,  2, 25, 22, 29])

    >>> pd.cut(x,3).value_counts() #Bins size has equal interval of 9
    (2, 11.0]        2
    (11.0, 20.0]     0
    (20.0, 29.0]     4

    >>> pd.qcut(x,3).value_counts() #Equal frequecy of 2 in each bins
    (1.999, 17.0]     2
    (17.0, 24.333]    2
    (24.333, 29.0]    2

Answer 4

回答by Aditya Anand

Pd.qcut distribute elements of an array on making division on the basis of ((no.of elements in array)/(no. of bins - 1)), then divide this much no. of elements serially in each bins.

Pd.qcut 在 ((数组中的元素数)/(bins 数 - 1)) 的基础上进行除法分配数组的元素，然后除以这个数。每个 bin 中的元素序列。

Pd.cut distribute elements of an array on making division on the basis of ((first +last element)/(no. of bins-1)) and then distribute element according to the range of values in which they fall.

Pd.cut 在 ((first + last element)/(no. of bins-1)) 的基础上进行除法分配数组的元素，然后根据它们落入的值范围分配元素。

Python pandas.qcut 和 pandas.cut 有什么区别？

提问by WillZ

采纳答案by JohnE

回答by Mir H.

回答by Ashish Anand

回答by Aditya Anand

相关推荐

最近更新

标签

Python pandas.qcut 和 pandas.cut 有什么区别？

提问by WillZ

采纳答案by JohnE

回答by Mir H.

回答by Ashish Anand

回答by Aditya Anand

相关推荐

Python 芹菜登录到文件

Python 计算两个多维数组之间的相关系数

Python 尝试使用 pip 在 ubuntu 12.04 上安装 pymssql

Python 主循环“builtin_function_or_method”对象不可迭代

相关推荐

最近更新

标签