Python pandas.qcut 和 pandas.cut 有什么区别?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30211923/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 08:03:46  来源:igfitidea点击:

What is the difference between pandas.qcut and pandas.cut?

pythonpandas

提问by WillZ

The documentation says:

文档说:

http://pandas.pydata.org/pandas-docs/dev/basics.html

http://pandas.pydata.org/pandas-docs/dev/basics.html

"Continuous values can be discretized using the cut (bins based on values) and qcut (bins based on sample quantiles) functions"

“可以使用 cut(基于值的 bins)和 qcut(基于样本分位数的 bins)函数来离散连续值”

Sounds very abstract to me... I can see the differences in the example below but what does qcut (sample quantile) actually do/mean? When would you use qcut versus cut?

对我来说听起来很抽象......我可以看到下面示例中的差异,但是qcut(样本分位数)实际上是做什么/意味着什么?你什么时候会使用 qcut 和 cut?

Thanks.

谢谢。

factors = np.random.randn(30)

In [11]:
pd.cut(factors, 5)
Out[11]:
[(-0.411, 0.575], (-0.411, 0.575], (-0.411, 0.575], (-0.411, 0.575], (0.575, 1.561], ..., (-0.411, 0.575], (-1.397, -0.411], (0.575, 1.561], (-2.388, -1.397], (-0.411, 0.575]]
Length: 30
Categories (5, object): [(-2.388, -1.397] < (-1.397, -0.411] < (-0.411, 0.575] < (0.575, 1.561] < (1.561, 2.547]]

In [14]:
pd.qcut(factors, 5)
Out[14]:
[(-0.348, 0.0899], (-0.348, 0.0899], (0.0899, 1.19], (0.0899, 1.19], (0.0899, 1.19], ..., (0.0899, 1.19], (-1.137, -0.348], (1.19, 2.547], [-2.383, -1.137], (-0.348, 0.0899]]
Length: 30
Categories (5, object): [[-2.383, -1.137] < (-1.137, -0.348] < (-0.348, 0.0899] < (0.0899, 1.19] < (1.19, 2.547]]`

采纳答案by JohnE

To begin, note that quantiles is just the most general term for things like percentiles, quartiles, and medians. You specified five bins in your example, so you are asking qcutfor quintiles.

首先,请注意分位数只是百分位数、四分位数和中位数等最通用的术语。您在示例中指定了五个 bin,因此您要求qcut五分之一。

So, when you ask for quintiles with qcut, the bins will be chosen so that you have the same number of records in each bin. You have 30 records, so should have 6 in each bin (your output should look like this, although the breakpoints will differ due to the random draw):

因此,当您使用 请求五分位数时qcut,将选择垃圾箱,以便您在每个垃圾箱中拥有相同数量的记录。您有 30 条记录,因此每个 bin 中应该有 6 条记录(您的输出应如下所示,尽管断点会因随机抽取而有所不同):

pd.qcut(factors, 5).value_counts()

[-2.578, -0.829]    6
(-0.829, -0.36]     6
(-0.36, 0.366]      6
(0.366, 0.868]      6
(0.868, 2.617]      6

Conversely, for cutyou will see something more uneven:

相反,因为cut你会看到更不均匀的东西:

pd.cut(factors, 5).value_counts()

(-2.583, -1.539]    5
(-1.539, -0.5]      5
(-0.5, 0.539]       9
(0.539, 1.578]      9
(1.578, 2.617]      2

That's because cutwill choose the bins to be evenly spaced according to the values themselves and not the frequencyof those values. Hence, because you drew from a random normal, you'll see higher frequencies in the inner bins and fewer in the outer. This is essentially going to be a tabular form of a histogram (which you would expect to be fairly bell shaped with 30 records).

那是因为cut将根据值本身而不是这些值的频率选择均匀间隔的 bin 。因此,因为您从随机法线中抽取,您会看到内部 bin 中的频率较高,而外部 bin 中的频率较少。这本质上将是直方图的表格形式(您希望它具有 30 条记录的相当钟形)。

回答by Mir H.

So qcut ensures a more even distribution of the values in each bin even if they cluster in the sample space. This means you are less likely to have a bin full of data with very close values and another bin with 0 values. In general, it's better sampling.

因此 qcut 确保每个 bin 中的值分布更均匀,即使它们在样本空间中聚集。这意味着您不太可能有一个装满数据的 bin 值非常接近,而另一个 bin 的值为 0。一般来说,它是更好的采样。

回答by Ashish Anand

  • cut command creates equispaced binsbut frequency of samples is unequal in each bin
  • qcut command creates unequal size binsbut frequency of samples is equal in each bin.
  • cut 命令创建等距的 bin,每个 bin 中的样本频率不相等
  • qcut 命令创建大小不等的 bin,每个 bin 中的样本频率相等。

enter image description here

在此处输入图片说明

    >>> x=np.array([24,  7,  2, 25, 22, 29])
    >>> x
    array([24,  7,  2, 25, 22, 29])

    >>> pd.cut(x,3).value_counts() #Bins size has equal interval of 9
    (2, 11.0]        2
    (11.0, 20.0]     0
    (20.0, 29.0]     4

    >>> pd.qcut(x,3).value_counts() #Equal frequecy of 2 in each bins
    (1.999, 17.0]     2
    (17.0, 24.333]    2
    (24.333, 29.0]    2

回答by Aditya Anand

Pd.qcut distribute elements of an array on making division on the basis of ((no.of elements in array)/(no. of bins - 1)), then divide this much no. of elements serially in each bins.

Pd.qcut 在 ((数组中的元素数)/(bins 数 - 1)) 的基础上进行除法分配数组的元素,然后除以这个数。每个 bin 中的元素序列。

Pd.cut distribute elements of an array on making division on the basis of ((first +last element)/(no. of bins-1)) and then distribute element according to the range of values in which they fall.

Pd.cut 在 ((first + last element)/(no. of bins-1)) 的基础上进行除法分配数组的元素,然后根据它们落入的值范围分配元素。