pandas 如何量化熊猫中的数据?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/31485526/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I quantize data in pandas?
提问by Bob
I have a DataFrame like this
我有一个像这样的 DataFrame
a = pd.DataFrame(a.random.random(5, 10), columns=['col1','col2','col3','col4','col5'])
I'd like to quantize a specific column, say col4, according to a set of thresholds (the corresponding output could be an integer from 0 to number of levels). Is there an API for that?
我想量化一个特定的列,比如col4,根据一组阈值(相应的输出可以是从 0 到级别数的整数)。有没有相关的 API?
回答by cchi
Perhaps qcut()is what you're seeking. Short answer:
也许qcut()这就是你正在寻找的。简答:
df['quantized'] = pd.qcut(df['col4'], 5, labels=False )
df['quantized'] = pd.qcut(df['col4'], 5, labels=False )
Longer explanation:
更长的解释:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.randn(10, 5), columns=['col1','col2','col3','col4','col5'])
>>> df
       col1      col2      col3      col4      col5
0  0.502017  0.290167  0.483311  1.755979 -0.866204
1  0.374881 -1.372040 -0.533093  1.559528 -1.835466
2 -0.110025 -1.071334 -0.474367 -0.250456  0.428927
3 -2.070885  0.095878 -3.133244 -1.295787  0.436325
4 -0.974993  0.591984 -0.839131 -0.949721 -1.130265
5 -0.383469  0.453937 -0.266297 -1.077004  0.123262
6 -2.548547  0.424707 -0.955433  1.147909 -0.249138
7  1.056661  0.949915 -0.234331 -0.146116  0.552332
8  0.029098 -1.016712 -1.252748 -0.216355  0.458309
9  0.262807  0.029040 -0.843372  0.492120  0.128395
You can use pd.qcut()to get the corresponding range.
您可以使用pd.qcut()来获取相应的范围。
>>> q = pd.qcut(df['col4'], 5)
>>> q
0       (1.23, 1.756]
1       (1.23, 1.756]
2     (-0.975, -0.23]
3    [-1.296, -0.975]
4     (-0.975, -0.23]
5    [-1.296, -0.975]
6       (0.109, 1.23]
7      (-0.23, 0.109]
8      (-0.23, 0.109]
9       (0.109, 1.23]
Name: col4, dtype: category
Categories (5, object): [[-1.296, -0.975] < (-0.975, -0.23] < (-0.23, 0.109] < (0.109, 1.23] < (1.23, 1.756]]
You can set parameter labels=Falseto get the integer representation
您可以设置参数labels=False以获取整数表示
>>> q = pd.qcut(df['col4'], 5, labels=False)
>>> q
0    4
1    4
2    1
3    0
4    1
5    0
6    3
7    2
8    2
9    3
dtype: int64
- First argument is an array or Series.
- Second argument is number of quantiles you'd like.
- Documentation here for more options. http://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html
- 第一个参数是数组或系列。
- 第二个参数是您想要的分位数。
- 此处的文档以获取更多选项。 http://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html
回答by dermen
Most pandas objectsare compatible with numpy functions. I would use numpy.digitize:
大多数Pandas 对象都与 numpy 函数兼容。我会用numpy.digitize:
import pandas as pd
a = pd.DataFrame(pd.np.random.random((5, 5)), columns=['col1','col2','col3','col4','col5'])
#       col1      col2      col3      col4      col5
#0  0.523311  0.266401  0.939214  0.487241  0.582323
#1  0.274436  0.761046  0.155482  0.630622  0.044595
#2  0.505696  0.953183  0.643918  0.894726  0.466916
#3  0.281888  0.621781  0.900743  0.339057  0.427644
#4  0.927478  0.442643  0.541234  0.450761  0.191215
pd.np.digitize( a.col4, bins = [0.3,0.6,0.9 ]  )
#array([1, 2, 2, 1, 1])
回答by Geeocode
回答by Peter9192
Pandas has a built in function pd.cutwhich allows you to specify bins and labels. Following Dermen's example:
Pandas 有一个内置函数pd.cut,允许您指定 bin 和标签。以下是德门的例子:
df = pd.DataFrame(pd.np.random.random((5, 5)), columns=['col1', 'col2', 'col3', 'col4', 'col5'])
#        col1      col2      col3      col4      col5
# 0  0.693759  0.175076  0.260484  0.883670  0.318821
# 1  0.062635  0.413724  0.341535  0.952104  0.854916
# 2  0.837990  0.440695  0.341482  0.833220  0.688664
# 3  0.652480  0.271256  0.338068  0.757838  0.311720
# 4  0.782419  0.567019  0.839786  0.208740  0.245261
pd.cut(df.col4, bins = [0, 0.3, 0.6, 0.9, 1], labels=['A', 'B', 'C', 'D'])
# 0    C
# 1    D
# 2    C
# 3    C
# 4    A
# Name: col4, dtype: category
# Categories (4, object): [A < B < C < D]

