Python 获取pandas中分类变量的映射

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30510562/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 08:34:07  来源:igfitidea点击:

Get mapping of categorical variables in pandas

pythonpandas

提问by Bob

I'm doing this to make categorical variables numbers

我这样做是为了使分类变量编号

>>> df = pd.DataFrame({'x':['good', 'bad', 'good', 'great']}, dtype='category')

       x
0   good
1    bad
2   good
3  great

How can I get the mapping between the original values and the new values?

如何获得原始值和新值之间的映射?

回答by JohnE

Method 1

方法一

You can create a dictionary mapping by enumerating (similar to creating a dictionary from a list by creating dictionary keys from the list indices):

您可以通过枚举创建字典映射(类似于通过从列表索引创建字典键来从列表创建字典):

dict( enumerate(df['x'].cat.categories ) )

# {0: 'bad', 1: 'good', 2: 'great'}

Method 2

方法二

Alternatively, you could map the values and codes in everyrow:

或者,您可以映射每一行中的值和代码:

dict( zip( df['x'].cat.codes, df['x'] ) )

# {0: 'bad', 1: 'good', 2: 'great'}

It's a little more transparent what is happening here, and arguably safer for that reason. It is also much less efficient as the length of the arguments to zip()is len(df)whereas the length of df['x'].cat.categoriesis only the count of unique values and generally much shorter than len(df).

这里发生的事情更加透明,因此可以说更安全。它的效率也低得多,因为参数的长度zip()是 ,len(df)而 的长度df['x'].cat.categories只是唯一值的数量,通常比 短得多len(df)

Additional Discussion

附加讨论

The reason Method 1 works is that the categories have type Index:

方法 1 有效的原因是类别具有索引类型:

type( df['x'].cat.categories )

# pandas.core.indexes.base.Index

and in this case you look up values in an index just as you would a list.

在这种情况下,您可以像查找列表一样在索引中查找值。

There are a couple of ways to verify that Method 1 works. First, you can just check that a round trip retains the correct values:

有几种方法可以验证方法 1 是否有效。首先,您可以检查往返行程是否保留了正确的值:

(df['x'] == df['x'].cat.codes.map( dict( 
            enumerate(df['x'].cat.categories) ) ).astype('category')).all()
# True

or you can check that Method 1 and Method 2 give the same answer:

或者您可以检查方法 1 和方法 2 给出相同的答案:

(dict( enumerate(df['x'].cat.categories ) ) == dict( zip( df['x'].cat.codes, df['x'] ) ))

# True