pandas 熊猫获得类别到整数值的映射

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42215354/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:58:37  来源:igfitidea点击:

pandas get mapping of categories to integer value

pythonpandas

提问by jxn

I can transform categorical columns to their categorical code but how do i get an accurate picture of their mapping? Example:

我可以将分类列转换为它们的分类代码,但是我如何获得它们映射的准确图片?例子:

df_labels = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab')})
df_labels['col2'] = df_labels['col2'].astype('category')  

df_labels looks like this:

df_labels 看起来像这样:

   col1 col2
0     1    a
1     2    b
2     3    c
3     4    a
4     5    b

How do i get an accurate mapping of the cat codes to cat categories? The stackoverflow response below says to enumerate the categories. However, I'm not sure if enumerating was the way cat.codes generated the integer values. Is there a more accurate way?

我如何获得猫代码到猫类别的准确映射?下面的 stackoverflow 响应说要枚举类别。但是,我不确定枚举是否是 cat.codes 生成整数值的方式。有没有更准确的方法?

Get mapping of categorical variables in pandas

获取pandas中分类变量的映射

>>> dict( enumerate(df.five.cat.categories) )

{0: 'bad', 1: 'good'}

What is a good way to get the mapping in the above format but accurate?

以上述格式获取映射但准确的好方法是什么?

采纳答案by Boud

Edited answer (removed cat.categoriesand changed listto dict):

编辑答案(已删除cat.categories并更改listdict):

>>> dict(zip(df_labels.col2.cat.codes, df_labels.col2))

{0: 'a', 1: 'b', 2: 'c'}

The original answer which some of the comments are referring to:

一些评论所指的原始答案:

>>> list(zip(df_labels.col2.cat.codes, df_labels.col2.cat.categories))

[(0, 'a'), (1, 'b'), (2, 'c')]

As the comments note, the original answer works in this example because the first three values happend to be [a,b,c], but would fail if they were instead [c,b,a]or [b,c,a].

正如评论所指出的,原始答案在此示例中有效,因为前三个值[a,b,c]恰好是 ,但如果它们是[c,b,a]或,则会失败[b,c,a]

回答by pomber

I use:

我用:

dict([(category, code) for code, category in enumerate(df_labels.col2.cat.categories)])

# {'a': 0, 'b': 1, 'c': 2}

回答by Neo X

If you want to convert each column/ data series from categorical back to original, you just need to reverse what you did in the forloop of the dataframe. There are two methods to do that:

如果您想将每一列/数据系列从分类转换回原始数据,您只需要反转您在for数据框循环中所做的操作。有两种方法可以做到这一点:

  1. To get back to the original Series or numpy array, use Series.astype(original_dtype)or np.asarray(categorical).

  2. If you have already codes and categories, you can use the from_codes()constructor to save the factorize step during normal constructor mode.

  1. 要返回原始系列或 numpy 数组,请使用Series.astype(original_dtype)np.asarray(categorical)

  2. 如果您已经有代码和类别,则可以使用from_codes()构造函数在正常构造函数模式下保存分解步骤。

See pandas: Categorical Data

参见Pandas:分类数据



Usage of from_codes

的用法 from_codes

As on official documentation, it makes a Categorical type from codes and categories arrays.

官方文档一样,它从代码和类别数组中创建了一个 Categorical 类型。

splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
print splitter
print s

gives

[0 1 1 0 0]
0    train
1     test
2     test
3    train
4    train
dtype: category
Categories (2, object): [train, test]

For your codes

对于您的代码

# after your previous conversion
print df['col2']
# apply from_codes, the 2nd argument is the categories from mapping dict
s = pd.Series(pd.Categorical.from_codes(df['col2'], list('abcde')))
print s

gives

0    0
1    1
2    2
3    0
4    1
Name: col2, dtype: int8
0    a
1    b
2    c
3    a
4    b
dtype: category
Categories (5, object): [a, b, c, d, e]

回答by JohnE

OP asks for something "accurate" relative to the answer in the linked question:

OP 要求相对于链接问题中的答案“准确”:

dict(enumerate(df_labels.col2.cat.categories))

# {0: 'a', 1: 'b', 2: 'c'}

I believe that the above answer is indeed accurate (full disclosure: it is my answer in the other question that I'm defending). Note also that it is roughly equivalent to @pomber's answer, except that the ordering of the keys and values is reversed. (Since both keys and values are unique, the ordering is in some sense irrelevant, and easy enough to reverse as a consequence).

我相信上面的答案确实是准确的(完全公开:这是我在另一个我正在捍卫的问题中的答案)。另请注意,它大致相当于@pomber 的答案,只是键和值的顺序颠倒了。(由于键和值都是唯一的,因此排序在某种意义上是无关紧要的,因此很容易反转)。

However, the following way is arguably safer, or at least more transparent as to how it works:

但是,以下方式可以说更安全,或者至少在其工作方式方面更加透明:

dict(zip(df_labels.col2.cat.codes, df_labels.col2))

# {0: 'a', 1: 'b', 2: 'c'}

This is similar in spirit to @boud's answer, but corrects an error by replacing df_labels.col2.cat.codeswith df_labels.col2. It also replaces list()with dict()which seems more appropriate for a mapping and automatically gets rid of duplicates.

这在精神上与@boud 的答案相似,但通过替换df_labels.col2.cat.codesdf_labels.col2. 它还替换list()dict()which 似乎更适合映射并自动删除重复项。

Note that the length of both arguments to zip()is len(df), whereas the length of df_labels.col2.cat.categoriesis a count of unique values which will generally be much shorter than len(df).

请注意, to 的两个参数的长度zip()都是len(df),而 的长度df_labels.col2.cat.categories是唯一值的计数,通常比len(df).

Also note that this method is quite inefficient as it maps 0to 'a'twice, and similarly for 'b'. In large dataframes the difference in speed could be pretty big. But it won't cause any error because dict()will remove redundancies like this -- it's just that it will be much less efficient than the other method.

另请注意,此方法非常低效,因为它映射0'a'两次,对于'b'. 在大型数据帧中,速度差异可能非常大。但它不会导致任何错误,因为dict()会删除这样的冗余——只是它比其他方法效率低得多。