如何从字符串列生成 Categorical 的 Pandas DataFrame 列?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15356433/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to generate pandas DataFrame column of Categorical from string column?
提问by smci
I can convert a pandas string column to Categorical, but when I try to insert it as a new DataFrame column it seems to get converted right back to Series of str:
我可以将 Pandas 字符串列转换为 Categorical,但是当我尝试将其作为新的 DataFrame 列插入时,它似乎会立即转换回系列 str:
train['LocationNFactor'] = pd.Categorical.from_array(train['LocationNormalized'])
>>> type(pd.Categorical.from_array(train['LocationNormalized']))
<class 'pandas.core.categorical.Categorical'>
# however it got converted back to...
>>> type(train['LocationNFactor'][2])
<type 'str'>
>>> train['LocationNFactor'][2]
'Hampshire'
Guessing this is because Categorical doesn't map to any numpy dtype; so do I have to convert it to some int type, and thus lose the factor labels<->levels association? What's the most elegant workaround to store the levels<->labels association and retain the ability to convert back? (just store as a dict like here, and manually convert when needed?) I think Categorical is still not a first-class datatype for DataFrame, unlike R.
猜测这是因为 Categorical 没有映射到任何 numpy dtype;那么我是否必须将其转换为某种 int 类型,从而丢失因子标签<->级别关联?存储级别<->标签关联并保留转换回来的能力的最优雅的解决方法是什么?(就像这里一样存储为 dict ,并在需要时手动转换?)我认为Categorical 仍然不是 DataFrame 的一流数据类型,与 R 不同。
(Using pandas 0.10.1, numpy 1.6.2, python 2.7.3 - the latest macports versions of everything).
(使用 pandas 0.10.1、numpy 1.6.2、python 2.7.3 - 最新的 macports 版本)。
采纳答案by smci
The only workaround for pandas pre-0.15I found is as follows:
我发现0.15 之前的Pandas的唯一解决方法如下:
- column must be converted to a Categorical for classifier, but numpy will immediately coerce the levels back to int, losing the factor information
- so store the factor in a global variable outside the dataframe
- column 必须转换为 Categorical 用于分类器,但 numpy 会立即将级别强制转换回 int,从而丢失因子信息
- 所以将因子存储在数据框外的全局变量中
.
.
train_LocationNFactor = pd.Categorical.from_array(train['LocationNormalized']) # default order: alphabetical
train['LocationNFactor'] = train_LocationNFactor.labels # insert in dataframe
[UPDATE: pandas 0.15+ added decent support for Categorical]
[更新:pandas 0.15+ 增加了对 Categorical 的体面支持]
回答by HYRY
The labels<->levels is stored in the index object.
标签<->级别存储在索引对象中。
- To convert an integer array to string array: index[integer_array]
- To convert a string array to integer array: index.get_indexer(string_array)
- 将整数数组转换为字符串数组:index[integer_array]
- 将字符串数组转换为整数数组: index.get_indexer(string_array)
Here is some exampe:
下面是一些例子:
In [56]:
c = pd.Categorical.from_array(['a', 'b', 'c', 'd', 'e'])
idx = c.levels
In [57]:
idx[[1,2,1,2,3]]
Out[57]:
Index([b, c, b, c, d], dtype=object)
In [58]:
idx.get_indexer(["a","c","d","e","a"])
Out[58]:
array([0, 2, 3, 4, 0])

