pandas 熊猫:将多个类别合二为一
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32262982/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas: Combining Multiple Categories into One
提问by Minh Mai
Let's say I have categories, 1 to 10, and I want to assign redto value 3 to 5, greento 1,6, and 7, and blueto 2, 8, 9, and 10.
假设我有 1 到 10 个类别,我想分配red值 3green到 5、1,6 和 7,以及 2、8、9blue和 10。
How would I do this? If I try
我该怎么做?如果我尝试
df.cat.rename_categories(['red','green','blue'])
I get an error: ValueError: new categories need to have the same number of items than the old categories!but if I put this in
我收到一个错误:ValueError: new categories need to have the same number of items than the old categories!但是如果我把它放进去
df.cat.rename_categories(['green','blue','red', 'red', 'red'
'green', 'green', 'blue', 'blue' 'blue'])
I'll get an error saying that there are duplicate values.
我会收到一条错误消息,指出存在重复值。
The only other method I can think of is to write a for loop that'll go through a dictionary of the values and replace them. Is there a more elegant of resolving this?
我能想到的唯一其他方法是编写一个 for 循环,该循环将遍历值的字典并替换它们。有没有更优雅的解决方法?
回答by DSM
Not sure about elegance, but if you make a dict of the old to new categories, something like (note the added 'purple'):
不确定优雅,但如果您对旧类别和新类别进行口述,例如(注意添加的“紫色”):
>>> m = {"red": [3,4,5], "green": [1,6,7], "blue": [2,8,9,10], "purple": [11]}
>>> m2 = {v: k for k,vv in m.items() for v in vv}
>>> m2
{1: 'green', 2: 'blue', 3: 'red', 4: 'red', 5: 'red', 6: 'green',
7: 'green', 8: 'blue', 9: 'blue', 10: 'blue', 11: 'purple'}
You can use this to build a new categorical Series:
您可以使用它来构建一个新的分类系列:
>>> df.cat.map(m2).astype("category", categories=set(m2.values()))
0 green
1 blue
2 red
3 red
4 red
5 green
6 green
7 blue
8 blue
9 blue
Name: cat, dtype: category
Categories (4, object): [green, purple, red, blue]
You don't need the categories=set(m2.values())(or an ordered equivalent if you care about the categorical ordering) if you're sure that all categorical values will be seen in the column. But here, if we didn't do that, we wouldn't have seen purplein the resulting Categorical, because it was building it from the categories it actually saw.
categories=set(m2.values())如果您确定将在列中看到所有分类值,则不需要(或有序的等价物,如果您关心分类排序)。但是在这里,如果我们不这样做,我们就不会purple在生成的 Categorical 中看到,因为它是根据实际看到的类别构建的。
Of course if you already have your list ['green','blue','red', etc.]built it's equally easy just to use it to make a new categorical column directly and bypass this mapping entirely.
当然,如果你已经['green','blue','red', etc.]建立了你的列表,那么使用它直接创建一个新的分类列并完全绕过这个映射同样容易。
回答by JohnE
I certainly don't see an issue with @DSM's original answer here, but that dictionary comprehension might not be the easiest thing to read for some (although is a fairly standard approach in Python).
我当然没有看到 @DSM 在这里的原始答案有问题,但是对于某些人来说,字典理解可能不是最容易阅读的东西(尽管在 Python 中是一种相当标准的方法)。
If you don't want to use a dictionary comprehension but are willing to use numpythen I would suggest np.selectwhich is roughly as concise as @DSM's answer but perhaps a little more straightforward to read, like @vector07's answer.
如果您不想使用字典理解但又愿意使用,numpy那么我建议np.select它与@DSM 的答案大致一样简洁,但可能更易于阅读,例如@vector07 的答案。
import numpy as np
number = [ df.numbers.isin([3,4,5]),
df.numbers.isin([1,6,7]),
df.numbers.isin([2,8,9,10]),
df.numbers.isin([11]) ]
color = [ "red", "green", "blue", "purple" ]
df.numbers = np.select( number, color )
Output (note this is a string or object column, but of course you can easily convert to a category with astype('category'):
输出(注意这是一个字符串或对象列,但当然你可以轻松地转换为一个类别astype('category'):
0 green
1 blue
2 red
3 red
4 red
5 green
6 green
7 blue
8 blue
9 blue
It's basically the same thing, but you could also do this with np.where:
这基本上是一样的,但你也可以这样做np.where:
df['numbers2'] = ''
df.numbers2 = np.where( df.numbers.isin([3,4,5]), "red", df.numbers2 )
df.numbers2 = np.where( df.numbers.isin([1,6,7]), "green", df.numbers2 )
df.numbers2 = np.where( df.numbers.isin([2,8,9,10]), "blue", df.numbers2 )
df.numbers2 = np.where( df.numbers.isin([11]), "purple", df.numbers2 )
That's not going to be as efficient as np.selectwhich is probably the most efficient way to do this (although I didn't time it), but it is arguably more readable in that you can put each key/value pair on the same line.
这不会像np.select执行此操作的最有效方法那样有效(尽管我没有计时),但可以说它更具可读性,因为您可以将每个键/值对放在同一行上。
回答by Divakar
Seems pandas.explodereleased with pandas-0.25.0(July 18, 2019)would fit right in there and hence avoid any looping -
似乎pandas.explode与发布的 将适合在那里,因此避免任何循环 -pandas-0.25.0(July 18, 2019)
# Mapping dict
In [150]: m = {"red": [3,4,5], "green": [1,6,7], "blue": [2,8,9,10]}
In [151]: pd.Series(m).explode().sort_values()
Out[151]:
green 1
blue 2
red 3
red 4
red 5
green 6
green 7
blue 8
blue 9
blue 10
dtype: object
So, the result is a pandas series that has all the required mappings from values:index. Now, based on user-requirements, we might use it directly or if needed in different formats like dict or series, swap index and values. Let's explore those too.
因此,结果是一个 Pandas 系列,其中包含来自values:index. 现在,根据用户的需求,我们可能会直接使用它,或者如果需要,可以使用不同的格式,如 dict 或系列、交换索引和值。让我们也探索一下。
# Mapping obtained
In [152]: s = pd.Series(m).explode().sort_values()
1) Output as dict :
1) 输出为 dict :
In [153]: dict(zip(s.values, s.index))
Out[153]:
{1: 'green',
2: 'blue',
3: 'red',
4: 'red',
5: 'red',
6: 'green',
7: 'green',
8: 'blue',
9: 'blue',
10: 'blue'}
2) Output as series :
2)输出为系列:
In [154]: pd.Series(s.index, s.values)
Out[154]:
1 green
2 blue
3 red
4 red
5 red
6 green
7 green
8 blue
9 blue
10 blue
dtype: object
回答by vector07
OK, this is slightly simpler, hopefully will stimulate further conversation.
好的,这稍微简单一点,希望能激发进一步的对话。
OP's example input:
OP的示例输入:
>>> my_data = {'numbers': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}
>>> df = pd.DataFrame(data=my_data)
>>> df.numbers = df.numbers.astype('category')
>>> df.numbers.cat.rename_categories(['green','blue','red', 'red', 'red'
>>> 'green', 'green', 'blue', 'blue' 'blue'])
This yields ValueError: Categorical categories must be uniqueas OP states.
这产生ValueError: Categorical categories must be unique作为 OP 状态。
My solution:
我的解决方案:
# write out a dict with the mapping of old to new
>>> remap_cat_dict = {
1: 'green',
2: 'blue',
3: 'red',
4: 'red',
5: 'red',
6: 'green',
7: 'green',
8: 'blue',
9: 'blue',
10: 'blue' }
>>> df.numbers = df.numbers.map(remap_cat_dict).astype('category')
>>> df.numbers
0 green
1 blue
2 red
3 red
4 red
5 green
6 green
7 blue
8 blue
9 blue
Name: numbers, dtype: category
Categories (3, object): [blue, green, red]
Forces you to write out a complete dict with 1:1 mapping of old categories to new, but is very readable. And then the conversion is pretty straightforward: use df.apply by row (implicit when .apply is used on a dataseries) to take each value and substitute it with the appropriate result from the remap_cat_dict. Then convert result to category and overwrite the column.
强制您写出一个完整的字典,其中旧类别到新类别的 1:1 映射,但非常易读。然后转换非常简单:按行使用 df.apply(当 .apply 用于数据系列时隐式)获取每个值并用来自 remap_cat_dict 的适当结果替换它。然后将结果转换为类别并覆盖该列。
I encountered almost this exact problem where I wanted to create a new column with less categories converrted over from an old column, which works just as easily here (and beneficially doesn't involve overwriting a current column):
我遇到了几乎这个确切的问题,我想创建一个新列,其中的类别较少,从旧列转换而来,在这里工作同样容易(并且有利地不涉及覆盖当前列):
>>> df['colors'] = df.numbers.map(remap_cat_dict).astype('category')
>>> print(df)
numbers colors
0 1 green
1 2 blue
2 3 red
3 4 red
4 5 red
5 6 green
6 7 green
7 8 blue
8 9 blue
9 10 blue
>>> df.colors
0 green
1 blue
2 red
3 red
4 red
5 green
6 green
7 blue
8 blue
9 blue
Name: colors, dtype: category
Categories (3, object): [blue, green, red]
EDIT 5/2/20: Further simplified df.numbers.apply(lambda x: remap_cat_dict[x])with df.numbers.map(remap_cat_dict)(thanks @JohnE)
EDIT 20年5月2日:进一步简化df.numbers.apply(lambda x: remap_cat_dict[x])与df.numbers.map(remap_cat_dict)(感谢@JohnE)

