Python 从 Pandas 中的虚拟对象重建分类变量

Question

提问by themiurgo

pd.get_dummiesallows to convert a categorical variable into dummy variables. Besides the fact that it's trivial to reconstruct the categorical variable, is there a preferred/quick way to do it?

pd.get_dummies允许将分类变量转换为虚拟变量。除了重建分类变量很简单这一事实之外，是否有首选/快速的方法来做到这一点？

Answer 1

采纳答案by Jeff

In [46]: s = Series(list('aaabbbccddefgh')).astype('category')

In [47]: s
Out[47]: 
0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]

In [48]: df = pd.get_dummies(s)

In [49]: df
Out[49]: 
    a  b  c  d  e  f  g  h
0   1  0  0  0  0  0  0  0
1   1  0  0  0  0  0  0  0
2   1  0  0  0  0  0  0  0
3   0  1  0  0  0  0  0  0
4   0  1  0  0  0  0  0  0
5   0  1  0  0  0  0  0  0
6   0  0  1  0  0  0  0  0
7   0  0  1  0  0  0  0  0
8   0  0  0  1  0  0  0  0
9   0  0  0  1  0  0  0  0
10  0  0  0  0  1  0  0  0
11  0  0  0  0  0  1  0  0
12  0  0  0  0  0  0  1  0
13  0  0  0  0  0  0  0  1

In [50]: x = df.stack()

# I don't think you actually need to specify ALL of the categories here, as by definition
# they are in the dummy matrix to start (and hence the column index)
In [51]: Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
Out[51]: 
0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
Name: level_1, dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]

So I think we need a function to 'do' this as it seems to be a natural operations. Maybe get_categories(), see here

所以我认为我们需要一个函数来“做”这件事，因为它似乎是一个自然的操作。也许get_categories()，看这里

Answer 2

回答by Nathan

It's been a few years, so this may well not have been in the pandastoolkit back when this question was originally asked, but this approach seems a little easier to me. idxmaxwill return the index corresponding to the largest element (i.e. the one with a 1). We do axis=1because we want the column name where the 1occurs.

已经有几年了，所以pandas当最初提出这个问题时，这很可能不在工具包中，但这种方法对我来说似乎更容易一些。idxmax将返回对应于最大元素的索引（即带有 a 的元素1）。我们这样做axis=1是因为我们想要发生的列名1。

EDIT: I didn't bother making it categorical instead of just a string, but you can do that the same way as @Jeff did by wrapping it with pd.Categorical(and pd.Series, if desired).

编辑：我没有费心将它分类而不仅仅是一个字符串，但是您可以像@Jeff 那样通过用pd.Categorical(and pd.Series，如果需要)包装它来做到这一点。

In [1]: import pandas as pd

In [2]: s = pd.Series(['a', 'b', 'a', 'c'])

In [3]: s
Out[3]: 
0    a
1    b
2    a
3    c
dtype: object

In [4]: dummies = pd.get_dummies(s)

In [5]: dummies
Out[5]: 
   a  b  c
0  1  0  0
1  0  1  0
2  1  0  0
3  0  0  1

In [6]: s2 = dummies.idxmax(axis=1)

In [7]: s2
Out[7]: 
0    a
1    b
2    a
3    c
dtype: object

In [8]: (s2 == s).all()
Out[8]: True

EDIT in response to @piRSquared's comment: This solution does indeed assume there's one 1per row. I think this is usually the format one has. pd.get_dummiescan return rows that are all 0 if you have drop_first=Trueor if there are NaNvalues and dummy_na=False(default) (any cases I'm missing?). A row of all zeros will be treated as if it was an instance of the variable named in the first column (e.g. ain the example above).

编辑以回应@piRSquared 的评论：此解决方案确实假设1每行有一个。我认为这通常是一种格式。pd.get_dummies可以返回全为 0 的行，如果有drop_first=True或者有NaN值和dummy_na=False（默认）（我遗漏了任何情况？）。一行全为零将被视为第一列中命名的变量的实例（例如a在上面的示例中）。

If drop_first=True, you have no way to know from the dummies dataframe alone what the name of the "first" variable was, so that operation isn't invertible unless you keep extra information around; I'd recommend leaving drop_first=False(default).

如果drop_first=True，您无法仅从虚拟数据帧中知道“第一个”变量的名称是什么，因此除非您保留额外信息，否则该操作是不可逆的；我建议离开drop_first=False（默认）。

Since dummy_na=Falseis the default, this could certainly cause problems. Please set dummy_na=Truewhen you call pd.get_dummiesif you want to use this solution to invert the "dummification" and your data contains any NaNs.Setting dummy_na=Truewill alwaysadd a "nan" column, even if that column is all 0s, so you probably don't want to set this unless you actually have NaNs. A nice approach might be to set dummies = pd.get_dummies(series, dummy_na=series.isnull().any()). What's also nice is that idxmaxsolution will correctly regenerate your NaNs (not just a string that says "nan").

由于dummy_na=False是默认设置，这肯定会导致问题。请设置dummy_na=True，当你打电话pd.get_dummies，如果你想使用该解决方案反转“实体模型”和您的数据中包含的任何NaNs。设置dummy_na=True将始终添加一个“nan”列，即使该列全为 0，因此除非您确实有NaNs，否则您可能不想设置它。一个不错的方法可能是设置dummies = pd.get_dummies(series, dummy_na=series.isnull().any()). 还有一个好处是该idxmax解决方案将正确地重新生成您的NaNs（不仅仅是一个显示“nan”的字符串）。

It's also worth mentioning that setting drop_first=Trueand dummy_na=Falsemeans that NaNs become indistinguishable from an instance of the first variable, so this should be strongly discouraged if your dataset may contain any NaNvalues.

还值得一提的是，设置drop_first=Trueanddummy_na=False意味着NaNs 与第一个变量的实例无法区分，因此如果您的数据集可能包含任何NaN值，则强烈建议不要这样做。

Answer 3

回答by sacuL

This is quite a late answer, but since you ask for a quickway to do it, I assume you're looking for the most performant strategy. On a large dataframe (for instance 10000 rows), you can get a very significant speed boost by using np.whereinstead of idxmaxor get_level_values, and obtain get the same result. The idea is to index the column names where the dummy dataframe is not 0:

这是一个很晚的答案，但由于您要求一种快速的方法来做到这一点，我假设您正在寻找最高效的策略。在大型数据帧（例如 10000 行）上，您可以通过使用np.where代替idxmaxor来获得非常显着的速度提升get_level_values，并获得相同的结果。这个想法是索引虚拟数据框不为 0 的列名：

Method:

方法：

Using the same sample data as @Nathan:

使用与@Nathan 相同的样本数据：

>>> dummies
   a  b  c
0  1  0  0
1  0  1  0
2  1  0  0
3  0  0  1

s2 = pd.Series(dummies.columns[np.where(dummies!=0)[1]])

>>> s2
0    a
1    b
2    a
3    c
dtype: object

Benchmark:

基准：

On a small dummy dataframe, you won't see much difference in performance. However, testing different strategies to solving this problem on a large series:

在一个小的虚拟数据帧上，您不会看到性能上的太大差异。但是，在大系列上测试解决此问题的不同策略：

s = pd.Series(np.random.choice(['a','b','c'], 10000))

dummies = pd.get_dummies(s)

def np_method(dummies=dummies):
    return pd.Series(dummies.columns[np.where(dummies!=0)[1]])

def idx_max_method(dummies=dummies):
    return dummies.idxmax(axis=1)

def get_level_values_method(dummies=dummies):
    x = dummies.stack()
    return pd.Series(pd.Categorical(x[x!=0].index.get_level_values(1)))

def dot_method(dummies=dummies):
    return dummies.dot(dummies.columns)

import timeit

# Time each method, 1000 iterations each:

>>> timeit.timeit(np_method, number=1000)
1.0491090340074152

>>> timeit.timeit(idx_max_method, number=1000)
12.119140846014488

>>> timeit.timeit(get_level_values_method, number=1000)
4.109266621991992

>>> timeit.timeit(dot_method, number=1000)
1.6741622970002936

The np.wheremethod is about 4 times faster than the get_level_valuesmethod 11.5 times faster than the idxmaxmethod! It also beats (but only by a little) the .dot()method outlined in this answer to a similar question

该np.where方法比该get_level_values方法快11.5倍的方法快4倍左右idxmax！它还胜过（但仅略胜一筹）此对类似问题的回答中.dot()概述的方法

They all return the same result:

它们都返回相同的结果：

>>> (get_level_values_method() == np_method()).all()
True
>>> (idx_max_method() == np_method()).all()
True

Answer 4

回答by piRSquared

Setup

设置

Using @Jeff's setup

使用@Jeff 的设置

s = Series(list('aaabbbccddefgh')).astype('category')
df = pd.get_dummies(s)

If columns are strings

如果列是字符串

and there is only one 1per row

1每行只有一个

df.dot(df.columns)

0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
dtype: object

`numpy.where`

Again! Assuming only one 1per row

再次！假设1每行只有一个

i, j = np.where(df)
pd.Series(df.columns[j], i)

0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
dtype: category
Categories (8, object): [a, b, c, d, e, f, g, h]

`numpy.where`

Not assuming one 1per row

不假设1每行一个

i, j = np.where(df)
pd.Series(dict(zip(zip(i, j), df.columns[j])))

0   0    a
1   0    a
2   0    a
3   1    b
4   1    b
5   1    b
6   2    c
7   2    c
8   3    d
9   3    d
10  4    e
11  5    f
12  6    g
13  7    h
dtype: object

`numpy.where`

Where we don't assume one 1per row andwe drop the index

我们不假设1每行一个，我们删除索引

i, j = np.where(df)
pd.Series(dict(zip(zip(i, j), df.columns[j]))).reset_index(-1, drop=True)

0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
dtype: object

Answer 5

回答by Tarun Bhavnani

Converting dat["classification"] to one hot encodes and back!!

将 dat["classification"] 转换为一个热编码并返回！！

import pandas as pd

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

dat["labels"]= le.fit_transform(dat["classification"])

Y= pd.get_dummies(dat["labels"])

tru=[]

for i in range(0, len(Y)):
  tru.append(np.argmax(Y.iloc[i]))

tru= le.inverse_transform(tru)

##Identical check!
(tru==dat["classification"]).value_counts()

Python 从 Pandas 中的虚拟对象重建分类变量

提问by themiurgo

采纳答案by Jeff

回答by Nathan

回答by sacuL

Method:

方法：

Benchmark:

基准：

回答by piRSquared

Setup

设置

If columns are strings

如果列是字符串

`numpy.where`

`numpy.where`

`numpy.where`

`numpy.where`

`numpy.where`

`numpy.where`

回答by Tarun Bhavnani

相关推荐

最近更新

标签

Python 从 Pandas 中的虚拟对象重建分类变量

提问by themiurgo

采纳答案by Jeff

回答by Nathan

回答by sacuL

Method:

方法：

Benchmark:

基准：

回答by piRSquared

Setup

设置

If columns are strings

如果列是字符串

numpy.where

numpy.where

numpy.where

numpy.where

numpy.where

numpy.where

回答by Tarun Bhavnani

相关推荐

Python 类型错误：列表索引必须是整数，而不是列表。怎么修？

Python 删除路径中的第一个文件夹

在python中以表格格式打印列表

Python Pandas SettingWithCopyWarning

相关推荐

最近更新

标签

`numpy.where`

`numpy.where`

`numpy.where`

`numpy.where`

`numpy.where`

`numpy.where`