Python 从 Pandas 中的虚拟对象重建分类变量
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/26762100/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Reconstruct a categorical variable from dummies in pandas
提问by themiurgo
pd.get_dummiesallows to convert a categorical variable into dummy variables. Besides the fact that it's trivial to reconstruct the categorical variable, is there a preferred/quick way to do it?
pd.get_dummies允许将分类变量转换为虚拟变量。除了重建分类变量很简单这一事实之外,是否有首选/快速的方法来做到这一点?
采纳答案by Jeff
In [46]: s = Series(list('aaabbbccddefgh')).astype('category')
In [47]: s
Out[47]: 
0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]
In [48]: df = pd.get_dummies(s)
In [49]: df
Out[49]: 
    a  b  c  d  e  f  g  h
0   1  0  0  0  0  0  0  0
1   1  0  0  0  0  0  0  0
2   1  0  0  0  0  0  0  0
3   0  1  0  0  0  0  0  0
4   0  1  0  0  0  0  0  0
5   0  1  0  0  0  0  0  0
6   0  0  1  0  0  0  0  0
7   0  0  1  0  0  0  0  0
8   0  0  0  1  0  0  0  0
9   0  0  0  1  0  0  0  0
10  0  0  0  0  1  0  0  0
11  0  0  0  0  0  1  0  0
12  0  0  0  0  0  0  1  0
13  0  0  0  0  0  0  0  1
In [50]: x = df.stack()
# I don't think you actually need to specify ALL of the categories here, as by definition
# they are in the dummy matrix to start (and hence the column index)
In [51]: Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
Out[51]: 
0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
Name: level_1, dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]
So I think we need a function to 'do' this as it seems to be a natural operations. Maybe get_categories(), see here
所以我认为我们需要一个函数来“做”这件事,因为它似乎是一个自然的操作。也许get_categories(),看这里
回答by Nathan
It's been a few years, so this may well not have been in the pandastoolkit back when this question was originally asked, but this approach seems a little easier to me. idxmaxwill return the index corresponding to the largest element (i.e. the one with a 1). We do axis=1because we want the column name where the 1occurs.
已经有几年了,所以pandas当最初提出这个问题时,这很可能不在工具包中,但这种方法对我来说似乎更容易一些。idxmax将返回对应于最大元素的索引(即带有 a 的元素1)。我们这样做axis=1是因为我们想要发生的列名1。
EDIT: I didn't bother making it categorical instead of just a string, but you can do that the same way as @Jeff did by wrapping it with pd.Categorical(and pd.Series, if desired).
编辑:我没有费心将它分类而不仅仅是一个字符串,但是您可以像@Jeff 那样通过用pd.Categorical(and pd.Series,如果需要)包装它来做到这一点。
In [1]: import pandas as pd
In [2]: s = pd.Series(['a', 'b', 'a', 'c'])
In [3]: s
Out[3]: 
0    a
1    b
2    a
3    c
dtype: object
In [4]: dummies = pd.get_dummies(s)
In [5]: dummies
Out[5]: 
   a  b  c
0  1  0  0
1  0  1  0
2  1  0  0
3  0  0  1
In [6]: s2 = dummies.idxmax(axis=1)
In [7]: s2
Out[7]: 
0    a
1    b
2    a
3    c
dtype: object
In [8]: (s2 == s).all()
Out[8]: True
EDIT in response to @piRSquared's comment:
This solution does indeed assume there's one 1per row. I think this is usually the format one has. pd.get_dummiescan return rows that are all 0 if you have drop_first=Trueor if there are NaNvalues and dummy_na=False(default) (any cases I'm missing?). A row of all zeros will be treated as if it was an instance of the variable named in the first column (e.g. ain the example above).
编辑以回应@piRSquared 的评论:此解决方案确实假设1每行有一个。我认为这通常是一种格式。pd.get_dummies可以返回全为 0 的行,如果有drop_first=True或者有NaN值和dummy_na=False(默认)(我遗漏了任何情况?)。一行全为零将被视为第一列中命名的变量的实例(例如a在上面的示例中)。
If drop_first=True, you have no way to know from the dummies dataframe alone what the name of the "first" variable was, so that operation isn't invertible unless you keep extra information around; I'd recommend leaving drop_first=False(default).
如果drop_first=True,您无法仅从虚拟数据帧中知道“第一个”变量的名称是什么,因此除非您保留额外信息,否则该操作是不可逆的;我建议离开drop_first=False(默认)。
Since dummy_na=Falseis the default, this could certainly cause problems. Please set dummy_na=Truewhen you call pd.get_dummiesif you want to use this solution to invert the "dummification" and your data contains any NaNs.Setting dummy_na=Truewill alwaysadd a "nan" column, even if that column is all 0s, so you probably don't want to set this unless you actually have NaNs. A nice approach might be to set dummies = pd.get_dummies(series, dummy_na=series.isnull().any()). What's also nice is that idxmaxsolution will correctly regenerate your NaNs (not just a string that says "nan").
由于dummy_na=False是默认设置,这肯定会导致问题。请设置dummy_na=True,当你打电话pd.get_dummies,如果你想使用该解决方案反转“实体模型”和您的数据中包含的任何NaNs。设置dummy_na=True将始终添加一个“nan”列,即使该列全为 0,因此除非您确实有NaNs,否则您可能不想设置它。一个不错的方法可能是设置dummies = pd.get_dummies(series, dummy_na=series.isnull().any()). 还有一个好处是该idxmax解决方案将正确地重新生成您的NaNs(不仅仅是一个显示“nan”的字符串)。
It's also worth mentioning that setting drop_first=Trueand dummy_na=Falsemeans that NaNs become indistinguishable from an instance of the first variable, so this should be strongly discouraged if your dataset may contain any NaNvalues.
还值得一提的是,设置drop_first=Trueanddummy_na=False意味着NaNs 与第一个变量的实例无法区分,因此如果您的数据集可能包含任何NaN值,则强烈建议不要这样做。
回答by sacuL
This is quite a late answer, but since you ask for a quickway to do it, I assume you're looking for the most performant strategy. On a large dataframe (for instance 10000 rows), you can get a very significant speed boost by using np.whereinstead of idxmaxor get_level_values, and obtain get the same result. The idea is to index the column names where the dummy dataframe is not 0:
这是一个很晚的答案,但由于您要求一种快速的方法来做到这一点,我假设您正在寻找最高效的策略。在大型数据帧(例如 10000 行)上,您可以通过使用np.where代替idxmaxor来获得非常显着的速度提升get_level_values,并获得相同的结果。这个想法是索引虚拟数据框不为 0 的列名:
Method:
方法:
Using the same sample data as @Nathan:
使用与@Nathan 相同的样本数据:
>>> dummies
   a  b  c
0  1  0  0
1  0  1  0
2  1  0  0
3  0  0  1
s2 = pd.Series(dummies.columns[np.where(dummies!=0)[1]])
>>> s2
0    a
1    b
2    a
3    c
dtype: object
Benchmark:
基准:
On a small dummy dataframe, you won't see much difference in performance. However, testing different strategies to solving this problem on a large series:
在一个小的虚拟数据帧上,您不会看到性能上的太大差异。但是,在大系列上测试解决此问题的不同策略:
s = pd.Series(np.random.choice(['a','b','c'], 10000))
dummies = pd.get_dummies(s)
def np_method(dummies=dummies):
    return pd.Series(dummies.columns[np.where(dummies!=0)[1]])
def idx_max_method(dummies=dummies):
    return dummies.idxmax(axis=1)
def get_level_values_method(dummies=dummies):
    x = dummies.stack()
    return pd.Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
def dot_method(dummies=dummies):
    return dummies.dot(dummies.columns)
import timeit
# Time each method, 1000 iterations each:
>>> timeit.timeit(np_method, number=1000)
1.0491090340074152
>>> timeit.timeit(idx_max_method, number=1000)
12.119140846014488
>>> timeit.timeit(get_level_values_method, number=1000)
4.109266621991992
>>> timeit.timeit(dot_method, number=1000)
1.6741622970002936
The np.wheremethod is about 4 times faster than the get_level_valuesmethod 11.5 times faster than the idxmaxmethod! It also beats (but only by a little) the .dot()method outlined in this answer to a similar question
该np.where方法比该get_level_values方法快11.5倍的方法快4倍左右idxmax!它还胜过(但仅略胜一筹)此对类似问题的回答中.dot()概述的方法
They all return the same result:
它们都返回相同的结果:
>>> (get_level_values_method() == np_method()).all()
True
>>> (idx_max_method() == np_method()).all()
True
回答by piRSquared
Setup
设置
Using @Jeff's setup
使用@Jeff 的设置
s = Series(list('aaabbbccddefgh')).astype('category')
df = pd.get_dummies(s)
If columns are strings
如果列是字符串
and there is only one 1per row
1每行只有一个
df.dot(df.columns)
0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
dtype: object
numpy.where
numpy.where
Again!  Assuming only one 1per row
再次!假设1每行只有一个
i, j = np.where(df)
pd.Series(df.columns[j], i)
0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
dtype: category
Categories (8, object): [a, b, c, d, e, f, g, h]
numpy.where
numpy.where
Not assuming one 1per row
不假设1每行一个
i, j = np.where(df)
pd.Series(dict(zip(zip(i, j), df.columns[j])))
0   0    a
1   0    a
2   0    a
3   1    b
4   1    b
5   1    b
6   2    c
7   2    c
8   3    d
9   3    d
10  4    e
11  5    f
12  6    g
13  7    h
dtype: object
numpy.where
numpy.where
Where we don't assume one 1per row andwe drop the index
我们不假设1每行一个,我们删除索引
i, j = np.where(df)
pd.Series(dict(zip(zip(i, j), df.columns[j]))).reset_index(-1, drop=True)
0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
dtype: object
回答by Tarun Bhavnani
Converting dat["classification"] to one hot encodes and back!!
将 dat["classification"] 转换为一个热编码并返回!!
import pandas as pd
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
dat["labels"]= le.fit_transform(dat["classification"])
Y= pd.get_dummies(dat["labels"])
tru=[]
for i in range(0, len(Y)):
  tru.append(np.argmax(Y.iloc[i]))
tru= le.inverse_transform(tru)
##Identical check!
(tru==dat["classification"]).value_counts()

