pandas 如何让熊猫 get_dummies 发出 N-1 个变量以避免共线性?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31498390/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how to get pandas get_dummies to emit N-1 variables to avoid collinearity?
提问by ihadanny
pandas.get_dummiesemits a dummy variable per categorical value. Is there some automated, easy way to ask it to create only N-1 dummy variables? (just get rid of one "baseline" variable arbitrarily)?
pandas.get_dummies每个分类值发出一个虚拟变量。是否有一些自动化的、简单的方法可以让它只创建 N-1 个虚拟变量?(只是随意摆脱一个“基线”变量)?
Needed to avoid co-linearity in our dataset.
需要避免我们数据集中的共线性。
回答by T.C. Proctor
Pandas version 0.18.0 implemented exactly what you're looking for: the drop_firstoption. Here's an example:
Pandas 0.18.0 版完全实现了您正在寻找的内容:drop_first选项。下面是一个例子:
In [1]: import pandas as pd
In [2]: pd.__version__
Out[2]: u'0.18.1'
In [3]: s = pd.Series(list('abcbacb'))
In [4]: pd.get_dummies(s, drop_first=True)
Out[4]:
b c
0 0.0 0.0
1 1.0 0.0
2 0.0 1.0
3 1.0 0.0
4 0.0 0.0
5 0.0 1.0
6 1.0 0.0
回答by Ami Tavory
There are a number of ways of doing so.
有多种方法可以这样做。
Possibly the simplest is replacing one of the values by Nonebefore calling get_dummies. Say you have:
可能最简单的方法是None在调用之前替换其中一个值get_dummies。说你有:
import pandas as pd
import numpy as np
s = pd.Series(list('babca'))
>> s
0 b
1 a
2 b
3 c
4 a
Then use:
然后使用:
>> pd.get_dummies(np.where(s == s.unique()[0], None, s))
a c
0 0 0
1 1 0
2 0 0
3 0 1
4 1 0
to drop b.
下降b。
(Of course, you need to consider if your category column doesn't already contain None.)
(当然,您需要考虑您的类别列是否已经包含None。)
Another way is to use the prefixargument to get_dummies:
另一种方法是使用prefix参数get_dummies:
pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False)prefix: string, list of strings, or dict of strings, default None - String to append DataFrame column names Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternativly, prefix can be a dictionary mapping column names to prefixes.
pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False)前缀:字符串、字符串列表或字符串字典,默认无 - 附加数据帧列名的字符串在对数据帧调用 get_dummies 时传递长度等于列数的列表。或者,前缀可以是将列名称映射到前缀的字典。
This will append some prefix to all of the resulting columns, and you can then erase one of the columns with this prefix (just make it unique).
这将为所有结果列附加一些前缀,然后您可以删除具有此前缀的列之一(只需使其唯一)。

