pandas 如何让熊猫 get_dummies 发出 N-1 个变量以避免共线性?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31498390/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:38:39  来源:igfitidea点击:

how to get pandas get_dummies to emit N-1 variables to avoid collinearity?

pythonpandasmachine-learningdummy-variable

提问by ihadanny

pandas.get_dummiesemits a dummy variable per categorical value. Is there some automated, easy way to ask it to create only N-1 dummy variables? (just get rid of one "baseline" variable arbitrarily)?

pandas.get_dummies每个分类值发出一个虚拟变量。是否有一些自动化的、简单的方法可以让它只创建 N-1 个虚拟变量?(只是随意摆脱一个“基线”变量)?

Needed to avoid co-linearity in our dataset.

需要避免我们数据集中的共线性。

回答by T.C. Proctor

Pandas version 0.18.0 implemented exactly what you're looking for: the drop_firstoption. Here's an example:

Pandas 0.18.0 版完全实现了您正在寻找的内容:drop_first选项。下面是一个例子:

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: u'0.18.1'

In [3]: s = pd.Series(list('abcbacb'))

In [4]: pd.get_dummies(s, drop_first=True)
Out[4]: 
     b    c
0  0.0  0.0
1  1.0  0.0
2  0.0  1.0
3  1.0  0.0
4  0.0  0.0
5  0.0  1.0
6  1.0  0.0

回答by Ami Tavory

There are a number of ways of doing so.

有多种方法可以这样做。

Possibly the simplest is replacing one of the values by Nonebefore calling get_dummies. Say you have:

可能最简单的方法是None在调用之前替换其中一个值get_dummies。说你有:

import pandas as pd
import numpy as np
s = pd.Series(list('babca'))
>> s
0    b
1    a
2    b
3    c
4    a

Then use:

然后使用:

>> pd.get_dummies(np.where(s == s.unique()[0], None, s))
    a   c
0   0   0
1   1   0
2   0   0
3   0   1
4   1   0

to drop b.

下降b

(Of course, you need to consider if your category column doesn't already contain None.)

(当然,您需要考虑您的类别列是否已经包含None。)



Another way is to use the prefixargument to get_dummies:

另一种方法是使用prefix参数get_dummies

pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False)

prefix: string, list of strings, or dict of strings, default None - String to append DataFrame column names Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternativly, prefix can be a dictionary mapping column names to prefixes.

pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False)

前缀:字符串、字符串列表或字符串字典,默认无 - 附加数据帧列名的字符串在对数据帧调用 get_dummies 时传递长度等于列数的列表。或者,前缀可以是将列名称映射到前缀的字典。

This will append some prefix to all of the resulting columns, and you can then erase one of the columns with this prefix (just make it unique).

这将为所有结果列附加一些前缀,然后您可以删除具有此前缀的列之一(只需使其唯一)。