Python 具有字符串/分类特征(变量)的线性回归分析?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34007308/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 14:19:55  来源:igfitidea点击:

Linear regression analysis with string/categorical features (variables)?

pythonmachine-learningregressionlinear-regressionfeature-selection

提问by Erba Aitbayev

Regression algorithms seem to be working on features represented as numbers. For example:

回归算法似乎正在处理以数字表示的特征。例如:

simple data without categorical features

没有分类特征的简单数据

This data set doesn't contain categorical features/variables. It's quite clear how to do regression on this data and predict price.

此数据集不包含分类特征/变量。很清楚如何对这些数据进行回归并预测价格。



But now I want to do a regression analysis on data that contain categorical features:

但现在我想对包含分类特征的数据进行回归分析:

data-set with categorical features

具有分类特征的数据集

There are 5features: District, Condition, Material, Security, Type

5 个特征:District, Condition, Material, Security,Type



How can I do a regression on this data? Do I have to transform all the string/categorical data to numbers manually? I mean if I have to create some encoding rules and according to that rules transform all data to numeric values.

如何对这些数据进行回归?我是否必须手动将所有字符串/分类数据转换为数字?我的意思是如果我必须创建一些编码规则并根据该规则将所有数据转换为数值。

Is there any simple way to transform string data to numbers without having to create my own encoding rules manually? Maybe there are some libraries in Pythonthat can be used for that? Are there some risks that the regression model will be somehow incorrect due to "bad encoding"?

有没有什么简单的方法可以将字符串数据转换为数字,而无需手动创建自己的编码规则?也许Python中有一些库可以用于此目的?由于“错误编码”,回归模型是否存在某种风险?

采纳答案by MaxNoe

Yes, you will have to convert everything to numbers. That requires thinking about what these attributes represent.

是的,您必须将所有内容都转换为数字。这需要考虑这些属性代表什么。

Usually there are three possibilities:

通常有以下三种可能:

  1. One-Hot encoding for categorical data
  2. Arbitrary numbers for ordinal data
  3. Use something like group means for categorical data (e. g. mean prices for city districts).
  1. 分类数据的 One-Hot 编码
  2. 序数数据的任意数
  3. 对分类数据使用类似组均值的方法(例如城市地区的平均价格)。

You have to be carefull to not infuse information you do not have in the application case.

您必须小心不要注入您在申请案例中没有的信息。

One hot encoding

一热编码

If you have categorical data, you can create dummy variables with 0/1 values for each possible value.

如果您有分类数据,您可以为每个可能的值创建具有 0/1 值的虚拟变量。

E. g.

例如

idx color
0   blue
1   green
2   green
3   red

to

idx blue green red
0   1    0     0
1   0    1     0
2   0    1     0
3   0    0     1

This can easily be done with pandas:

这可以很容易地用熊猫完成:

import pandas as pd

data = pd.DataFrame({'color': ['blue', 'green', 'green', 'red']})
print(pd.get_dummies(data))

will result in:

将导致:

   color_blue  color_green  color_red
0           1            0          0
1           0            1          0
2           0            1          0
3           0            0          1

Numbers for ordinal data

序数数据的数字

Create a mapping of your sortable categories, e. g. old < renovated < new → 0, 1, 2

创建可排序类别的映射,例如 old < renovated < new → 0, 1, 2

This is also possible with pandas:

这对于熊猫也是可能的:

data = pd.DataFrame({'q': ['old', 'new', 'new', 'ren']})
data['q'] = data['q'].astype('category')
data['q'] = data['q'].cat.reorder_categories(['old', 'ren', 'new'], ordered=True)
data['q'] = data['q'].cat.codes
print(data['q'])

Result:

结果:

0    0
1    2
2    2
3    1
Name: q, dtype: int8

Using categorical data for groupby operations

使用分类数据进行 groupby 操作

You could use the mean for each category over past (known events).

您可以使用过去(已知事件)每个类别的平均值。

Say you have a DataFrame with the last known mean prices for cities:

假设您有一个 DataFrame,其中包含最近已知的城市平均价格:

prices = pd.DataFrame({
    'city': ['A', 'A', 'A', 'B', 'B', 'C'],
    'price': [1, 1, 1, 2, 2, 3],
})
mean_price = prices.groupby('city').mean()
data = pd.DataFrame({'city': ['A', 'B', 'C', 'A', 'B', 'A']})

print(data.merge(mean_price, on='city', how='left'))

Result:

结果:

  city  price
0    A      1
1    B      2
2    C      3
3    A      1
4    B      2
5    A      1

回答by burhan

You can use "Dummy Coding" in this case. There are Python libraries to do dummy coding, you have a few options:

在这种情况下,您可以使用“虚拟编码”。有 Python 库可以进行虚拟编码,您有几个选择:

  • You may use scikit-learnlibrary. Take a look at here.
  • Or, if you are working with pandas, it has a built-in function to create dummy variables.
  • 你可以使用scikit-learn图书馆。看看这里
  • 或者,如果您正在使用pandas,它有一个内置函数来创建虚拟变量

An example with pandas is below:

熊猫的一个例子如下:

import pandas as pd

sample_data = [[1,2,'a'],[3,4,'b'],[5,6,'c'],[7,8,'b']]
df = pd.DataFrame(sample_data, columns=['numeric1','numeric2','categorical'])
dummies = pd.get_dummies(df.categorical)
df.join(dummies)

回答by Harvey

In linear regression with categorical variables you should be careful of the Dummy Variable Trap. The Dummy Variable trap is a scenario in which the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others. This can produce singularity of a model, meaning your model just won't work. Read about it here

在使用分类变量的线性回归中,您应该小心虚拟变量陷阱。虚拟变量陷阱是自变量多重共线的场景——两个或多个变量高度相关的场景;简单来说,可以从其他变量中预测一个变量。这可能会产生模型的奇异性,这意味着您的模型将无法工作。在这里阅读

Idea is to use dummy variable encoding with drop_first=True, this will omit one column from each category after converting categorical variable into dummy/indicator variables. You WILL NOTlose any relevant information by doing that simply because your all point in dataset can fully be explained by rest of the features.

想法是使用虚拟变量编码drop_first=True,这将在将分类变量转换为虚拟/指标变量后从每个类别中省略一列。您将不会通过这样做,只是因为你在数据集中的所有点可以充分的休息特征来解释失去任何相关信息。

Here is complete code on how you can do it for your housing dataset

这是有关如何为住房数据集执行此操作的完整代码

So you have categorical features:

所以你有分类特征:

District, Condition, Material, Security, Type

And one numerical features that you are trying to predict:

以及您试图预测的一个数字特征:

Price

First you need to split your initial dataset on input variables and prediction, assuming its pandas dataframe it would look like this:

首先,您需要根据输入变量和预测拆分初始数据集,假设其 Pandas 数据框如下所示:

Input variables:

输入变量:

X = housing[['District','Condition','Material','Security','Type']]

Prediction:

预言:

Y = housing['Price']

Convert categorical variable into dummy/indicator variables and drop one in each category:

将分类变量转换为虚拟/指标变量并在每个类别中删除一个:

X = pd.get_dummies(data=X, drop_first=True)

So now if you check shape of X with drop_first=Trueyou will see that it has 4 columns less - one for each of your categorical variables.

所以现在如果你检查 X 的形状,drop_first=True你会发现它少了 4 列 - 每个分类变量都有一个。

You can now continue to use them in your linear model. For scikit-learn implementation it could look like this:

您现在可以继续在您的线性模型中使用它们。对于 scikit-learn 实现,它可能如下所示:

from sklearn import linear_model
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .20, random_state = 40)

regr = linear_model.LinearRegression() # Do not use fit_intercept = False if you have removed 1 column after dummy encoding
regr.fit(X_train, Y_train)
predicted = regr.predict(X_test)

回答by ShikharDua

One way to achieve regression with categorical variables as independent variables is as mentioned above - Using encoding. Another way of doing is by using R like statistical formula using statmodels library. Here is a code snippet

使用分类变量作为自变量实现回归的一种方法是如上所述 - 使用编码。另一种方法是使用 statmodels 库使用 R 之类的统计公式。这是一个代码片段

from statsmodels.formula.api import ols
tips = sns.load_dataset("tips")

model = ols('tip ~ total_bill + C(sex) + C(day) + C(day) + size', data=tips)
fitted_model = model.fit()
fitted_model.summary()

Dataset

数据集

total_bill  tip     sex  smoker day  time  size
0   16.99   1.01    Female  No  Sun Dinner  2
1   10.34   1.66    Male    No  Sun Dinner  3
2   21.01   3.50    Male    No  Sun Dinner  3
3   23.68   3.31    Male    No  Sun Dinner  2
4   24.59   3.61    Female  No  Sun Dinner  4

Summary of regression

回归总结

enter image description here

在此处输入图片说明