Python 具有字符串/分类特征(变量)的线性回归分析?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34007308/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Linear regression analysis with string/categorical features (variables)?
提问by Erba Aitbayev
Regression algorithms seem to be working on features represented as numbers. For example:
回归算法似乎正在处理以数字表示的特征。例如:
This data set doesn't contain categorical features/variables. It's quite clear how to do regression on this data and predict price.
此数据集不包含分类特征/变量。很清楚如何对这些数据进行回归并预测价格。
But now I want to do a regression analysis on data that contain categorical features:
但现在我想对包含分类特征的数据进行回归分析:
There are 5features: District
, Condition
, Material
, Security
, Type
有5 个特征:District
, Condition
, Material
, Security
,Type
How can I do a regression on this data? Do I have to transform all the string/categorical data to numbers manually? I mean if I have to create some encoding rules and according to that rules transform all data to numeric values.
如何对这些数据进行回归?我是否必须手动将所有字符串/分类数据转换为数字?我的意思是如果我必须创建一些编码规则并根据该规则将所有数据转换为数值。
Is there any simple way to transform string data to numbers without having to create my own encoding rules manually? Maybe there are some libraries in Pythonthat can be used for that? Are there some risks that the regression model will be somehow incorrect due to "bad encoding"?
有没有什么简单的方法可以将字符串数据转换为数字,而无需手动创建自己的编码规则?也许Python中有一些库可以用于此目的?由于“错误编码”,回归模型是否存在某种风险?
采纳答案by MaxNoe
Yes, you will have to convert everything to numbers. That requires thinking about what these attributes represent.
是的,您必须将所有内容都转换为数字。这需要考虑这些属性代表什么。
Usually there are three possibilities:
通常有以下三种可能:
- One-Hot encoding for categorical data
- Arbitrary numbers for ordinal data
- Use something like group means for categorical data (e. g. mean prices for city districts).
- 分类数据的 One-Hot 编码
- 序数数据的任意数
- 对分类数据使用类似组均值的方法(例如城市地区的平均价格)。
You have to be carefull to not infuse information you do not have in the application case.
您必须小心不要注入您在申请案例中没有的信息。
One hot encoding
一热编码
If you have categorical data, you can create dummy variables with 0/1 values for each possible value.
如果您有分类数据,您可以为每个可能的值创建具有 0/1 值的虚拟变量。
E. g.
例如
idx color
0 blue
1 green
2 green
3 red
to
到
idx blue green red
0 1 0 0
1 0 1 0
2 0 1 0
3 0 0 1
This can easily be done with pandas:
这可以很容易地用熊猫完成:
import pandas as pd
data = pd.DataFrame({'color': ['blue', 'green', 'green', 'red']})
print(pd.get_dummies(data))
will result in:
将导致:
color_blue color_green color_red
0 1 0 0
1 0 1 0
2 0 1 0
3 0 0 1
Numbers for ordinal data
序数数据的数字
Create a mapping of your sortable categories, e. g. old < renovated < new → 0, 1, 2
创建可排序类别的映射,例如 old < renovated < new → 0, 1, 2
This is also possible with pandas:
这对于熊猫也是可能的:
data = pd.DataFrame({'q': ['old', 'new', 'new', 'ren']})
data['q'] = data['q'].astype('category')
data['q'] = data['q'].cat.reorder_categories(['old', 'ren', 'new'], ordered=True)
data['q'] = data['q'].cat.codes
print(data['q'])
Result:
结果:
0 0
1 2
2 2
3 1
Name: q, dtype: int8
Using categorical data for groupby operations
使用分类数据进行 groupby 操作
You could use the mean for each category over past (known events).
您可以使用过去(已知事件)每个类别的平均值。
Say you have a DataFrame with the last known mean prices for cities:
假设您有一个 DataFrame,其中包含最近已知的城市平均价格:
prices = pd.DataFrame({
'city': ['A', 'A', 'A', 'B', 'B', 'C'],
'price': [1, 1, 1, 2, 2, 3],
})
mean_price = prices.groupby('city').mean()
data = pd.DataFrame({'city': ['A', 'B', 'C', 'A', 'B', 'A']})
print(data.merge(mean_price, on='city', how='left'))
Result:
结果:
city price
0 A 1
1 B 2
2 C 3
3 A 1
4 B 2
5 A 1
回答by burhan
You can use "Dummy Coding" in this case. There are Python libraries to do dummy coding, you have a few options:
在这种情况下,您可以使用“虚拟编码”。有 Python 库可以进行虚拟编码,您有几个选择:
- You may use
scikit-learn
library. Take a look at here. - Or, if you are working with
pandas
, it has a built-in function to create dummy variables.
An example with pandas is below:
熊猫的一个例子如下:
import pandas as pd
sample_data = [[1,2,'a'],[3,4,'b'],[5,6,'c'],[7,8,'b']]
df = pd.DataFrame(sample_data, columns=['numeric1','numeric2','categorical'])
dummies = pd.get_dummies(df.categorical)
df.join(dummies)
回答by Harvey
In linear regression with categorical variables you should be careful of the Dummy Variable Trap. The Dummy Variable trap is a scenario in which the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others. This can produce singularity of a model, meaning your model just won't work. Read about it here
在使用分类变量的线性回归中,您应该小心虚拟变量陷阱。虚拟变量陷阱是自变量多重共线的场景——两个或多个变量高度相关的场景;简单来说,可以从其他变量中预测一个变量。这可能会产生模型的奇异性,这意味着您的模型将无法工作。在这里阅读
Idea is to use dummy variable encoding with drop_first=True
, this will omit one column from each category after converting categorical variable into dummy/indicator variables. You WILL NOTlose any relevant information by doing that simply because your all point in dataset can fully be explained by rest of the features.
想法是使用虚拟变量编码drop_first=True
,这将在将分类变量转换为虚拟/指标变量后从每个类别中省略一列。您将不会通过这样做,只是因为你在数据集中的所有点可以充分的休息特征来解释失去任何相关信息。
Here is complete code on how you can do it for your housing dataset
这是有关如何为住房数据集执行此操作的完整代码
So you have categorical features:
所以你有分类特征:
District, Condition, Material, Security, Type
And one numerical features that you are trying to predict:
以及您试图预测的一个数字特征:
Price
First you need to split your initial dataset on input variables and prediction, assuming its pandas dataframe it would look like this:
首先,您需要根据输入变量和预测拆分初始数据集,假设其 Pandas 数据框如下所示:
Input variables:
输入变量:
X = housing[['District','Condition','Material','Security','Type']]
Prediction:
预言:
Y = housing['Price']
Convert categorical variable into dummy/indicator variables and drop one in each category:
将分类变量转换为虚拟/指标变量并在每个类别中删除一个:
X = pd.get_dummies(data=X, drop_first=True)
So now if you check shape of X with drop_first=True
you will see that it has 4 columns less - one for each of your categorical variables.
所以现在如果你检查 X 的形状,drop_first=True
你会发现它少了 4 列 - 每个分类变量都有一个。
You can now continue to use them in your linear model. For scikit-learn implementation it could look like this:
您现在可以继续在您的线性模型中使用它们。对于 scikit-learn 实现,它可能如下所示:
from sklearn import linear_model
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .20, random_state = 40)
regr = linear_model.LinearRegression() # Do not use fit_intercept = False if you have removed 1 column after dummy encoding
regr.fit(X_train, Y_train)
predicted = regr.predict(X_test)
回答by ShikharDua
One way to achieve regression with categorical variables as independent variables is as mentioned above - Using encoding. Another way of doing is by using R like statistical formula using statmodels library. Here is a code snippet
使用分类变量作为自变量实现回归的一种方法是如上所述 - 使用编码。另一种方法是使用 statmodels 库使用 R 之类的统计公式。这是一个代码片段
from statsmodels.formula.api import ols
tips = sns.load_dataset("tips")
model = ols('tip ~ total_bill + C(sex) + C(day) + C(day) + size', data=tips)
fitted_model = model.fit()
fitted_model.summary()
Dataset
数据集
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Summary of regression
回归总结