Python 将 StandardScaler 应用于数据集的一部分
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38420847/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Apply StandardScaler to parts of a data set
提问by mitsi
I want to use sklearn
's StandardScaler
. Is it possible to apply it to some feature columns but not others?
我想用sklearn
的StandardScaler
。是否可以将其应用于某些特征列而不是其他特征列?
For instance, say my data
is:
例如,说我data
是:
data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})
Age Name Weight
0 18 3 68
1 92 4 59
2 98 6 49
col_names = ['Name', 'Age', 'Weight']
features = data[col_names]
I fit and transform the data
我适合并改造 data
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
scaled_features = pd.DataFrame(features, columns = col_names)
Name Age Weight
0 -1.069045 -1.411004 1.202703
1 -0.267261 0.623041 0.042954
2 1.336306 0.787964 -1.245657
But of course the names are not really integers but strings and I don't want to standardize them. How can I apply the fit
and transform
methods only on the columns Age
and Weight
?
但当然,名称并不是真正的整数,而是字符串,我不想标准化它们。如何仅对列和应用fit
和transform
方法?Age
Weight
回答by ayhan
Update:
更新:
Currently the best way to handle this is to use ColumnTransformer as explained here.
目前来处理这一点的最好办法是使用ColumnTransformer作为解释在这里。
First create a copy of your dataframe:
首先创建数据框的副本:
scaled_features = data.copy()
Don't include the Name column in the transformation:
不要在转换中包含 Name 列:
col_names = ['Age', 'Weight']
features = scaled_features[col_names]
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
Now, don't create a new dataframe but assign the result to those two columns:
现在,不要创建新的数据框,而是将结果分配给这两列:
scaled_features[col_names] = features
print(scaled_features)
Age Name Weight
0 -1.411004 3 1.202703
1 0.623041 4 0.042954
2 0.787964 6 -1.245657
回答by Guy C
Introduced in v0.20 is ColumnTransformerwhich applies transformers to a specified set of columns of an array or pandas DataFrame.
v0.20 中引入了 ColumnTransformer,它将转换器应用于数组或 Pandas DataFrame 的一组指定列。
import pandas as pd
data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})
col_names = ['Name', 'Age', 'Weight']
features = data[col_names]
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
ct = ColumnTransformer([
('somename', StandardScaler(), ['Age', 'Weight'])
], remainder='passthrough')
ct.fit_transform(features)
NB: Like Pipeline it also has a shorthand version make_column_transformerwhich doesn't require naming the transformers
注意:像流水线一样,它也有一个速记版本make_column_transformer不需要命名变压器
Output
输出
-1.41100443, 1.20270298, 3.
0.62304092, 0.04295368, 4.
0.78796352, -1.24565666, 6.
回答by Danil
Another option would be to drop Name column before scaling then merge it back together:
另一种选择是在缩放之前删除 Name 列,然后将其合并在一起:
data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})
from sklearn.preprocessing import StandardScaler
# Save the variable you don't want to scale
name_var = data['Name']
# Fit scaler to your data
scaler.fit(data.drop('Name', axis = 1))
# Calculate scaled values and store them in a separate object
scaled_values = scaler.transform(data.drop('Name', axis = 1))
data = pd.DataFrame(scaled_values, index = data.index, columns = data.drop('ID', axis = 1).columns)
data['Name'] = name_var
print(data)
回答by hashcode55
A more pythonic way to do this -
一种更pythonic的方法来做到这一点 -
from sklearn.preprocessing import StandardScaler
data[['Age','Weight']] = data[['Age','Weight']].apply(
lambda x: StandardScaler().fit_transform(x))
data
Output -
输出 -
Age Name Weight
0 -1.411004 3 1.202703
1 0.623041 4 0.042954
2 0.787964 6 -1.245657