Python 使用 sklearn 缩放的熊猫数据框列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24645153/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas dataframe columns scaling with sklearn
提问by flyingmeatball
I have a pandas dataframe with mixed type columns, and I'd like to apply sklearn's min_max_scaler to some of the columns. Ideally, I'd like to do these transformations in place, but haven't figured out a way to do that yet. I've written the following code that works:
我有一个带有混合类型列的 Pandas 数据框,我想将 sklearn 的 min_max_scaler 应用于某些列。理想情况下,我想就地进行这些转换,但还没有想出一种方法来做到这一点。我编写了以下有效的代码:
import pandas as pd
import numpy as np
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
dfTest = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],'B':[103.02,107.26,110.35,114.23,114.68], 'C':['big','small','big','small','small']})
min_max_scaler = preprocessing.MinMaxScaler()
def scaleColumns(df, cols_to_scale):
for col in cols_to_scale:
df[col] = pd.DataFrame(min_max_scaler.fit_transform(pd.DataFrame(dfTest[col])),columns=[col])
return df
dfTest
A B C
0 14.00 103.02 big
1 90.20 107.26 small
2 90.95 110.35 big
3 96.27 114.23 small
4 91.21 114.68 small
scaled_df = scaleColumns(dfTest,['A','B'])
scaled_df
A B C
0 0.000000 0.000000 big
1 0.926219 0.363636 small
2 0.935335 0.628645 big
3 1.000000 0.961407 small
4 0.938495 1.000000 small
I'm curious if this is the preferred/most efficient way to do this transformation. Is there a way I could use df.apply that would be better?
我很好奇这是否是进行这种转换的首选/最有效的方法。有没有办法可以更好地使用 df.apply ?
I'm also surprised I can't get the following code to work:
我也很惊讶我无法让以下代码工作:
bad_output = min_max_scaler.fit_transform(dfTest['A'])
bad_output = min_max_scaler.fit_transform(dfTest['A'])
If I pass an entire dataframe to the scaler it works:
如果我将整个数据帧传递给缩放器,它会起作用:
dfTest2 = dfTest.drop('C', axis = 1)
good_output = min_max_scaler.fit_transform(dfTest2)
good_output
dfTest2 = dfTest.drop('C', axis = 1)
good_output = min_max_scaler.fit_transform(dfTest2)
good_output
I'm confused why passing a series to the scaler fails. In my full working code above I had hoped to just pass a series to the scaler then set the dataframe column = to the scaled series. I've seen this question asked a few other places, but haven't found a good answer. Any help understanding what's going on here would be greatly appreciated!
我很困惑为什么将系列传递给定标器会失败。在我上面的完整工作代码中,我希望只将一个系列传递给缩放器,然后将数据框列 = 设置为缩放的系列。我在其他几个地方看到过这个问题,但没有找到好的答案。任何帮助理解这里发生的事情将不胜感激!
采纳答案by LetsPlayYahtzee
I am not sure if previous versions of pandas
prevented this but now the following snippet works perfectly for me and produces exactly what you want without having to use apply
我不确定以前的版本是否pandas
阻止了这种情况,但现在以下代码段对我来说非常适合,并且无需使用即可完全生成您想要的内容apply
>>> import pandas as pd
>>> from sklearn.preprocessing import MinMaxScaler
>>> scaler = MinMaxScaler()
>>> dfTest = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],
'B':[103.02,107.26,110.35,114.23,114.68],
'C':['big','small','big','small','small']})
>>> dfTest[['A', 'B']] = scaler.fit_transform(dfTest[['A', 'B']])
>>> dfTest
A B C
0 0.000000 0.000000 big
1 0.926219 0.363636 small
2 0.935335 0.628645 big
3 1.000000 0.961407 small
4 0.938495 1.000000 small
回答by CT Zhu
You can do it using pandas
only:
您只能使用 pandas
:
In [235]:
dfTest = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],'B':[103.02,107.26,110.35,114.23,114.68], 'C':['big','small','big','small','small']})
df = dfTest[['A', 'B']]
df_norm = (df - df.min()) / (df.max() - df.min())
print df_norm
print pd.concat((df_norm, dfTest.C),1)
A B
0 0.000000 0.000000
1 0.926219 0.363636
2 0.935335 0.628645
3 1.000000 0.961407
4 0.938495 1.000000
A B C
0 0.000000 0.000000 big
1 0.926219 0.363636 small
2 0.935335 0.628645 big
3 1.000000 0.961407 small
4 0.938495 1.000000 small
回答by Eric Czech
Like this?
像这样?
dfTest = pd.DataFrame({
'A':[14.00,90.20,90.95,96.27,91.21],
'B':[103.02,107.26,110.35,114.23,114.68],
'C':['big','small','big','small','small']
})
dfTest[['A','B']] = dfTest[['A','B']].apply(
lambda x: MinMaxScaler().fit_transform(x))
dfTest
A B C
0 0.000000 0.000000 big
1 0.926219 0.363636 small
2 0.935335 0.628645 big
3 1.000000 0.961407 small
4 0.938495 1.000000 small
回答by Low Yield Bond
As it is being mentioned in pir's comment - the .apply(lambda el: scale.fit_transform(el))
method will produce the following warning:
正如 pir 的评论中提到的那样 - 该.apply(lambda el: scale.fit_transform(el))
方法将产生以下警告:
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
DeprecationWarning:在 0.17 中不推荐将一维数组作为数据传递,并将在 0.19 中引发 ValueError。如果您的数据具有单个特征,则使用 X.reshape(-1, 1) 或 X.reshape(1, -1) 如果它包含单个样本来重塑您的数据。
Converting your columns to numpy arrays should do the job (I prefer StandardScaler):
将您的列转换为 numpy 数组应该可以完成这项工作(我更喜欢 StandardScaler):
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
dfTest[['A','B','C']] = scale.fit_transform(dfTest[['A','B','C']].as_matrix())
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
dfTest[['A','B','C']] = scale.fit_transform(dfTest[['A','B','C']].as_matrix())
-- EditNov 2018 (Tested for pandas 0.23.4)--
-- 2018 年 11 月编辑(已针对熊猫0.23.4 进行测试)--
As Rob Murray mentions in the comments, in the current (v0.23.4) version of pandas .as_matrix()
returns FutureWarning
. Therefore, it should be replaced by .values
:
正如 Rob Murray 在评论中提到的,在当前 (v0.23.4) 版本的 pandas 中.as_matrix()
返回FutureWarning
。因此,它应该替换为.values
:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit_transform(dfTest[['A','B']].values)
-- EditMay 2019 (Tested for pandas 0.24.2)--
-- 2019 年 5 月编辑(已针对熊猫0.24.2 进行测试)--
As joelostblom mentions in the comments, "Since 0.24.0
, it is recommended to use .to_numpy()
instead of .values
."
正如 joelostblom 在评论中提到的,“由于0.24.0
,建议使用.to_numpy()
代替.values
。”
Updated example:
更新示例:
import pandas as pd
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
dfTest = pd.DataFrame({
'A':[14.00,90.20,90.95,96.27,91.21],
'B':[103.02,107.26,110.35,114.23,114.68],
'C':['big','small','big','small','small']
})
dfTest[['A', 'B']] = scaler.fit_transform(dfTest[['A','B']].to_numpy())
dfTest
A B C
0 -1.995290 -1.571117 big
1 0.436356 -0.603995 small
2 0.460289 0.100818 big
3 0.630058 0.985826 small
4 0.468586 1.088469 small
回答by athlonshi
df = pd.DataFrame(scale.fit_transform(df.values), columns=df.columns, index=df.index)
This should work without depreciation warnings.
这应该可以在没有折旧警告的情况下工作。
回答by WAN
I know it's a very old comment, but still:
我知道这是一个非常古老的评论,但仍然:
Instead of using single bracket (dfTest['A'])
, use double brackets (dfTest[['A']])
.
不要使用单括号(dfTest['A'])
,而是使用双括号(dfTest[['A']])
。
i.e: min_max_scaler.fit_transform(dfTest[['A']])
.
即:min_max_scaler.fit_transform(dfTest[['A']])
。
I believe this will give the desired result.
我相信这会给出想要的结果。