Python 使用 sklearn 缩放的熊猫数据框列

Question

提问by flyingmeatball

I have a pandas dataframe with mixed type columns, and I'd like to apply sklearn's min_max_scaler to some of the columns. Ideally, I'd like to do these transformations in place, but haven't figured out a way to do that yet. I've written the following code that works:

我有一个带有混合类型列的 Pandas 数据框，我想将 sklearn 的 min_max_scaler 应用于某些列。理想情况下，我想就地进行这些转换，但还没有想出一种方法来做到这一点。我编写了以下有效的代码：

import pandas as pd
import numpy as np
from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler()

dfTest = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],'B':[103.02,107.26,110.35,114.23,114.68], 'C':['big','small','big','small','small']})
min_max_scaler = preprocessing.MinMaxScaler()

def scaleColumns(df, cols_to_scale):
    for col in cols_to_scale:
        df[col] = pd.DataFrame(min_max_scaler.fit_transform(pd.DataFrame(dfTest[col])),columns=[col])
    return df

dfTest

    A   B   C
0    14.00   103.02  big
1    90.20   107.26  small
2    90.95   110.35  big
3    96.27   114.23  small
4    91.21   114.68  small

scaled_df = scaleColumns(dfTest,['A','B'])
scaled_df

A   B   C
0    0.000000    0.000000    big
1    0.926219    0.363636    small
2    0.935335    0.628645    big
3    1.000000    0.961407    small
4    0.938495    1.000000    small

I'm curious if this is the preferred/most efficient way to do this transformation. Is there a way I could use df.apply that would be better?

我很好奇这是否是进行这种转换的首选/最有效的方法。有没有办法可以更好地使用 df.apply ？

I'm also surprised I can't get the following code to work:

我也很惊讶我无法让以下代码工作：

bad_output = min_max_scaler.fit_transform(dfTest['A'])

If I pass an entire dataframe to the scaler it works:

如果我将整个数据帧传递给缩放器，它会起作用：

dfTest2 = dfTest.drop('C', axis = 1) good_output = min_max_scaler.fit_transform(dfTest2) good_output

I'm confused why passing a series to the scaler fails. In my full working code above I had hoped to just pass a series to the scaler then set the dataframe column = to the scaled series. I've seen this question asked a few other places, but haven't found a good answer. Any help understanding what's going on here would be greatly appreciated!

我很困惑为什么将系列传递给定标器会失败。在我上面的完整工作代码中，我希望只将一个系列传递给缩放器，然后将数据框列 = 设置为缩放的系列。我在其他几个地方看到过这个问题，但没有找到好的答案。任何帮助理解这里发生的事情将不胜感激！

Answer 1

采纳答案by LetsPlayYahtzee

I am not sure if previous versions of pandasprevented this but now the following snippet works perfectly for me and produces exactly what you want without having to use apply

我不确定以前的版本是否pandas阻止了这种情况，但现在以下代码段对我来说非常适合，并且无需使用即可完全生成您想要的内容apply

>>> import pandas as pd
>>> from sklearn.preprocessing import MinMaxScaler


>>> scaler = MinMaxScaler()

>>> dfTest = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],
                           'B':[103.02,107.26,110.35,114.23,114.68],
                           'C':['big','small','big','small','small']})

>>> dfTest[['A', 'B']] = scaler.fit_transform(dfTest[['A', 'B']])

>>> dfTest
          A         B      C
0  0.000000  0.000000    big
1  0.926219  0.363636  small
2  0.935335  0.628645    big
3  1.000000  0.961407  small
4  0.938495  1.000000  small

Answer 2

回答by CT Zhu

You can do it using pandasonly:

您只能使用 pandas：

In [235]:
dfTest = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],'B':[103.02,107.26,110.35,114.23,114.68], 'C':['big','small','big','small','small']})
df = dfTest[['A', 'B']]
df_norm = (df - df.min()) / (df.max() - df.min())
print df_norm
print pd.concat((df_norm, dfTest.C),1)

          A         B
0  0.000000  0.000000
1  0.926219  0.363636
2  0.935335  0.628645
3  1.000000  0.961407
4  0.938495  1.000000
          A         B      C
0  0.000000  0.000000    big
1  0.926219  0.363636  small
2  0.935335  0.628645    big
3  1.000000  0.961407  small
4  0.938495  1.000000  small

Answer 3

回答by Eric Czech

Like this?

像这样？

dfTest = pd.DataFrame({
           'A':[14.00,90.20,90.95,96.27,91.21],
           'B':[103.02,107.26,110.35,114.23,114.68], 
           'C':['big','small','big','small','small']
         })
dfTest[['A','B']] = dfTest[['A','B']].apply(
                           lambda x: MinMaxScaler().fit_transform(x))
dfTest

    A           B           C
0   0.000000    0.000000    big
1   0.926219    0.363636    small
2   0.935335    0.628645    big
3   1.000000    0.961407    small
4   0.938495    1.000000    small

Answer 4

回答by Low Yield Bond

As it is being mentioned in pir's comment - the .apply(lambda el: scale.fit_transform(el))method will produce the following warning:

正如 pir 的评论中提到的那样 - 该.apply(lambda el: scale.fit_transform(el))方法将产生以下警告：

DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.

DeprecationWarning：在 0.17 中不推荐将一维数组作为数据传递，并将在 0.19 中引发 ValueError。如果您的数据具有单个特征，则使用 X.reshape(-1, 1) 或 X.reshape(1, -1) 如果它包含单个样本来重塑您的数据。

Converting your columns to numpy arrays should do the job (I prefer StandardScaler):

将您的列转换为 numpy 数组应该可以完成这项工作（我更喜欢 StandardScaler）：

~~from sklearn.preprocessing import StandardScaler scale = StandardScaler() dfTest[['A','B','C']] = scale.fit_transform(dfTest[['A','B','C']].as_matrix())~~
~~from sklearn.preprocessing import StandardScaler scale = StandardScaler() dfTest[['A','B','C']] = scale.fit_transform(dfTest[['A','B','C']].as_matrix())~~

-- EditNov 2018 (Tested for pandas 0.23.4)--

-- 2018 年 11 月编辑（已针对熊猫0.23.4 进行测试）--

As Rob Murray mentions in the comments, in the current (v0.23.4) version of pandas .as_matrix()returns FutureWarning. Therefore, it should be replaced by .values:

正如 Rob Murray 在评论中提到的，在当前 (v0.23.4) 版本的 pandas 中.as_matrix()返回FutureWarning。因此，它应该替换为.values：

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

scaler.fit_transform(dfTest[['A','B']].values)

-- EditMay 2019 (Tested for pandas 0.24.2)--

-- 2019 年 5 月编辑（已针对熊猫0.24.2 进行测试）--

As joelostblom mentions in the comments, "Since 0.24.0, it is recommended to use .to_numpy()instead of .values."

正如 joelostblom 在评论中提到的，“由于0.24.0，建议使用.to_numpy()代替.values。”

Updated example:

更新示例：

import pandas as pd
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
dfTest = pd.DataFrame({
               'A':[14.00,90.20,90.95,96.27,91.21],
               'B':[103.02,107.26,110.35,114.23,114.68],
               'C':['big','small','big','small','small']
             })
dfTest[['A', 'B']] = scaler.fit_transform(dfTest[['A','B']].to_numpy())
dfTest
      A         B      C
0 -1.995290 -1.571117    big
1  0.436356 -0.603995  small
2  0.460289  0.100818    big
3  0.630058  0.985826  small
4  0.468586  1.088469  small

Answer 5

回答by athlonshi

df = pd.DataFrame(scale.fit_transform(df.values), columns=df.columns, index=df.index)

This should work without depreciation warnings.

这应该可以在没有折旧警告的情况下工作。

Answer 6

回答by WAN

I know it's a very old comment, but still:

我知道这是一个非常古老的评论，但仍然：

Instead of using single bracket (dfTest['A']), use double brackets (dfTest[['A']]).

不要使用单括号(dfTest['A'])，而是使用双括号(dfTest[['A']])。

i.e: min_max_scaler.fit_transform(dfTest[['A']]).

即：min_max_scaler.fit_transform(dfTest[['A']])。

I believe this will give the desired result.

我相信这会给出想要的结果。

Python 使用 sklearn 缩放的熊猫数据框列

提问by flyingmeatball

采纳答案by LetsPlayYahtzee

回答by CT Zhu

回答by Eric Czech

回答by Low Yield Bond

回答by athlonshi

回答by WAN

相关推荐

最近更新

标签

Python 使用 sklearn 缩放的熊猫数据框列

提问by flyingmeatball

采纳答案by LetsPlayYahtzee

回答by CT Zhu

回答by Eric Czech

回答by Low Yield Bond

回答by athlonshi

回答by WAN

相关推荐

Python Sklearn SGDClassifier 部分拟合

Python Tkinter 变量跟踪方法回调的参数是什么？

Python 如何在pytest中打印到控制台？

Python Mogo ImportError：无法导入名称连接

相关推荐

最近更新

标签