Python 谁能给我解释一下 StandardScaler?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40758562/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 23:57:56  来源:igfitidea点击:

Can anyone explain me StandardScaler?

pythonmachine-learningscikit-learnscalingstandardized

提问by nitinvijay23

I am unable to understand the pageof the StandardScalerin the documentation of sklearn.

我无法理解网页StandardScaler的文档中sklearn

Can anyone explain this to me in simple terms?

任何人都可以用简单的术语向我解释这一点吗?

回答by user6903745

The idea behind StandardScaleris that it will transform your data such that its distribution will have a mean value 0 and standard deviation of 1.
In case of multivariate data, this is done feature-wise (in other words independently for each column of the data).
Given the distribution of the data, each value in the dataset will have the mean value subtracted, and then divided by the standard deviation of the whole dataset (or feature in the multivariate case).

背后的想法StandardScaler是,它将转换您的数据,使其分布的平均值为 0,标准差为 1。
在多变量数据的情况下,这是按特征完成的(换句话说,独立于数据的每一列) .
给定数据的分布,数据集中的每个值都会减去平均值,然后除以整个数据集(或多元情况下的特征)的标准差。

回答by seralouk

Intro:I assume that you have a matrix Xwhere each row/lineis a sample/observationand each columnis a variable/feature(this is the expected input for any sklearnML function by the way -- X.shapeshould be [number_of_samples, number_of_features]).

介绍:我假设您有一个矩阵X,其中每一行/行是一个样本/观察值,每一是一个变量/特征sklearn顺便说一下,这是任何ML 函数的预期输入-X.shape应该是[number_of_samples, number_of_features])。



Core of method: The main idea is to normalize/standardizei.e. μ = 0and σ = 1your features/variables/columnsof X, individually, beforeapplying any machine learning model.

方法的核心:主要思路是正常化/标准化μ = 0σ = 1你的功能/变量/列X单独之前应用任何机器学习模型。

StandardScaler()will normalize the featuresi.e. each column of X, INDIVIDUALLY (!!!)so that each column/feature/variable will have μ = 0and σ = 1.

StandardScaler()规范化特征,即 X 的每一列,单独(!!!),以便每一列/特征/变量都有μ = 0σ = 1



P.S:I find the most upvoted answer on this page, wrong. I am quoting "each value in the dataset will have the sample mean value subtracted" -- This is not true either correct.

PS:我在这个页面上找到了最高票的答案,错了。我引用“数据集中的每个值都会减去样本平均值”——这既不正确也不正确。



See also: https://towardsdatascience.com/how-scikit-learns-standardscaler-works-996926c2c832

另见:https: //towardsdatascience.com/how-scikit-learns-standardscaler-works-996926c2c832



Example:

例子:

from sklearn.preprocessing import StandardScaler
import numpy as np

# 4 samples/observations and 2 variables/features
data = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

print(data)
[[0, 0],
 [1, 0],
 [0, 1],
 [1, 1]])

print(scaled_data)
[[-1. -1.]
 [ 1. -1.]
 [-1.  1.]
 [ 1.  1.]]

Verify that the mean of each feature (column) is 0:

验证每个特征(列)的均值是否为 0:

scaled_data.mean(axis = 0)
array([0., 0.])

Verify that the std of each feature (column) is 1:

验证每个特征(列)的 std 是否为 1:

scaled_data.std(axis = 0)
array([1., 1.])


The maths:

数学:

enter image description here

enter image description here



UPDATE 08/2019: Concering the input parameters with_meanand with_stdto False/True, I have provided an answer here: https://stackoverflow.com/a/57381708/5025009

UPDATE 08/2019:Concering输入参数with_meanwith_stdFalse/ True,我在这里给出了一个答案:https://stackoverflow.com/a/57381708/5025009

回答by krish___na

StandardScaler performs the task of Standardization. Usually a dataset contains variables that are different in scale. For e.g. an Employee dataset will contain AGE column with values on scale 20-70and SALARY column with values on scale 10000-80000.
As these two columns are different in scale, they are Standardized to have common scale while building machine learning model.

StandardScaler执行任务的标准化。通常,数据集包含规模不同的变量。例如,员工数据集将包含值在 20-70 范围内的AGE 列和值10000-80000 范围内的SALARY 列。
由于这两列的规模不同,在构建机器学习模型时,它们被标准化为具有共同的规模。

回答by Riccardo Petraglia

This is useful when you want to compare data that correspond to different units. In that case, you want to remove the units. To do that in a consistent way of all the data, you transform the data in a way that the variance is unitary and that the mean of the series is 0.

当您想要比较对应于不同单位的数据时,这很有用。在这种情况下,您要删除这些单位。要以一致的方式对所有数据执行此操作,您需要以方差单一且序列均值为 0 的方式转换数据。

回答by Thom Ives

The answers above are great, but I needed a simple example to alleviate some concerns that I have had in the past. I wanted to make sure it was indeed treating each column separately. I am now reassured and can't find what example had caused me concern. All columns AREscaled separately as described by those above.

上面的答案很好,但我需要一个简单的例子来减轻我过去的一些担忧。我想确保它确实单独处理每一列。我现在放心了,找不到让我担心的例子。所有列ARE由上述那些单独缩放。

CODE

代码

import pandas as pd
import scipy.stats as ss
from sklearn.preprocessing import StandardScaler


data= [[1, 1, 1, 1, 1],[2, 5, 10, 50, 100],[3, 10, 20, 150, 200],[4, 15, 40, 200, 300]]

df = pd.DataFrame(data, columns=['N0', 'N1', 'N2', 'N3', 'N4']).astype('float64')

sc_X = StandardScaler()
df = sc_X.fit_transform(df)

num_cols = len(df[0,:])
for i in range(num_cols):
    col = df[:,i]
    col_stats = ss.describe(col)
    print(col_stats)

OUTPUT

输出

DescribeResult(nobs=4, minmax=(-1.3416407864998738, 1.3416407864998738), mean=0.0, variance=1.3333333333333333, skewness=0.0, kurtosis=-1.3599999999999999)
DescribeResult(nobs=4, minmax=(-1.2828087129930659, 1.3778315806221817), mean=-5.551115123125783e-17, variance=1.3333333333333337, skewness=0.11003776770595125, kurtosis=-1.394993095506219)
DescribeResult(nobs=4, minmax=(-1.155344148338584, 1.53471088361394), mean=0.0, variance=1.3333333333333333, skewness=0.48089217736510326, kurtosis=-1.1471008824318165)
DescribeResult(nobs=4, minmax=(-1.2604572012883055, 1.2668071116222517), mean=-5.551115123125783e-17, variance=1.3333333333333333, skewness=0.0056842140599118185, kurtosis=-1.6438177182479734)
DescribeResult(nobs=4, minmax=(-1.338945389819976, 1.3434309690153527), mean=5.551115123125783e-17, variance=1.3333333333333333, skewness=0.005374558840039456, kurtosis=-1.3619131970819205)

回答by LCJ

Following is a simple working example to explain how standarization calculation works. The theory part is already well explained in other answers.

以下是一个简单的工作示例,用于解释标准化计算的工作原理。理论部分已经在其他答案中得到了很好的解释。

>>>import numpy as np
>>>data = [[6, 2], [4, 2], [6, 4], [8, 2]]
>>>a = np.array(data)

>>>np.std(a, axis=0)
array([1.41421356, 0.8660254 ])

>>>np.mean(a, axis=0)
array([6. , 2.5])

>>>from sklearn.preprocessing import StandardScaler
>>>scaler = StandardScaler()
>>>scaler.fit(data)
>>>print(scaler.mean_)

#Xchanged = (X?μ)/σ  WHERE σ is Standard Deviation and μ is mean
>>>z=scaler.transform(data)
>>>z

Calculation

计算

As you can see in the output, mean is [6. , 2.5] and std deviation is [1.41421356, 0.8660254 ]

正如您在输出中看到的,均值是 [6。, 2.5] 和标准偏差是 [1.41421356, 0.8660254 ]

Data is (0,1) position is 2 Standardization = (2 - 2.5)/0.8660254 = -0.57735027

数据为 (0,1) 位置为 2 标准化 = (2 - 2.5)/0.8660254 = -0.57735027

Data in (1,0) position is 4 Standardization = (4-6)/1.41421356 = -1.414

(1,0) 位置的数据为 4 标准化 = (4-6)/1.41421356 = -1.414

Result After Standardization

标准化后的结果

enter image description here

enter image description here

Check Mean and Std Deviation After Standardization

标准化后检查均值和标准差

enter image description here

enter image description here

Note: -2.77555756e-17 is very close to 0.

注意:-2.77555756e-17 非常接近于 0。

References

参考

  1. Compare the effect of different scalers on data with outliers

  2. What's the difference between Normalization and Standardization?

  3. Mean of data scaled with sklearn StandardScaler is not zero

  1. 比较不同缩放器对数据与异常值的影响

  2. 标准化和标准化有什么区别?

  3. 使用 sklearn StandardScaler 缩放的数据均值不为零

回答by Paul

After applying StandardScaler(), each columnin X will have mean of 0 and standard deviation of 1.

应用后StandardScaler(),X 中每一列的均值为 0,标准差为 1。

Formulas are listed by others on this page.

其他人在此页面上列出了公式。

Rationale: some algorithms require data to look like this (see sklearn docs).

基本原理:某些算法要求数据看起来像这样(请参阅sklearn 文档)。