Python 谁能给我解释一下 StandardScaler?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40758562/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Can anyone explain me StandardScaler?
提问by nitinvijay23
回答by user6903745
The idea behind StandardScaler
is that it will transform your data such that its distribution will have a mean value 0 and standard deviation of 1.
In case of multivariate data, this is done feature-wise (in other words independently for each column of the data).
Given the distribution of the data, each value in the dataset will have the mean value subtracted, and then divided by the standard deviation of the whole dataset (or feature in the multivariate case).
背后的想法StandardScaler
是,它将转换您的数据,使其分布的平均值为 0,标准差为 1。
在多变量数据的情况下,这是按特征完成的(换句话说,独立于数据的每一列) .
给定数据的分布,数据集中的每个值都会减去平均值,然后除以整个数据集(或多元情况下的特征)的标准差。
回答by seralouk
Intro:I assume that you have a matrix X
where each row/lineis a sample/observationand each columnis a variable/feature(this is the expected input for any sklearn
ML function by the way -- X.shape
should be [number_of_samples, number_of_features]
).
介绍:我假设您有一个矩阵X
,其中每一行/行是一个样本/观察值,每一列是一个变量/特征(sklearn
顺便说一下,这是任何ML 函数的预期输入-X.shape
应该是[number_of_samples, number_of_features]
)。
Core of method: The main idea is to normalize/standardizei.e. μ = 0
and σ = 1
your features/variables/columnsof X
, individually, beforeapplying any machine learning model.
方法的核心:主要思路是正常化/标准化即μ = 0
和σ = 1
你的功能/变量/列的X
,单独, 之前应用任何机器学习模型。
StandardScaler()
will normalize the featuresi.e. each column of X, INDIVIDUALLY (!!!)so that each column/feature/variable will have μ = 0
and σ = 1
.
StandardScaler()
将规范化特征,即 X 的每一列,单独(!!!),以便每一列/特征/变量都有μ = 0
和σ = 1
。
P.S:I find the most upvoted answer on this page, wrong. I am quoting "each value in the dataset will have the sample mean value subtracted" -- This is not true either correct.
PS:我在这个页面上找到了最高票的答案,错了。我引用“数据集中的每个值都会减去样本平均值”——这既不正确也不正确。
See also: https://towardsdatascience.com/how-scikit-learns-standardscaler-works-996926c2c832
另见:https: //towardsdatascience.com/how-scikit-learns-standardscaler-works-996926c2c832
Example:
例子:
from sklearn.preprocessing import StandardScaler
import numpy as np
# 4 samples/observations and 2 variables/features
data = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print(data)
[[0, 0],
[1, 0],
[0, 1],
[1, 1]])
print(scaled_data)
[[-1. -1.]
[ 1. -1.]
[-1. 1.]
[ 1. 1.]]
Verify that the mean of each feature (column) is 0:
验证每个特征(列)的均值是否为 0:
scaled_data.mean(axis = 0)
array([0., 0.])
Verify that the std of each feature (column) is 1:
验证每个特征(列)的 std 是否为 1:
scaled_data.std(axis = 0)
array([1., 1.])
The maths:
数学:
UPDATE 08/2019: Concering the input parameters with_mean
and with_std
to False
/True
, I have provided an answer here: https://stackoverflow.com/a/57381708/5025009
UPDATE 08/2019:Concering输入参数with_mean
和with_std
到False
/ True
,我在这里给出了一个答案:https://stackoverflow.com/a/57381708/5025009
回答by Tuan Vu
How to calculate it:
如何计算:
You can read more here:
你可以在这里阅读更多:
回答by krish___na
StandardScaler performs the task of Standardization. Usually a dataset contains variables that are different in scale. For e.g. an Employee dataset will contain AGE column with values on scale 20-70and SALARY column with values on scale 10000-80000.
As these two columns are different in scale, they are Standardized to have common scale while building machine learning model.
StandardScaler执行任务的标准化。通常,数据集包含规模不同的变量。例如,员工数据集将包含值在 20-70 范围内的AGE 列和值在10000-80000 范围内的SALARY 列。
由于这两列的规模不同,在构建机器学习模型时,它们被标准化为具有共同的规模。
回答by Riccardo Petraglia
This is useful when you want to compare data that correspond to different units. In that case, you want to remove the units. To do that in a consistent way of all the data, you transform the data in a way that the variance is unitary and that the mean of the series is 0.
当您想要比较对应于不同单位的数据时,这很有用。在这种情况下,您要删除这些单位。要以一致的方式对所有数据执行此操作,您需要以方差单一且序列均值为 0 的方式转换数据。
回答by Thom Ives
The answers above are great, but I needed a simple example to alleviate some concerns that I have had in the past. I wanted to make sure it was indeed treating each column separately. I am now reassured and can't find what example had caused me concern. All columns AREscaled separately as described by those above.
上面的答案很好,但我需要一个简单的例子来减轻我过去的一些担忧。我想确保它确实单独处理每一列。我现在放心了,找不到让我担心的例子。所有列ARE由上述那些单独缩放。
CODE
代码
import pandas as pd
import scipy.stats as ss
from sklearn.preprocessing import StandardScaler
data= [[1, 1, 1, 1, 1],[2, 5, 10, 50, 100],[3, 10, 20, 150, 200],[4, 15, 40, 200, 300]]
df = pd.DataFrame(data, columns=['N0', 'N1', 'N2', 'N3', 'N4']).astype('float64')
sc_X = StandardScaler()
df = sc_X.fit_transform(df)
num_cols = len(df[0,:])
for i in range(num_cols):
col = df[:,i]
col_stats = ss.describe(col)
print(col_stats)
OUTPUT
输出
DescribeResult(nobs=4, minmax=(-1.3416407864998738, 1.3416407864998738), mean=0.0, variance=1.3333333333333333, skewness=0.0, kurtosis=-1.3599999999999999)
DescribeResult(nobs=4, minmax=(-1.2828087129930659, 1.3778315806221817), mean=-5.551115123125783e-17, variance=1.3333333333333337, skewness=0.11003776770595125, kurtosis=-1.394993095506219)
DescribeResult(nobs=4, minmax=(-1.155344148338584, 1.53471088361394), mean=0.0, variance=1.3333333333333333, skewness=0.48089217736510326, kurtosis=-1.1471008824318165)
DescribeResult(nobs=4, minmax=(-1.2604572012883055, 1.2668071116222517), mean=-5.551115123125783e-17, variance=1.3333333333333333, skewness=0.0056842140599118185, kurtosis=-1.6438177182479734)
DescribeResult(nobs=4, minmax=(-1.338945389819976, 1.3434309690153527), mean=5.551115123125783e-17, variance=1.3333333333333333, skewness=0.005374558840039456, kurtosis=-1.3619131970819205)
回答by LCJ
Following is a simple working example to explain how standarization calculation works. The theory part is already well explained in other answers.
以下是一个简单的工作示例,用于解释标准化计算的工作原理。理论部分已经在其他答案中得到了很好的解释。
>>>import numpy as np
>>>data = [[6, 2], [4, 2], [6, 4], [8, 2]]
>>>a = np.array(data)
>>>np.std(a, axis=0)
array([1.41421356, 0.8660254 ])
>>>np.mean(a, axis=0)
array([6. , 2.5])
>>>from sklearn.preprocessing import StandardScaler
>>>scaler = StandardScaler()
>>>scaler.fit(data)
>>>print(scaler.mean_)
#Xchanged = (X?μ)/σ WHERE σ is Standard Deviation and μ is mean
>>>z=scaler.transform(data)
>>>z
Calculation
计算
As you can see in the output, mean is [6. , 2.5] and std deviation is [1.41421356, 0.8660254 ]
正如您在输出中看到的,均值是 [6。, 2.5] 和标准偏差是 [1.41421356, 0.8660254 ]
Data is (0,1) position is 2 Standardization = (2 - 2.5)/0.8660254 = -0.57735027
数据为 (0,1) 位置为 2 标准化 = (2 - 2.5)/0.8660254 = -0.57735027
Data in (1,0) position is 4 Standardization = (4-6)/1.41421356 = -1.414
(1,0) 位置的数据为 4 标准化 = (4-6)/1.41421356 = -1.414
Result After Standardization
标准化后的结果
Check Mean and Std Deviation After Standardization
标准化后检查均值和标准差
Note: -2.77555756e-17 is very close to 0.
注意:-2.77555756e-17 非常接近于 0。
References
参考
回答by Paul
After applying StandardScaler()
, each columnin X will have mean of 0 and standard deviation of 1.
应用后StandardScaler()
,X 中每一列的均值为 0,标准差为 1。
Formulas are listed by others on this page.
其他人在此页面上列出了公式。
Rationale: some algorithms require data to look like this (see sklearn docs).
基本原理:某些算法要求数据看起来像这样(请参阅sklearn 文档)。