Python Numpy:零均值数据和标准化

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45834276/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:18:17  来源:igfitidea点击:

Numpy:zero mean data and standardization

pythonnumpyimage-preprocessing

提问by econ

I saw in tutorial (there were no further explanation) that we can process data to zero mean with x -= np.mean(x, axis=0)and normalize data with x /= np.std(x, axis=0). Can anyone elaborate on these two pieces on code, only thing I got from documentations is that np.meancalculates arithmetic mean calculates mean along specific axis and np.stddoes so for standard deviation.

我在教程中看到(没有进一步的解释)我们可以将数据处理为零均值x -= np.mean(x, axis=0)并使用x /= np.std(x, axis=0). 任何人都可以在代码上详细说明这两部分,我从文档中唯一得到的是np.mean计算算术平均值沿特定轴计算平均值并计算np.std标准偏差。

回答by Jonas Adler

This is also called zscore.

这也称为zscore.

SciPy has a utility for it:

SciPy 有一个实用程序:

    >>> from scipy import stats
    >>> stats.zscore([ 0.7972,  0.0767,  0.4383,  0.7866,  0.8091,
    ...                0.1954,  0.6307,  0.6599,  0.1065,  0.0508])
    array([ 1.1273, -1.247 , -0.0552,  1.0923,  1.1664, -0.8559,  0.5786,
            0.6748, -1.1488, -1.3324])

回答by Clock Slave

Follow the comments in the code below

按照下面代码中的注释进行操作

import numpy as np

# create x
x = np.asarray([1,2,3,4], dtype=np.float64)

np.mean(x) # calculates the mean of the array x
x-np.mean(x) # this is euivalent to subtracting the mean of x from each value in x
x-=np.mean(x) # the -= means can be read as x = x- np.mean(x)

np.std(x) # this calcualtes the standard deviation of the array
x/=np.std(x) # the /= means can be read as x = x/np.std(x)

回答by Jürg Merlin Spaak

From the given syntax you have I conclude, that your array is multidimensional. Hence I will first discuss the case where your x is just a linear array:

根据给定的语法,我得出结论,您的数组是多维的。因此,我将首先讨论 x 只是一个线性数组的情况:

np.mean(x)will compute the mean, by broadcasting x-np.mean(x)the mean of xwill be subtracted form all the entries. x -=np.mean(x,axis = 0)is equivalent to x = x-np.mean(x,axis = 0). Similar forx/np.std(x)`.

np.mean(x)将计算平均值,通过广播从所有条目中减去x-np.mean(x)的平均值xx -=np.mean(x,axis = 0)等价于x = x-np.mean(x,axis = 0). Similar forx/np.std(x)`。

In the case of multidimensional arrays the same thing happens, but instead of computing the mean over the entire array, you just compute the mean over the first "axis". Axis is the numpyword for dimension. So if your xis two dimensional, then np.mean(x,axis =0) = [np.mean(x[:,0], np.mean(x[:,1])...]. Broadcasting again will ensure, that this is done to all elements.

在多维数组的情况下,会发生同样的事情,但不是计算整个数组的平均值,而是计算第一个“轴”上的平均值。轴是numpy维度的代名词。所以如果你x是二维的,那么np.mean(x,axis =0) = [np.mean(x[:,0], np.mean(x[:,1])...]. 再次广播将确保对所有元素都这样做。

Note, that this only works with the first dimension, otherwise the shapes will not match for broadcasting. If you want to normalize wrt another axis you need to do something like:

请注意,这仅适用于第一维,否则形状将不匹配广播。如果要对另一个轴进行标准化,则需要执行以下操作:

x -= np.expand_dims(np.mean(x,axis = n),n)

回答by Ando Jurai

Key here are the assignment operators. They actually performs some operations on the original variable. a += c is actually equal to a=a+c.

这里的关键是赋值运算符。他们实际上对原始变量执行了一些操作。a += c 实际上等于 a=a+c。

So indeed a (in your case x) has to be defined beforehand.

因此,确实必须事先定义 a (在您的情况下为 x)。

Each method takes an array/iterable (x) as input and outputs a value (or array if a multidimensional array was input), which is thus applied in your assignment operations.
The axis parameter means that you apply the mean or std operation over the rows. Hence, you take values for each row in a given column and perform the mean or std. Axis=1 would take values of each column for a given row.

每个方法都将一个数组/可迭代对象 (x) 作为输入并输出一个值(如果输入的是多维数组,则为数组),从而应用于您的赋值操作。
轴参数意味着您对行应用均值或标准差操作。因此,您为给定列中的每一行取值并执行均值或标准差。Axis=1 将采用给定行的每一列的值。

What you do with both operations is that first you remove the mean so that your column mean is now centered around 0. Then, when you divide by std, you happen to reduce the spread of the data around this zero, and now it should roughly be in a [-1, +1] interval around 0.

您对这两个操作所做的是首先删除均值,以便您的列均值现在以 0 为中心。然后,当您除以 std 时,您碰巧减少了围绕该零的数据散布,现在它应该大致处于 [-1, +1] 区间,大约为 0。

So now, each of your column values is centered around zero and standardized.

所以现在,您的每个列值都以零为中心并标准化。

There are other scaling techniques, such as removing the minimal or maximal value and dividing by the range of values.

还有其他缩放技术,例如删除最小值或最大值并除以值的范围。