Python Pandas - 计算所有列的 z-score

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24761998/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 05:08:52  来源:igfitidea点击:

Pandas - Compute z-score for all columns

pythonpandasindexingstatistics

提问by Slavatron

I have a dataframe containing a single column of IDs and all other columns are numerical values for which I want to compute z-scores. Here's a subsection of it:

我有一个包含单列 ID 的数据框,所有其他列都是我想要计算 z 分数的数值。这是它的一个小节:

ID      Age    BMI    Risk Factor
PT 6    48     19.3    4
PT 8    43     20.9    NaN
PT 2    39     18.1    3
PT 9    41     19.5    NaN

Some of my columns contain NaN values which I do not want to include into the z-score calculations so I intend to use a solution offered to this question: how to zscore normalize pandas column with nans?

我的某些列包含 NaN 值,我不想将这些值包含在 z 分数计算中,因此我打算使用针对此问题提供的解决方案:如何使用 nan 对 Pandas 列进行 zscore 标准化?

df['zscore'] = (df.a - df.a.mean())/df.a.std(ddof=0)

I'm interested in applying this solution to all of my columns except the ID column to produce a new dataframe which I can save as an Excel file using

我有兴趣将此解决方案应用于除 ID 列之外的所有列以生成一个新的数据框,我可以使用该数据框将其另存为 Excel 文件

df2.to_excel("Z-Scores.xlsx")

So basically; how can I compute z-scores for each column (ignoring NaN values) and push everything into a new dataframe?

所以基本上; 如何计算每列的 z 分数(忽略 NaN 值)并将所有内容推送到新数据框中?

SIDENOTE: there is a concept in pandas called "indexing" which intimidates me because I do not understand it well. If indexing is a crucial part of solving this problem, please dumb down your explanation of indexing.

旁注:pandas 中有一个叫做“索引”的概念,它吓到我了,因为我不太了解它。如果索引是解决此问题的关键部分,请简化您对索引的解释。

采纳答案by EdChum

Build a list from the columns and remove the column you don't want to calculate the Z score for:

从列中构建一个列表并删除您不想为其计算 Z 分数的列:

In [66]:
cols = list(df.columns)
cols.remove('ID')
df[cols]

Out[66]:
   Age  BMI  Risk  Factor
0    6   48  19.3       4
1    8   43  20.9     NaN
2    2   39  18.1       3
3    9   41  19.5     NaN
In [68]:
# now iterate over the remaining columns and create a new zscore column
for col in cols:
    col_zscore = col + '_zscore'
    df[col_zscore] = (df[col] - df[col].mean())/df[col].std(ddof=0)
df
Out[68]:
   ID  Age  BMI  Risk  Factor  Age_zscore  BMI_zscore  Risk_zscore  \
0  PT    6   48  19.3       4   -0.093250    1.569614    -0.150946   
1  PT    8   43  20.9     NaN    0.652753    0.074744     1.459148   
2  PT    2   39  18.1       3   -1.585258   -1.121153    -1.358517   
3  PT    9   41  19.5     NaN    1.025755   -0.523205     0.050315   

   Factor_zscore  
0              1  
1            NaN  
2             -1  
3            NaN  

回答by Josh Chartier

The almost one-liner solution:

几乎单线解决方案:

df2 = (df.ix[:,1:] - df.ix[:,1:].mean()) / df.ix[:,1:].std()
df2['ID'] = df['ID']

回答by Manuel

Using Scipy's zscorefunction:

使用Scipy 的 zscore函数:

df = pd.DataFrame(np.random.randint(100, 200, size=(5, 3)), columns=['A', 'B', 'C'])
df

|    |   A |   B |   C |
|---:|----:|----:|----:|
|  0 | 163 | 163 | 159 |
|  1 | 120 | 153 | 181 |
|  2 | 130 | 199 | 108 |
|  3 | 108 | 188 | 157 |
|  4 | 109 | 171 | 119 |

from scipy.stats import zscore
df.apply(zscore)

|    |         A |         B |         C |
|---:|----------:|----------:|----------:|
|  0 |  1.83447  | -0.708023 |  0.523362 |
|  1 | -0.297482 | -1.30804  |  1.3342   |
|  2 |  0.198321 |  1.45205  | -1.35632  |
|  3 | -0.892446 |  0.792025 |  0.449649 |
|  4 | -0.842866 | -0.228007 | -0.950897 |

If not all the columns of your data frame are numeric, then you can apply the Z-score function only to the numeric columns using the select_dtypesfunction:

如果不是所有数据框的列都是数字,那么您可以使用以下函数将 Z-score 函数仅应用于数字列select_dtypes

# Note that `select_dtypes` returns a data frame. We are selecting only the columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols].apply(zscore)

|    |         A |         B |         C |
|---:|----------:|----------:|----------:|
|  0 |  1.83447  | -0.708023 |  0.523362 |
|  1 | -0.297482 | -1.30804  |  1.3342   |
|  2 |  0.198321 |  1.45205  | -1.35632  |
|  3 | -0.892446 |  0.792025 |  0.449649 |
|  4 | -0.842866 | -0.228007 | -0.950897 |

回答by Joe Bathelt

If you want to calculate the zscore for all of the columns, you can just use the following:

如果要计算所有列的 zscore,只需使用以下命令:

df_zscore = (df - df.mean())/df.std()

回答by Deninhos

When we are dealing with time-series, calculating z-scores (or anomalies - not the same thing, but you can adapt this code easily) is a bit more complicated. For example, you have 10 years of temperature data measured weekly. To calculate z-scores for the whole time-series, you have to know the means and standard deviations for each day of the year. So, let's get started:

当我们处理时间序列时,计算 z 分数(或异常 - 不是一回事,但您可以轻松调整此代码)会稍微复杂一些。例如,您每周测量 10 年的温度数据。要计算整个时间序列的 z 分数,您必须知道一年中每一天的均值和标准差。那么,让我们开始吧:

Assume you have a pandas DataFrame. First of all, you need a DateTime index. If you don't have it yet, but luckily you do have a column with dates, just make it as your index. Pandas will try to guess the date format. The goal here is to have DateTimeIndex. You can check it out by trying:

假设您有一个 Pandas DataFrame。首先,您需要一个 DateTime 索引。如果您还没有它,但幸运的是,您确实有一列带有日期的,只需将其作为索引即可。Pandas 会尝试猜测日期格式。这里的目标是拥有 DateTimeIndex。您可以通过以下方式检查:

type(df.index)

If you don't have one, let's make it.

如果你没有,那就来吧。

df.index = pd.DatetimeIndex(df[datecolumn])
df = df.drop(datecolumn,axis=1)

Next step is to calculate mean and standard deviation for each group of days. For this, we use the groupby method.

下一步是计算每组天数的平均值和标准差。为此,我们使用 groupby 方法。

mean = pd.groupby(df,by=[df.index.dayofyear]).aggregate(np.nanmean)
std = pd.groupby(df,by=[df.index.dayofyear]).aggregate(np.nanstd)

Finally, we loop through all the dates, performing the calculation (value - mean)/stddev; however, as mentioned, for time-series this is not so straightforward.

最后,我们遍历所有日期,执行计算 (value - mean)/stddev;然而,如前所述,对于时间序列,这并不是那么简单。

df2 = df.copy() #keep a copy for future comparisons 
for y in np.unique(df.index.year):
    for d in np.unique(df.index.dayofyear):
        df2[(df.index.year==y) & (df.index.dayofyear==d)] = (df[(df.index.year==y) & (df.index.dayofyear==d)]- mean.ix[d])/std.ix[d]
        df2.index.name = 'date' #this is just to look nicer

df2 #this is your z-score dataset.

The logic inside the for loops is: for a given year we have to match each dayofyear to its mean and stdev. We run this for all the years in your time-series.

for 循环内部的逻辑是:对于给定的年份,我们必须将每年的每一天与其平均值和标准差进行匹配。我们在您的时间序列中运行了所有年份。

回答by Surya

Here's other way of getting Zscore using custom function:

这是使用自定义函数获取 Zscore 的其他方法:

In [6]: import pandas as pd; import numpy as np

In [7]: np.random.seed(0) # Fixes the random seed

In [8]: df = pd.DataFrame(np.random.randn(5,3), columns=["randomA", "randomB","randomC"])

In [9]: df # watch output of dataframe
Out[9]:
    randomA   randomB   randomC
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
2  0.950088 -0.151357 -0.103219
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

## Create custom function to compute Zscore 
In [10]: def z_score(df):
   ....:         df.columns = [x + "_zscore" for x in df.columns.tolist()]
   ....:         return ((df - df.mean())/df.std(ddof=0))
   ....:

## make sure you filter or select columns of interest before passing dataframe to function
In [11]: z_score(df) # compute Zscore
Out[11]:
   randomA_zscore  randomB_zscore  randomC_zscore
0        0.798350       -0.106335        0.731041
1        1.505002        1.939828       -1.577295
2       -0.407899       -0.875374       -0.545799
3       -1.207392       -0.463464        1.292230
4       -0.688061       -0.494655        0.099824

Result reproduced using scipy.stats zscore

使用 scipy.stats zscore 重现的结果

In [12]: from scipy.stats import zscore

In [13]: df.apply(zscore) # (Credit: Manuel)
Out[13]:
    randomA   randomB   randomC
0  0.798350 -0.106335  0.731041
1  1.505002  1.939828 -1.577295
2 -0.407899 -0.875374 -0.545799
3 -1.207392 -0.463464  1.292230
4 -0.688061 -0.494655  0.099824

回答by ibozkurt79

for Z score, we can stick to documentation instead of using 'apply' function

对于 Z 分数,我们可以坚持使用文档而不是使用“应用”功能

df_zscore = scipy.stats.zscore(cols as array, axis=1)