Python pandas:标准化数据的最佳方式?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40197156/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python pandas: Best way to normalize data?
提问by Rnaldinho
I have a large pandas dataframe with about 80 columns. Each of the 80 columns in the dataframe report daily traffic statistics for websites (the columns are the websites).
我有一个大约有 80 列的大Pandas数据框。数据框中的 80 列中的每一列都报告网站的每日流量统计信息(这些列是网站)。
As I don't want to work with the raw traffic statistics, I rather like to normalize all of my columns (except for the first, which is the date). Either from 0 to 1 or (even better) from 0 to 100.
因为我不想使用原始流量统计数据,所以我更喜欢规范化我的所有列(除了第一个,它是日期)。从 0 到 1 或(甚至更好)从 0 到 100。
Date A B ...
10/10/2010 100.0 402.0 ...
11/10/2010 250.0 800.0 ...
12/10/2010 800.0 2000.0 ...
13/10/2010 400.0 1800.0 ...
That being said, I wonder which normalization to apply. Min-Max scalingvs. z-Score Normalization (standardization)? Some of my columns have strong outliers. It would be great to have an example. I am sorry not being able to provide the full data.
话虽如此,我想知道要应用哪种规范化。Min-Max 缩放与 z-Score 归一化(标准化)?我的一些列有很强的异常值。有一个例子会很棒。很抱歉不能提供完整的数据。
回答by User191919
First, turn your Date column into an index.
首先,将您的日期列转换为索引。
dates = df.pop('Date')
df.index = dates
Then either use z-score normalizing:
然后要么使用 z-score 归一化:
df1 = (df - df.mean())/df.std()
or min-max scaling:
或最小-最大缩放:
df2 = (df-df.min())/(df.max()-df.min())
I would probably advise z-score normalization, because min-max scaling is highly susceptible to outliers.
我可能会建议 z-score 归一化,因为 min-max 缩放非常容易受到异常值的影响。