Python pandas：标准化数据的最佳方式？

Question

提问by Rnaldinho

I have a large pandas dataframe with about 80 columns. Each of the 80 columns in the dataframe report daily traffic statistics for websites (the columns are the websites).

我有一个大约有 80 列的大Pandas数据框。数据框中的 80 列中的每一列都报告网站的每日流量统计信息（这些列是网站）。

As I don't want to work with the raw traffic statistics, I rather like to normalize all of my columns (except for the first, which is the date). Either from 0 to 1 or (even better) from 0 to 100.

因为我不想使用原始流量统计数据，所以我更喜欢规范化我的所有列（除了第一个，它是日期）。从 0 到 1 或（甚至更好）从 0 到 100。

Date        A      B      ...
10/10/2010  100.0  402.0  ...
11/10/2010  250.0  800.0  ...
12/10/2010  800.0  2000.0 ...
13/10/2010  400.0  1800.0 ...

That being said, I wonder which normalization to apply. Min-Max scalingvs. z-Score Normalization (standardization)? Some of my columns have strong outliers. It would be great to have an example. I am sorry not being able to provide the full data.

话虽如此，我想知道要应用哪种规范化。Min-Max 缩放与 z-Score 归一化（标准化）？我的一些列有很强的异常值。有一个例子会很棒。很抱歉不能提供完整的数据。

Answer 1

回答by User191919

First, turn your Date column into an index.

首先，将您的日期列转换为索引。

dates = df.pop('Date')
df.index = dates

Then either use z-score normalizing:

然后要么使用 z-score 归一化：

df1 = (df - df.mean())/df.std()

or min-max scaling:

或最小-最大缩放：

df2 = (df-df.min())/(df.max()-df.min())

I would probably advise z-score normalization, because min-max scaling is highly susceptible to outliers.

我可能会建议 z-score 归一化，因为 min-max 缩放非常容易受到异常值的影响。

Python pandas：标准化数据的最佳方式？

提问by Rnaldinho

回答by User191919

相关推荐

最近更新

标签

Python pandas：标准化数据的最佳方式？

提问by Rnaldinho

回答by User191919

相关推荐

Pandas：根据唯一值获取行中对应的列值

将目录中的所有 csv 文件导入为 pandas dfs 并将它们命名为 csv 文件名

如何根据来自多列的数据在 Pandas Python 中的一个图中绘制多条线？

pandas 用字典替换熊猫系列中的值

相关推荐

最近更新

标签