Python 如何规范化熊猫数据框中一系列列中的数据

Question

提问by Jeremy

Suppose I have a pandas data frame surveyData:

假设我有一个熊猫数据框surveyData：

I want to normalize the data in each column by performing:

我想通过执行以下操作来规范化每列中的数据：

surveyData_norm = (surveyData - surveyData.mean()) / (surveyData.max() - surveyData.min())

This would work fine if my data table only contained the columns I wanted to normalize. However, I have some columns containing string data preceding like:

如果我的数据表只包含我想要规范化的列，这将工作正常。但是，我有一些包含前面的字符串数据的列，例如：

Name  State  Gender  Age  Income  Height
Sam   CA     M        13   10000    70
Bob   AZ     M        21   25000    55
Tom   FL     M        30   100000   45

I only want to normalize the Age, Income, and Height columns but my above method does not work becuase of the string data in the name state and gender columns.

我只想对 Age、Income 和 Height 列进行规范化，但由于名称状态和性别列中的字符串数据，我的上述方法不起作用。

Answer 1

采纳答案by cwharland

You can perform operations on a sub set of rows or columns in pandas in a number of ways. One useful way is indexing:

您可以通过多种方式对 Pandas 中的行或列的子集执行操作。一种有用的方法是索引：

# Assuming same lines from your example
cols_to_norm = ['Age','Height']
survey_data[cols_to_norm] = survey_data[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min()))

This will apply it to only the columns you desire and assign the result back to those columns. Alternatively you could set them to new, normalized columns and keep the originals if you want.

这将仅将其应用于您想要的列并将结果分配回这些列。或者，您可以将它们设置为新的、标准化的列，并根据需要保留原始列。

.....

Answer 2

回答by Alvaro Joao

Simple way and way more efficient:
Pre-calculate the mean:
dropna()avoid missing data.

简单的方法和更有效的方法：
预先计算平均值：
dropna()避免丢失数据。

mean_age = survey_data.Age.dropna().mean()
max_age = survey_data.Age.dropna().max()
min_age = survey_data.Age.dropna().min()

dataframe['Age'] = dataframe['Age'].apply(lambda x: (x - mean_age ) / (max_age -min_age ))

this way will work...

这种方式会奏效...

Answer 3

回答by Yaron

I think it's better to use 'sklearn.preprocessing' in this case which can give us much more scaling options. The way of doing that in your case when using StandardScaler would be:

我认为在这种情况下最好使用“sklearn.preprocessing”，它可以为我们提供更多的缩放选项。在您使用 StandardScaler 的情况下，这样做的方法是：

from sklearn.preprocessing import StandardScaler
cols_to_norm = ['Age','Height']
surveyData[cols_to_norm] = StandardScaler().fit_transform(surveyData[cols_to_norm])

Answer 4

回答by Gauravdeep

import pandas as pd
import numpy as np
# let Dataset here be your data#

from sklearn.preprocessing import MinMaxScaler
minmax = MinMaxScaler()

for x in dataset.columns[dataset.dtypes == 'int64']:
    Dataset[x] = minmax.fit_transform(np.array(Dataset[I]).reshape(-1,1))

Python 如何规范化熊猫数据框中一系列列中的数据

提问by Jeremy

采纳答案by cwharland

回答by Alvaro Joao

回答by Yaron

回答by Gauravdeep

相关推荐

最近更新

标签

Python 如何规范化熊猫数据框中一系列列中的数据

提问by Jeremy

采纳答案by cwharland

回答by Alvaro Joao

回答by Yaron

回答by Gauravdeep

相关推荐

Python 将 django FloatField 限制为 2 个小数位

Python 正则表达式错误字符范围。

为什么python会在大指数结果的末尾添加一个“L”？

Python UnicodeEncodeError: 'ascii' 编解码器无法对位置 0-5 中的字符进行编码：序号不在范围内 (128)

相关推荐

最近更新

标签