Python 如何规范化熊猫数据框中一系列列中的数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28576540/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 03:28:23  来源:igfitidea点击:

How can I normalize the data in a range of columns in my pandas dataframe

pythonpandas

提问by Jeremy

Suppose I have a pandas data frame surveyData:

假设我有一个熊猫数据框surveyData:

I want to normalize the data in each column by performing:

我想通过执行以下操作来规范化每列中的数据:

surveyData_norm = (surveyData - surveyData.mean()) / (surveyData.max() - surveyData.min())

This would work fine if my data table only contained the columns I wanted to normalize. However, I have some columns containing string data preceding like:

如果我的数据表只包含我想要规范化的列,这将工作正常。但是,我有一些包含前面的字符串数据的列,例如:

Name  State  Gender  Age  Income  Height
Sam   CA     M        13   10000    70
Bob   AZ     M        21   25000    55
Tom   FL     M        30   100000   45

I only want to normalize the Age, Income, and Height columns but my above method does not work becuase of the string data in the name state and gender columns.

我只想对 Age、Income 和 Height 列进行规范化,但由于名称状态和性别列中的字符串数据,我的上述方法不起作用。

采纳答案by cwharland

You can perform operations on a sub set of rows or columns in pandas in a number of ways. One useful way is indexing:

您可以通过多种方式对 Pandas 中的行或列的子集执行操作。一种有用的方法是索引:

# Assuming same lines from your example
cols_to_norm = ['Age','Height']
survey_data[cols_to_norm] = survey_data[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min()))

This will apply it to only the columns you desire and assign the result back to those columns. Alternatively you could set them to new, normalized columns and keep the originals if you want.

这将仅将其应用于您想要的列并将结果分配回这些列。或者,您可以将它们设置为新的、标准化的列,并根据需要保留原始列。

.....

.....

回答by Alvaro Joao

Simple way and way more efficient:
Pre-calculate the mean:
dropna()avoid missing data.

简单的方法和更有效的方法:
预先计算平均值:
dropna()避免丢失数据。

mean_age = survey_data.Age.dropna().mean()
max_age = survey_data.Age.dropna().max()
min_age = survey_data.Age.dropna().min()

dataframe['Age'] = dataframe['Age'].apply(lambda x: (x - mean_age ) / (max_age -min_age ))

this way will work...

这种方式会奏效...

回答by Yaron

I think it's better to use 'sklearn.preprocessing' in this case which can give us much more scaling options. The way of doing that in your case when using StandardScaler would be:

我认为在这种情况下最好使用“sklearn.preprocessing”,它可以为我们提供更多的缩放选项。在您使用 StandardScaler 的情况下,这样做的方法是:

from sklearn.preprocessing import StandardScaler
cols_to_norm = ['Age','Height']
surveyData[cols_to_norm] = StandardScaler().fit_transform(surveyData[cols_to_norm])

回答by Gauravdeep

import pandas as pd
import numpy as np
# let Dataset here be your data#

from sklearn.preprocessing import MinMaxScaler
minmax = MinMaxScaler()

for x in dataset.columns[dataset.dtypes == 'int64']:
    Dataset[x] = minmax.fit_transform(np.array(Dataset[I]).reshape(-1,1))