Python 如何规范化熊猫数据框中一系列列中的数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28576540/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I normalize the data in a range of columns in my pandas dataframe
提问by Jeremy
Suppose I have a pandas data frame surveyData:
假设我有一个熊猫数据框surveyData:
I want to normalize the data in each column by performing:
我想通过执行以下操作来规范化每列中的数据:
surveyData_norm = (surveyData - surveyData.mean()) / (surveyData.max() - surveyData.min())
This would work fine if my data table only contained the columns I wanted to normalize. However, I have some columns containing string data preceding like:
如果我的数据表只包含我想要规范化的列,这将工作正常。但是,我有一些包含前面的字符串数据的列,例如:
Name State Gender Age Income Height
Sam CA M 13 10000 70
Bob AZ M 21 25000 55
Tom FL M 30 100000 45
I only want to normalize the Age, Income, and Height columns but my above method does not work becuase of the string data in the name state and gender columns.
我只想对 Age、Income 和 Height 列进行规范化,但由于名称状态和性别列中的字符串数据,我的上述方法不起作用。
采纳答案by cwharland
You can perform operations on a sub set of rows or columns in pandas in a number of ways. One useful way is indexing:
您可以通过多种方式对 Pandas 中的行或列的子集执行操作。一种有用的方法是索引:
# Assuming same lines from your example
cols_to_norm = ['Age','Height']
survey_data[cols_to_norm] = survey_data[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min()))
This will apply it to only the columns you desire and assign the result back to those columns. Alternatively you could set them to new, normalized columns and keep the originals if you want.
这将仅将其应用于您想要的列并将结果分配回这些列。或者,您可以将它们设置为新的、标准化的列,并根据需要保留原始列。
.....
.....
回答by Alvaro Joao
Simple way and way more efficient:
Pre-calculate the mean:dropna()
avoid missing data.
简单的方法和更有效的方法:
预先计算平均值:dropna()
避免丢失数据。
mean_age = survey_data.Age.dropna().mean()
max_age = survey_data.Age.dropna().max()
min_age = survey_data.Age.dropna().min()
dataframe['Age'] = dataframe['Age'].apply(lambda x: (x - mean_age ) / (max_age -min_age ))
this way will work...
这种方式会奏效...
回答by Yaron
I think it's better to use 'sklearn.preprocessing' in this case which can give us much more scaling options. The way of doing that in your case when using StandardScaler would be:
我认为在这种情况下最好使用“sklearn.preprocessing”,它可以为我们提供更多的缩放选项。在您使用 StandardScaler 的情况下,这样做的方法是:
from sklearn.preprocessing import StandardScaler
cols_to_norm = ['Age','Height']
surveyData[cols_to_norm] = StandardScaler().fit_transform(surveyData[cols_to_norm])
回答by Gauravdeep
import pandas as pd
import numpy as np
# let Dataset here be your data#
from sklearn.preprocessing import MinMaxScaler
minmax = MinMaxScaler()
for x in dataset.columns[dataset.dtypes == 'int64']:
Dataset[x] = minmax.fit_transform(np.array(Dataset[I]).reshape(-1,1))