Python 规范化熊猫数据框的列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26414913/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:27:33  来源:igfitidea点击:

Normalize columns of pandas data frame

pythonpandasnormalize

提问by ahajib

I have a dataframe in pandas where each column has different value range. For example:

我在 Pandas 中有一个数据框,其中每列都有不同的值范围。例如:

df:

df:

A     B   C
1000  10  0.5
765   5   0.35
800   7   0.09

Any idea how I can normalize the columns of this dataframe where each value is between 0 and 1?

知道如何规范这个数据帧的列,其中每个值都在 0 和 1 之间吗?

My desired output is:

我想要的输出是:

A     B    C
1     1    1
0.765 0.5  0.7
0.8   0.7  0.18(which is 0.09/0.5)

采纳答案by Sandman

You can use the package sklearn and its associated preprocessing utilities to normalize the data.

您可以使用包 sklearn 及其相关的预处理实用程序来规范化数据。

import pandas as pd
from sklearn import preprocessing

x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

For more information look at the scikit-learn documentationon preprocessing data: scaling features to a range.

有关更多信息,请查看有关预处理数据的 scikit-learn文档:将特征缩放到一个范围。

回答by tschm

Your problem is actually a simple transform acting on the columns:

您的问题实际上是作用于列的简单转换:

def f(s):
    return s/s.max()

frame.apply(f, axis=0)

Or even more terse:

或者更简洁:

   frame.apply(lambda x: x/x.max(), axis=0)

回答by Daniele

I think that a better way to do that in pandas is just

我认为在熊猫中做到这一点的更好方法就是

df = df/df.max().astype(np.float64)

EditIf in your data frame negative numbers are present you should use instead

编辑如果您的数据框中存在负数,则应改用

df = df/df.loc[df.abs().idxmax()].astype(np.float64)

回答by Michael Aquilina

Based on this post: https://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range

基于这篇文章:https: //stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range

You can do the following:

您可以执行以下操作:

def normalize(df):
    result = df.copy()
    for feature_name in df.columns:
        max_value = df[feature_name].max()
        min_value = df[feature_name].min()
        result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
    return result

You don't need to stay worrying about whether your values are negative or positive. And the values should be nicely spread out between 0 and 1.

你不需要一直担心你的价值观是消极的还是积极的。并且这些值应该很好地分布在 0 和 1 之间。

回答by Cina

one easy way by using Pandas: (here I want to use mean normalization)

使用Pandas 的一种简单方法:(这里我想使用均值归一化)

normalized_df=(df-df.mean())/df.std()

to use min-max normalization:

使用最小-最大归一化:

normalized_df=(df-df.min())/(df.max()-df.min())

Edit: To address some concerns, need to say that Pandas automatically applies colomn-wise function in the code above.

编辑:为了解决一些问题,需要说明的是 Pandas 会在上面的代码中自动应用 colomn-wise 函数。

回答by j sad

If you like using the sklearn package, you can keep the column and index names by using pandas loclike so:

如果你喜欢使用 sklearn 包,你可以像这样使用 Pandas 来保留列和索引名称loc

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler() 
scaled_values = scaler.fit_transform(df) 
df.loc[:,:] = scaled_values

回答by cyber-math

The solution given by Sandman and Praveen is very well. The only problem with that if you have categorical variables in other columns of your data frame this method will need some adjustments.

Sandman 和 Praveen 给出的解决方案非常好。唯一的问题是,如果数据框的其他列中有分类变量,则此方法需要进行一些调整。

My solution to this type of issue is following:

我对此类问题的解决方案如下:

 from sklearn import preprocesing
 x = pd.concat([df.Numerical1, df.Numerical2,df.Numerical3])
 min_max_scaler = preprocessing.MinMaxScaler()
 x_scaled = min_max_scaler.fit_transform(x)
 x_new = pd.DataFrame(x_scaled)
 df = pd.concat([df.Categoricals,x_new])

回答by Basil Musa

Simple is Beautiful:

简单即美:

df["A"] = df["A"] / df["A"].max()
df["B"] = df["B"] / df["B"].max()
df["C"] = df["C"] / df["C"].max()

回答by shg

def normalize(x):
    try:
        x = x/np.linalg.norm(x,ord=1)
        return x
    except :
        raise
data = pd.DataFrame.apply(data,normalize)

From the document of pandas,DataFrame structure can apply an operation (function) to itself .

从pandas的文档来看,DataFrame结构可以对自身应用一个操作(函数)。

DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)

Applies function along input axis of DataFrame. Objects passed to functions are Series objects having index either the DataFrame's index (axis=0) or the columns (axis=1). Return type depends on whether passed function aggregates, or the reduce argument if the DataFrame is empty.

沿 DataFrame 的输入轴应用函数。传递给函数的对象是具有索引 DataFrame 的索引 (axis=0) 或列 (axis=1) 的 Series 对象。返回类型取决于传递的函数是否聚合,或者如果 DataFrame 为空,则取决于 reduce 参数。

You can apply a custom function to operate the DataFrame .

您可以应用自定义函数来操作 DataFrame 。

回答by raullalves

You can create a list of columns that you want to normalize

您可以创建要规范化的列列表

column_names_to_normalize = ['A', 'E', 'G', 'sadasdsd', 'lol']
x = df[column_names_to_normalize].values
x_scaled = min_max_scaler.fit_transform(x)
df_temp = pd.DataFrame(x_scaled, columns=column_names_to_normalize, index = df.index)
df[column_names_to_normalize] = df_temp

Your Pandas Dataframe is now normalized only at the columns you want

您的 Pandas 数据框现在仅在您想要的列处标准化



However, if you want the opposite, select a list of columns that you DON'Twant to normalize, you can simply create a list of all columns and remove that non desired ones

但是,如果你想的相反,选择列的列表不要想规范化,您可以简单地创建的所有列的列表,删除非期望的人

column_names_to_not_normalize = ['B', 'J', 'K']
column_names_to_normalize = [x for x in list(df) if x not in column_names_to_not_normalize ]