pandas 用随机值替换数据框中的 NaN

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30647247/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:26:02  来源:igfitidea点击:

Replace NaN in a dataframe with random values

pythonpandas

提问by Sam

I have a data frame (data_train) with NaN values, A sample is given below:

我有一个带有 NaN 值的数据框 (data_train),下面给出了一个示例:

republican                n                          y   
republican                n                          NaN   
democrat                 NaN                         n
democrat                  n                          y   

I want to replace all the NaN with some random values like .

我想用一些随机值替换所有的 NaN,比如 .

republican                n                           y   
republican                n                          rnd2
democrat                 rnd1                         n
democrat                  n                           y   

How do I do it.

我该怎么做。

I tried the following, but had no luck:

我尝试了以下方法,但没有运气:

df_rand = pd.DataFrame(np.random.randn(data_train.shape[0],data_train.shape[1]))
data_train[pd.isnull(data_train)] = dfrand[pd.isnull(data_train)]

when I do the above with a dataframe with random numerical data the above script works fine.

当我使用带有随机数字数据的数据框执行上述操作时,上述脚本工作正常。

回答by fixxxer

Well, if you use fillnato fill the NaN, a random generator works only once and will fill all N/As with the same number.

好吧,如果你fillna用来填充NaN,随机生成器只工作一次,并且会用相同的数字填充所有 N/As。

So, make sure that a random number is generated and used each time. For a dataframe like this :

因此,请确保每次生成并使用随机数。对于这样的数据框:

          Date         A       B
0   2015-01-01       NaN     NaN
1   2015-01-02       NaN     NaN
2   2015-01-03       NaN     NaN
3   2015-01-04       NaN     NaN
4   2015-01-05       NaN     NaN
5   2015-01-06       NaN     NaN
6   2015-01-07       NaN     NaN
7   2015-01-08       NaN     NaN
8   2015-01-09       NaN     NaN
9   2015-01-10       NaN     NaN
10  2015-01-11       NaN     NaN
11  2015-01-12       NaN     NaN
12  2015-01-13       NaN     NaN
13  2015-01-14       NaN     NaN
14  2015-01-15       NaN     NaN
15  2015-01-16       NaN     NaN

I used the following code to fill up the NaNsin column A:

我使用以下代码填写NaNsA 列:

import random
x['A'] = x['A'].apply(lambda v: random.random() * 1000)

Which will give us something like:

这会给我们一些类似的东西:

          Date           A       B
0   2015-01-01   96.538211     NaN
1   2015-01-02  404.683392     NaN
2   2015-01-03  849.614253     NaN
3   2015-01-04  590.030660     NaN
4   2015-01-05  203.167519     NaN
5   2015-01-06  980.508258     NaN
6   2015-01-07  221.088002     NaN
7   2015-01-08  285.013762     NaN

回答by Abramodj

You can use the pandas updatecommand, this way:

您可以通过以下方式使用 pandas update命令:

1) Generate a random DataFrame with the same columns and index as the original one:

1) 生成一个与原始数据帧具有相同列和索引的随机数据帧:

import numpy as np; import pandas as pd
M = len(df.index)
N = len(df.columns)
ran = pd.DataFrame(np.random.randn(M,N), columns=df.columns, index=df.index)

2) Then use update, so that the NaN values in dfwill be replaced by the generated random values

2) 然后使用update,这样 中的 NaN 值df将被生成的随机值替换

df.update(ran)


In the above example I used values from a standard normal, but you can also use values randomly picked from the original DataFrame:

在上面的示例中,我使用了标准法线中的值,但您也可以使用从原始 DataFrame 中随机选取的值:

import numpy as np; import pandas as pd

M = len(df.index)
N = len(df.columns)

val = np.ravel(df.values)
val = val[~np.isnan(val)]
val = np.random.choice(val, size=(M,N))
ran = pd.DataFrame(val, columns=df.columns, index=df.index)

df.update(ran)

回答by Mangnier Lo?c

If you want to replace NaN in your column with hot deck technique, I can propose way like this :

如果你想用热甲板技术替换你的列中的 NaN,我可以提出这样的方法:

def hot_deck(dataframe) :
    dataframe = dataframe.fillna(0)
    for col in dataframe.columns :
        assert (dataframe[col].dtype == np.float64) | (dataframe[col].dtype == np.int64)
        liste_sample = dataframe[dataframe[col] != 0][col].unique()
        dataframe[col] = dataframe.apply(lambda row : random.choice(liste_sample) if row[col] == 0 else row[col],axis=1)
    return dataframe

After if you prefer just replace NaN with a new random value for each iteration you can do a thing like that. You've just to determine the max value of your random choices.

之后,如果您更愿意为每次迭代用新的随机值替换 NaN,您可以做这样的事情。您只需确定随机选择的最大值。

def hot_deck(dataframe,max_value) :
    dataframe = dataframe.fillna(0)
    for col in dataframe.columns :
        assert (dataframe[col].dtype == np.float64) | (dataframe[col].dtype == np.int64)
        liste_sample = random.sample(range(max_value),dataframe.isnull().sum())
        dataframe[col] = dataframe.apply(lambda row : random.choice(liste_sample) if row[col] == 0 else row[col],axis=1)
    return dataframe

回答by farhawa

Just use fillnathis way

就用fillna这种方式

import random
data_train.fillna(random.random())