pandas 用随机值替换数据框中的 NaN

Question

提问by Sam

I have a data frame (data_train) with NaN values, A sample is given below:

我有一个带有 NaN 值的数据框 (data_train)，下面给出了一个示例：

republican                n                          y   
republican                n                          NaN   
democrat                 NaN                         n
democrat                  n                          y

I want to replace all the NaN with some random values like .

我想用一些随机值替换所有的 NaN，比如 .

republican                n                           y   
republican                n                          rnd2
democrat                 rnd1                         n
democrat                  n                           y

How do I do it.

我该怎么做。

I tried the following, but had no luck:

我尝试了以下方法，但没有运气：

df_rand = pd.DataFrame(np.random.randn(data_train.shape[0],data_train.shape[1]))
data_train[pd.isnull(data_train)] = dfrand[pd.isnull(data_train)]

when I do the above with a dataframe with random numerical data the above script works fine.

当我使用带有随机数字数据的数据框执行上述操作时，上述脚本工作正常。

Answer 1

回答by fixxxer

Well, if you use fillnato fill the NaN, a random generator works only once and will fill all N/As with the same number.

好吧，如果你fillna用来填充NaN，随机生成器只工作一次，并且会用相同的数字填充所有 N/As。

So, make sure that a random number is generated and used each time. For a dataframe like this :

因此，请确保每次生成并使用随机数。对于这样的数据框：

          Date         A       B
0   2015-01-01       NaN     NaN
1   2015-01-02       NaN     NaN
2   2015-01-03       NaN     NaN
3   2015-01-04       NaN     NaN
4   2015-01-05       NaN     NaN
5   2015-01-06       NaN     NaN
6   2015-01-07       NaN     NaN
7   2015-01-08       NaN     NaN
8   2015-01-09       NaN     NaN
9   2015-01-10       NaN     NaN
10  2015-01-11       NaN     NaN
11  2015-01-12       NaN     NaN
12  2015-01-13       NaN     NaN
13  2015-01-14       NaN     NaN
14  2015-01-15       NaN     NaN
15  2015-01-16       NaN     NaN

I used the following code to fill up the NaNsin column A:

我使用以下代码填写NaNsA 列：

import random
x['A'] = x['A'].apply(lambda v: random.random() * 1000)

Which will give us something like:

这会给我们一些类似的东西：

          Date           A       B
0   2015-01-01   96.538211     NaN
1   2015-01-02  404.683392     NaN
2   2015-01-03  849.614253     NaN
3   2015-01-04  590.030660     NaN
4   2015-01-05  203.167519     NaN
5   2015-01-06  980.508258     NaN
6   2015-01-07  221.088002     NaN
7   2015-01-08  285.013762     NaN

Answer 2

回答by Abramodj

You can use the pandas updatecommand, this way:

您可以通过以下方式使用 pandas update命令：

1) Generate a random DataFrame with the same columns and index as the original one:

1) 生成一个与原始数据帧具有相同列和索引的随机数据帧：

import numpy as np; import pandas as pd
M = len(df.index)
N = len(df.columns)
ran = pd.DataFrame(np.random.randn(M,N), columns=df.columns, index=df.index)

2) Then use update, so that the NaN values in dfwill be replaced by the generated random values

2) 然后使用update，这样中的 NaN 值df将被生成的随机值替换

df.update(ran)

In the above example I used values from a standard normal, but you can also use values randomly picked from the original DataFrame:

在上面的示例中，我使用了标准法线中的值，但您也可以使用从原始 DataFrame 中随机选取的值：

import numpy as np; import pandas as pd

M = len(df.index)
N = len(df.columns)

val = np.ravel(df.values)
val = val[~np.isnan(val)]
val = np.random.choice(val, size=(M,N))
ran = pd.DataFrame(val, columns=df.columns, index=df.index)

df.update(ran)

Answer 3

回答by Mangnier Lo?c

If you want to replace NaN in your column with hot deck technique, I can propose way like this :

如果你想用热甲板技术替换你的列中的 NaN，我可以提出这样的方法：

def hot_deck(dataframe) :
    dataframe = dataframe.fillna(0)
    for col in dataframe.columns :
        assert (dataframe[col].dtype == np.float64) | (dataframe[col].dtype == np.int64)
        liste_sample = dataframe[dataframe[col] != 0][col].unique()
        dataframe[col] = dataframe.apply(lambda row : random.choice(liste_sample) if row[col] == 0 else row[col],axis=1)
    return dataframe

After if you prefer just replace NaN with a new random value for each iteration you can do a thing like that. You've just to determine the max value of your random choices.

之后，如果您更愿意为每次迭代用新的随机值替换 NaN，您可以做这样的事情。您只需确定随机选择的最大值。

def hot_deck(dataframe,max_value) :
    dataframe = dataframe.fillna(0)
    for col in dataframe.columns :
        assert (dataframe[col].dtype == np.float64) | (dataframe[col].dtype == np.int64)
        liste_sample = random.sample(range(max_value),dataframe.isnull().sum())
        dataframe[col] = dataframe.apply(lambda row : random.choice(liste_sample) if row[col] == 0 else row[col],axis=1)
    return dataframe

Answer 4

回答by farhawa

Just use fillnathis way

就用fillna这种方式

import random
data_train.fillna(random.random())

pandas 用随机值替换数据框中的 NaN

提问by Sam

回答by fixxxer

回答by Abramodj

回答by Mangnier Lo?c

回答by farhawa

相关推荐

最近更新

标签

pandas 用随机值替换数据框中的 NaN

提问by Sam

回答by fixxxer

回答by Abramodj

回答by Mangnier Lo?c

回答by farhawa

相关推荐

使用子图和循环绘制 Pandas groupby 组

Pandas read_csv: AttributeError: 'NoneType' 对象没有属性 'dtype'

pandas 用逐年数据绘制熊猫数据框

pandas Seaborn 因子图自定义误差线

相关推荐

最近更新

标签