Python numpy 数组:用列的平均值替换 nan 值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18689235/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 11:28:44  来源:igfitidea点击:

numpy array: replace nan values with average of columns

pythonarraysnumpynan

提问by piokuc

I've got a numpy array filled mostly with real numbers, but there is a few nanvalues in it as well.

我有一个主要用实数填充的 numpy 数组,但其中也有一些nan值。

How can I replace the nans with averages of columns where they are?

如何nan用它们所在的列的平均值替换s?

采纳答案by Daniel

No loops required:

不需要循环:

print(a)
[[ 0.93230948         nan  0.47773439  0.76998063]
 [ 0.94460779  0.87882456  0.79615838  0.56282885]
 [ 0.94272934  0.48615268  0.06196785         nan]
 [ 0.64940216  0.74414127         nan         nan]]

#Obtain mean of columns as you need, nanmean is convenient.
col_mean = np.nanmean(a, axis=0)
print(col_mean)
[ 0.86726219  0.7030395   0.44528687  0.66640474]

#Find indices that you need to replace
inds = np.where(np.isnan(a))

#Place column means in the indices. Align the arrays using take
a[inds] = np.take(col_mean, inds[1])

print(a)
[[ 0.93230948  0.7030395   0.47773439  0.76998063]
 [ 0.94460779  0.87882456  0.79615838  0.56282885]
 [ 0.94272934  0.48615268  0.06196785  0.66640474]
 [ 0.64940216  0.74414127  0.44528687  0.66640474]]

回答by Hammer

This isn't very clean but I can't think of a way to do it other than iterating

这不是很干净,但我想不出除了迭代之外的其他方法

#example
a = np.arange(16, dtype = float).reshape(4,4)
a[2,2] = np.nan
a[3,3] = np.nan

indices = np.where(np.isnan(a)) #returns an array of rows and column indices
for row, col in zip(*indices):
    a[row,col] = np.mean(a[~np.isnan(a[:,col]), col])

回答by ifryed

you might want to try this built-in function:

你可能想试试这个内置函数:

x = np.array([np.inf, -np.inf, np.nan, -128, 128])
np.nan_to_num(x)
array([  1.79769313e+308,  -1.79769313e+308,   0.00000000e+000,
-1.28000000e+002,   1.28000000e+002])

回答by Ulf Aslak

Alternative: Replacing NaNs with interpolation of columns.

替代方法:用列插值替换 NaN。

def interpolate_nans(X):
    """Overwrite NaNs with column value interpolations."""
    for j in range(X.shape[1]):
        mask_j = np.isnan(X[:,j])
        X[mask_j,j] = np.interp(np.flatnonzero(mask_j), np.flatnonzero(~mask_j), X[~mask_j,j])
    return X

Example use:

使用示例:

X_incomplete = np.array([[10,     20,     30    ],
                         [np.nan, 30,     np.nan],
                         [np.nan, np.nan, 50    ],
                         [40,     50,     np.nan    ]])

X_complete = interpolate_nans(X_incomplete)

print X_complete
[[10,     20,     30    ],
 [20,     30,     40    ],
 [30,     40,     50    ],
 [40,     50,     50    ]]

I use this bit of code for time series data in particular, where columns are attributes and rows are time-ordered samples.

我特别将这段代码用于时间序列数据,其中列是属性,行是按时间排序的样本。

回答by Donald Hobson

If partialis your original data, and replaceis an array of the same shape containing averaged values then this code will use the value from partial if one exists.

如果partial是您的原始数据,而replace是一个包含平均值的相同形状的数组,那么此代码将使用 partial 中的值(如果存在)。

Complete= np.where(np.isnan(partial),replace,partial)

回答by LetsPlayYahtzee

To extend Donald's Answer I provide a minimal example. Let's say ais an ndarray and we want to replace its zero values with the mean of the column.

为了扩展唐纳德的回答,我提供了一个最小的例子。假设a是一个 ndarray,我们想用列的平均值替换它的零值。

In [231]: a
Out[231]: 
array([[0, 3, 6],
       [2, 0, 0]])


In [232]: col_mean = np.nanmean(a, axis=0)
Out[232]: array([ 1. ,  1.5,  3. ])

In [228]: np.where(np.equal(a, 0), col_mean, a)
Out[228]: 
array([[ 1. ,  3. ,  6. ],
       [ 2. ,  1.5,  3. ]])

回答by Praveen

Using masked arrays

使用屏蔽数组

The standard way to do this using only numpy would be to use the masked arraymodule.

仅使用 numpy 执行此操作的标准方法是使用掩码数组模块。

Scipy is a pretty heavy package which relies on external libraries, so it's worth having a numpy-only method. This borrows from @DonaldHobson's answer.

Scipy 是一个非常重的包,它依赖于外部库,因此值得拥有一个 numpy-only 方法。这借鉴了@DonaldHobson 的回答。

Edit:np.nanmeanis now a numpy function. However, it doesn't handle all-nan columns...

编辑:np.nanmean现在是一个 numpy 函数。但是,它不处理全纳米列......

Suppose you have an array a:

假设你有一个数组a

>>> a
array([[  0.,  nan,  10.,  nan],
       [  1.,   6.,  nan,  nan],
       [  2.,   7.,  12.,  nan],
       [  3.,   8.,  nan,  nan],
       [ nan,   9.,  14.,  nan]])

>>> import numpy.ma as ma
>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=0), a)    
array([[  0. ,   7.5,  10. ,   0. ],
       [  1. ,   6. ,  12. ,   0. ],
       [  2. ,   7. ,  12. ,   0. ],
       [  3. ,   8. ,  12. ,   0. ],
       [  1.5,   9. ,  14. ,   0. ]])

Note that the masked array's mean does not need to be the same shape as a, because we're taking advantage of the implicit broadcastingover rows.

请注意,掩码数组的均值不需要与 具有相同的形状a,因为我们正在利用行上的隐式广播

Also note how the all-nan column is nicely handled. The mean is zero since you're taking the mean of zero elements. The method using nanmeandoesn't handle all-nan columns:

还要注意 all-nan 列是如何被很好地处理的。平均值为零,因为您取的是零元素的平均值。使用的方法nanmean不处理全纳米列:

>>> col_mean = np.nanmean(a, axis=0)
/home/praveen/.virtualenvs/numpy3-mkl/lib/python3.4/site-packages/numpy/lib/nanfunctions.py:675: RuntimeWarning: Mean of empty slice
  warnings.warn("Mean of empty slice", RuntimeWarning)
>>> inds = np.where(np.isnan(a))
>>> a[inds] = np.take(col_mean, inds[1])
>>> a
array([[  0. ,   7.5,  10. ,   nan],
       [  1. ,   6. ,  12. ,   nan],
       [  2. ,   7. ,  12. ,   nan],
       [  3. ,   8. ,  12. ,   nan],
       [  1.5,   9. ,  14. ,   nan]])


Explanation

解释

Converting ainto a masked array gives you

转换a为掩码数组为您提供

>>> ma.array(a, mask=np.isnan(a))
masked_array(data =
 [[0.0 --  10.0 --]
  [1.0 6.0 --   --]
  [2.0 7.0 12.0 --]
  [3.0 8.0 --   --]
  [--  9.0 14.0 --]],
             mask =
 [[False  True False  True]
 [False False  True  True]
 [False False False  True]
 [False False  True  True]
 [ True False False  True]],
       fill_value = 1e+20)

And taking the mean over columns gives you the correctanswer, normalizing only over the non-masked values:

并在列上取平均值为您提供正确答案,仅对非屏蔽值进行标准化:

>>> ma.array(a, mask=np.isnan(a)).mean(axis=0)
masked_array(data = [1.5 7.5 12.0 --],
             mask = [False False False  True],
       fill_value = 1e+20)

Further, note how the mask nicely handles the column which is all-nan!

此外,请注意掩码如何很好地处理全 nan列!

Finally, np.wheredoes the job of replacement.

最后,np.where做替换工作。



Row-wise mean

行均值

To replace nanvalues with row-wise mean instead of column-wise mean requires a tiny change for broadcasting to take effect nicely:

nan用行均值代替列均值替换值需要进行微小的更改才能使广播很好地生效:

>>> a
array([[  0.,   1.,   2.,   3.,  nan],
       [ nan,   6.,   7.,   8.,   9.],
       [ 10.,  nan,  12.,  nan,  14.],
       [ nan,  nan,  nan,  nan,  nan]])

>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=1), a)
ValueError: operands could not be broadcast together with shapes (4,5) (4,) (4,5)

>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=1)[:, np.newaxis], a)
array([[  0. ,   1. ,   2. ,   3. ,   1.5],
       [  7.5,   6. ,   7. ,   8. ,   9. ],
       [ 10. ,  12. ,  12. ,  12. ,  14. ],
       [  0. ,   0. ,   0. ,   0. ,   0. ]])

回答by rnso

Using simple functions with loops:

使用带循环的简单函数:

a=[[0.93230948, np.nan, 0.47773439, 0.76998063],
  [0.94460779, 0.87882456, 0.79615838, 0.56282885],
  [0.94272934, 0.48615268, 0.06196785, np.nan],
  [0.64940216, 0.74414127, np.nan, np.nan],
  [0.64940216, 0.74414127, np.nan, np.nan]]

print("------- original array -----")
for aa in a:
    print(aa)

# GET COLUMN MEANS: 
ta = np.array(a).T.tolist()                         # transpose the array; 
col_means = list(map(lambda x: np.nanmean(x), ta))  # get means; 
print("column means:", col_means)

# REPLACE NAN ENTRIES WITH COLUMN MEANS: 
nrows = len(a); ncols = len(a[0]) # get number of rows & columns; 
for r in range(nrows):
    for c in range(ncols):
        if np.isnan(a[r][c]):
            a[r][c] = col_means[c]

print("------- means added -----")
for aa in a:
    print(aa)

Output:

输出:

------- original array -----
[0.93230948, nan, 0.47773439, 0.76998063]
[0.94460779, 0.87882456, 0.79615838, 0.56282885]
[0.94272934, 0.48615268, 0.06196785, nan]
[0.64940216, 0.74414127, nan, nan]
[0.64940216, 0.74414127, nan, nan]

column means: [0.82369018599999999, 0.71331494500000003, 0.44528687333333333, 0.66640474000000005]

------- means added -----
[0.93230948, 0.71331494500000003, 0.47773439, 0.76998063]
[0.94460779, 0.87882456, 0.79615838, 0.56282885]
[0.94272934, 0.48615268, 0.06196785, 0.66640474000000005]
[0.64940216, 0.74414127, 0.44528687333333333, 0.66640474000000005]
[0.64940216, 0.74414127, 0.44528687333333333, 0.66640474000000005]

The for loops can also be written with list comprehension:

for 循环也可以用列表推导式编写:

new_a = [[col_means[c] if np.isnan(a[r][c]) else a[r][c] 
            for c in range(ncols) ]
        for r in range(nrows) ]