Python numpy 数组:用列的平均值替换 nan 值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18689235/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
numpy array: replace nan values with average of columns
提问by piokuc
I've got a numpy array filled mostly with real numbers, but there is a few nan
values in it as well.
我有一个主要用实数填充的 numpy 数组,但其中也有一些nan
值。
How can I replace the nan
s with averages of columns where they are?
如何nan
用它们所在的列的平均值替换s?
采纳答案by Daniel
No loops required:
不需要循环:
print(a)
[[ 0.93230948 nan 0.47773439 0.76998063]
[ 0.94460779 0.87882456 0.79615838 0.56282885]
[ 0.94272934 0.48615268 0.06196785 nan]
[ 0.64940216 0.74414127 nan nan]]
#Obtain mean of columns as you need, nanmean is convenient.
col_mean = np.nanmean(a, axis=0)
print(col_mean)
[ 0.86726219 0.7030395 0.44528687 0.66640474]
#Find indices that you need to replace
inds = np.where(np.isnan(a))
#Place column means in the indices. Align the arrays using take
a[inds] = np.take(col_mean, inds[1])
print(a)
[[ 0.93230948 0.7030395 0.47773439 0.76998063]
[ 0.94460779 0.87882456 0.79615838 0.56282885]
[ 0.94272934 0.48615268 0.06196785 0.66640474]
[ 0.64940216 0.74414127 0.44528687 0.66640474]]
回答by Hammer
This isn't very clean but I can't think of a way to do it other than iterating
这不是很干净,但我想不出除了迭代之外的其他方法
#example
a = np.arange(16, dtype = float).reshape(4,4)
a[2,2] = np.nan
a[3,3] = np.nan
indices = np.where(np.isnan(a)) #returns an array of rows and column indices
for row, col in zip(*indices):
a[row,col] = np.mean(a[~np.isnan(a[:,col]), col])
回答by ifryed
you might want to try this built-in function:
你可能想试试这个内置函数:
x = np.array([np.inf, -np.inf, np.nan, -128, 128])
np.nan_to_num(x)
array([ 1.79769313e+308, -1.79769313e+308, 0.00000000e+000,
-1.28000000e+002, 1.28000000e+002])
回答by Ulf Aslak
Alternative: Replacing NaNs with interpolation of columns.
替代方法:用列插值替换 NaN。
def interpolate_nans(X):
"""Overwrite NaNs with column value interpolations."""
for j in range(X.shape[1]):
mask_j = np.isnan(X[:,j])
X[mask_j,j] = np.interp(np.flatnonzero(mask_j), np.flatnonzero(~mask_j), X[~mask_j,j])
return X
Example use:
使用示例:
X_incomplete = np.array([[10, 20, 30 ],
[np.nan, 30, np.nan],
[np.nan, np.nan, 50 ],
[40, 50, np.nan ]])
X_complete = interpolate_nans(X_incomplete)
print X_complete
[[10, 20, 30 ],
[20, 30, 40 ],
[30, 40, 50 ],
[40, 50, 50 ]]
I use this bit of code for time series data in particular, where columns are attributes and rows are time-ordered samples.
我特别将这段代码用于时间序列数据,其中列是属性,行是按时间排序的样本。
回答by Donald Hobson
If partialis your original data, and replaceis an array of the same shape containing averaged values then this code will use the value from partial if one exists.
如果partial是您的原始数据,而replace是一个包含平均值的相同形状的数组,那么此代码将使用 partial 中的值(如果存在)。
Complete= np.where(np.isnan(partial),replace,partial)
回答by LetsPlayYahtzee
To extend Donald's Answer I provide a minimal example. Let's say a
is an ndarray and we want to replace its zero values with the mean of the column.
为了扩展唐纳德的回答,我提供了一个最小的例子。假设a
是一个 ndarray,我们想用列的平均值替换它的零值。
In [231]: a
Out[231]:
array([[0, 3, 6],
[2, 0, 0]])
In [232]: col_mean = np.nanmean(a, axis=0)
Out[232]: array([ 1. , 1.5, 3. ])
In [228]: np.where(np.equal(a, 0), col_mean, a)
Out[228]:
array([[ 1. , 3. , 6. ],
[ 2. , 1.5, 3. ]])
回答by Praveen
Using masked arrays
使用屏蔽数组
The standard way to do this using only numpy would be to use the masked arraymodule.
仅使用 numpy 执行此操作的标准方法是使用掩码数组模块。
Scipy is a pretty heavy package which relies on external libraries, so it's worth having a numpy-only method. This borrows from @DonaldHobson's answer.
Scipy 是一个非常重的包,它依赖于外部库,因此值得拥有一个 numpy-only 方法。这借鉴了@DonaldHobson 的回答。
Edit:np.nanmean
is now a numpy function. However, it doesn't handle all-nan columns...
编辑:np.nanmean
现在是一个 numpy 函数。但是,它不处理全纳米列......
Suppose you have an array a
:
假设你有一个数组a
:
>>> a
array([[ 0., nan, 10., nan],
[ 1., 6., nan, nan],
[ 2., 7., 12., nan],
[ 3., 8., nan, nan],
[ nan, 9., 14., nan]])
>>> import numpy.ma as ma
>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=0), a)
array([[ 0. , 7.5, 10. , 0. ],
[ 1. , 6. , 12. , 0. ],
[ 2. , 7. , 12. , 0. ],
[ 3. , 8. , 12. , 0. ],
[ 1.5, 9. , 14. , 0. ]])
Note that the masked array's mean does not need to be the same shape as a
, because we're taking advantage of the implicit broadcastingover rows.
请注意,掩码数组的均值不需要与 具有相同的形状a
,因为我们正在利用行上的隐式广播。
Also note how the all-nan column is nicely handled. The mean is zero since you're taking the mean of zero elements. The method using nanmean
doesn't handle all-nan columns:
还要注意 all-nan 列是如何被很好地处理的。平均值为零,因为您取的是零元素的平均值。使用的方法nanmean
不处理全纳米列:
>>> col_mean = np.nanmean(a, axis=0)
/home/praveen/.virtualenvs/numpy3-mkl/lib/python3.4/site-packages/numpy/lib/nanfunctions.py:675: RuntimeWarning: Mean of empty slice
warnings.warn("Mean of empty slice", RuntimeWarning)
>>> inds = np.where(np.isnan(a))
>>> a[inds] = np.take(col_mean, inds[1])
>>> a
array([[ 0. , 7.5, 10. , nan],
[ 1. , 6. , 12. , nan],
[ 2. , 7. , 12. , nan],
[ 3. , 8. , 12. , nan],
[ 1.5, 9. , 14. , nan]])
Explanation
解释
Converting a
into a masked array gives you
转换a
为掩码数组为您提供
>>> ma.array(a, mask=np.isnan(a))
masked_array(data =
[[0.0 -- 10.0 --]
[1.0 6.0 -- --]
[2.0 7.0 12.0 --]
[3.0 8.0 -- --]
[-- 9.0 14.0 --]],
mask =
[[False True False True]
[False False True True]
[False False False True]
[False False True True]
[ True False False True]],
fill_value = 1e+20)
And taking the mean over columns gives you the correctanswer, normalizing only over the non-masked values:
并在列上取平均值为您提供正确答案,仅对非屏蔽值进行标准化:
>>> ma.array(a, mask=np.isnan(a)).mean(axis=0)
masked_array(data = [1.5 7.5 12.0 --],
mask = [False False False True],
fill_value = 1e+20)
Further, note how the mask nicely handles the column which is all-nan!
此外,请注意掩码如何很好地处理全 nan列!
Finally, np.where
does the job of replacement.
最后,np.where
做替换工作。
Row-wise mean
行均值
To replace nan
values with row-wise mean instead of column-wise mean requires a tiny change for broadcasting to take effect nicely:
nan
用行均值代替列均值替换值需要进行微小的更改才能使广播很好地生效:
>>> a
array([[ 0., 1., 2., 3., nan],
[ nan, 6., 7., 8., 9.],
[ 10., nan, 12., nan, 14.],
[ nan, nan, nan, nan, nan]])
>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=1), a)
ValueError: operands could not be broadcast together with shapes (4,5) (4,) (4,5)
>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=1)[:, np.newaxis], a)
array([[ 0. , 1. , 2. , 3. , 1.5],
[ 7.5, 6. , 7. , 8. , 9. ],
[ 10. , 12. , 12. , 12. , 14. ],
[ 0. , 0. , 0. , 0. , 0. ]])
回答by rnso
Using simple functions with loops:
使用带循环的简单函数:
a=[[0.93230948, np.nan, 0.47773439, 0.76998063],
[0.94460779, 0.87882456, 0.79615838, 0.56282885],
[0.94272934, 0.48615268, 0.06196785, np.nan],
[0.64940216, 0.74414127, np.nan, np.nan],
[0.64940216, 0.74414127, np.nan, np.nan]]
print("------- original array -----")
for aa in a:
print(aa)
# GET COLUMN MEANS:
ta = np.array(a).T.tolist() # transpose the array;
col_means = list(map(lambda x: np.nanmean(x), ta)) # get means;
print("column means:", col_means)
# REPLACE NAN ENTRIES WITH COLUMN MEANS:
nrows = len(a); ncols = len(a[0]) # get number of rows & columns;
for r in range(nrows):
for c in range(ncols):
if np.isnan(a[r][c]):
a[r][c] = col_means[c]
print("------- means added -----")
for aa in a:
print(aa)
Output:
输出:
------- original array -----
[0.93230948, nan, 0.47773439, 0.76998063]
[0.94460779, 0.87882456, 0.79615838, 0.56282885]
[0.94272934, 0.48615268, 0.06196785, nan]
[0.64940216, 0.74414127, nan, nan]
[0.64940216, 0.74414127, nan, nan]
column means: [0.82369018599999999, 0.71331494500000003, 0.44528687333333333, 0.66640474000000005]
------- means added -----
[0.93230948, 0.71331494500000003, 0.47773439, 0.76998063]
[0.94460779, 0.87882456, 0.79615838, 0.56282885]
[0.94272934, 0.48615268, 0.06196785, 0.66640474000000005]
[0.64940216, 0.74414127, 0.44528687333333333, 0.66640474000000005]
[0.64940216, 0.74414127, 0.44528687333333333, 0.66640474000000005]
The for loops can also be written with list comprehension:
for 循环也可以用列表推导式编写:
new_a = [[col_means[c] if np.isnan(a[r][c]) else a[r][c]
for c in range(ncols) ]
for r in range(nrows) ]