pandas 在 numpy 数组中前向填充 NaN 值的最有效方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41190852/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:38:22  来源:igfitidea点击:

Most efficient way to forward-fill NaN values in numpy array

pythonarraysperformancepandasnumpy

提问by Xukrao

Example Problem

示例问题

As a simple example, consider the numpy array arras defined below:

作为一个简单的例子,考虑arr如下定义的 numpy 数组:

import numpy as np
arr = np.array([[5, np.nan, np.nan, 7, 2],
                [3, np.nan, 1, 8, np.nan],
                [4, 9, 6, np.nan, np.nan]])

where arrlooks like this in console output:

arr在控制台输出中看起来像这样:

array([[  5.,  nan,  nan,   7.,   2.],
       [  3.,  nan,   1.,   8.,  nan],
       [  4.,   9.,   6.,  nan,  nan]])

I would now like to row-wise 'forward-fill' the nanvalues in array arr. By that I mean replacing each nanvalue with the nearest valid value from the left. The desired result would look like this:

我现在想按行“向前填充” nanarray 中的值arr。我的意思是用nan左边最接近的有效值替换每个值。所需的结果如下所示:

array([[  5.,   5.,   5.,  7.,  2.],
       [  3.,   3.,   1.,  8.,  8.],
       [  4.,   9.,   6.,  6.,  6.]])


Tried thus far

到目前为止尝试过

I've tried using for-loops:

我试过使用 for 循环:

for row_idx in range(arr.shape[0]):
    for col_idx in range(arr.shape[1]):
        if np.isnan(arr[row_idx][col_idx]):
            arr[row_idx][col_idx] = arr[row_idx][col_idx - 1]

I've also tried using a pandas dataframe as an intermediate step (since pandas dataframes have a very neat built-in method for forward-filling):

我还尝试使用 Pandas 数据框作为中间步骤(因为 Pandas 数据框有一个非常简洁的内置方法进行前向填充):

import pandas as pd
df = pd.DataFrame(arr)
df.fillna(method='ffill', axis=1, inplace=True)
arr = df.as_matrix()

Both of the above strategies produce the desired result, but I keep on wondering: wouldn't a strategy that uses only numpy vectorized operations be the most efficient one?

上述两种策略都产生了预期的结果,但我一直想知道:仅使用 numpy 向量化操作的策略不是最有效的策略吗?



Summary

概括

Is there another more efficient way to 'forward-fill' nanvalues in numpy arrays? (e.g. by using numpy vectorized operations)

是否有另一种更有效的方法来“向前填充” nannumpy 数组中的值?(例如,通过使用 numpy 向量化操作)



Update: Solutions Comparison

更新:解决方案比较

I've tried to time all solutions thus far. This was my setup script:

到目前为止,我已经尝试为所有解决方案计时。这是我的安装脚本:

import numba as nb
import numpy as np
import pandas as pd

def random_array():
    choices = [1, 2, 3, 4, 5, 6, 7, 8, 9, np.nan]
    out = np.random.choice(choices, size=(1000, 10))
    return out

def loops_fill(arr):
    out = arr.copy()
    for row_idx in range(out.shape[0]):
        for col_idx in range(1, out.shape[1]):
            if np.isnan(out[row_idx, col_idx]):
                out[row_idx, col_idx] = out[row_idx, col_idx - 1]
    return out

@nb.jit
def numba_loops_fill(arr):
    '''Numba decorator solution provided by shx2.'''
    out = arr.copy()
    for row_idx in range(out.shape[0]):
        for col_idx in range(1, out.shape[1]):
            if np.isnan(out[row_idx, col_idx]):
                out[row_idx, col_idx] = out[row_idx, col_idx - 1]
    return out

def pandas_fill(arr):
    df = pd.DataFrame(arr)
    df.fillna(method='ffill', axis=1, inplace=True)
    out = df.as_matrix()
    return out

def numpy_fill(arr):
    '''Solution provided by Divakar.'''
    mask = np.isnan(arr)
    idx = np.where(~mask,np.arange(mask.shape[1]),0)
    np.maximum.accumulate(idx,axis=1, out=idx)
    out = arr[np.arange(idx.shape[0])[:,None], idx]
    return out

followed by this console input:

然后是这个控制台输入:

%timeit -n 1000 loops_fill(random_array())
%timeit -n 1000 numba_loops_fill(random_array())
%timeit -n 1000 pandas_fill(random_array())
%timeit -n 1000 numpy_fill(random_array())

resulting in this console output:

导致此控制台输出:

1000 loops, best of 3: 9.64 ms per loop
1000 loops, best of 3: 377 μs per loop
1000 loops, best of 3: 455 μs per loop
1000 loops, best of 3: 351 μs per loop

回答by Divakar

Here's one approach -

这是一种方法 -

mask = np.isnan(arr)
idx = np.where(~mask,np.arange(mask.shape[1]),0)
np.maximum.accumulate(idx,axis=1, out=idx)
out = arr[np.arange(idx.shape[0])[:,None], idx]

If you don't want to create another array and just fill the NaNs in arritself, replace the last step with this -

如果您不想创建另一个数组而只填充 NaNarr本身,请将最后一步替换为 -

arr[mask] = arr[np.nonzero(mask)[0], idx[mask]]

Sample input, output -

样本输入、输出 -

In [179]: arr
Out[179]: 
array([[  5.,  nan,  nan,   7.,   2.,   6.,   5.],
       [  3.,  nan,   1.,   8.,  nan,   5.,  nan],
       [  4.,   9.,   6.,  nan,  nan,  nan,   7.]])

In [180]: out
Out[180]: 
array([[ 5.,  5.,  5.,  7.,  2.,  6.,  5.],
       [ 3.,  3.,  1.,  8.,  8.,  5.,  5.],
       [ 4.,  9.,  6.,  6.,  6.,  6.,  7.]])

回答by shx2

Use Numba. This should give a significant speedup:

使用Numba。这应该会显着加速:

import numba
@numba.jit
def loops_fill(arr):
    ...

回答by cchwala

For those that came here looking for the backward-fill of NaN values, I modified the solution provided by Divakar aboveto do exactly that. The trick is that you have to do the accumulation on the reversed array using the minimum except for the maximum.

对于那些来这里寻找 NaN 值的向后填充的人,我修改了上面 Divakar 提供的解决方案来做到这一点。诀窍是您必须使用除最大值之外的最小值对反向数组进行累加。

Here is the code:

这是代码:



# As provided in the answer by Divakar
def ffill(arr):
    mask = np.isnan(arr)
    idx = np.where(~mask, np.arange(mask.shape[1]), 0)
    np.maximum.accumulate(idx, axis=1, out=idx)
    out = arr[np.arange(idx.shape[0])[:,None], idx]
    return out

# My modification to do a backward-fill
def bfill(arr):
    mask = np.isnan(arr)
    idx = np.where(~mask, np.arange(mask.shape[1]), mask.shape[1] - 1)
    idx = np.minimum.accumulate(idx[:, ::-1], axis=1)[:, ::-1]
    out = arr[np.arange(idx.shape[0])[:,None], idx]
    return out


# Test both functions
arr = np.array([[5, np.nan, np.nan, 7, 2],
                [3, np.nan, 1, 8, np.nan],
                [4, 9, 6, np.nan, np.nan]])
print('Array:')
print(arr)

print('\nffill')
print(ffill(arr))

print('\nbfill')
print(bfill(arr))

Output:

输出:

Array:
[[ 5. nan nan  7.  2.]
 [ 3. nan  1.  8. nan]
 [ 4.  9.  6. nan nan]]

ffill
[[5. 5. 5. 7. 2.]
 [3. 3. 1. 8. 8.]
 [4. 9. 6. 6. 6.]]

bfill
[[ 5.  7.  7.  7.  2.]
 [ 3.  1.  1.  8. nan]
 [ 4.  9.  6. nan nan]]

Edit: Update according to comment of MS_

编辑:根据 MS_ 的评论更新

回答by christian_bock

For those who are interested in the problem of having leading np.nanafter foward-filling, the following works:

对于那些对np.nan向前填充后领先的问题感兴趣的人,以下作品:

mask = np.isnan(arr)
first_non_zero_idx = (~mask!=0).argmax(axis=1) #Get indices of first non-zero values
arr = [ np.hstack([
             [arr[i,first_nonzero]]*(first_nonzero), 
             arr[i,first_nonzero:]])
             for i, first_nonzero in enumerate(first_non_zero_idx) ]

回答by RichieV

I liked Divakar's answer on pure numpy. Here's a generalized function for n-dimensional arrays:

我喜欢 Divakar 对纯 numpy 的回答。这是一个用于 n 维数组的广义函数:

def np_ffill(arr, axis):
    idx_shape = tuple([slice(None)] + [np.newaxis] * (len(arr.shape) - axis - 1))
    idx = np.where(~np.isnan(arr), np.arange(arr.shape[axis])[idx_shape], 0)
    np.maximum.accumulate(idx, axis=axis, out=idx)
    slc = [np.arange(k)[tuple([slice(None) if dim==i else np.newaxis
        for dim in range(len(arr.shape))])]
        for i, k in enumerate(arr.shape)]
    slc[axis] = idx
    return arr[tuple(slc)]

AFIK pandas can only work with two dimensions, despite having multi-index to make up for it. The only way to accomplish this would be to flatten a DataFrame, unstack desired level, restack, and finally reshape as original. This unstacking/restacking/reshaping, with the pandas sorting involved, is just unnecessary overhead to achieve the same result.

AFIK pandas 只能处理二维,尽管有多索引来弥补它。完成此操作的唯一方法是展平 DataFrame,取消堆叠所需的级别,重新堆叠,最后将其重塑为原始形状。这种取消堆叠/重新堆叠/重塑,涉及Pandas排序,只是实现相同结果的不必要开销。

Testing:

测试:

def random_array(shape):
    choices = [1, 2, 3, 4, np.nan]
    out = np.random.choice(choices, size=shape)
    return out

ra = random_array((2, 4, 8))
print('arr')
print(ra)
print('\nffull')
print(np_ffill(ra, 1))
raise SystemExit

Output:

输出:

arr

阿尔

[[[ 3. nan  4.  1.  4.  2.  2.  3.]
  [ 2. nan  1.  3. nan  4.  4.  3.]
  [ 3.  2. nan  4. nan nan  3.  4.]
  [ 2.  2.  2. nan  1.  1. nan  2.]]

 [[ 2.  3.  2. nan  3.  3.  3.  3.]
  [ 3.  3.  1.  4.  1.  4.  1. nan]
  [ 4.  2. nan  4.  4.  3. nan  4.]
  [ 2.  4.  2.  1.  4.  1.  3. nan]]]

ffull
[[[ 3. nan  4.  1.  4.  2.  2.  3.]
  [ 2. nan  1.  3.  4.  4.  4.  3.]
  [ 3.  2.  1.  4.  4.  4.  3.  4.]
  [ 2.  2.  2.  4.  1.  1.  3.  2.]]

 [[ 2.  3.  2. nan  3.  3.  3.  3.]
  [ 3.  3.  1.  4.  1.  4.  1.  3.]
  [ 4.  2.  1.  4.  4.  3.  1.  4.]
  [ 2.  4.  2.  1.  4.  1.  3.  4.]]]