pandas 用熊猫数据框中的空列表替换 NaN

Question

提问by moku

I'm trying to replace some NaN values in my data with an empty list []. However the list is represented as a str and doesn't allow me to properly apply the len() function. is there anyway to replace a NaN value with an actual empty list in pandas?

我正在尝试用空列表 [] 替换数据中的一些 NaN 值。但是，列表表示为 str 并且不允许我正确应用 len() 函数。有没有用Pandas中的实际空列表替换 NaN 值？

In [28]: d = pd.DataFrame({'x' : [[1,2,3], [1,2], np.NaN, np.NaN], 'y' : [1,2,3,4]})

In [29]: d
Out[29]:
           x  y
0  [1, 2, 3]  1
1     [1, 2]  2
2        NaN  3
3        NaN  4

In [32]: d.x.replace(np.NaN, '[]', inplace=True)

In [33]: d
Out[33]:
           x  y
0  [1, 2, 3]  1
1     [1, 2]  2
2         []  3
3         []  4

In [34]: d.x.apply(len)
Out[34]:
0    3
1    2
2    2
3    2
Name: x, dtype: int64

Answer 1

回答by EdChum

This works using isnulland locto mask the series:

这可以使用isnull和loc来掩盖系列：

In [90]:
d.loc[d.isnull()] = d.loc[d.isnull()].apply(lambda x: [])
d

Out[90]:
0    [1, 2, 3]
1       [1, 2]
2           []
3           []
dtype: object

In [91]:
d.apply(len)

Out[91]:
0    3
1    2
2    0
3    0
dtype: int64

You have to do this using applyin order for the list object to not be interpreted as an array to assign back to the df which will try to align the shape back to the original series

您必须这样做apply，以便列表对象不会被解释为要分配回 df 的数组，该 df 将尝试将形状与原始系列对齐

EDIT

编辑

Using your updated sample the following works:

使用您更新的示例进行以下工作：

In [100]:
d.loc[d['x'].isnull(),['x']] = d.loc[d['x'].isnull(),'x'].apply(lambda x: [])
d

Out[100]:
           x  y
0  [1, 2, 3]  1
1     [1, 2]  2
2         []  3
3         []  4

In [102]:    
d['x'].apply(len)

Out[102]:
0    3
1    2
2    0
3    0
Name: x, dtype: int64

Answer 2

回答by ieaves

To extend the accepted answer, apply calls can be particularly expensive - the same task can be accomplished without it by constructing a numpy array from scratch.

为了扩展公认的答案，apply 调用可能特别昂贵 - 通过从头开始构造一个 numpy 数组，可以在没有它的情况下完成相同的任务。

isna = df['x'].isna()
df.loc[isna, 'x'] = pd.Series([[]] * isna.sum()).values

A quick timing comparison:

快速时序比较：

def empty_assign_1(s):
    s.isna().apply(lambda x: [])

def empty_assign_2(s):
    pd.Series([[]] * s.isna().sum()).values

series = pd.Series(np.random.choice([1, 2, np.nan], 1000000))

%timeit empty_assign_1(series)
>>> 172 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit empty_assign_2(series)
>>> 19.5 ms ± 116 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Nearly 10 times faster!

快了近 10 倍！

pandas 用熊猫数据框中的空列表替换 NaN

提问by moku

回答by EdChum

回答by ieaves

相关推荐

最近更新

标签

pandas 用熊猫数据框中的空列表替换 NaN

提问by moku

回答by EdChum

回答by ieaves

相关推荐

pandas 熊猫在 x 轴上绘制 xticks

如何在 Pandas 中绘制日期的核密度图？

pandas python dask DataFrame，是否支持（平凡可并行化）行应用？

pandas read_table vs. read_csv vs. from_csv vs. read_excel的性能差异？

相关推荐

最近更新

标签