pandas 用熊猫数据框中的空列表替换 NaN

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31567218/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:39:44  来源:igfitidea点击:

Replace NaN with empty list in a pandas dataframe

pythonpandasdataframe

提问by moku

I'm trying to replace some NaN values in my data with an empty list []. However the list is represented as a str and doesn't allow me to properly apply the len() function. is there anyway to replace a NaN value with an actual empty list in pandas?

我正在尝试用空列表 [] 替换数据中的一些 NaN 值。但是,列表表示为 str 并且不允许我正确应用 len() 函数。有没有用Pandas中的实际空列表替换 NaN 值?

In [28]: d = pd.DataFrame({'x' : [[1,2,3], [1,2], np.NaN, np.NaN], 'y' : [1,2,3,4]})

In [29]: d
Out[29]:
           x  y
0  [1, 2, 3]  1
1     [1, 2]  2
2        NaN  3
3        NaN  4

In [32]: d.x.replace(np.NaN, '[]', inplace=True)

In [33]: d
Out[33]:
           x  y
0  [1, 2, 3]  1
1     [1, 2]  2
2         []  3
3         []  4

In [34]: d.x.apply(len)
Out[34]:
0    3
1    2
2    2
3    2
Name: x, dtype: int64

回答by EdChum

This works using isnulland locto mask the series:

这可以使用isnullloc来掩盖系列:

In [90]:
d.loc[d.isnull()] = d.loc[d.isnull()].apply(lambda x: [])
d

Out[90]:
0    [1, 2, 3]
1       [1, 2]
2           []
3           []
dtype: object

In [91]:
d.apply(len)

Out[91]:
0    3
1    2
2    0
3    0
dtype: int64

You have to do this using applyin order for the list object to not be interpreted as an array to assign back to the df which will try to align the shape back to the original series

您必须这样做apply,以便列表对象不会被解释为要分配回 df 的数组,该 df 将尝试将形状与原始系列对齐

EDIT

编辑

Using your updated sample the following works:

使用您更新的示例进行以下工作:

In [100]:
d.loc[d['x'].isnull(),['x']] = d.loc[d['x'].isnull(),'x'].apply(lambda x: [])
d

Out[100]:
           x  y
0  [1, 2, 3]  1
1     [1, 2]  2
2         []  3
3         []  4

In [102]:    
d['x'].apply(len)

Out[102]:
0    3
1    2
2    0
3    0
Name: x, dtype: int64

回答by ieaves

To extend the accepted answer, apply calls can be particularly expensive - the same task can be accomplished without it by constructing a numpy array from scratch.

为了扩展公认的答案,apply 调用可能特别昂贵 - 通过从头开始构造一个 numpy 数组,可以在没有它的情况下完成相同的任务。

isna = df['x'].isna()
df.loc[isna, 'x'] = pd.Series([[]] * isna.sum()).values

A quick timing comparison:

快速时序比较:

def empty_assign_1(s):
    s.isna().apply(lambda x: [])

def empty_assign_2(s):
    pd.Series([[]] * s.isna().sum()).values

series = pd.Series(np.random.choice([1, 2, np.nan], 1000000))

%timeit empty_assign_1(series)
>>> 172 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit empty_assign_2(series)
>>> 19.5 ms ± 116 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Nearly 10 times faster!

快了近 10 倍!