Pandas 应用 lambda 函数空值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37060385/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:11:00  来源:igfitidea点击:

Pandas Apply lambda function null values

pythonpandas

提问by flyingmeatball

I'm trying to split a column in two, but I know there are null values in my data. Imagine this dataframe:

我试图将一列一分为二,但我知道我的数据中有空值。想象一下这个数据框:

df = pd.DataFrame(['fruit: apple','vegetable: asparagus',None, 'fruit: pear'], columns = ['text'])

df

                   text
0          fruit: apple
1  vegetable: asparagus
2                   None
3           fruit: pear

I'd like to split this into multiple columns like so:

我想把它分成多列,如下所示:

df['cat'] = df['text'].apply(lambda x: 'unknown' if x == None else x.split(': ')[0])
df['value'] = df['text'].apply(lambda x: 'unknown' if x == None else x.split(': ')[1])

print df

                   text        cat      value
0          fruit: apple      fruit      apple
1  vegetable: asparagus  vegetable  asparagus
2                  None    unknown    unknown
3           fruit: pear      fruit       pear

However, if I have the following df instead:

但是,如果我有以下 df 代替:

df = pd.DataFrame(['fruit: apple','vegetable: asparagus',np.nan, 'fruit: pear'], columns = ['text'])

splitting results in the following error:

拆分导致以下错误:

df['cat'] = df['text'].apply(lambda x: 'unknown' if x == np.nan else x.split(': ')[0])

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-159-8e5bca809635> in <module>()
      1 df = pd.DataFrame(['fruit: apple','vegetable: asparagus',np.nan, 'fruit: pear'], columns = ['text'])
      2 #df.columns = ['col_name']
----> 3 df['cat'] = df['text'].apply(lambda x: 'unknown' if x == np.nan else x.split(': ')[0])
      4 df['value'] = df['text'].apply(lambda x: 'unknown' if x == np.nan else x.split(': ')[1])

C:\Python27\lib\site-packages\pandas\core\series.pyc in apply(self, func, convert_dtype, args, **kwds)
   2158             values = lib.map_infer(values, lib.Timestamp)
   2159 
-> 2160         mapped = lib.map_infer(values, f, convert=convert_dtype)
   2161         if len(mapped) and isinstance(mapped[0], Series):
   2162             from pandas.core.frame import DataFrame

pandas\src\inference.pyx in pandas.lib.map_infer (pandas\lib.c:62187)()

<ipython-input-159-8e5bca809635> in <lambda>(x)
      1 df = pd.DataFrame(['fruit: apple','vegetable: asparagus',np.nan, 'fruit: pear'], columns = ['text'])
      2 #df.columns = ['col_name']
----> 3 df['cat'] = df['text'].apply(lambda x: 'unknown' if x == np.nan else x.split(': ')[0])
      4 df['value'] = df['text'].apply(lambda x: 'unknown' if x == np.nan else x.split(': ')[1])

AttributeError: 'float' object has no attribute 'split'

How do I do the same split with NaN values?Is there generally a better way to apply a split function that ignores null values?

如何使用 NaN 值进行相同的拆分?通常是否有更好的方法来应用忽略空值的拆分函数?

Imagine this wasn't a string example, instead if I had the following:

想象一下,这不是一个字符串示例,而是如果我有以下内容:

df = pd.DataFrame([2,4,6,8,10,np.nan,12], columns = ['numerics'])
df['numerics'].apply(lambda x: np.nan if pd.isnull(x) else x/2.0)

I feel like Series.apply should almost take an argument that instructs it to skip null rows and just output them as nulls. I haven't found a better genericway to do transformations to a series without having to manually avoid nulls.

我觉得 Series.apply 几乎应该接受一个参数,指示它跳过空行并将它们输出为空值。我还没有找到一种更好的通用方法来对系列进行转换而无需手动避免空值。

回答by unutbu

Instead of applywith a custom function you could use the Series.str.extractmethod:

apply您可以使用以下Series.str.extract方法代替自定义函数:

import numpy as np
import pandas as pd
# df = pd.DataFrame(['fruit: apple','vegetable: asparagus',None, 'fruit: pear'], 
#                   columns = ['text'])
df = pd.DataFrame(['fruit: apple','vegetable: asparagus',np.nan, 'fruit: pear'], 
                  columns = ['text'])
df[['cat', 'value']] = df['text'].str.extract(r'([^:]+):?(.*)', expand=True).fillna('unknown')
print(df)

yields

产量

                   text        cat       value
0          fruit: apple      fruit       apple
1  vegetable: asparagus  vegetable   asparagus
2                   NaN    unknown     unknown
3           fruit: pear      fruit        pear


applywith a custom function is generally slower than equivalent code which makes use of vectorized methods such as Series.str.extract. Under the hood, apply(with an unvectorizable function) essentially calls the custom function in a Python for-loop.

apply使用自定义函数通常比使用矢量化方法的等效代码慢,例如Series.str.extract. 在apply幕后,(使用不可矢量化的函数)本质上是在 Python 中调用自定义函数for-loop



Regarding the edited question: If you have

关于编辑的问题:如果你有

df = pd.DataFrame([2,4,6,8,10,np.nan,12], columns = ['numerics'])

then use

然后使用

In [207]: df['numerics']/2
Out[207]: 
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    NaN
6    6.0
Name: numerics, dtype: float64

instead of

代替

df['numerics'].apply(lambda x: np.nan if pd.isnull(x) else x/2.0)

Again, vectorized arithmetic beats applywith a custom function:

同样,矢量化算术apply与自定义函数相得益彰:

In [210]: df = pd.concat([df]*100, ignore_index=True)

In [211]: %timeit df['numerics']/2
10000 loops, best of 3: 93.8 μs per loop

In [212]: %timeit df['numerics'].apply(lambda x: np.nan if pd.isnull(x) else x/2.0)
1000 loops, best of 3: 836 μs per loop