计算pandas数据框列中列表长度的Pythonic方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41340341/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-20 00:50:31  来源:igfitidea点击:

Pythonic way for calculating length of lists in pandas dataframe column

pythonpython-2.7pandas

提问by MYGz

I have a dataframe like this:

我有一个这样的数据框:

                                                    CreationDate
2013-12-22 15:25:02                  [ubuntu, mac-osx, syslinux]
2009-12-14 14:29:32  [ubuntu, mod-rewrite, laconica, apache-2.2]
2013-12-22 15:42:00               [ubuntu, nat, squid, mikrotik]

I am calculation length of lists in the CreationDatecolumn and making a new Lengthcolumn like this:

我正在计算CreationDate列中列表的长度并创建一个Length像这样的新列:

df['Length'] = df.CreationDate.apply(lambda x: len(x))

Which gives me this:

这给了我这个:

                                                    CreationDate  Length
2013-12-22 15:25:02                  [ubuntu, mac-osx, syslinux]       3
2009-12-14 14:29:32  [ubuntu, mod-rewrite, laconica, apache-2.2]       4
2013-12-22 15:42:00               [ubuntu, nat, squid, mikrotik]       4

Is there a more pythonic way to do this?

有没有更pythonic的方法来做到这一点?

回答by ayhan

You can use the straccessor for some list operations as well. In this example,

您也可以将str访问器用于某些列表操作。在这个例子中,

df['CreationDate'].str.len()

returns the length of each list. See the docs for str.len.

返回每个列表的长度。请参阅文档str.len

df['Length'] = df['CreationDate'].str.len()
df
Out: 
                                                    CreationDate  Length
2013-12-22 15:25:02                  [ubuntu, mac-osx, syslinux]       3
2009-12-14 14:29:32  [ubuntu, mod-rewrite, laconica, apache-2.2]       4
2013-12-22 15:42:00               [ubuntu, nat, squid, mikrotik]       4

For these operations, vanilla Python is generally faster. pandas handles NaNs though. Here are timings:

对于这些操作,vanilla Python 通常更快。熊猫虽然处理 NaN。以下是时间安排:

ser = pd.Series([random.sample(string.ascii_letters, 
                               random.randint(1, 20)) for _ in range(10**6)])

%timeit ser.apply(lambda x: len(x))
1 loop, best of 3: 425 ms per loop

%timeit ser.str.len()
1 loop, best of 3: 248 ms per loop

%timeit [len(x) for x in ser]
10 loops, best of 3: 84 ms per loop

%timeit pd.Series([len(x) for x in ser], index=ser.index)
1 loop, best of 3: 236 ms per loop