计算pandas数据框列中列表长度的Pythonic方法

Question

提问by MYGz

I have a dataframe like this:

我有一个这样的数据框：

                                                    CreationDate
2013-12-22 15:25:02                  [ubuntu, mac-osx, syslinux]
2009-12-14 14:29:32  [ubuntu, mod-rewrite, laconica, apache-2.2]
2013-12-22 15:42:00               [ubuntu, nat, squid, mikrotik]

I am calculation length of lists in the CreationDatecolumn and making a new Lengthcolumn like this:

我正在计算CreationDate列中列表的长度并创建一个Length像这样的新列：

df['Length'] = df.CreationDate.apply(lambda x: len(x))

Which gives me this:

这给了我这个：

                                                    CreationDate  Length
2013-12-22 15:25:02                  [ubuntu, mac-osx, syslinux]       3
2009-12-14 14:29:32  [ubuntu, mod-rewrite, laconica, apache-2.2]       4
2013-12-22 15:42:00               [ubuntu, nat, squid, mikrotik]       4

Is there a more pythonic way to do this?

有没有更pythonic的方法来做到这一点？

Answer 1

回答by ayhan

You can use the straccessor for some list operations as well. In this example,

您也可以将str访问器用于某些列表操作。在这个例子中，

df['CreationDate'].str.len()

returns the length of each list. See the docs for str.len.

返回每个列表的长度。请参阅文档str.len。

df['Length'] = df['CreationDate'].str.len()
df
Out: 
                                                    CreationDate  Length
2013-12-22 15:25:02                  [ubuntu, mac-osx, syslinux]       3
2009-12-14 14:29:32  [ubuntu, mod-rewrite, laconica, apache-2.2]       4
2013-12-22 15:42:00               [ubuntu, nat, squid, mikrotik]       4

For these operations, vanilla Python is generally faster. pandas handles NaNs though. Here are timings:

对于这些操作，vanilla Python 通常更快。熊猫虽然处理 NaN。以下是时间安排：

ser = pd.Series([random.sample(string.ascii_letters, 
                               random.randint(1, 20)) for _ in range(10**6)])

%timeit ser.apply(lambda x: len(x))
1 loop, best of 3: 425 ms per loop

%timeit ser.str.len()
1 loop, best of 3: 248 ms per loop

%timeit [len(x) for x in ser]
10 loops, best of 3: 84 ms per loop

%timeit pd.Series([len(x) for x in ser], index=ser.index)
1 loop, best of 3: 236 ms per loop

计算pandas数据框列中列表长度的Pythonic方法

提问by MYGz

回答by ayhan

相关推荐

最近更新

标签

计算pandas数据框列中列表长度的Pythonic方法

提问by MYGz

回答by ayhan

相关推荐

Python 如何在 Tensorflow 中关闭 dropout 以进行测试？

Python 确定 Pandas 列数据类型

Python 无法安装 csv 模块

Python 使用 boto3 时 S3 连接超时

相关推荐

最近更新

标签