计算pandas数据框列中列表长度的Pythonic方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41340341/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pythonic way for calculating length of lists in pandas dataframe column
提问by MYGz
I have a dataframe like this:
我有一个这样的数据框:
CreationDate
2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux]
2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2]
2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik]
I am calculation length of lists in the CreationDate
column and making a new Length
column like this:
我正在计算CreationDate
列中列表的长度并创建一个Length
像这样的新列:
df['Length'] = df.CreationDate.apply(lambda x: len(x))
Which gives me this:
这给了我这个:
CreationDate Length
2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux] 3
2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2] 4
2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik] 4
Is there a more pythonic way to do this?
有没有更pythonic的方法来做到这一点?
回答by ayhan
You can use the str
accessor for some list operations as well. In this example,
您也可以将str
访问器用于某些列表操作。在这个例子中,
df['CreationDate'].str.len()
returns the length of each list. See the docs for str.len
.
返回每个列表的长度。请参阅文档str.len
。
df['Length'] = df['CreationDate'].str.len()
df
Out:
CreationDate Length
2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux] 3
2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2] 4
2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik] 4
For these operations, vanilla Python is generally faster. pandas handles NaNs though. Here are timings:
对于这些操作,vanilla Python 通常更快。熊猫虽然处理 NaN。以下是时间安排:
ser = pd.Series([random.sample(string.ascii_letters,
random.randint(1, 20)) for _ in range(10**6)])
%timeit ser.apply(lambda x: len(x))
1 loop, best of 3: 425 ms per loop
%timeit ser.str.len()
1 loop, best of 3: 248 ms per loop
%timeit [len(x) for x in ser]
10 loops, best of 3: 84 ms per loop
%timeit pd.Series([len(x) for x in ser], index=ser.index)
1 loop, best of 3: 236 ms per loop