pandas 系列的“减少”功能
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35004945/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
"Reduce" function for Series
提问by hlin117
Is there an analog for reduce
for a pandas Series?
reduce
Pandas系列有类似的吗?
For example, the analog for map
is pd.Series.apply, but I can't find any analog for reduce
.
例如,模拟 formap
是pd.Series.apply,但我找不到任何模拟reduce
。
My application is, I have a pandas Series of lists:
我的应用程序是,我有一个Pandas系列列表:
>>> business["categories"].head()
0 ['Doctors', 'Health & Medical']
1 ['Nightlife']
2 ['Active Life', 'Mini Golf', 'Golf']
3 ['Shopping', 'Home Services', 'Internet Servic...
4 ['Bars', 'American (New)', 'Nightlife', 'Loung...
Name: categories, dtype: object
I'd like to merge the Series of lists together using reduce
, like so:
我想使用 将一系列列表合并在一起reduce
,如下所示:
categories = reduce(lambda l1, l2: l1 + l2, categories)
but this takes a horrific time because merging two lists together is O(n)
time in Python. I'm hoping that pd.Series
has a vectorized way to perform this faster.
但这需要很长时间,因为将两个列表合并在一起是O(n)
Python 中的时间。我希望pd.Series
有一种矢量化的方式来更快地执行此操作。
采纳答案by Mike Müller
With itertools.chain()
on the values
随着itertools.chain()
价值观
This could be faster:
这可能会更快:
from itertools import chain
categories = list(chain.from_iterable(categories.values))
Performance
表现
from functools import reduce
from itertools import chain
categories = pd.Series([['a', 'b'], ['c', 'd', 'e']] * 1000)
%timeit list(chain.from_iterable(categories.values))
1000 loops, best of 3: 231 μs per loop
%timeit list(chain(*categories.values.flat))
1000 loops, best of 3: 237 μs per loop
%timeit reduce(lambda l1, l2: l1 + l2, categories)
100 loops, best of 3: 15.8 ms per loop
For this data set the chain
ing is about 68x faster.
对于这个数据集,chain
速度大约快了 68 倍。
Vectorization?
矢量化?
Vectorization works when you have native NumPy data types (pandas uses NumPy for its data after all). Since we have lists in the Series already and want a list as result, it is rather unlikely that vectorization will speed things up. The conversion between standard Python objects and pandas/NumPy data types will likely eat up all the performance you might get from the vectorization. I made one attempt to vectorize the algorithm in another answer.
当您拥有本机 NumPy 数据类型时(Pandas使用 NumPy 作为其数据),向量化工作。由于我们已经在 Series 中有列表并且想要一个列表作为结果,因此矢量化不太可能加快速度。标准 Python 对象和 Pandas/NumPy 数据类型之间的转换可能会耗尽您可能从矢量化中获得的所有性能。我尝试在另一个答案中对算法进行矢量化。
回答by Mike Müller
Vectorized but slow
矢量化但速度慢
You can use NumPy's concatenate
:
您可以使用 NumPy 的 concatenate
:
import numpy as np
list(np.concatenate(categories.values))
Performance
表现
But we have lists, i.e. Python objects already. So the vectorization has to switch back and forth between Python objects and NumPy data types. This make things slow:
但是我们已经有了列表,即 Python 对象。因此矢量化必须在 Python 对象和 NumPy 数据类型之间来回切换。这使事情变慢:
categories = pd.Series([['a', 'b'], ['c', 'd', 'e']] * 1000)
%timeit list(np.concatenate(categories.values))
100 loops, best of 3: 7.66 ms per loop
%timeit np.concatenate(categories.values)
100 loops, best of 3: 5.33 ms per loop
%timeit list(chain.from_iterable(categories.values))
1000 loops, best of 3: 231 μs per loop
回答by ssm
You can try your luck with business["categories"].str.join('')
, but I am guessing that Pandas uses Pythons string functions. I doubt you can do better tha what Python already offers you.
你可以试试你的运气business["categories"].str.join('')
,但我猜 Pandas 使用 Python 的字符串函数。我怀疑你能比 Python 已经提供给你的东西做得更好。
回答by Muhammad Mubashirullah Durrani
I used "".join(business["categories"])
我用了 "".join(business["categories"])
It is much faster than business["categories"].str.join('')
but still 4 times slower than the itertools.chain
method. I preferred it because it is more readable and no import is required.
它business["categories"].str.join('')
比该itertools.chain
方法快得多,但仍比该方法慢 4 倍。我更喜欢它,因为它更具可读性并且不需要导入。