pandas 系列的“减少”功能

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35004945/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:33:57  来源:igfitidea点击:

"Reduce" function for Series

pythonperformancepandasvectorizationreduce

提问by hlin117

Is there an analog for reducefor a pandas Series?

reducePandas系列有类似的吗?

For example, the analog for mapis pd.Series.apply, but I can't find any analog for reduce.

例如,模拟 formappd.Series.apply,但我找不到任何模拟reduce



My application is, I have a pandas Series of lists:

我的应用程序是,我有一个Pandas系列列表:

>>> business["categories"].head()

0                      ['Doctors', 'Health & Medical']
1                                        ['Nightlife']
2                 ['Active Life', 'Mini Golf', 'Golf']
3    ['Shopping', 'Home Services', 'Internet Servic...
4    ['Bars', 'American (New)', 'Nightlife', 'Loung...
Name: categories, dtype: object

I'd like to merge the Series of lists together using reduce, like so:

我想使用 将一系列列表合并在一起reduce,如下所示:

categories = reduce(lambda l1, l2: l1 + l2, categories)

but this takes a horrific time because merging two lists together is O(n)time in Python. I'm hoping that pd.Serieshas a vectorized way to perform this faster.

但这需要很长时间,因为将两个列表合并在一起是O(n)Python 中的时间。我希望pd.Series有一种矢量化的方式来更快地执行此操作。

采纳答案by Mike Müller

With itertools.chain()on the values

随着itertools.chain()价值观

This could be faster:

这可能会更快:

from itertools import chain
categories = list(chain.from_iterable(categories.values))

Performance

表现

from functools import reduce
from itertools import chain

categories = pd.Series([['a', 'b'], ['c', 'd', 'e']] * 1000)

%timeit list(chain.from_iterable(categories.values))
1000 loops, best of 3: 231 μs per loop

%timeit list(chain(*categories.values.flat))
1000 loops, best of 3: 237 μs per loop

%timeit reduce(lambda l1, l2: l1 + l2, categories)
100 loops, best of 3: 15.8 ms per loop

For this data set the chaining is about 68x faster.

对于这个数据集,chain速度大约快了 68 倍。

Vectorization?

矢量化?

Vectorization works when you have native NumPy data types (pandas uses NumPy for its data after all). Since we have lists in the Series already and want a list as result, it is rather unlikely that vectorization will speed things up. The conversion between standard Python objects and pandas/NumPy data types will likely eat up all the performance you might get from the vectorization. I made one attempt to vectorize the algorithm in another answer.

当您拥有本机 NumPy 数据类型时(Pandas使用 NumPy 作为其数据),向量化工作。由于我们已经在 Series 中有列表并且想要一个列表作为结果,因此矢量化不太可能加快速度。标准 Python 对象和 Pandas/NumPy 数据类型之间的转换可能会耗尽您可能从矢量化中获得的所有性能。我尝试在另一个答案中对算法进行矢量化。

回答by Mike Müller

Vectorized but slow

矢量化但速度慢

You can use NumPy's concatenate:

您可以使用 NumPy 的 concatenate

import numpy as np

list(np.concatenate(categories.values))

Performance

表现

But we have lists, i.e. Python objects already. So the vectorization has to switch back and forth between Python objects and NumPy data types. This make things slow:

但是我们已经有了列表,即 Python 对象。因此矢量化必须在 Python 对象和 NumPy 数据类型之间来回切换。这使事情变慢:

categories = pd.Series([['a', 'b'], ['c', 'd', 'e']] * 1000)

%timeit list(np.concatenate(categories.values))
100 loops, best of 3: 7.66 ms per loop

%timeit np.concatenate(categories.values)
100 loops, best of 3: 5.33 ms per loop

%timeit list(chain.from_iterable(categories.values))
1000 loops, best of 3: 231 μs per loop

回答by ssm

You can try your luck with business["categories"].str.join(''), but I am guessing that Pandas uses Pythons string functions. I doubt you can do better tha what Python already offers you.

你可以试试你的运气business["categories"].str.join(''),但我猜 Pandas 使用 Python 的字符串函数。我怀疑你能比 Python 已经提供给你的东西做得更好。

回答by Muhammad Mubashirullah Durrani

I used "".join(business["categories"])

我用了 "".join(business["categories"])

It is much faster than business["categories"].str.join('')but still 4 times slower than the itertools.chainmethod. I preferred it because it is more readable and no import is required.

business["categories"].str.join('')比该itertools.chain方法快得多,但仍比该方法慢 4 倍。我更喜欢它,因为它更具可读性并且不需要导入。