pandas 系列的“减少”功能

Question

提问by hlin117

Is there an analog for reducefor a pandas Series?

reducePandas系列有类似的吗？

For example, the analog for mapis pd.Series.apply, but I can't find any analog for reduce.

例如，模拟 formap是pd.Series.apply，但我找不到任何模拟reduce。

My application is, I have a pandas Series of lists:

我的应用程序是，我有一个Pandas系列列表：

>>> business["categories"].head()

0                      ['Doctors', 'Health & Medical']
1                                        ['Nightlife']
2                 ['Active Life', 'Mini Golf', 'Golf']
3    ['Shopping', 'Home Services', 'Internet Servic...
4    ['Bars', 'American (New)', 'Nightlife', 'Loung...
Name: categories, dtype: object

I'd like to merge the Series of lists together using reduce, like so:

我想使用将一系列列表合并在一起reduce，如下所示：

categories = reduce(lambda l1, l2: l1 + l2, categories)

but this takes a horrific time because merging two lists together is O(n)time in Python. I'm hoping that pd.Serieshas a vectorized way to perform this faster.

但这需要很长时间，因为将两个列表合并在一起是O(n)Python 中的时间。我希望pd.Series有一种矢量化的方式来更快地执行此操作。

Answer 1

采纳答案by Mike Müller

With `itertools.chain()`on the values

随着`itertools.chain()`价值观

This could be faster:

这可能会更快：

from itertools import chain
categories = list(chain.from_iterable(categories.values))

Performance

表现

from functools import reduce
from itertools import chain

categories = pd.Series([['a', 'b'], ['c', 'd', 'e']] * 1000)

%timeit list(chain.from_iterable(categories.values))
1000 loops, best of 3: 231 μs per loop

%timeit list(chain(*categories.values.flat))
1000 loops, best of 3: 237 μs per loop

%timeit reduce(lambda l1, l2: l1 + l2, categories)
100 loops, best of 3: 15.8 ms per loop

For this data set the chaining is about 68x faster.

对于这个数据集，chain速度大约快了 68 倍。

Vectorization?

矢量化？

Vectorization works when you have native NumPy data types (pandas uses NumPy for its data after all). Since we have lists in the Series already and want a list as result, it is rather unlikely that vectorization will speed things up. The conversion between standard Python objects and pandas/NumPy data types will likely eat up all the performance you might get from the vectorization. I made one attempt to vectorize the algorithm in another answer.

当您拥有本机 NumPy 数据类型时（Pandas使用 NumPy 作为其数据），向量化工作。由于我们已经在 Series 中有列表并且想要一个列表作为结果，因此矢量化不太可能加快速度。标准 Python 对象和 Pandas/NumPy 数据类型之间的转换可能会耗尽您可能从矢量化中获得的所有性能。我尝试在另一个答案中对算法进行矢量化。

Answer 2

回答by Mike Müller

Vectorized but slow

矢量化但速度慢

You can use NumPy's concatenate:

您可以使用 NumPy 的 concatenate：

import numpy as np

list(np.concatenate(categories.values))

Performance

表现

But we have lists, i.e. Python objects already. So the vectorization has to switch back and forth between Python objects and NumPy data types. This make things slow:

但是我们已经有了列表，即 Python 对象。因此矢量化必须在 Python 对象和 NumPy 数据类型之间来回切换。这使事情变慢：

categories = pd.Series([['a', 'b'], ['c', 'd', 'e']] * 1000)

%timeit list(np.concatenate(categories.values))
100 loops, best of 3: 7.66 ms per loop

%timeit np.concatenate(categories.values)
100 loops, best of 3: 5.33 ms per loop

%timeit list(chain.from_iterable(categories.values))
1000 loops, best of 3: 231 μs per loop

Answer 3

回答by ssm

You can try your luck with business["categories"].str.join(''), but I am guessing that Pandas uses Pythons string functions. I doubt you can do better tha what Python already offers you.

你可以试试你的运气business["categories"].str.join('')，但我猜 Pandas 使用 Python 的字符串函数。我怀疑你能比 Python 已经提供给你的东西做得更好。

Answer 4

回答by Muhammad Mubashirullah Durrani

I used "".join(business["categories"])

我用了 "".join(business["categories"])

It is much faster than business["categories"].str.join('')but still 4 times slower than the itertools.chainmethod. I preferred it because it is more readable and no import is required.

它business["categories"].str.join('')比该itertools.chain方法快得多，但仍比该方法慢 4 倍。我更喜欢它，因为它更具可读性并且不需要导入。

pandas 系列的“减少”功能

提问by hlin117

采纳答案by Mike Müller

With `itertools.chain()`on the values

随着`itertools.chain()`价值观

Performance

表现

Vectorization?

矢量化？

回答by Mike Müller

Vectorized but slow

矢量化但速度慢

Performance

表现

回答by ssm

回答by Muhammad Mubashirullah Durrani

相关推荐

最近更新

标签

pandas 系列的“减少”功能

提问by hlin117

采纳答案by Mike Müller

With itertools.chain()on the values

随着itertools.chain()价值观

Performance

表现

Vectorization?

矢量化？

回答by Mike Müller

Vectorized but slow

矢量化但速度慢

Performance

表现

回答by ssm

回答by Muhammad Mubashirullah Durrani

相关推荐

Python/Pandas - 将类型从 Pandas 句点转换为字符串

Python pandas 使用滚动应用到 groupby 对象以矢量化方式计算机车车辆 beta

pandas 如何在熊猫中找到重复项？

使用 Pandas 的每小时日期时间直方图

相关推荐

最近更新

标签

With `itertools.chain()`on the values

随着`itertools.chain()`价值观