pandas 计算Python数组中连续的正值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27626542/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:47:15  来源:igfitidea点击:

Counting consecutive positive value in Python array

pythonpandasstatistics

提问by alex314159

I'm trying to count consecutive up days in equity return data - so if a positive day is 1 and a negative is 0, a list y=[0,0,1,1,1,0,0,1,0,1,1]should return z=[0,0,1,2,3,0,0,1,0,1,2].

我正在尝试计算股票收益数据中连续上涨的天数 - 因此,如果正日为 1,负日为 0,则列表y=[0,0,1,1,1,0,0,1,0,1,1]应返回z=[0,0,1,2,3,0,0,1,0,1,2]

I've come to a solution which is neat in terms of number of lines of code, but is veryslow:

我找到了一个解决方案,它在代码行数方面很简洁,但速度慢:

import pandas
y=pandas.Series([0,0,1,1,1,0,0,1,0,1,1])
def f(x):
    return reduce(lambda a,b:reduce((a+b)*b,x)
z=pandas.expanding_apply(y,f)

I'm guessing I'm looping through the whole list y too many times. Is there a nice Pythonic way of achieving what I want while only going through the data once? I could write a loop myself but wondering if there's a better way.

我猜我在整个列表中循环了太多次。有没有一种很好的 Pythonic 方式来实现我想要的,而只浏览一次数据?我可以自己写一个循环,但想知道是否有更好的方法。

Thanks!

谢谢!

采纳答案by Coding Orange

why the obsession with the ultra-pythonic way of doing things? readability + efficiency trumps "leet hackerz style."

为什么痴迷于超pythonic的做事方式?可读性 + 效率胜过“leethackerz 风格”。

I'd just do it like so:

我只是这样做:

a = [0,0,1,1,1,0,0,1,0,1,1]
b = [0,0,0,0,0,0,0,0,0,0,0]

for i in range(len(a)):
    if a[i] == 1:
        b[i] = b[i-1] + 1
    else:
        b[i] = 0

回答by DSM

>>> y = pandas.Series([0,0,1,1,1,0,0,1,0,1,1])

The following may seem a little magical, but actually uses some common idioms: since pandasdoesn't yet have nice native support for a contiguous groupby, you often find yourself needing something like this.

以下可能看起来有点神奇,但实际上使用了一些常见的习惯用法:由于pandas还没有对 contiguous 的良好原生支持groupby,您经常发现自己需要这样的东西。

>>> y * (y.groupby((y != y.shift()).cumsum()).cumcount() + 1)
0     0
1     0
2     1
3     2
4     3
5     0
6     0
7     1
8     0
9     1
10    2
dtype: int64


Some explanation: first, we compare yagainst a shifted version of itself to find when the contiguous groups begin:

一些解释:首先,我们与y自身的移动版本进行比较,以找出连续组何时开始:

>>> y != y.shift()
0      True
1     False
2      True
3     False
4     False
5      True
6     False
7      True
8      True
9      True
10    False
dtype: bool

Then (since False == 0 and True == 1) we can apply a cumulative sum to get a number for the groups:

然后(因为 False == 0 和 True == 1)我们可以应用累积和来获得组的数字:

>>> (y != y.shift()).cumsum()
0     1
1     1
2     2
3     2
4     2
5     3
6     3
7     4
8     5
9     6
10    6
dtype: int32

We can use groupbyand cumcountto get us an integer counting up in each group:

我们可以使用groupbycumcount为我们提供一个在每个组中计数的整数:

>>> y.groupby((y != y.shift()).cumsum()).cumcount()
0     0
1     1
2     0
3     1
4     2
5     0
6     1
7     0
8     0
9     0
10    1
dtype: int64

Add one:

加一个:

>>> y.groupby((y != y.shift()).cumsum()).cumcount() + 1
0     1
1     2
2     1
3     2
4     3
5     1
6     2
7     1
8     1
9     1
10    2
dtype: int64

And finally zero the values where we had zero to begin with:

最后将我们从零开始的值归零:

>>> y * (y.groupby((y != y.shift()).cumsum()).cumcount() + 1)
0     0
1     0
2     1
3     2
4     3
5     0
6     0
7     1
8     0
9     1
10    2
dtype: int64

回答by osa

If something is clear, it is "pythonic". Frankly, I cannot even make your original solution work. Also, if it does work, I am curious if it is faster than a loop. Did you compare?

如果某些事情很清楚,那就是“pythonic”。坦率地说,我什至无法使您的原始解决方案起作用。另外,如果它确实有效,我很好奇它是否比循环更快。你比较了吗?

Now, since we've started discussing efficiency, here are some insights.

现在,既然我们已经开始讨论效率,这里有一些见解。

Loops in Python are inherently slow, no matter what you do. Of course, if you are using pandas, you are also using numpy underneath, with all the performance advantages. Just don't destroy them by looping. This is not to mention that Python lists take a lot more memory than you may think; potentially much more than 8 bytes * length, as every integer may be wrapped into a separate object and placed into a separate area in memory, and pointed at by a pointer from the list.

无论你做什么,Python 中的循环本质上都很慢。当然,如果您使用的是 Pandas,那么您也在底层使用 numpy,具有所有的性能优势。只是不要通过循环来破坏它们。这并不是说 Python 列表占用的内存比您想象的要多得多;可能远不止8 bytes * length,因为每个整数都可能被包装成一个单独的对象并放置在内存中的一个单独区域中,并由列表中的指针指向。

Vectorization provided by numpyshould be sufficient IF you can find some way to express this function without looping. In fact, I wonder if there some way to represent it by using expressions such as A+B*C. If you can construct this function out of functions in Lapack, then you can even potentially beat ordinary C++ code compiled with optimization.

numpy提供的矢量化应该足够了,如果您可以找到某种方法来表达此函数而无需循环。事实上,我想知道是否有某种方法可以通过使用诸如A+B*C. 如果你可以用Lapack 中的函数构造这个函数,那么你甚至可以击败经过优化编译的普通 C++ 代码。

You can also use one of the compiled approaches to speed-up your loops. See a solution with Numbaon numpy arrays below. Another option is to use PyPy, though you probably can't properly combine it with pandas.

您还可以使用其中一种编译方法来加速循环。在下面的 numpy 数组上查看Numba的解决方案。另一种选择是使用PyPy,尽管您可能无法将其与 Pandas 正确结合。

In [140]: import pandas as pd
In [141]: import numpy as np
In [143]: a=np.random.randint(2,size=1000000)

# Try the simple approach
In [147]: def simple(L):
              for i in range(len(L)):
                  if L[i]==1:
                      L[i] += L[i-1]


In [148]: %time simple(L)
CPU times: user 255 ms, sys: 20.8 ms, total: 275 ms
Wall time: 248 ms


# Just-In-Time compilation
In[149]: from numba import jit
@jit          
def faster(z):
    prev=0
    for i in range(len(z)):
        cur=z[i]
        if cur==0:
             prev=0
        else:
             prev=prev+cur
             z[i]=prev

In [151]: %time faster(a)
CPU times: user 51.9 ms, sys: 1.12 ms, total: 53 ms
Wall time: 51.9 ms


In [159]: list(L)==list(a)
Out[159]: True

In fact, most of the time in the second example above was spent on Just-In-Time compilation. Instead (remember to copy, as the function changes the array).

事实上,上面第二个例子中的大部分时间都花在了 Just-In-Time 编译上。取而代之(记住要复制,因为函数会更改数组)。

b=a.copy()
In [38]: %time faster(b)
CPU times: user 55.1 ms, sys: 1.56 ms, total: 56.7 ms
Wall time: 56.3 ms

In [39]: %time faster(c)
CPU times: user 10.8 ms, sys: 42 μs, total: 10.9 ms
Wall time: 10.9 ms

So for subsequent calls we have a 25x-speedupcompared to the simple version. I suggest you read High Performance Pythonif you want to know more.

因此,对于后续调用,与简单版本相比,我们有25 倍的加速。如果你想了解更多,我建议你阅读高性能 Python

回答by Dan

Keeping things simple, using one array, one loop, and one conditional.

保持简单,使用一个数组、一个循环和一个条件。

a = [0,0,1,1,1,0,0,1,0,1,1]

for i in range(1, len(a)):
    if a[i] == 1:
        a[i] += a[i - 1]