在 Python pandas 中自定义滚动应用函数

Question

提问by Maxim Zaslavsky

Setup

设置

I have a DataFrame with three columns:

我有一个包含三列的 DataFrame：

"Category" contains True and False, and I have done df.groupby('Category')to group by these values.
"Time" contains timestamps (measured in seconds) at which values have been recorded
"Value" contains the values themselves.

“类别”包含 True 和 False，我已经df.groupby('Category')按照这些值进行了分组。
“时间”包含已记录值的时间戳（以秒为单位）
“价值”包含价值本身。

At each time instance, two values are recorded: one has category "True", and the other has category "False".

在每个时间实例中，记录两个值：一个具有类别“真”，另一个具有类别“假”。

Rolling apply question

滚动申请问题

Within each category group, I want to compute a number and store it in column Result for each time. Result is the percentage of values between time t-60and tthat fall between 1 and 3.

在每个类别组中，我想每次计算一个数字并将其存储在列 Result 中。结果是时间t-60与t介于 1 和 3 之间的值的百分比。

The easiest way to accomplish this is probably to calculate the total number of values in that time interval via rolling_count, then execute rolling_applyto count only the values from that interval that fall between 1 and 3.

完成此操作的最简单方法可能是通过计算该时间间隔内的值总数rolling_count，然后执行rolling_apply以仅计算该间隔中介于 1 和 3 之间的值。

Here is my code so far:

到目前为止，这是我的代码：

groups = df.groupby(['Category'])
for key, grp in groups:
    grp = grp.reindex(grp['Time']) # reindex by time so we can count with rolling windows
    grp['total'] = pd.rolling_count(grp['Value'], window=60) # count number of values in the last 60 seconds
    grp['in_interval'] = ? ## Need to count number of values where 1<v<3 in the last 60 seconds

    grp['Result'] = grp['in_interval'] / grp['total'] # percentage of values between 1 and 3 in the last 60 seconds

What is the proper rolling_apply()call to find grp['in_interval']?

rolling_apply()find的正确调用是grp['in_interval']什么？

Answer 1

回答by unutbu

Let's work through an example:

让我们看一个例子：

import pandas as pd
import numpy as np
np.random.seed(1)

def setup(regular=True):
    N = 10
    x = np.arange(N)
    a = np.arange(N)
    b = np.arange(N)

    if regular:
        timestamps = np.linspace(0, 120, N)
    else:
        timestamps = np.random.uniform(0, 120, N)

    df = pd.DataFrame({
        'Category': [True]*N + [False]*N,
        'Time': np.hstack((timestamps, timestamps)),
        'Value': np.hstack((a,b))
        })
    return df

df = setup(regular=False)
df.sort(['Category', 'Time'], inplace=True)

So the DataFrame, df, looks like this:

所以 DataFramedf看起来像这样：

In [4]: df
Out[4]: 
   Category       Time  Value    Result
12    False   0.013725      2  1.000000
15    False  11.080631      5  0.500000
14    False  17.610707      4  0.333333
16    False  22.351225      6  0.250000
13    False  36.279909      3  0.400000
17    False  41.467287      7  0.333333
18    False  47.612097      8  0.285714
10    False  50.042641      0  0.250000
19    False  64.658008      9  0.125000
11    False  86.438939      1  0.333333
2      True   0.013725      2  1.000000
5      True  11.080631      5  0.500000
4      True  17.610707      4  0.333333
6      True  22.351225      6  0.250000
3      True  36.279909      3  0.400000
7      True  41.467287      7  0.333333
8      True  47.612097      8  0.285714
0      True  50.042641      0  0.250000
9      True  64.658008      9  0.125000
1      True  86.438939      1  0.333333

Now, copying @herrfz, let's define

现在，复制@herrfz，让我们定义

def between(a, b):
    def between_percentage(series):
        return float(len(series[(a <= series) & (series < b)])) / float(len(series))
    return between_percentage

between(1,3)is a function which takes a Series as input and returns the fraction of its elements which lie in the half-open interval [1,3). For example,

between(1,3)是一个函数，它将 Series 作为输入并返回其位于半开区间的元素的分数[1,3)。例如，

In [9]: series = pd.Series([1,2,3,4,5])

In [10]: between(1,3)(series)
Out[10]: 0.4

Now we are going to take our DataFrame, df, and group by Category:

现在我们将使用我们的 DataFramedf和 group by Category：

df.groupby(['Category'])

For each group in the groupby object, we will want to apply a function:

对于 groupby 对象中的每个组，我们将要应用一个函数：

df['Result'] = df.groupby(['Category']).apply(toeach_category)

The function, toeach_category, will take a (sub)DataFrame as input, and return a DataFrame as output. The entire result will be assigned to a new column of dfcalled Result.

函数 ,toeach_category将一个（子）DataFrame 作为输入，并返回一个 DataFrame 作为输出。整个结果将被分配到新列df名为Result。

Now what exactly must toeach_categorydo? If we write toeach_categorylike this:

现在到底必须toeach_category做什么？如果我们这样写toeach_category：

def toeach_category(subf):
    print(subf)

then we see each subfis a DataFrame such as this one (when Categoryis False):

然后我们看到每个subf都是一个 DataFrame，比如这个（当Category是 False 时）：

   Category       Time  Value    Result
12    False   0.013725      2  1.000000
15    False  11.080631      5  0.500000
14    False  17.610707      4  0.333333
16    False  22.351225      6  0.250000
13    False  36.279909      3  0.400000
17    False  41.467287      7  0.333333
18    False  47.612097      8  0.285714
10    False  50.042641      0  0.250000
19    False  64.658008      9  0.125000
11    False  86.438939      1  0.333333

We want to take the Times column, and for each time, apply a function. That's done with applymap:

我们想要获取 Times 列，并为每个 time应用一个函数。这是完成的applymap：

def toeach_category(subf):
    result = subf[['Time']].applymap(percentage)

The function percentagewill take a time value as input, and return a value as output. The value will be the fraction of rows with values between 1 and 3. applymapis very strict: percentagecan not take any other arguments.

该函数percentage将以时间值作为输入，并返回一个值作为输出。该值将是值介于 1 和 3 之间的行的分数。applymap非常严格：percentage不能采用任何其他参数。

Given a time t, we can select the Values from subfwhose times are in the half-open interval (t-60, t]using the ixmethod:

给定时间t，我们可以使用以下方法从中选择时间处于半开区间的Values ：subf(t-60, t]ix

subf.ix[(t-60 < subf['Time']) & (subf['Time'] <= t), 'Value']

And so we can find the percentage of those Valuesbetween 1 and 3 by applying between(1,3):

所以我们可以Values通过应用找到1 到 3 之间的百分比between(1,3)：

between(1,3)(subf.ix[(t-60 < subf['Time']) & (subf['Time'] <= t), 'Value'])

Now remember that we want a function percentagewhich takes tas input and returns the above expression as output:

现在请记住，我们想要一个函数percentage，它接受t输入并返回上面的表达式作为输出：

def percentage(t):
    return between(1,3)(subf.ix[(t-60 < subf['Time']) & (subf['Time'] <= t), 'Value'])

But notice that percentagedepends on subf, and we are not allowed to pass subfto percentageas an argument (again, because applymapis very strict).

但是请注意，percentage取决于subf, 并且我们不允许将subftopercentage作为参数传递（同样，因为applymap非常严格）。

So how do we get out of this jam? The solution is to define percentageinside toeach_category. Python's scoping rules say that a bare name like subfis first looked for in the Local scope, then the Enclosing scope, the the Global scope, and lastly in the Builtin scope. When percentage(t)is called, and Python encounters subf, Python first looks in the Local scope for the value of subf. Since subfis not a local variable in percentage, Python looks for it in the Enclosing scope of the function toeach_category. It finds subfthere. Perfect. That is just what we need.

那么我们如何摆脱这种困境呢？解决方案是定义percentageinside toeach_category。Python 的作用域规则说，像这样的裸名subf首先在本地作用域中查找，然后是封闭作用域、全局作用域，最后是内置作用域。当percentage(t)被调用并且 Python 遇到时subf，Python 首先在本地范围内查找的值subf。由于subf不是局部变量percentage，Python 在函数的封闭作用域中寻找它toeach_category。它在subf那里找到。完美的。这正是我们所需要的。

So now we have our function toeach_category:

所以现在我们有了我们的功能toeach_category：

def toeach_category(subf):
    def percentage(t):
        return between(1, 3)(
            subf.ix[(t - 60 < subf['Time']) & (subf['Time'] <= t), 'Value'])
    result = subf[['Time']].applymap(percentage)
    return result

Putting it all together,

把这一切放在一起，

import pandas as pd
import numpy as np
np.random.seed(1)


def setup(regular=True):
    N = 10
    x = np.arange(N)
    a = np.arange(N)
    b = np.arange(N)

    if regular:
        timestamps = np.linspace(0, 120, N)
    else:
        timestamps = np.random.uniform(0, 120, N)

    df = pd.DataFrame({
        'Category': [True] * N + [False] * N,
        'Time': np.hstack((timestamps, timestamps)),
        'Value': np.hstack((a, b))
    })
    return df


def between(a, b):
    def between_percentage(series):
        return float(len(series[(a <= series) & (series < b)])) / float(len(series))
    return between_percentage


def toeach_category(subf):
    def percentage(t):
        return between(1, 3)(
            subf.ix[(t - 60 < subf['Time']) & (subf['Time'] <= t), 'Value'])
    result = subf[['Time']].applymap(percentage)
    return result


df = setup(regular=False)
df.sort(['Category', 'Time'], inplace=True)
df['Result'] = df.groupby(['Category']).apply(toeach_category)
print(df)

yields

产量

   Category       Time  Value    Result
12    False   0.013725      2  1.000000
15    False  11.080631      5  0.500000
14    False  17.610707      4  0.333333
16    False  22.351225      6  0.250000
13    False  36.279909      3  0.200000
17    False  41.467287      7  0.166667
18    False  47.612097      8  0.142857
10    False  50.042641      0  0.125000
19    False  64.658008      9  0.000000
11    False  86.438939      1  0.166667
2      True   0.013725      2  1.000000
5      True  11.080631      5  0.500000
4      True  17.610707      4  0.333333
6      True  22.351225      6  0.250000
3      True  36.279909      3  0.200000
7      True  41.467287      7  0.166667
8      True  47.612097      8  0.142857
0      True  50.042641      0  0.125000
9      True  64.658008      9  0.000000
1      True  86.438939      1  0.166667

Answer 2

回答by herrfz

If I understand your problem statement correctly, you could probably skip rolling countif you use it only for the sake of computing the percentage. rolling_applytakes as an argument a function that performs aggregation, i.e. a function that takes an array as input and returns a number as an output.

如果我正确理解你的问题陈述，rolling count如果你只是为了计算百分比而使用它，你可能可以跳过。rolling_apply将执行聚合的函数作为参数，即以数组作为输入并返回数字作为输出的函数。

Having this in mind, let's first define a function:

考虑到这一点，让我们首先定义一个函数：

def between_1_3_perc(x):
    # pandas Series is basically a numpy array, we can do boolean indexing
    return float(len(x[(x > 1) & (x < 3)])) / float(len(x))

Then use the function name as an argument of rolling_applyin the for-loop:

然后使用函数名作为rolling_applyfor 循环中的参数：

grp['Result'] = pd.rolling_apply(grp['Value'], 60, between_1_3_perc)

在 Python pandas 中自定义滚动应用函数

提问by Maxim Zaslavsky

Setup

设置

Rolling apply question

滚动申请问题

回答by unutbu

回答by herrfz

相关推荐

最近更新

标签

在 Python pandas 中自定义滚动应用函数

提问by Maxim Zaslavsky

Setup

设置

Rolling apply question

滚动申请问题

回答by unutbu

回答by herrfz

相关推荐

pandas 来自 unix utc 秒的 numpy datetime64

pandas 如何从“groupby”对象的“单元格”获取值？

pandas 格式化乳胶 (to_latex) 输出

将组 ID 取回 Pandas 数据框

相关推荐

最近更新

标签