在 Pandas 数据框中查找连续段

Question

提问by languitar

I have a pandas.DataFrame with measurements taken at consecutive points in time. Along with each measurement the system under observation had a distinct state at each point in time. Hence, the DataFrame also contains a column with the state of the system at each measurement. State changes are much slower than the measurement interval. As a result, the column indicating the states might look like this (index: state):

我有一个 pandas.DataFrame ，在连续的时间点进行测量。随着每次测量，被观察系统在每个时间点都有不同的状态。因此，DataFrame 还包含一个列，其中包含每次测量时系统的状态。状态变化比测量间隔慢得多。因此，指示状态的列可能如下所示（索引：状态）：

Is there an easy way to retrieve the indices of each segment of consecutively equal states. That means I would like to get something like this:

有没有一种简单的方法来检索每个连续相等状态段的索引。这意味着我想得到这样的东西：

[[1,2,3,4], [5,6,7,8], [9,10,11,12,13]]

The result might also be in something different than plain lists.

结果也可能与普通列表不同。

The only solution I could think of so far is manually iterating over the rows, finding segment change points and reconstructing the indices from these change points, but I have the hope that there is an easier solution.

到目前为止我能想到的唯一解决方案是手动迭代行，找到段变化点并从这些变化点重建索引，但我希望有一个更简单的解决方案。

Answer 1

回答by Zelazny7

One-liner:

单线：

df.reset_index().groupby('A')['index'].apply(np.array)

Code for example:

代码示例：

In [1]: import numpy as np

In [2]: from pandas import *

In [3]: df = DataFrame([3]*4+[4]*4+[1]*4, columns=['A'])
In [4]: df
Out[4]:
    A
0   3
1   3
2   3
3   3
4   4
5   4
6   4
7   4
8   1
9   1
10  1
11  1

In [5]: df.reset_index().groupby('A')['index'].apply(np.array)
Out[5]:
A
1    [8, 9, 10, 11]
3      [0, 1, 2, 3]
4      [4, 5, 6, 7]

You can also directly access the information from the groupby object:

您也可以直接访问 groupby 对象中的信息：

In [1]: grp = df.groupby('A')

In [2]: grp.indices
Out[2]:
{1L: array([ 8,  9, 10, 11], dtype=int64),
 3L: array([0, 1, 2, 3], dtype=int64),
 4L: array([4, 5, 6, 7], dtype=int64)}

In [3]: grp.indices[3]
Out[3]: array([0, 1, 2, 3], dtype=int64)

To address the situation that DSM mentioned you could do something like:

要解决 DSM 提到的情况，您可以执行以下操作：

In [1]: df['block'] = (df.A.shift(1) != df.A).astype(int).cumsum()

In [2]: df
Out[2]:
    A  block
0   3      1
1   3      1
2   3      1
3   3      1
4   4      2
5   4      2
6   4      2
7   4      2
8   1      3
9   1      3
10  1      3
11  1      3
12  3      4
13  3      4
14  3      4
15  3      4

Now groupby both columns and apply the lambda function:

现在对两列进行分组并应用 lambda 函数：

In [77]: df.reset_index().groupby(['A','block'])['index'].apply(np.array)
Out[77]:
A  block
1  3          [8, 9, 10, 11]
3  1            [0, 1, 2, 3]
   4        [12, 13, 14, 15]
4  2            [4, 5, 6, 7]

Answer 2

回答by Rutger Kassies

You could use np.diff() to test where a segment starts/ends and iterate over those results. Its a very simple solution, so probably not the most performent one.

您可以使用 np.diff() 来测试段的开始/结束位置并迭代这些结果。它是一个非常简单的解决方案，所以可能不是性能最好的解决方案。

a = np.array([3,3,3,3,3,4,4,4,4,4,1,1,1,1,4,4,12,12,12])

prev = 0
splits = np.append(np.where(np.diff(a) != 0)[0],len(a)+1)+1

for split in splits:
    print np.arange(1,a.size+1,1)[prev:split]
    prev = split

Results in:

结果是：

[1 2 3 4 5]
[ 6  7  8  9 10]
[11 12 13 14]
[15 16]
[17 18 19]

在 Pandas 数据框中查找连续段

提问by languitar

回答by Zelazny7

回答by Rutger Kassies

相关推荐

最近更新

标签

在 Pandas 数据框中查找连续段

提问by languitar

回答by Zelazny7

回答by Rutger Kassies

相关推荐

pandas 在熊猫中运行总和（无循环）

pandas 熊猫：生成并绘制平均值

Pandas 数据框中值的矢量化查找

pandas 按键更新pandas DataFrame

相关推荐

最近更新

标签