Pandas:如何在数据框列中找到特定模式?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44967805/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:56:36  来源:igfitidea点击:

Pandas: How to find a particular pattern in a dataframe column?

pythonpandas

提问by vestland

I'd like to find a particular pattern in a pandas dataframe column, and return the corresponding index values in order to subset the dataframe.

我想在 Pandas 数据帧列中找到特定模式,并返回相应的索引值以便对数据帧进行子集化。

Here's a sample dataframe with a possible pattern:

这是一个具有可能模式的示例数据框:

Snippet to produce dataframe:

生成数据帧的代码段:

import pandas as pd
import numpy as np

Observations = 10
Columns = 2
np.random.seed(123)
df = pd.DataFrame(np.random.randint(90,110,size=(Observations, Columns)),
                  columns = ['ColA','ColB'])
datelist = pd.date_range(pd.datetime(2017, 7, 7).strftime('%Y-%m-%d'),
                         periods=Observations).tolist()
df['Dates'] = datelist
df = df.set_index(['Dates'])

pattern = [100,90,105]
print(df)

Dataframe:

数据框:

            ColA  ColB
Dates                 
2017-07-07   103    92
2017-07-08    92    96
2017-07-09   107   109
2017-07-10   100    91
2017-07-11    90   107
2017-07-12   105    99
2017-07-13    90   104
2017-07-14    90   105
2017-07-15   109   104
2017-07-16    94    90

Here, the pattern of interest occurs in Column Aon the dates 2017-07-10to 2017-07-12, and that's what I'd like to end up with:

在这里,感兴趣的模式出现在Column A的日期2017-07-102017-07-12,这就是我想要的结果:

Desired output:

期望的输出:

2017-07-10   100    91
2017-07-11    90   107
2017-07-12   105    99

If the same pattern occurs several times, I would like to subset the dataframe the same way, and also count how many times the pattern occurs, but I hope that's more straight forward as long as I get the first step sorted out.

如果多次出现相同的模式,我想以相同的方式对数据帧进行子集化,并计算该模式出现的次数,但我希望只要我解决了第一步,就更直接了。

Thank you for any suggestions!

感谢您的任何建议!

采纳答案by FLab

Here is a solution:

这是一个解决方案:

Check if the pattern was found in any of the columns using rolling. This will give you the last index of the group matching the pattern

使用滚动检查是否在任何列中找到该模式。这将为您提供与模式匹配的组的最后一个索引

matched = df.rolling(len(pattern)).apply(lambda x: all(np.equal(x, pattern)))
matched = matched.sum(axis = 1).astype(bool)   #Sum to perform boolean OR

matched
Out[129]: 
Dates
2017-07-07    False
2017-07-08    False
2017-07-09    False
2017-07-10    False
2017-07-11    False
2017-07-12     True
2017-07-13    False
2017-07-14    False
2017-07-15    False
2017-07-16    False
dtype: bool

For each match, add the indexes of the complete pattern:

对于每个匹配项,添加完整模式的索引:

idx_matched = np.where(matched)[0]
subset = [range(match-len(pattern)+1, match+1) for match in idx_matched]

Get all the patterns:

获取所有模式:

result = pd.concat([df.iloc[subs,:] for subs in subset], axis = 0)

result
Out[128]: 
            ColA  ColB
Dates                 
2017-07-10   100    91
2017-07-11    90   107
2017-07-12   105    99

回答by Deena

Using the magic of list comprehensions:

使用列表推导式的魔力:

[df.index[i - len(pattern)] # Get the datetime index 
 for i in range(len(pattern), len(df)) # For each 3 consequent elements 
 if all(df['ColA'][i-len(pattern):i] == pattern)] # If the pattern matched 

# [Timestamp('2017-07-10 00:00:00')]

回答by baloo

The shortest way is finding the index at which the pattern starts. Then you just need to select the three following rows.

最短的方法是找到模式开始的索引。然后您只需要选择以下三行。

In order to find these indexes, a one-liner is enough:

为了找到这些索引,一行就足够了:

indexes=df[(df.ColA==pattern[0])&(df["ColA"].shift(-1)==pattern[1])&(df["ColA"].shift(-2)==pattern[2])].index

Then do as the other answer says to get the subsets that you want.

然后按照另一个答案所说的来获取您想要的子集。

回答by SayPy

for col in df:
    index = df[col][(df[col] == pattern[0]) & (df[col].shift(-1) == pattern[1]) & (df[col].shift(-2) == pattern[2])].index
    if not index.empty: print(index)