Pandas:如何在数据框列中找到特定模式?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/44967805/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: How to find a particular pattern in a dataframe column?
提问by vestland
I'd like to find a particular pattern in a pandas dataframe column, and return the corresponding index values in order to subset the dataframe.
我想在 Pandas 数据帧列中找到特定模式,并返回相应的索引值以便对数据帧进行子集化。
Here's a sample dataframe with a possible pattern:
这是一个具有可能模式的示例数据框:
Snippet to produce dataframe:
生成数据帧的代码段:
import pandas as pd
import numpy as np
Observations = 10
Columns = 2
np.random.seed(123)
df = pd.DataFrame(np.random.randint(90,110,size=(Observations, Columns)),
columns = ['ColA','ColB'])
datelist = pd.date_range(pd.datetime(2017, 7, 7).strftime('%Y-%m-%d'),
periods=Observations).tolist()
df['Dates'] = datelist
df = df.set_index(['Dates'])
pattern = [100,90,105]
print(df)
Dataframe:
数据框:
ColA ColB
Dates
2017-07-07 103 92
2017-07-08 92 96
2017-07-09 107 109
2017-07-10 100 91
2017-07-11 90 107
2017-07-12 105 99
2017-07-13 90 104
2017-07-14 90 105
2017-07-15 109 104
2017-07-16 94 90
Here, the pattern of interest occurs in Column A
on the dates 2017-07-10
to 2017-07-12
, and that's what I'd like to end up with:
在这里,感兴趣的模式出现在Column A
的日期2017-07-10
上2017-07-12
,这就是我想要的结果:
Desired output:
期望的输出:
2017-07-10 100 91
2017-07-11 90 107
2017-07-12 105 99
If the same pattern occurs several times, I would like to subset the dataframe the same way, and also count how many times the pattern occurs, but I hope that's more straight forward as long as I get the first step sorted out.
如果多次出现相同的模式,我想以相同的方式对数据帧进行子集化,并计算该模式出现的次数,但我希望只要我解决了第一步,就更直接了。
Thank you for any suggestions!
感谢您的任何建议!
采纳答案by FLab
Here is a solution:
这是一个解决方案:
Check if the pattern was found in any of the columns using rolling. This will give you the last index of the group matching the pattern
使用滚动检查是否在任何列中找到该模式。这将为您提供与模式匹配的组的最后一个索引
matched = df.rolling(len(pattern)).apply(lambda x: all(np.equal(x, pattern)))
matched = matched.sum(axis = 1).astype(bool) #Sum to perform boolean OR
matched
Out[129]:
Dates
2017-07-07 False
2017-07-08 False
2017-07-09 False
2017-07-10 False
2017-07-11 False
2017-07-12 True
2017-07-13 False
2017-07-14 False
2017-07-15 False
2017-07-16 False
dtype: bool
For each match, add the indexes of the complete pattern:
对于每个匹配项,添加完整模式的索引:
idx_matched = np.where(matched)[0]
subset = [range(match-len(pattern)+1, match+1) for match in idx_matched]
Get all the patterns:
获取所有模式:
result = pd.concat([df.iloc[subs,:] for subs in subset], axis = 0)
result
Out[128]:
ColA ColB
Dates
2017-07-10 100 91
2017-07-11 90 107
2017-07-12 105 99
回答by Deena
Using the magic of list comprehensions:
使用列表推导式的魔力:
[df.index[i - len(pattern)] # Get the datetime index
for i in range(len(pattern), len(df)) # For each 3 consequent elements
if all(df['ColA'][i-len(pattern):i] == pattern)] # If the pattern matched
# [Timestamp('2017-07-10 00:00:00')]
回答by baloo
The shortest way is finding the index at which the pattern starts. Then you just need to select the three following rows.
最短的方法是找到模式开始的索引。然后您只需要选择以下三行。
In order to find these indexes, a one-liner is enough:
为了找到这些索引,一行就足够了:
indexes=df[(df.ColA==pattern[0])&(df["ColA"].shift(-1)==pattern[1])&(df["ColA"].shift(-2)==pattern[2])].index
Then do as the other answer says to get the subsets that you want.
然后按照另一个答案所说的来获取您想要的子集。
回答by SayPy
for col in df:
index = df[col][(df[col] == pattern[0]) & (df[col].shift(-1) == pattern[1]) & (df[col].shift(-2) == pattern[2])].index
if not index.empty: print(index)