在 Pandas 数据帧中查找字符串模式匹配并返回匹配的字符串
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/22703494/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Find String Pattern Match in Pandas Dataframe and Return Matched Strin
提问by horatio1701d
I have a dataframe column with variable comma separated text and just trying to extract the values that are found based on another list. So my dataframe looks like this:
我有一个带有可变逗号分隔文本的数据框列,只是试图提取基于另一个列表找到的值。所以我的数据框看起来像这样:
col1 | col2
-----------
 x   | a,b
listformatch = [c,d,f,b]
pattern = '|'.join(listformatch)
def test_for_pattern(x):
    if re.search(pattern, x):
        return pattern
    else:
        return x
#also can use col2.str.contains(pattern) for same results
The above filtering works great but instead of returning bwhen it finds the match it returns the whole pattern such as a|binstead of just bwhereas I want to create another column with the pattern it finds such as b. 
上面的过滤效果很好,但是b当它找到匹配项时不会返回,而是返回整个模式,a|b而不是仅仅返回,b而我想用它找到的模式创建另一列,例如b.
Here is my final function but still getting UserWarning: This pattern has match groups. To actually get the groups, use str.extract." groups, use str.extract.", UserWarning)I wish I can solve:
这是我的最终功能,但仍然UserWarning: This pattern has match groups. To actually get the groups, use str.extract." groups, use str.extract.", UserWarning)希望我能解决:
def matching_func(file1, file2):
    file1 = pd.read_csv(fin)
    file2 = pd.read_excel(fin1, 0, skiprows=1)
    pattern = '|'.join(file1[col1].tolist())
    file2['new_col'] = file2[col1].map(lambda x: re.search(pattern, x).group()\
                                             if re.search(pattern, x) else None)
I think I understand how pandas extract works now but probably still rusty on regex. How do I create a pattern variable to use for the below example:
我想我现在了解Pandas提取物的工作原理,但在正则表达式上可能仍然生疏。如何创建用于以下示例的模式变量:
df[col1].str.extract('(word1|word2)')
Instead of having the words in the argument, I want to create variable as pattern = 'word1|word2'but that won't work because of the way the string is being created. 
我不想在参数中包含单词,而是想创建变量 aspattern = 'word1|word2'但由于创建字符串的方式,这将不起作用。
My final and preferred version with vectorized string method in pandas 0.13:
我在 Pandas 0.13 中使用矢量化字符串方法的最终和首选版本:
Using values from one column to extract from a second column:
使用一列中的值从第二列中提取:
df[col1].str.extract('({})'.format('|'.join(df[col2]))
回答by Andy Hayden
You might like to use extract, or one of the other vectorised string methods:
您可能喜欢使用提取或其他矢量化字符串方法之一:
In [11]: s = pd.Series(['a', 'a,b'])
In [12]: s.str.extract('([cdfb])')
Out[12]:
0    NaN
1      b
dtype: object

