pandas Python:用户警告:此模式具有匹配组。要实际获取组,请使用 str.extract

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39901550/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:09:27  来源:igfitidea点击:

Python: UserWarning: This pattern has match groups. To actually get the groups, use str.extract

pythonregexpandas

提问by Petr Petrov

I have a dataframe and I try to get string, where on of column contain some string Df looks like

我有一个数据框,我尝试获取字符串,其中的列包含一些字符串 Df 看起来像

member_id,event_path,event_time,event_duration
30595,"2016-03-30 12:27:33",yandex.ru/,1
30595,"2016-03-30 12:31:42",yandex.ru/,0
30595,"2016-03-30 12:31:43",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0
30595,"2016-03-30 12:31:44",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0
30595,"2016-03-30 12:31:45",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0
30595,"2016-03-30 12:31:46",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0
30595,"2016-03-30 12:31:49",kinogo.co/,1
30595,"2016-03-30 12:32:11",kinogo.co/melodramy/,0

And another df with urls

另一个带有 url 的 df

url
003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnyj_telefon_bq_phoenix
003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnyj_telefon_fly_
003\.ru\/sonyxperia
003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnye_telefony_smartfony
003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnye_telefony_smartfony\/brands5D5Bbr_23
1click\.ru\/sonyxperia
1click\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/chasy-motorola

I use

我用

urls = pd.read_csv('relevant_url1.csv', error_bad_lines=False)
substr = urls.url.values.tolist()
data = pd.read_csv('data_nts2.csv', error_bad_lines=False, chunksize=50000)
result = pd.DataFrame()
for i, df in enumerate(data):
    res = df[df['event_time'].str.contains('|'.join(substr), regex=True)]

but it return me

但它回报我

UserWarning: This pattern has match groups. To actually get the groups, use str.extract.

How can I fix that?

我该如何解决?

采纳答案by unutbu

At least one of the regex patterns in urlsmust use a capturing group. str.containsonly returns True or False for each row in df['event_time']-- it does not make use of the capturing group. Thus, the UserWarningis alerting you that the regex uses a capturing group but the match is not used.

中的至少一个正则表达式模式urls必须使用捕获组。 str.contains只为每一行返回 True 或 False df['event_time']—— 它不使用捕获组。因此,UserWarning警告您正则表达式使用捕获组但未使用匹配。

If you wish to remove the UserWarningyou could find and remove the capturing group from the regex pattern(s). They are not shown in the regex patterns you posted, but they must be there in your actual file. Look for parentheses outside of the character classes.

如果您希望删除 ,UserWarning您可以从正则表达式模式中找到并删除捕获组。它们未显示在您发布的正则表达式模式中,但它们必须存在于您的实际文件中。在字符类之外寻找括号。

Alternatively, you could suppress this particular UserWarning by putting

或者,您可以通过放置来抑制此特定的 UserWarning

import warnings
warnings.filterwarnings("ignore", 'This pattern has match groups')

before the call to str.contains.

在调用之前str.contains



Here is a simple example which demonstrates the problem (and solution):

这是一个演示问题(和解决方案)的简单示例:

# import warnings
# warnings.filterwarnings("ignore", 'This pattern has match groups') # uncomment to suppress the UserWarning

import pandas as pd

df = pd.DataFrame({ 'event_time': ['gouda', 'stilton', 'gruyere']})

urls = pd.DataFrame({'url': ['g(.*)']})   # With a capturing group, there is a UserWarning
# urls = pd.DataFrame({'url': ['g.*']})   # Without a capturing group, there is no UserWarning. Uncommenting this line avoids the UserWarning.

substr = urls.url.values.tolist()
df[df['event_time'].str.contains('|'.join(substr), regex=True)]

prints

印刷

  script.py:10: UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
  df[df['event_time'].str.contains('|'.join(substr), regex=True)]

Removing the capturing group from the regex pattern:

从正则表达式模式中删除捕获组:

urls = pd.DataFrame({'url': ['g.*']})   

avoids the UserWarning.

避免用户警告。

回答by climatebrad

The alternative way to get rid of the warning is change the regex so that it is a matching group and not a capturing group. That is the (?:)notation.

摆脱警告的另一种方法是更改​​正则表达式,使其成为匹配组而不是捕获组。这就是(?:)符号。

Thus, if the matching group is (url1|url2)it should be replaced by (?:url1|url2).

因此,如果匹配组是(url1|url2),则应替换为(?:url1|url2)

回答by Chankey Pathak

Since regex=Trueis provided, sublistgets treated as a regex, which in your case contains capturing groups (strings enclosed with parentheses).

由于regex=True提供了,sublist被视为正则表达式,在您的情况下,它包含捕获组(用括号括起来的字符串)。

You get the warning because if you want to capture something then there is no use of str.contains(which returns booleandepending upon whether the provided pattern is contained within the string or not)

您收到警告,因为如果您想捕获某些内容,则不使用str.contains(它返回布尔值,具体取决于所提供的模式是否包含在字符串中)

Obviously, you can suppress the warnings but it's betterto fix them.

显然,您可以抑制警告,但最好修复它们。

Either escape the parenthesis blocks or use str.extractif you really want to capture something.

如果您真的想捕获某些内容,请转义括号块或使用str.extract

回答by Rob

you should use re.escape(yourString)for the string you are passing to contains.

您应该re.escape(yourString)用于传递给包含的字符串。

回答by toto_tico

You can use str.matchinstead. In your code:

你可以str.match改用。在您的代码中:

res = df[df['event_time'].str.match('|'.join(substr), regex=True)]




Explanation

解释

The warning is triggered by str.containswhen the regular expression includes groups, e.g. in the regex r'foo(bar)', the (bar)part is considered a group because it is in parenthesis. Therefore you could theoretically extract that from a regex.

str.contains当正则表达式包含组时触发警告,例如在 regex 中r'foo(bar)',该(bar)部分被视为一个组,因为它在括号中。因此,您理论上可以从正则表达式中提取它。

However, the warning doesn't make sense in the first place, containsis supposed to only "test if pattern or regex is contained within a string of a Series or Index" (pandas documentation). There is nothing about extracting groups.

但是,警告首先没有意义contains应该只“测试模式或正则表达式是否包含在系列或索引的字符串中”(Pandas文档)。没有关于提取组的内容。

In any case, str.matchdoes not throw the warning, and currently does almost the same as str.containsexcept that (1) the string must exactly match and (2) one cannot deactivate regex from str.match(str.containshas a regexparameter to deactivate them)

在任何情况下,str.match都不会抛出警告,并且目前几乎与str.contains除了 (1) 字符串必须完全匹配和 (2) 不能停用正则表达式str.match(str.contains有一个regex参数可以停用它们)