pandas Python:用户警告:此模式具有匹配组。要实际获取组,请使用 str.extract
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39901550/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python: UserWarning: This pattern has match groups. To actually get the groups, use str.extract
提问by Petr Petrov
I have a dataframe and I try to get string, where on of column contain some string Df looks like
我有一个数据框,我尝试获取字符串,其中的列包含一些字符串 Df 看起来像
member_id,event_path,event_time,event_duration
30595,"2016-03-30 12:27:33",yandex.ru/,1
30595,"2016-03-30 12:31:42",yandex.ru/,0
30595,"2016-03-30 12:31:43",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0
30595,"2016-03-30 12:31:44",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0
30595,"2016-03-30 12:31:45",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0
30595,"2016-03-30 12:31:46",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0
30595,"2016-03-30 12:31:49",kinogo.co/,1
30595,"2016-03-30 12:32:11",kinogo.co/melodramy/,0
And another df with urls
另一个带有 url 的 df
url
003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnyj_telefon_bq_phoenix
003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnyj_telefon_fly_
003\.ru\/sonyxperia
003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnye_telefony_smartfony
003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnye_telefony_smartfony\/brands5D5Bbr_23
1click\.ru\/sonyxperia
1click\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/chasy-motorola
I use
我用
urls = pd.read_csv('relevant_url1.csv', error_bad_lines=False)
substr = urls.url.values.tolist()
data = pd.read_csv('data_nts2.csv', error_bad_lines=False, chunksize=50000)
result = pd.DataFrame()
for i, df in enumerate(data):
res = df[df['event_time'].str.contains('|'.join(substr), regex=True)]
but it return me
但它回报我
UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
How can I fix that?
我该如何解决?
采纳答案by unutbu
At least one of the regex patterns in urls
must use a capturing group.
str.contains
only returns True or False for each row in df['event_time']
--
it does not make use of the capturing group. Thus, the UserWarning
is alerting you
that the regex uses a capturing group but the match is not used.
中的至少一个正则表达式模式urls
必须使用捕获组。
str.contains
只为每一行返回 True 或 False df['event_time']
—— 它不使用捕获组。因此,UserWarning
警告您正则表达式使用捕获组但未使用匹配。
If you wish to remove the UserWarning
you could find and remove the capturing group from the regex pattern(s). They are not shown in the regex patterns you posted, but they must be there in your actual file. Look for parentheses outside of the character classes.
如果您希望删除 ,UserWarning
您可以从正则表达式模式中找到并删除捕获组。它们未显示在您发布的正则表达式模式中,但它们必须存在于您的实际文件中。在字符类之外寻找括号。
Alternatively, you could suppress this particular UserWarning by putting
或者,您可以通过放置来抑制此特定的 UserWarning
import warnings
warnings.filterwarnings("ignore", 'This pattern has match groups')
before the call to str.contains
.
在调用之前str.contains
。
Here is a simple example which demonstrates the problem (and solution):
这是一个演示问题(和解决方案)的简单示例:
# import warnings
# warnings.filterwarnings("ignore", 'This pattern has match groups') # uncomment to suppress the UserWarning
import pandas as pd
df = pd.DataFrame({ 'event_time': ['gouda', 'stilton', 'gruyere']})
urls = pd.DataFrame({'url': ['g(.*)']}) # With a capturing group, there is a UserWarning
# urls = pd.DataFrame({'url': ['g.*']}) # Without a capturing group, there is no UserWarning. Uncommenting this line avoids the UserWarning.
substr = urls.url.values.tolist()
df[df['event_time'].str.contains('|'.join(substr), regex=True)]
prints
印刷
script.py:10: UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
df[df['event_time'].str.contains('|'.join(substr), regex=True)]
Removing the capturing group from the regex pattern:
从正则表达式模式中删除捕获组:
urls = pd.DataFrame({'url': ['g.*']})
avoids the UserWarning.
避免用户警告。
回答by climatebrad
The alternative way to get rid of the warning is change the regex so that it is a matching group and not a capturing group. That is the (?:)
notation.
摆脱警告的另一种方法是更改正则表达式,使其成为匹配组而不是捕获组。这就是(?:)
符号。
Thus, if the matching group is (url1|url2)
it should be replaced by (?:url1|url2)
.
因此,如果匹配组是(url1|url2)
,则应替换为(?:url1|url2)
。
回答by Chankey Pathak
Since regex=True
is provided, sublist
gets treated as a regex, which in your case contains capturing groups (strings enclosed with parentheses).
由于regex=True
提供了,sublist
被视为正则表达式,在您的情况下,它包含捕获组(用括号括起来的字符串)。
You get the warning because if you want to capture something then there is no use of str.contains(which returns booleandepending upon whether the provided pattern is contained within the string or not)
您收到警告,因为如果您想捕获某些内容,则不使用str.contains(它返回布尔值,具体取决于所提供的模式是否包含在字符串中)
Obviously, you can suppress the warnings but it's betterto fix them.
显然,您可以抑制警告,但最好修复它们。
Either escape the parenthesis blocks or use str.extractif you really want to capture something.
如果您真的想捕获某些内容,请转义括号块或使用str.extract。
回答by Rob
you should use re.escape(yourString)
for the string you are passing to contains.
您应该re.escape(yourString)
用于传递给包含的字符串。
回答by toto_tico
You can use str.match
instead. In your code:
你可以str.match
改用。在您的代码中:
res = df[df['event_time'].str.match('|'.join(substr), regex=True)]
Explanation
解释
The warning is triggered by str.contains
when the regular expression includes groups, e.g. in the regex r'foo(bar)'
, the (bar)
part is considered a group because it is in parenthesis. Therefore you could theoretically extract that from a regex.
str.contains
当正则表达式包含组时触发警告,例如在 regex 中r'foo(bar)'
,该(bar)
部分被视为一个组,因为它在括号中。因此,您理论上可以从正则表达式中提取它。
However, the warning doesn't make sense in the first place, contains
is supposed to only "test if pattern or regex is contained within a string of a Series or Index" (pandas documentation). There is nothing about extracting groups.
但是,警告首先没有意义,contains
应该只“测试模式或正则表达式是否包含在系列或索引的字符串中”(Pandas文档)。没有关于提取组的内容。
In any case, str.match
does not throw the warning, and currently does almost the same as str.contains
except that (1) the string must exactly match and (2) one cannot deactivate regex from str.match
(str.contains
has a regex
parameter to deactivate them)
在任何情况下,str.match
都不会抛出警告,并且目前几乎与str.contains
除了 (1) 字符串必须完全匹配和 (2) 不能停用正则表达式str.match
(str.contains
有一个regex
参数可以停用它们)