python 正则表达式匹配错误
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1268761/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Regex Matching Error
提问by Btibert3
I am new to Python (I dont have any programming training either), so please keep that in mind as I ask my question.
我是 Python 新手(我也没有接受过任何编程培训),所以在我提出问题时请记住这一点。
I am trying to search a retrieved webpage and find all links using a specified pattern. I have done this successfully in other scripts, but I am getting an error that says
我正在尝试搜索检索到的网页并使用指定的模式查找所有链接。我已在其他脚本中成功完成此操作,但出现错误提示
raise error, v # invalid expression
sre_constants.error: multiple repeat
raise error, v # invalid expression
sre_constants.error:多次重复
I have to admit I do not know why, but again, I am new to Python and Regular Expressions. However, even when I don't use patterns and use a specific link (just to test the matching), I do not believe I return any matches (nothing is sent to the window when I print match.group(0). The link I tested is commented out below.
我不得不承认我不知道为什么,但同样,我是 Python 和正则表达式的新手。但是,即使我不使用模式并使用特定链接(只是为了测试匹配),我也不相信我返回任何匹配项(当我打印 match.group(0) 时,没有任何内容发送到窗口)。链接我测试在下面注释掉了。
Any ideas? It usually is easier for me to learn by example, but any advice you can give is greatly appreciated!
有任何想法吗?通过示例通常对我来说更容易学习,但是非常感谢您提供的任何建议!
Brock
布洛克
import urllib2
from BeautifulSoup import BeautifulSoup
import re
url = "http://forums.epicgames.com/archive/index.php?f-356-p-164.html"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
pattern = r'<a href="http://forums.epicgames.com/archive/index.php?t-([0-9]+).html">(.?+)</a> <i>((.?+) replies)'
#pattern = r'href="http://forums.epicgames.com/archive/index.php?t-622233.html">Gears of War 2: Horde Gameplay</a> <i>(20 replies)'
for match in re.finditer(pattern, page, re.S):
print match(0)
采纳答案by hughdbrown
import urllib2
import re
from BeautifulSoup import BeautifulSoup
url = "http://forums.epicgames.com/archive/index.php?f-356-p-164.html"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
# Get all the links
links = [str(match) for match in soup('a')]
s = r'<a href="http://forums.epicgames.com/archive/index.php\?t-\d+.html">(.+?)</a>'
r = re.compile(s)
for link in links:
m = r.match(link)
if m:
print m.groups(1)[0]
回答by retracile
You need to escape the literal '?' and the literal '(' and ')' that you are trying to match.
您需要转义文字 '?' 以及您尝试匹配的文字 '(' 和 ')'。
Also, instead of '?+', I think you're looking for the non-greedy matching provided by '+?'.
另外,我认为您正在寻找由 '+?' 提供的非贪婪匹配,而不是 '?+'。
For your case, try this:
对于你的情况,试试这个:
pattern = r'<a href="http://forums.epicgames.com/archive/index.php\?t-([0-9]+).html"> (.+?)</a> <i>\((.+?) replies\)'
回答by Unknown
That means your regular expression has an error.
这意味着您的正则表达式有错误。
(.?+)</a> <i>((.?+)
What does ?+ mean? Both ? and + are meta characters that does not make sense right next to each other. Maybe you forgot to escape the '?' or something.
?+ 是什么意思?两个都 ?和 + 是元字符,彼此相邻没有意义。也许你忘了逃避“?” 或者其他的东西。
回答by Ned Deily
As you're discovering, parsing arbitrary HTML is not easy to do correctly. That's what packages like Beautiful Soup do. Note, you're calling it in your script but then not using the results. Refer to its documentation herefor examples of how to make your task a lot easier!
正如您所发现的,正确解析任意 HTML 并不容易。这就是 Beautiful Soup 之类的包装所做的。请注意,您在脚本中调用它,但不使用结果。有关如何使您的任务更轻松的示例,请参阅此处的文档!
回答by machineghost
To extend on what others wrote:
扩展其他人写的内容:
.? means "one or zero of any character"
.? 表示“任何字符的一或零”
.+ means "one ore more of any character"
.+ 表示“一个或多个任何字符”
As you can hopefully see, combining the two makes no sense; they are different and contradictory "repeat" characters. So, your error about "multiple repeats" is because you combined those two "repeat" characters in your regular expression. To fix it, just decide which one you actually meant to use, and delete the other.
正如您所希望的那样,将两者结合起来是没有意义的。它们是不同且相互矛盾的“重复”字符。因此,您关于“多次重复”的错误是因为您在正则表达式中组合了这两个“重复”字符。要修复它,只需决定您实际打算使用哪个,然后删除另一个。