python中用于解析HTML标题标签的正则表达式模式
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20045955/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
regex pattern in python for parsing HTML title tags
提问by rahuL
I am learning to use both the remodule and the urllibmodule in python and attempting to write a simple web scraper. Here's the code I've written to scrape just the title of websites:
我正在学习在 python 中使用re模块和urllib模块,并尝试编写一个简单的网络爬虫。这是我编写的用于抓取网站标题的代码:
#!/usr/bin/python
import urllib
import re
urls=["http://google.com","https://facebook.com","http://reddit.com"]
i=0
these_regex="<title>(.+?)</title>"
pattern=re.compile(these_regex)
while(i<len(urls)):
htmlfile=urllib.urlopen(urls[i])
htmltext=htmlfile.read()
titles=re.findall(pattern,htmltext)
print titles
i+=1
This gives the correct output for Google and Reddit but not for Facebook - like so:
这为 Google 和 Reddit 提供了正确的输出,但不适用于 Facebook - 如下所示:
['Google']
[]
['reddit: the front page of the internet']
This is because, I found that on Facebook's page the titletag is as follows: <title id="pageTitle">. To accomodate for the additional id=, I modified the these_regexvariable as follows: these_regex="<title.+?>(.+?)</title>". But this gives the following output:
这是因为,我发现,Facebook的页面上的title标签如下:<title id="pageTitle">。以容纳用于附加id=,我修改了these_regex变量,如下所示:these_regex="<title.+?>(.+?)</title>"。但这给出了以下输出:
[]
['Welcome to Facebook \xe2\x80\x94 Log in, sign up or learn more']
[]
How would I combine both so that I can take into account any additional parameters passed within the titletag?
我如何将两者结合起来,以便我可以考虑title标签内传递的任何其他参数?
采纳答案by Martijn Pieters
You are using a regular expression, and matching HTML with such expressions get too complicated, too fast.
您正在使用正则表达式,并且将 HTML 与此类表达式匹配变得太复杂、太快。
Use a HTML parser instead, Python has several to choose from. I recommend you use BeautifulSoup, a popular 3rd party library.
改用 HTML 解析器,Python 有几个可供选择。我推荐你使用BeautifulSoup,一个流行的 3rd 方库。
BeautifulSoup example:
BeautifulSoup 示例:
from bs4 import BeautifulSoup
response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
title = soup.find('title').text
Since a titletag itself doesn't contain other tags, you can get away with a regular expression here, but as soon as you try to parse nested tags, you willrun into hugely complex issues.
由于title标签本身不包含其他标签,因此您可以在这里使用正则表达式,但是一旦您尝试解析嵌套标签,您就会遇到非常复杂的问题。
Your specific problem can be solved by matching additional characters within the titletag, optionally:
您的具体问题可以通过匹配title标签中的其他字符来解决,可选:
r'<title[^>]*>([^<]+)</title>'
This matches 0 or more characters that are notthe closing >bracket. The '0 or more' here lets you match both extra attributes and the plain <title>tag.
这匹配 0 个或多个不是右>括号的字符。此处的“0 或更多”可让您匹配额外的属性和普通<title>标签。
回答by K DawG
It is recommended that you use Beautiful Soupor any other parser to parse HTML, but if you badly want regexthe following piece of code would do the job.
建议您使用Beautiful Soup或任何其他解析器来解析 HTML,但如果您非常想要正则表达式,以下代码可以完成这项工作。
The regex code:
正则表达式代码:
<title.*?>(.+?)</title>
How it works:
这个怎么运作:
Produces:
产生:
['Google']
['Welcome to Facebook - Log In, Sign Up or Learn More']
['reddit: the front page of the internet']
回答by Harsh Gupta
If you wish to identify all the htlm tags, you can use this
如果你想识别所有的 htlm 标签,你可以使用这个
batRegex = re.compile(r'(<[a-z]*>)')
m1=batRegex.search(html)
print batRegex.findall(yourstring)


