如何从 Python 中的 HTML 页面中提取 URL
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15517483/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to extract URLs from an HTML page in Python
提问by user2189704
I have to write a web crawler in Python. I don't know how to parse a page and extract the URLs from HTML. Where should I go and study to write such a program?
我必须用 Python 编写一个网络爬虫。我不知道如何解析页面并从 HTML 中提取 URL。我应该去哪里学习编写这样的程序?
In other words, is there a simple python program which can be used as a template for a generic web crawler? Ideally it should use modules which are relatively simple to use and it should include plenty of comments to describe what each line of code is doing.
换句话说,是否有一个简单的python程序可以用作通用网络爬虫的模板?理想情况下,它应该使用相对简单易用的模块,并且应该包含大量注释来描述每行代码在做什么。
回答by Shankar
Look at example code below. The script extracts html code of a web page (here Python home page) and extracts all the links in that page. Hope this helps.
看看下面的示例代码。该脚本提取网页(此处为 Python 主页)的 html 代码并提取该页面中的所有链接。希望这可以帮助。
#!/usr/bin/env python
import requests
from bs4 import BeautifulSoup
url = "http://www.python.org"
response = requests.get(url)
# parse html
page = str(BeautifulSoup(response.content))
def getURL(page):
"""
:param page: html of web page (here: Python home page)
:return: urls in that page
"""
start_link = page.find("a href")
if start_link == -1:
return None, 0
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1: end_quote]
return url, end_quote
while True:
url, n = getURL(page)
page = page[n:]
if url:
print(url)
else:
break
Output:
输出:
/
#left-hand-navigation
#content-body
/search
/about/
/news/
/doc/
/download/
/getit/
/community/
/psf/
http://docs.python.org/devguide/
/about/help/
http://pypi.python.org/pypi
/download/releases/2.7.3/
http://docs.python.org/2/
/ftp/python/2.7.3/python-2.7.3.msi
/ftp/python/2.7.3/Python-2.7.3.tar.bz2
/download/releases/3.3.0/
http://docs.python.org/3/
/ftp/python/3.3.0/python-3.3.0.msi
/ftp/python/3.3.0/Python-3.3.0.tar.bz2
/community/jobs/
/community/merchandise/
/psf/donations/
http://wiki.python.org/moin/Languages
http://wiki.python.org/moin/Languages
http://www.google.com/calendar/ical/b6v58qvojllt0i6ql654r1vh00%40group.calendar.google.com/public/basic.ics
http://www.google.com/calendar/ical/j7gov1cmnqr9tvg14k621j7t5c%40group.calendar.google.com/public/basic.ics
http://pycon.org/#calendar
http://www.google.com/calendar/ical/3haig2m9msslkpf2tn1h56nn9g%40group.calendar.google.com/public/basic.ics
http://pycon.org/#calendar
http://www.psfmember.org
...
...
回答by TerryA
With parsing pages, check out the BeautifulSoupmodule. It's simple to use and allows you to parse pages with HTML. You can extract URLs from the HTML simply by doing str.find('a')
使用解析页面,查看BeautifulSoup模块。它使用简单,并允许您使用 HTML 解析页面。你可以简单地从 HTML 中提取 URLsstr.find('a')
回答by Sushant Gupta
You can use beautifulsoup. Follow the documentation and see what matches your requirements. The documentation contains code snippets for how to extract URL's as well.
您可以使用beautifulsoup。按照文档查看符合您要求的内容。该文档还包含有关如何提取 URL 的代码片段。
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
soup.find_all('a') # Finds all hrefs from the html doc.
回答by Scy
import sys
import re
import urllib2
import urlparse
tocrawl = set(["http://www.facebook.com/"])
crawled = set([])
keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')
while 1:
try:
crawling = tocrawl.pop()
print crawling
except KeyError:
raise StopIteration
url = urlparse.urlparse(crawling)
try:
response = urllib2.urlopen(crawling)
except:
continue
msg = response.read()
startPos = msg.find('<title>')
if startPos != -1:
endPos = msg.find('</title>', startPos+7)
if endPos != -1:
title = msg[startPos+7:endPos]
print title
keywordlist = keywordregex.findall(msg)
if len(keywordlist) > 0:
keywordlist = keywordlist[0]
keywordlist = keywordlist.split(", ")
print keywordlist
links = linkregex.findall(msg)
crawled.add(crawling)
for link in (links.pop(0) for _ in xrange(len(links))):
if link.startswith('/'):
link = 'http://' + url[1] + link
elif link.startswith('#'):
link = 'http://' + url[1] + url[2] + link
elif not link.startswith('http'):
link = 'http://' + url[1] + '/' + link
if link not in crawled:
tocrawl.add(link)
Referenced to: Python Web Crawler in Less Than 50 Lines(Slow or no longer works, does not load for me)
参考:少于 50 行的 Python Web Crawler(缓慢或不再有效,不适合我加载)
回答by pradyunsg
You can use BeautifulSoupas many have also stated. It can parse HTML,XML etc. To see some of it's features, see here.
正如许多人所说,您可以使用BeautifulSoup。它可以解析 HTML、XML 等。要查看它的一些功能,请参见此处。
Example:
例子:
import urllib2
from bs4 import BeautifulSoup
url = 'http://www.google.co.in/'
conn = urllib2.urlopen(url)
html = conn.read()
soup = BeautifulSoup(html)
links = soup.find_all('a')
for tag in links:
link = tag.get('href',None)
if link is not None:
print link

