如何从 Python 中的 HTML 页面中提取 URL

Question

提问by user2189704

I have to write a web crawler in Python. I don't know how to parse a page and extract the URLs from HTML. Where should I go and study to write such a program?

我必须用 Python 编写一个网络爬虫。我不知道如何解析页面并从 HTML 中提取 URL。我应该去哪里学习编写这样的程序？

In other words, is there a simple python program which can be used as a template for a generic web crawler? Ideally it should use modules which are relatively simple to use and it should include plenty of comments to describe what each line of code is doing.

换句话说，是否有一个简单的python程序可以用作通用网络爬虫的模板？理想情况下，它应该使用相对简单易用的模块，并且应该包含大量注释来描述每行代码在做什么。

Answer 1

回答by Shankar

Look at example code below. The script extracts html code of a web page (here Python home page) and extracts all the links in that page. Hope this helps.

看看下面的示例代码。该脚本提取网页（此处为 Python 主页）的 html 代码并提取该页面中的所有链接。希望这可以帮助。

#!/usr/bin/env python

import requests
from bs4 import BeautifulSoup

url = "http://www.python.org"
response = requests.get(url)
# parse html
page = str(BeautifulSoup(response.content))


def getURL(page):
    """

    :param page: html of web page (here: Python home page) 
    :return: urls in that page 
    """
    start_link = page.find("a href")
    if start_link == -1:
        return None, 0
    start_quote = page.find('"', start_link)
    end_quote = page.find('"', start_quote + 1)
    url = page[start_quote + 1: end_quote]
    return url, end_quote

while True:
    url, n = getURL(page)
    page = page[n:]
    if url:
        print(url)
    else:
        break

Output:

输出：

/
#left-hand-navigation
#content-body
/search
/about/
/news/
/doc/
/download/
/getit/
/community/
/psf/
http://docs.python.org/devguide/
/about/help/
http://pypi.python.org/pypi
/download/releases/2.7.3/
http://docs.python.org/2/
/ftp/python/2.7.3/python-2.7.3.msi
/ftp/python/2.7.3/Python-2.7.3.tar.bz2
/download/releases/3.3.0/
http://docs.python.org/3/
/ftp/python/3.3.0/python-3.3.0.msi
/ftp/python/3.3.0/Python-3.3.0.tar.bz2
/community/jobs/
/community/merchandise/
/psf/donations/
http://wiki.python.org/moin/Languages
http://wiki.python.org/moin/Languages
http://www.google.com/calendar/ical/b6v58qvojllt0i6ql654r1vh00%40group.calendar.google.com/public/basic.ics
http://www.google.com/calendar/ical/j7gov1cmnqr9tvg14k621j7t5c%40group.calendar.google.com/public/basic.ics
http://pycon.org/#calendar
http://www.google.com/calendar/ical/3haig2m9msslkpf2tn1h56nn9g%40group.calendar.google.com/public/basic.ics
http://pycon.org/#calendar
http://www.psfmember.org

...

Answer 2

回答by TerryA

With parsing pages, check out the BeautifulSoupmodule. It's simple to use and allows you to parse pages with HTML. You can extract URLs from the HTML simply by doing str.find('a')

使用解析页面，查看BeautifulSoup模块。它使用简单，并允许您使用 HTML 解析页面。你可以简单地从 HTML 中提取 URLsstr.find('a')

Don't use regular expressions for parsing HTML

不要使用正则表达式来解析 HTML

Answer 3

回答by Sushant Gupta

You can use beautifulsoup. Follow the documentation and see what matches your requirements. The documentation contains code snippets for how to extract URL's as well.

您可以使用beautifulsoup。按照文档查看符合您要求的内容。该文档还包含有关如何提取 URL 的代码片段。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

soup.find_all('a') # Finds all hrefs from the html doc.

Answer 4

回答by Scy

import sys
import re
import urllib2
import urlparse
tocrawl = set(["http://www.facebook.com/"])
crawled = set([])
keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')

while 1:
    try:
        crawling = tocrawl.pop()
        print crawling
    except KeyError:
        raise StopIteration
    url = urlparse.urlparse(crawling)
    try:
        response = urllib2.urlopen(crawling)
    except:
        continue
    msg = response.read()
    startPos = msg.find('<title>')
    if startPos != -1:
        endPos = msg.find('</title>', startPos+7)
        if endPos != -1:
            title = msg[startPos+7:endPos]
            print title
    keywordlist = keywordregex.findall(msg)
    if len(keywordlist) > 0:
        keywordlist = keywordlist[0]
        keywordlist = keywordlist.split(", ")
        print keywordlist
    links = linkregex.findall(msg)
    crawled.add(crawling)
    for link in (links.pop(0) for _ in xrange(len(links))):
        if link.startswith('/'):
            link = 'http://' + url[1] + link
        elif link.startswith('#'):
            link = 'http://' + url[1] + url[2] + link
        elif not link.startswith('http'):
            link = 'http://' + url[1] + '/' + link
        if link not in crawled:
            tocrawl.add(link)

Referenced to: Python Web Crawler in Less Than 50 Lines(Slow or no longer works, does not load for me)

参考：少于 50 行的 Python Web Crawler（缓慢或不再有效，不适合我加载）

Answer 5

回答by pradyunsg

You can use BeautifulSoupas many have also stated. It can parse HTML,XML etc. To see some of it's features, see here.

正如许多人所说，您可以使用BeautifulSoup。它可以解析 HTML、XML 等。要查看它的一些功能，请参见此处。

Example:

例子：

import urllib2
from bs4 import BeautifulSoup
url = 'http://www.google.co.in/'

conn = urllib2.urlopen(url)
html = conn.read()

soup = BeautifulSoup(html)
links = soup.find_all('a')

for tag in links:
    link = tag.get('href',None)
    if link is not None:
        print link

如何从 Python 中的 HTML 页面中提取 URL

提问by user2189704

回答by Shankar

回答by TerryA

回答by Sushant Gupta

回答by Scy

回答by pradyunsg

相关推荐

最近更新

标签

如何从 Python 中的 HTML 页面中提取 URL

提问by user2189704

回答by Shankar

回答by TerryA

回答by Sushant Gupta

回答by Scy

回答by pradyunsg

相关推荐

Python 使用其构造函数初始化 OrderedDict 以使其保留初始数据顺序的正确方法？

Python 如何在 matplotlib 中向 im​​show() 添加图例

Python 如何从文本数据中获取词袋？

如何使用 Python 删除文本文件的第一行？

相关推荐

最近更新

标签

Python 如何在 matplotlib 中向 imshow() 添加图例