Python Beautiful Soup 解析 url 以获取另一个 urls 数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4462061/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 15:51:44  来源:igfitidea点击:

Beautiful Soup to parse url to get another urls data

pythonhtmlparsingbeautifulsoup

提问by tim

I need to parse a url to get a list of urls that link to a detail page. Then from that page I need to get all the details from that page. I need to do it this way because the detail page url is not regularly incremented and changes, but the event list page stays the same.

我需要解析一个 url 以获取链接到详细信息页面的 url 列表。然后从该页面我需要从该页面获取所有详细信息。我需要这样做,因为详细信息页面 url 不会定期增加和更改,但事件列表页面保持不变。

Basically:

基本上:

example.com/events/
    <a href="http://example.com/events/1">Event 1</a>
    <a href="http://example.com/events/2">Event 2</a>

example.com/events/1
    ...some detail stuff I need

example.com/events/2
    ...some detail stuff I need

采纳答案by Tauquir

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen('http://yahoo.com').read()
soup = BeautifulSoup(page)
soup.prettify()
for anchor in soup.findAll('a', href=True):
    print anchor['href']

It will give you the list of urls. Now You can iterate over those urls and parse the data.

它将为您提供网址列表。现在您可以遍历这些 url 并解析数据。

  • inner_div = soup.findAll("div", {"id": "y-shade"})This is an example. You can go through the BeautifulSoup tutorials.
  • inner_div = soup.findAll("div", {"id": "y-shade"})这是一个例子。您可以浏览 BeautifulSoup 教程。

回答by line break

Use urllib2 to get the page, then use beautiful soup to get the list of links, also try scraperwiki.com

使用urllib2获取页面,然后使用美丽的汤获取链接列表,也尝试scraperwiki.com

Edit:

编辑:

Recent discovery: Using BeautifulSoup through lxml with

最近发现:通过 lxml 使用 BeautifulSoup

from lxml.html.soupparser import fromstring

is miles better than just BeautifulSoup. It lets you do dom.cssselect('your selector') which is a life saver. Just make sure you have a good version of BeautifulSoup installed. 3.2.1 works a treat.

比BeautifulSoup 好多了。它让你做 dom.cssselect('your selector') 这是一个救星。只要确保你安装了一个很好的 BeautifulSoup 版本。3.2.1 工作请客。

dom = fromstring('<html... ...')
navigation_links = [a.get('href') for a in htm.cssselect('#navigation a')]

回答by disuse

For the next group of people that come across this, BeautifulSoup has been upgraded to v4 as of this post as v3 is no longer being updated..

对于下一组遇到此问题的人,在本文中,BeautifulSoup 已升级到 v4,因为 v3 不再更新。

$ easy_install beautifulsoup4

$ pip install beautifulsoup4

To use in Python...

要在 Python 中使用...

import bs4 as BeautifulSoup

回答by Sevenearths

FULL PYTHON 3 EXAMPLE

完整的 Python 3 示例

Packages

套餐

# pip3 install urllib
# pip3 install beautifulsoup4

Example:

例子:

import urllib.request
from bs4 import BeautifulSoup

with urllib.request.urlopen('https://www.wikipedia.org/') as f:
    data = f.read().decode('utf-8')

d = BeautifulSoup(data)

d.title.string

The above should print out 'Wikipedia'

以上应该打印出来 'Wikipedia'