Python Beautiful Soup 解析 url 以获取另一个 urls 数据

Question

提问by tim

I need to parse a url to get a list of urls that link to a detail page. Then from that page I need to get all the details from that page. I need to do it this way because the detail page url is not regularly incremented and changes, but the event list page stays the same.

我需要解析一个 url 以获取链接到详细信息页面的 url 列表。然后从该页面我需要从该页面获取所有详细信息。我需要这样做，因为详细信息页面 url 不会定期增加和更改，但事件列表页面保持不变。

Basically:

基本上：

example.com/events/
    <a href="http://example.com/events/1">Event 1</a>
    <a href="http://example.com/events/2">Event 2</a>

example.com/events/1
    ...some detail stuff I need

example.com/events/2
    ...some detail stuff I need

Answer 1

采纳答案by Tauquir

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen('http://yahoo.com').read()
soup = BeautifulSoup(page)
soup.prettify()
for anchor in soup.findAll('a', href=True):
    print anchor['href']

It will give you the list of urls. Now You can iterate over those urls and parse the data.

它将为您提供网址列表。现在您可以遍历这些 url 并解析数据。

inner_div = soup.findAll("div", {"id": "y-shade"})This is an example. You can go through the BeautifulSoup tutorials.

inner_div = soup.findAll("div", {"id": "y-shade"})这是一个例子。您可以浏览 BeautifulSoup 教程。

Answer 2

回答by line break

Use urllib2 to get the page, then use beautiful soup to get the list of links, also try scraperwiki.com

使用urllib2获取页面，然后使用美丽的汤获取链接列表，也尝试scraperwiki.com

Edit:

编辑：

Recent discovery: Using BeautifulSoup through lxml with

最近发现：通过 lxml 使用 BeautifulSoup

from lxml.html.soupparser import fromstring

is miles better than just BeautifulSoup. It lets you do dom.cssselect('your selector') which is a life saver. Just make sure you have a good version of BeautifulSoup installed. 3.2.1 works a treat.

比BeautifulSoup 好多了。它让你做 dom.cssselect('your selector') 这是一个救星。只要确保你安装了一个很好的 BeautifulSoup 版本。3.2.1 工作请客。

dom = fromstring('<html... ...')
navigation_links = [a.get('href') for a in htm.cssselect('#navigation a')]

Answer 3

回答by disuse

For the next group of people that come across this, BeautifulSoup has been upgraded to v4 as of this post as v3 is no longer being updated..

对于下一组遇到此问题的人，在本文中，BeautifulSoup 已升级到 v4，因为 v3 不再更新。

$ easy_install beautifulsoup4

$ pip install beautifulsoup4

To use in Python...

要在 Python 中使用...

import bs4 as BeautifulSoup

Answer 4

回答by Sevenearths

FULL PYTHON 3 EXAMPLE

完整的 Python 3 示例

Packages

套餐

# pip3 install urllib
# pip3 install beautifulsoup4

Example:

例子：

import urllib.request
from bs4 import BeautifulSoup

with urllib.request.urlopen('https://www.wikipedia.org/') as f:
    data = f.read().decode('utf-8')

d = BeautifulSoup(data)

d.title.string

The above should print out 'Wikipedia'

以上应该打印出来 'Wikipedia'

Python Beautiful Soup 解析 url 以获取另一个 urls 数据

提问by tim

采纳答案by Tauquir

回答by line break

回答by disuse

回答by Sevenearths

相关推荐

最近更新

标签

Python Beautiful Soup 解析 url 以获取另一个 urls 数据

提问by tim

采纳答案by Tauquir

回答by line break

回答by disuse

回答by Sevenearths

相关推荐

相当于条件中的 GOTO，Python

Python Django Unique Together（带外键）

在 Python 中，是否有一种优雅的方法可以在没有显式循环的情况下以自定义格式打印列表？

Python 为什么django 的model.save() 不调用full_clean()？

相关推荐

最近更新

标签