使用python从网站中提取数据

Question

提问by gameoverman

I recently started learning python and one of the first projects I did was to scrap updates from my son's classroom web page and send me notifications that they updated the site. This turned out to be an easy project so I wanted to expand on this and create a script that would automatically check if any of our lotto numbers hit. Unfortunately I haven't been able to figure out how to get the data from the website. Here is one of my attempts from last night.

我最近开始学习 python，我做的第一个项目是从我儿子的课堂网页上删除更新并向我发送他们更新网站的通知。结果证明这是一个简单的项目，所以我想对此进行扩展并创建一个脚本，该脚本会自动检查我们的乐透号码是否命中。不幸的是，我一直无法弄清楚如何从网站获取数据。这是我昨晚的尝试之一。

from bs4 import BeautifulSoup
import urllib.request

webpage = "http://www.masslottery.com/games/lottery/large-winningnumbers.html"

websource = urllib.request.urlopen(webpage)
soup = BeautifulSoup(websource.read(), "html.parser")

span = soup.find("span", {"id": "winning_num_0"})
print (span)

Output is here...
<span id="winning_num_0"></span>

The output listed above is also what I see if I "view source" with a web browser. When I "inspect Element" with the web browser I can see the winning numbers in the inspect element panel. Unfortunately I'm not even sure how/where the web browser is getting the data. is it loading from another page or a script in the background? I thought the following tutorial was going to help me but I wasn't able to get the data using similar commands.

如果我使用网络浏览器“查看源代码”，上面列出的输出也是我看到的。当我使用网络浏览器“检查元素”时，我可以在检查元素面板中看到中奖号码。不幸的是，我什至不确定网络浏览器是如何/从哪里获取数据的。它是从另一个页面加载还是从后台脚本加载？我认为以下教程会对我有所帮助，但我无法使用类似的命令获取数据。

http://zevross.com/blog/2014/05/16/using-the-python-library-beautifulsoup-to-extract-data-from-a-webpage-applied-to-world-cup-rankings/

Any help is appreciated. Thanks

任何帮助表示赞赏。谢谢

Answer 1

回答by Wayne Werner

If you look closely at the source of the page (I just used curl) you can see this block

如果您仔细查看页面的来源（我刚刚使用curl），您可以看到这个块

<script type="text/javascript">
    // <![CDATA[
    var dataPath = '../../';
    var json_filename = 'data/json/games/lottery/recent.json';
    var games = new Array();
    var sessions = new Array();
    // ]]>
</script>

That recent.jsonstuck out like a sore thumb (I actually missed the dataPathpart at first).

那recent.json伸出像突兀（其实我错过了dataPath在第一部分）。

After giving that a try, I came up with this:

尝试之后，我想出了这个：

curl http://www.masslottery.com/data/json/games/lottery/recent.json

Which, as lari points out in the comments, is way easier than scraping HTML. This easy, in fact:

正如拉里在评论中指出的那样，这比抓取 HTML 容易得多。这很容易，事实上：

import json
import urllib.request
from pprint import pprint

websource = urllib.request.urlopen('http://www.masslottery.com/data/json/games/lottery/recent.json')
data = json.loads(websource.read().decode())
pprint(data)

datais now a dict, and you can do whatever kind of dict-like things you'd like to do with it. And good luck ;)

data现在是一个 dict，你可以做任何你想做的类似 dict 的事情。还有祝你好运 ;）

使用python从网站中提取数据

提问by gameoverman

回答by Wayne Werner

相关推荐

最近更新

标签

使用python从网站中提取数据

提问by gameoverman

回答by Wayne Werner

相关推荐

Python Airflow：如何通过 SSH 和从不同的服务器运行 BashOperator

Python 类型错误：zip 参数 #2 必须支持迭代

Python 凯拉斯 | 类型错误：__init__() 缺少 1 个必需的位置参数：'nb_col'

Python Gunicorn，没有名为“myproject”的模块

相关推荐

最近更新

标签

Python 凯拉斯 | 类型错误：init() 缺少 1 个必需的位置参数：'nb_col'