使用python从网站中提取数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39510830/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
extract data from website using python
提问by gameoverman
I recently started learning python and one of the first projects I did was to scrap updates from my son's classroom web page and send me notifications that they updated the site. This turned out to be an easy project so I wanted to expand on this and create a script that would automatically check if any of our lotto numbers hit. Unfortunately I haven't been able to figure out how to get the data from the website. Here is one of my attempts from last night.
我最近开始学习 python,我做的第一个项目是从我儿子的课堂网页上删除更新并向我发送他们更新网站的通知。结果证明这是一个简单的项目,所以我想对此进行扩展并创建一个脚本,该脚本会自动检查我们的乐透号码是否命中。不幸的是,我一直无法弄清楚如何从网站获取数据。这是我昨晚的尝试之一。
from bs4 import BeautifulSoup
import urllib.request
webpage = "http://www.masslottery.com/games/lottery/large-winningnumbers.html"
websource = urllib.request.urlopen(webpage)
soup = BeautifulSoup(websource.read(), "html.parser")
span = soup.find("span", {"id": "winning_num_0"})
print (span)
Output is here...
<span id="winning_num_0"></span>
The output listed above is also what I see if I "view source" with a web browser. When I "inspect Element" with the web browser I can see the winning numbers in the inspect element panel. Unfortunately I'm not even sure how/where the web browser is getting the data. is it loading from another page or a script in the background? I thought the following tutorial was going to help me but I wasn't able to get the data using similar commands.
如果我使用网络浏览器“查看源代码”,上面列出的输出也是我看到的。当我使用网络浏览器“检查元素”时,我可以在检查元素面板中看到中奖号码。不幸的是,我什至不确定网络浏览器是如何/从哪里获取数据的。它是从另一个页面加载还是从后台脚本加载?我认为以下教程会对我有所帮助,但我无法使用类似的命令获取数据。
Any help is appreciated. Thanks
任何帮助表示赞赏。谢谢
回答by Wayne Werner
If you look closely at the source of the page (I just used curl
) you can see this block
如果您仔细查看页面的来源(我刚刚使用curl
),您可以看到这个块
<script type="text/javascript">
// <![CDATA[
var dataPath = '../../';
var json_filename = 'data/json/games/lottery/recent.json';
var games = new Array();
var sessions = new Array();
// ]]>
</script>
That recent.json
stuck out like a sore thumb (I actually missed the dataPath
part at first).
那recent.json
伸出像突兀(其实我错过了dataPath
在第一部分)。
After giving that a try, I came up with this:
尝试之后,我想出了这个:
curl http://www.masslottery.com/data/json/games/lottery/recent.json
Which, as lari points out in the comments, is way easier than scraping HTML. This easy, in fact:
正如拉里在评论中指出的那样,这比抓取 HTML 容易得多。这很容易,事实上:
import json
import urllib.request
from pprint import pprint
websource = urllib.request.urlopen('http://www.masslottery.com/data/json/games/lottery/recent.json')
data = json.loads(websource.read().decode())
pprint(data)
data
is now a dict, and you can do whatever kind of dict-like things you'd like to do with it. And good luck ;)
data
现在是一个 dict,你可以做任何你想做的类似 dict 的事情。还有祝你好运 ;)