获取 Python 中脚本标签内的变量数据或从 js 添加的内容
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24118337/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Fetch data of variables inside script tag in Python or Content added from js
提问by Inforian
I want to fetch data from another url for which I am using urllib and Beautiful Soup, My data is inside table tag (which I have figure out using Firefox console). But when I tried to fetch table using his id the result is None , Then I guess this table must be dynamically added via some js code.
我想从我使用urllib 和 Beautiful Soup 的另一个 url 获取数据,我的数据在 table 标签内(我已经使用 Firefox 控制台弄清楚了)。但是当我尝试使用他的 id 获取表时,结果是 None ,然后我想这个表必须通过一些 js 代码动态添加。
I have tried all both parsers 'lxml', 'html5lib'but still I can't get that table data.
我已经尝试了所有的解析器 “lxml”、“html5lib”,但仍然无法获取该表数据。
I have also tried one more thing :
我还尝试了另一件事:
web = urllib.urlopen("my url")
html = web.read()
soup = BeautifulSoup(html, 'lxml')
js = soup.find("script")
ss = js.prettify()
print ss
Result :
结果 :
<script type="text/javascript">
myPage = 'ETFs';
sectionId = 'liQuotes'; //section tab
breadCrumbId = 'qQuotes'; //page
is_dartSite = "quotes";
is_dartZone = "news";
propVar = "ETFs";
</script>
But now I don't know how I can get data of these js variables.
但是现在我不知道如何获取这些js变量的数据。
Now I have two options either get that table content ot get that the js variables, any one of them can fulfil my task but unfortunately I don't know how to get these , So please tell how I can get resolve any one of the problem.
现在我有两个选择,要么获取表格内容,要么获取 js 变量,其中任何一个都可以完成我的任务,但不幸的是我不知道如何获取这些,所以请告诉我如何解决任何一个问题.
Thanks
谢谢
回答by mhawke
EDIT
编辑
This will do the trick using remodule to extract the data and loading it as JSON:
这将使用re模块来提取数据并将其加载为 JSON:
import urllib
import json
import re
from bs4 import BeautifulSoup
web = urllib.urlopen("http://www.nasdaq.com/quotes/nasdaq-financial-100-stocks.aspx")
soup = BeautifulSoup(web.read(), 'lxml')
data = soup.find_all("script")[19].string
p = re.compile('var table_body = (.*?);')
m = p.match(data)
stocks = json.loads(m.groups()[0])
>>> for stock in stocks:
... print stock
...
[u'ASPS', u'Altisource Portfolio Solutions S.A.', 116.96, 2.2, 1.92, 86635, u'N', u'N']
[u'AGNC', u'American Capital Agency Corp.', 23.76, 0.13, 0.55, 3184303, u'N', u'N']
.
.
.
[u'ZION', u'Zions Bancorporation', 29.79, 0.46, 1.57, 2154017, u'N', u'N']
The problem with this is that the script tag offset is hard-coded and there is not a reliable way to locate it within the page. Changes to the page could break your code.
这样做的问题是脚本标记偏移量是硬编码的,并且没有可靠的方法在页面内定位它。对页面的更改可能会破坏您的代码。
ORIGINAL answer
原答案
Rather than try to screen scrape the data, you can download a CSV representation of the same data from http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx?render=download.
您可以从http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx?render=download下载相同数据的 CSV 表示,而不是尝试屏幕抓取数据。
Then use the Python csvmodule to parse and process it. Not only is this more convenient, it will be a more resilient solution because any changes to the HTML could easily break your screen scraping code.
然后使用Python csv模块对其进行解析和处理。这不仅更方便,而且将是一个更有弹性的解决方案,因为对 HTML 的任何更改都可能很容易破坏您的屏幕抓取代码。
Otherwise, if you look at the actual HTML you will find that the data is available within the page in the following script tag:
否则,如果您查看实际的 HTML,您会发现页面内的以下脚本标记中的数据可用:
<script type="text/javascript">var table_body = [["ATVI", "Activision Blizzard, Inc", 20.92, 0.21, 1.01, 6182877, .1, "N", "N"],
["ADBE", "Adobe Systems Incorporated", 66.91, 1.44, 2.2, 3629837, .6, "N", "N"],
["AKAM", "Akamai Technologies, Inc.", 57.47, 1.57, 2.81, 2697834, .3, "N", "N"],
["ALXN", "Alexion Pharmaceuticals, Inc.", 170.2, 0.7, 0.41, 659817, .1, "N", "N"],
["ALTR", "Altera Corporation", 33.82, -0.06, -0.18, 1928706, .0, "N", "N"],
["AMZN", "Amazon.com, Inc.", 329.67, 6.1, 1.89, 5246300, 2.5, "N", "N"],
....
["YHOO", "Yahoo! Inc.", 35.92, 0.98, 2.8, 18705720, .9, "N", "N"]];
回答by parkerproject
Just to add to @mhawke 's answer, rather than hardcoding the offset of the script tag, you loop through all the script tags and match the one that matches your pattern;
只是为了添加到@mhawke 的答案中,而不是硬编码脚本标签的偏移量,而是遍历所有脚本标签并匹配与您的模式匹配的标签;
web = urllib.urlopen("http://www.nasdaq.com/quotes/nasdaq-financial-100-stocks.aspx")
pattern = re.compile('var table_body = (.*?);')
soup = BeautifulSoup(web.read(), "lxml")
scripts = soup.find_all('script')
for script in scripts:
if(pattern.match(str(script.string))):
data = pattern.match(script.string)
stock = json.loads(data.groups()[0])
print stock