如何使用 Python 检索动态 html 内容的值

Question

提问by Tagc

I'm using Python 3 and I'm trying to retrieve data from a website. However, this data is dynamically loaded and the code I have right now doesn't work:

我正在使用 Python 3，并且正在尝试从网站检索数据。但是，这些数据是动态加载的，我现在拥有的代码不起作用：

url = eveCentralBaseURL + str(mineral)
print("URL : %s" % url);

response = request.urlopen(url)
data = str(response.read(10000))

data = data.replace("\n", "\n")
print(data)

Where I'm trying to find a particular value, I'm finding a template instead e.g."{{formatPrice median}}" instead of "4.48".

在我试图找到特定值的地方，我找到了一个模板，例如“{{formatPrice medium}}”而不是“4.48”。

How can I make it so that I can retrieve the value instead of the placeholder text?

我该如何制作它以便我可以检索值而不是占位符文本？

Edit: Thisis the specific page I'm trying to extract information from. I'm trying to get the "median" value, which uses the template {{formatPrice median}}

编辑：这是我试图从中提取信息的特定页面。我正在尝试获取“中值”值，该值使用模板 {{formatPrice 中值}}

Edit 2: I've installed and set up my program to use Selenium and BeautifulSoup.

编辑 2：我已经安装并设置了我的程序以使用 Selenium 和 BeautifulSoup。

The code I have now is:

我现在的代码是：

from bs4 import BeautifulSoup
from selenium import webdriver

#...

driver = webdriver.Firefox()
driver.get(url)

html = driver.page_source
soup = BeautifulSoup(html)

print "Finding..."

for tag in soup.find_all('formatPrice median'):
    print tag.text

Hereis a screenshot of the program as it's executing. Unfortunately, it doesn't seem to be finding anything with "formatPrice median" specified.

这是程序执行时的屏幕截图。不幸的是，它似乎没有找到任何指定了“formatPrice 中位数”的东西。

Answer 1

采纳答案by will-hart

Assuming you are trying to get values from a page that is rendered using javascript templates (for instance something like handlebars), then this is what you will get with any of the standard solutions (i.e. beautifulsoupor requests).

假设您正在尝试从使用 javascript 模板（例如像handlebars 之类的东西）呈现的页面获取值，那么这就是您使用任何标准解决方案（即beautifulsoup或requests）获得的结果。

This is because the browser uses javascript to alter what it received and create new DOM elements. urllibwill do the requesting part like a browser but not the template rendering part. A good description of the issues can be found here. This article discusses three main solutions:

这是因为浏览器使用 javascript 来改变它接收到的内容并创建新的 DOM 元素。urllib将像浏览器一样完成请求部分，而不是模板渲染部分。可以在此处找到对这些问题的详细描述。本文讨论了三种主要解决方案：

parse the ajax JSON directly
use an offline Javascript interpreter to process the request SpiderMonkey, crowbar
use a browser automation tool splinter

直接解析ajax JSON
使用离线 Javascript 解释器处理请求SpiderMonkey, crowbar
使用浏览器自动化工具分裂

This answerprovides a few more suggestions for option 3, such as seleniumor watir. I've used selenium for automated web testing and its pretty handy.

此答案为选项 3 提供了更多建议，例如selenium或 watir。我使用 selenium 进行自动化 Web 测试，它非常方便。

EDIT

编辑

From your comments it looks like it is a handlebars driven site. I'd recommend selenium and beautiful soup. This answergives a good code example which may be useful:

从您的评论来看，它看起来像是一个车把驱动的网站。我推荐硒和美丽的汤。这个答案给出了一个很好的代码示例，它可能很有用：

from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://eve-central.com/home/quicklook.html?typeid=34')

html = driver.page_source
soup = BeautifulSoup(html)

# check out the docs for the kinds of things you can do with 'find_all'
# this (untested) snippet should find tags with a specific class ID
# see: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
for tag in soup.find_all("a", class_="my_class"):
    print tag.text

Basically selenium gets the rendered HTML from your browser and then you can parse it using BeautifulSoup from the page_sourceproperty. Good luck :)

基本上 selenium 从您的浏览器获取呈现的 HTML，然后您可以使用page_source属性中的BeautifulSoup 解析它。祝你好运：）

如何使用 Python 检索动态 html 内容的值

提问by Tagc

采纳答案by will-hart

相关推荐

最近更新

标签

如何使用 Python 检索动态 html 内容的值

提问by Tagc

采纳答案by will-hart

相关推荐

Python 在jinja中将字符串拆分为列表？

如何使用 python http.server 运行 CGI“hello world”

Python 为 Django 应用程序创建 REST API

在 python 中将 range(r) 转换为长度为 2 的字符串列表

相关推荐

最近更新

标签