如何使用 Python 检索动态 html 内容的值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17597424/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 08:38:06  来源:igfitidea点击:

How to retrieve the values of dynamic html content using Python

pythonhtmltemplatesurllib

提问by Tagc

I'm using Python 3 and I'm trying to retrieve data from a website. However, this data is dynamically loaded and the code I have right now doesn't work:

我正在使用 Python 3,并且正在尝试从网站检索数据。但是,这些数据是动态加载的,我现在拥有的代码不起作用:

url = eveCentralBaseURL + str(mineral)
print("URL : %s" % url);

response = request.urlopen(url)
data = str(response.read(10000))

data = data.replace("\n", "\n")
print(data)

Where I'm trying to find a particular value, I'm finding a template instead e.g."{{formatPrice median}}" instead of "4.48".

在我试图找到特定值的地方,我找到了一个模板,例如“{{formatPrice medium}}”而不是“4.48”。

How can I make it so that I can retrieve the value instead of the placeholder text?

我该如何制作它以便我可以检索值而不是占位符文本?

Edit: Thisis the specific page I'm trying to extract information from. I'm trying to get the "median" value, which uses the template {{formatPrice median}}

编辑:是我试图从中提取信息的特定页面。我正在尝试获取“中值”值,该值使用模板 {{formatPrice 中值}}

Edit 2: I've installed and set up my program to use Selenium and BeautifulSoup.

编辑 2:我已经安装并设置了我的程序以使用 Selenium 和 BeautifulSoup。

The code I have now is:

我现在的代码是:

from bs4 import BeautifulSoup
from selenium import webdriver

#...

driver = webdriver.Firefox()
driver.get(url)

html = driver.page_source
soup = BeautifulSoup(html)

print "Finding..."

for tag in soup.find_all('formatPrice median'):
    print tag.text

Hereis a screenshot of the program as it's executing. Unfortunately, it doesn't seem to be finding anything with "formatPrice median" specified.

是程序执行时的屏幕截图。不幸的是,它似乎没有找到任何指定了“formatPrice 中位数”的东西。

采纳答案by will-hart

Assuming you are trying to get values from a page that is rendered using javascript templates (for instance something like handlebars), then this is what you will get with any of the standard solutions (i.e. beautifulsoupor requests).

假设您正在尝试从使用 javascript 模板(例如像handlebars 之类的东西)呈现的页面获取值,那么这就是您使用任何标准解决方案(即beautifulsouprequests)获得的结果。

This is because the browser uses javascript to alter what it received and create new DOM elements. urllibwill do the requesting part like a browser but not the template rendering part. A good description of the issues can be found here. This article discusses three main solutions:

这是因为浏览器使用 javascript 来改变它接收到的内容并创建新的 DOM 元素。urllib将像浏览器一样完成请求部分,而不是模板渲染部分。可以在此处找到对这些问题的详细描述。本文讨论了三种主要解决方案:

  1. parse the ajax JSON directly
  2. use an offline Javascript interpreter to process the request SpiderMonkey, crowbar
  3. use a browser automation tool splinter
  1. 直接解析ajax JSON
  2. 使用离线 Javascript 解释器处理请求SpiderMonkey, crowbar
  3. 使用浏览器自动化工具分裂

This answerprovides a few more suggestions for option 3, such as seleniumor watir. I've used selenium for automated web testing and its pretty handy.

此答案为选项 3 提供了更多建议,例如selenium或 watir。我使用 selenium 进行自动化 Web 测试,它非常方便。



EDIT

编辑

From your comments it looks like it is a handlebars driven site. I'd recommend selenium and beautiful soup. This answergives a good code example which may be useful:

从您的评论来看,它看起来像是一个车把驱动的网站。我推荐硒和美丽的汤。 这个答案给出了一个很好的代码示例,它可能很有用:

from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://eve-central.com/home/quicklook.html?typeid=34')

html = driver.page_source
soup = BeautifulSoup(html)

# check out the docs for the kinds of things you can do with 'find_all'
# this (untested) snippet should find tags with a specific class ID
# see: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
for tag in soup.find_all("a", class_="my_class"):
    print tag.text

Basically selenium gets the rendered HTML from your browser and then you can parse it using BeautifulSoup from the page_sourceproperty. Good luck :)

基本上 selenium 从您的浏览器获取呈现的 HTML,然后您可以使用page_source属性中的BeautifulSoup 解析它。祝你好运 :)