如何使用 Python 从网页的检查元素中获取数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25027339/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to get data from inspect element of a webpage using Python
提问by user3783999
I'd like to get the data from inspect element using Python. I'm able to download the source code using BeautifulSoup but now I need the text from inspect element of a webpage. I'd truly appreciate if you could advise me how to do it.
我想使用 Python 从检查元素中获取数据。我可以使用 BeautifulSoup 下载源代码,但现在我需要来自网页的检查元素的文本。如果您能建议我如何做,我将不胜感激。
Edit: By inspect element I mean, in google chrome, right click gives us an option called inspect element which has code related to each element of that particular page. I'd like to extract that code/ just its text strings.
编辑:通过检查元素我的意思是,在谷歌浏览器中,右键单击为我们提供了一个名为检查元素的选项,其中包含与该特定页面的每个元素相关的代码。我想提取该代码/只是它的文本字符串。
采纳答案by Jason S
If you want to automatically fetch a web page from Python in a way that runs Javascript, you should look into Selenium. It can automatically drive a web browser (even a headless web browser such as PhantomJS, so you don't have to have a window open).
如果您想以运行 Javascript 的方式从 Python 自动获取网页,您应该查看 Selenium。它可以自动驱动网页浏览器(甚至是 PhantomJS 之类的无头网页浏览器,因此您不必打开窗口)。
In order to get the HTML, you'll need to evaluate some javascript. Simple sample code, alter to suit:
为了获得 HTML,您需要评估一些 javascript。简单的示例代码,更改以适应:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get("http://google.com")
# This will get the initial html - before javascript
html1 = driver.page_source
# This will get the html after on-load javascript
html2 = driver.execute_script("return document.documentElement.innerHTML;")
Note 1: If you want a specific element or elements, you actually have a couple of options -- parse the HTML in Python, or write more specific JavaScript that returns what you want.
注意 1:如果您想要一个或多个特定元素,实际上您有几个选择——在 Python 中解析 HTML,或者编写更具体的 JavaScript 来返回您想要的内容。
Note 2: if you actually need specific information from Chrome's tools that is not just dynamically generated HTML, you'll need a way to hook into Chrome itself. No way around that.
注意 2:如果您确实需要来自 Chrome 工具的特定信息,而不仅仅是动态生成的 HTML,您将需要一种方法来连接到 Chrome 本身。没办法。
回答by Serial
Inspect element shows all the HTML of the page which is the same as fetching the html using urllib
Inspect 元素显示页面的所有 HTML,这与使用获取 html 相同 urllib
do something like this
做这样的事情
import urllib
from bs4 import BeautifulSoup as BS
html = urllib.urlopen(URL).read()
soup = BS(html)
print soup.findAll(tag_name).get_text()
回答by flyingfoxlee
BeautifulSoup could be used to parse the html document, and extract anything you want. It's not designed for downloading. You could find the elements you want by it's class and id.
BeautifulSoup 可用于解析 html 文档,并提取您想要的任何内容。它不是为下载而设计的。你可以通过它的类和 id 找到你想要的元素。
回答by Jakub
I would like to update answer from Jason S. I wasn't able to start phantomjs on OS X
我想更新 Jason S 的答案。我无法在 OS X 上启动 phantomjs
driver = webdriver.PhantomJS()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium/webdriver/phantomjs/webdriver.py", line 50, in __init__
self.service.start()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium/webdriver/phantomjs/service.py", line 74, in start
raise WebDriverException("Unable to start phantomjs with ghostdriver.", e)
selenium.common.exceptions.WebDriverException: Message: Unable to start phantomjs with ghostdriver.
Resolved by answer hereby downloading executables
driver = webdriver.PhantomJS("phantomjs-2.0.0-macosx/bin/phantomjs")

