如何使用 Python 从网页的检查元素中获取数据

Question

提问by user3783999

I'd like to get the data from inspect element using Python. I'm able to download the source code using BeautifulSoup but now I need the text from inspect element of a webpage. I'd truly appreciate if you could advise me how to do it.

我想使用 Python 从检查元素中获取数据。我可以使用 BeautifulSoup 下载源代码，但现在我需要来自网页的检查元素的文本。如果您能建议我如何做，我将不胜感激。

Edit: By inspect element I mean, in google chrome, right click gives us an option called inspect element which has code related to each element of that particular page. I'd like to extract that code/ just its text strings.

编辑：通过检查元素我的意思是，在谷歌浏览器中，右键单击为我们提供了一个名为检查元素的选项，其中包含与该特定页面的每个元素相关的代码。我想提取该代码/只是它的文本字符串。

Answer 1

采纳答案by Jason S

If you want to automatically fetch a web page from Python in a way that runs Javascript, you should look into Selenium. It can automatically drive a web browser (even a headless web browser such as PhantomJS, so you don't have to have a window open).

如果您想以运行 Javascript 的方式从 Python 自动获取网页，您应该查看 Selenium。它可以自动驱动网页浏览器（甚至是 PhantomJS 之类的无头网页浏览器，因此您不必打开窗口）。

In order to get the HTML, you'll need to evaluate some javascript. Simple sample code, alter to suit:

为了获得 HTML，您需要评估一些 javascript。简单的示例代码，更改以适应：

from selenium import webdriver

driver = webdriver.PhantomJS()
driver.get("http://google.com")

# This will get the initial html - before javascript
html1 = driver.page_source

# This will get the html after on-load javascript
html2 = driver.execute_script("return document.documentElement.innerHTML;")

Note 1: If you want a specific element or elements, you actually have a couple of options -- parse the HTML in Python, or write more specific JavaScript that returns what you want.

注意 1：如果您想要一个或多个特定元素，实际上您有几个选择——在 Python 中解析 HTML，或者编写更具体的 JavaScript 来返回您想要的内容。

Note 2: if you actually need specific information from Chrome's tools that is not just dynamically generated HTML, you'll need a way to hook into Chrome itself. No way around that.

注意 2：如果您确实需要来自 Chrome 工具的特定信息，而不仅仅是动态生成的 HTML，您将需要一种方法来连接到 Chrome 本身。没办法。

Answer 2

回答by Serial

Inspect element shows all the HTML of the page which is the same as fetching the html using urllib

Inspect 元素显示页面的所有 HTML，这与使用获取 html 相同 urllib

do something like this

做这样的事情

import urllib
from bs4 import BeautifulSoup as BS

html = urllib.urlopen(URL).read()

soup = BS(html)

print soup.findAll(tag_name).get_text()

Answer 3

回答by flyingfoxlee

BeautifulSoup could be used to parse the html document, and extract anything you want. It's not designed for downloading. You could find the elements you want by it's class and id.

BeautifulSoup 可用于解析 html 文档，并提取您想要的任何内容。它不是为下载而设计的。你可以通过它的类和 id 找到你想要的元素。

Answer 4

回答by Jakub

I would like to update answer from Jason S. I wasn't able to start phantomjs on OS X

我想更新 Jason S 的答案。我无法在 OS X 上启动 phantomjs

driver = webdriver.PhantomJS()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File     "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium/webdriver/phantomjs/webdriver.py", line 50, in __init__
self.service.start()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium/webdriver/phantomjs/service.py", line 74, in start
raise WebDriverException("Unable to start phantomjs with ghostdriver.", e)
selenium.common.exceptions.WebDriverException: Message: Unable to start phantomjs with ghostdriver.

Resolved by answer hereby downloading executables

通过下载可执行文件在此处通过回答解决

driver = webdriver.PhantomJS("phantomjs-2.0.0-macosx/bin/phantomjs")

如何使用 Python 从网页的检查元素中获取数据

提问by user3783999

采纳答案by Jason S

回答by Serial

回答by flyingfoxlee

回答by Jakub

相关推荐

最近更新

标签

如何使用 Python 从网页的检查元素中获取数据

提问by user3783999

采纳答案by Jason S

回答by Serial

回答by flyingfoxlee

回答by Jakub

相关推荐

Python 从文件指针获取文件名

pythonic是什么意思？

Python 如何制作良好的可重复熊猫示例

Python 3.2 输入日期函数

相关推荐

最近更新

标签