python 如何使用 selenium 获取特定元素的 html 源代码？

Question

提问by Rivka

The page I'm looking at contains :

我正在查看的页面包含：

<div id='1'> <p> text 1 <h1> text 2 </h1> text 3 <p> text 4 </p> </p> </div>

I want to get all the text in the div, except for the text that is in the <h>. (I want to get "text 1","text 3" and "text 4") There may be a few <h>elements, or none at all. And there may be a few <p>elements, even one inside the other, or none.

我想获取 div 中的所有文本，除了<h>. （我想得到“文本 1”、“文本 3”和“文本 4”）可能有几个<h>元素，或者根本没有。并且可能有几个<p>元素，甚至一个在另一个内部，或者没有。

I thought to do this by getting all the html source of the div, and using a regex to remove the <h>elements. But selenium.get_text does not return the html, just the text (all of it!).

我想通过获取 div 的所有 html 源并使用正则表达式来删除<h>元素来做到这一点。但是 selenium.get_text 不返回 html，只返回文本（全部！）。

I know I can use selenium.get_html_sourceand then look for the element I need with a regex, but that looks like a waste since selenium knows how to find the element.

我知道我可以使用selenium.get_html_source，然后使用正则表达式查找我需要的元素，但这看起来很浪费，因为 selenium 知道如何找到该元素。

Does anyone have a better solution? Thanks :)

有没有人有更好的解决方案？谢谢：）

Answer 1

回答by luc

The following code will give you the HTML in the div element:

以下代码将为您提供 div 元素中的 HTML：

sel = selenium('localhost', 4444, browser, my_url)
html = sel.get_eval("this.browserbot.getCurrentWindow().document.getElementById('1').innerHTML")

then you can use BeautifulSoup to parse it and extract what you really want.

然后你可以使用 BeautifulSoup 来解析它并提取你真正想要的。

I hope it helps

我希望它有帮助

Answer 2

回答by int3

Use xpath. From selenium.py:

使用 xpath。来自selenium.py：

Without an explicit locator prefix, Selenium uses the following default strategies:
\**dom**\ , for locators starting with "document."
\**xpath**\ , for locators starting with "//"
\**identifier**\ , otherwise

在没有显式定位器前缀的情况下，Selenium 使用以下默认策略：
\**dom**\ ，用于以“document”开头的定位器。
\**xpath**\ ，用于以“//”开头的定位器
\**标识符**\ ，否则

In your case, you could try

在你的情况下，你可以尝试

selenium.get_text("//div[@id='1']/descendant::*[not(self::h1)]")

You can learn more about xpath here.

您可以在此处了解有关 xpath 的更多信息。

P.S. I don't know if there's good HTML documentation available for python-selenium, but I haven't found any; on the other hand, the docstrings of the selenium.pyfile seem to constitute comprehensive documentation. So I'd suggest looking up the source to get a better understanding of how it works.

PS 我不知道是否有适用于 python-selenium 的好的 HTML 文档，但我没有找到；另一方面，selenium.py文件的文档字符串似乎构成了全面的文档。所以我建议查找源代码以更好地了解它的工作原理。

Answer 3

回答by hminaya

What about using jQuery?

使用 jQuery 怎么样？

Edit:

编辑：

First you have to add the required .JS files, for that go to www.jQuery.com.

首先，您必须添加所需的 .JS 文件，为此请访问 www.jQuery.com。

Then all you need to do is call a simple jQuery selector:

然后你需要做的就是调用一个简单的 jQuery 选择器：

alert($("div#1").html());

Answer 4

回答by Michael SM

The selected answer does not work in Python 3 at the time of writing. Instead use this:

在撰写本文时，所选答案在 Python 3 中不起作用。而是使用这个：

from selenium import webdriver

wd = webdriver.Firefox()
wd.get(url)
return wd.execute_script('return window.document.getElementById('1').innerHTML')

python 如何使用 selenium 获取特定元素的 html 源代码？

提问by Rivka

回答by luc

回答by int3

回答by hminaya

回答by Michael SM

相关推荐

最近更新

标签

python 如何使用 selenium 获取特定元素的 html 源代码？

提问by Rivka

回答by luc

回答by int3

回答by hminaya

回答by Michael SM

相关推荐

Python 3：将换行符写入 HTML

使用 python imaplib 从 Gmail 中“删除”一封电子邮件？

python Django MultiWidget 电话号码字段

python MySQL Django 模型中的布尔字段？

相关推荐

最近更新

标签