Python 获取lxml中元素的内部HTML

Question

提问by Sudip Kafle

I am trying to get the HTML content of child node with lxml and xpath in Python. As shown in code below, I want to find the html content of the each of product nodes. Does it have any methods like product.html?

我试图在 Python 中使用 lxml 和 xpath 获取子节点的 HTML 内容。如下代码所示，我想找到每个产品节点的 html 内容。它有类似product.html 的方法吗？

productGrids = tree.xpath("//div[@class='name']/parent::*")
for product in productGrids:
    print #html content of product

Answer 1

采纳答案by Walty Yeung

from lxml import etree
print(etree.tostring(root, pretty_print=True))

you may see more examples here: http://lxml.de/tutorial.html

您可能会在这里看到更多示例：http: //lxml.de/tutorial.html

Answer 2

回答by vezult

I believe you want to use the tostring()method:

我相信你想使用这个tostring()方法：

from lxml import etree

tree = etree.fromstring('<html><head><title>foo</title></head><body><div class="name"><p>foo</p></div><div class="name"><ul><li>bar</li></ul></div></body></html>')
for elem in tree.xpath("//div[@class='name']"):
     # pretty_print ensures that it is nicely formatted.
     print etree.tostring(elem, pretty_print=True)

Answer 3

回答by Saurabh Chandra Patel

another way to do this

另一种方法来做到这一点

x=doc.xpath("//div[@class='name']/parent::*")
print(map(etree.tostring,x))

Answer 4

回答by randompast

After right clicking (copy, copy xpath) on the specific field you want (in chrome's inspector), you might get something like this:

在您想要的特定字段（在 chrome 的检查器中）上右键单击（复制、复制 xpath）后，您可能会得到如下内容：

//*[@id="specialID"]/div[12]/div[2]/h4/text()[1]

If you wanted that text element for each "specialID"

如果您想要每个“specialID”的文本元素

//*[@id="specialID"]/div/div[2]/h4/text()[1]

You could select another field and it'll interleave the results

您可以选择另一个字段，它会交错结果

//*[@id="specialID"]/div/div[2]/h4/text()[1] | //*[@id="specialID"]/div/some/weird/path[95]

Example could be improved, but it illustrates the point:

示例可以改进，但它说明了这一点：

//*[@id="mw-content-text"]/div/ul[1]/li[11]/text()

from lxml import html
import requests
page = requests.get('https://en.wikipedia.org/wiki/Web_scraping')
tree = html.fromstring(page.content)
data = tree.xpath('//*[@id="mw-content-text"]/div/ul[1]/li/a/text() | //*[@id="mw-content-text"]/div/ul[1]/li/text()[1]')
print(len(data))
for i in range(len(data)):
    print(data[i])

Answer 5

回答by Virako

You can use product.text_content()

您可以使用 product.text_content()

Answer 6

回答by Sivashanmugam Kannan

Simple function to get innerHTML or innerXML
.
Try it out directly https://pyfiddle.io/fiddle/631aa049-2785-4c58-bf82-eff4e2f8bedb/
.

获取innerHTML 或 innerXML 的简单函数
。
直接尝试https://pyfiddle.io/fiddle/631aa049-2785-4c58-bf82-eff4e2f8bedb/
。

function

功能


def innerXML(elem):
    elemName = elem.xpath('name(/*)')
    resultStr = ''
    for e in elem.xpath('/'+ elemName + '/node()'):
        if(isinstance(e, str) ):
            resultStr = resultStr + ''
        else:
            resultStr = resultStr + etree.tostring(e, encoding='unicode')

    return resultStr

invocation

调用

XMLElem = etree.fromstring("<div>I am<name>Jhon <last.name> Corner</last.name></name>.I work as <job>software engineer</job><end meta='bio' />.</div>")
print(innerXML(XMLElem))

.
Logic Behind

.
背后的逻辑

get the outermost element name first,
Then get all the child nodes
Convert all the child nodes to string using tostring
Concatinate Them

首先获取最外面的元素名称，
然后获取所有子节点
使用将所有子节点转换为字符串 tostring
连接它们

Python 获取lxml中元素的内部HTML

提问by Sudip Kafle

采纳答案by Walty Yeung

回答by vezult

回答by Saurabh Chandra Patel

回答by randompast

回答by Virako

回答by Sivashanmugam Kannan

相关推荐

最近更新

标签

Python 获取lxml中元素的内部HTML

提问by Sudip Kafle

采纳答案by Walty Yeung

回答by vezult

回答by Saurabh Chandra Patel

回答by randompast

回答by Virako

回答by Sivashanmugam Kannan

相关推荐

Python 数据库连接关闭

Python tkinter：使任何输出出现在 GUI 上的文本框中而不是在 shell 中

Python 迭代元组列表

python元组到字典

相关推荐

最近更新

标签