需要 python lxml 语法帮助来解析 html
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/603287/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Need python lxml syntax help for parsing html
提问by Shaheeb Roshan
I am brand new to python, and I need some help with the syntax for finding and iterating through html tags using lxml. Here are the use-cases I am dealing with:
我是 python 的新手,我需要一些关于使用 lxml 查找和迭代 html 标签的语法的帮助。以下是我正在处理的用例:
HTML file is fairly well formed (but not perfect). Has multiple tables on screen, one containing a set of search results, and one each for a header and footer. Each result row contains a link for the search result detail.
HTML 文件格式相当好(但并不完美)。屏幕上有多个表格,一个包含一组搜索结果,一个包含页眉和页脚。每个结果行都包含搜索结果详细信息的链接。
I need to find the middle table with the search result rows (this one I was able to figure out):
self.mySearchTables = self.mySearchTree.findall(".//table") self.myResultRows = self.mySearchTables[1].findall(".//tr")
I need to find the links contained in this table (this is where I'm getting stuck):
for searchRow in self.myResultRows: searchLink = patentRow.findall(".//a")
It doesn't seem to actually locate the link elements.
I need the plain text of the link. I imagine it would be something like
searchLink.text
if I actually got the link elements in the first place.
我需要找到带有搜索结果行的中间表(这个我能够弄清楚):
self.mySearchTables = self.mySearchTree.findall(".//table") self.myResultRows = self.mySearchTables[1].findall(".//tr")
我需要找到这个表中包含的链接(这是我卡住的地方):
for searchRow in self.myResultRows: searchLink = patentRow.findall(".//a")
它似乎没有真正找到链接元素。
我需要链接的纯文本。我想这就像
searchLink.text
我一开始就得到了链接元素一样。
Finally, in the actual API reference for lxml, I wasn't able to find information on the find and the findall calls. I gleaned these from bits of code I found on google. Am I missing something about how to effectively find and iterate over HTML tags using lxml?
最后,在 lxml 的实际 API 参考中,我无法找到有关 find 和 findall 调用的信息。我从我在谷歌上找到的一些代码中收集了这些。我是否缺少有关如何使用 lxml 有效查找和迭代 HTML 标签的信息?
回答by Van Gale
Okay, first, in regards to parsing the HTML: if you follow the recommendation of zweiterlinde and S.Lott at least use the version of beautifulsoup included with lxml. That way you will also reap the benefit of a nice xpath or css selector interface.
好的,首先,关于解析 HTML:如果您遵循 zweiterlinde 和 S.Lott 的建议,至少使用lxml 中包含的beautifulsoup版本。这样,您还将受益于漂亮的 xpath 或 css 选择器界面。
However, I personally prefer Ian Bicking's HTML parser included in lxml.
但是,我个人更喜欢lxml 中包含的Ian Bicking 的HTML 解析器。
Secondly, .find()
and .findall()
come from lxml trying to be compatible with ElementTree, and those two methods are described in XPath Support in ElementTree.
其次,.find()
和.findall()
从LXML试图与ElementTree的兼容进来,并且这两种方法中所描述的XPath支持的ElementTree。
Those two functions are fairly easy to use but they are very limited XPath. I recommend trying to use either the full lxml xpath()
methodor, if you are already familiar with CSS, using the cssselect()
method.
这两个函数相当容易使用,但它们是非常有限的 XPath。我建议尝试使用完整的 lxmlxpath()
方法,或者,如果您已经熟悉 CSS,则使用cssselect()
方法。
Here are some examples, with an HTML string parsed like this:
下面是一些示例,其中的 HTML 字符串解析如下:
from lxml.html import fromstring
mySearchTree = fromstring(your_input_string)
Using the css selector class your program would roughly look something like this:
使用 css 选择器类,您的程序大致如下所示:
# Find all 'a' elements inside 'tr' table rows with css selector
for a in mySearchTree.cssselect('tr a'):
print 'found "%s" link to href "%s"' % (a.text, a.get('href'))
The equivalent using xpath method would be:
等效的使用 xpath 方法是:
# Find all 'a' elements inside 'tr' table rows with xpath
for a in mySearchTree.xpath('.//tr/*/a'):
print 'found "%s" link to href "%s"' % (a.text, a.get('href'))
回答by zweiterlinde
Is there a reason you're not using Beautiful Soupfor this project? It will make dealing with imperfectly formed documents much easier.
你有什么理由不为这个项目使用Beautiful Soup吗?它将使处理格式不完整的文档变得更加容易。