python、lxml 和 xpath - html 表解析

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1577487/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 22:37:11  来源:igfitidea点击:

python, lxml and xpath - html table parsing

pythonxpathlxml

提问by user191131

I 'am new to lxml, quite new to python and could not find a solution to the following:

我是 lxml 的新手,对 python 很陌生,但找不到以下解决方案:

I need to import a few tables with 3 columns and an undefined number of rows starting at row 3.

我需要导入一些包含 3 列和从第 3 行开始的未定义行数的表。

When the second column of any row is empty, this row is discarded and the processing of the table is aborted.

当任何行的第二列为空时,丢弃该行并中止表的处理。

The following code prints the table's data fine (but I'm unable to reuse the data afterwards):

以下代码可以很好地打印表的数据(但之后我无法重用数据):

from lxml.html import parse

def process_row(row):  
    for cell in row.xpath('./td'):  
        print cell.text_content()  
        yield cell.text_content()  

def process_table(table):  
    return [process_row(row) for row in table.xpath('./tr')]

doc = parse(url).getroot()  
tbl = doc.xpath("/html//table[2]")[0]  
data = process_table(tbl)  

This only prints the first column :(

这仅打印第一列:(

for i in data:  
    print i.next()

The following only import the third row, and not the subsequent

下面只导入第三行,不导入后续

tbl = doc.xpath("//body/table[2]//tr[position()>2]")[0]

Anyone knows a fancy solution to get all the data from row 3 into tbl and copy it into an array so it can be processed into a module with no lxml dependency?

任何人都知道将第 3 行中的所有数据获取到 tbl 并将其复制到数组中的奇特解决方案,以便可以将其处理为没有 lxml 依赖项的模块?

Thanks in advance for your help, Alex

预先感谢您的帮助,亚历克斯

回答by Robert Rossney

This is a generator:

这是一个生成器:

def process_row(row):  
     for cell in row.xpath('./td'):  
         print cell.text_content()  
         yield cell.text_content() 

You're calling it as though you thought it returns a list. It doesn't. There are contexts in which it behaveslike a list:

您正在调用它,就像您认为它返回一个列表一样。它没有。在某些上下文中,它的行为类似于列表:

print [r for r in process_row(row)]

but that's only because a generator and a list both expose the same interface to forloops. Using it in a context where it gets evaluated just one time, e.g.:

但这只是因为生成器和列表都向for循环公开了相同的接口。在只评估一次的上下文中使用它,例如:

return [process_row(row) for row in table.xpath('./tr')]

just calls a new instance of the generator once for each new value of row, returning the first result yielded.

只需为 的每个新值调用一次生成器的新实例row,返回产生的第一个结果。

So that's your first problem. Your second one is that you're expecting:

所以这是你的第一个问题。你的第二个是你期待:

tbl = doc.xpath("//body/table[2]//tr[position()>2]")[0]

to give you the third and all subsequent rows, and it's only setting tblto the third row. Well, the call to xpathisreturning the third and all subsequent rows. It's the [0]at the end that's messing you up.

给你第三行和所有后续行,它只设置tbl到第三行。好了,调用xpath返回第三和所有后续行。这是[0]在多数民众赞成你搞乱了尽头。

回答by interjay

You need to use a loop to access the row's data, like this:

您需要使用循环来访问行的数据,如下所示:

for row in data:  
    for col in row:
        print col

Calling next() once as you did will access only the first item, which is why you see one column.

像您一样调用 next() 将仅访问第一项,这就是您看到一列的原因。

Note that due to the nature of generators, you can only access them once. If you changed the call process_row(row)into list(process_row(row)), the generator would be converted to a list which can be reused.

请注意,由于生成器的性质,您只能访问它们一次。如果您将调用更改process_row(row)list(process_row(row)),则生成器将转换为可以重复使用的列表。

Update: If you need just the 3rd row and on, use data[2:]

更新:如果您只需要第三行及以上,请使用 data[2:]