Python 使用 BeautifulSoup 提取特定的 TD 表格元素文本?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22746176/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:35:13  来源:igfitidea点击:

Using BeautifulSoup To Extract Specific TD Table Elements Text?

pythonhtmlbeautifulsoup

提问by Pike Man

I trying to extract IP Addresses from a autogenerated HTML table using the BeautifulSoup library and im having a little trouble.

我尝试使用 BeautifulSoup 库从自动生成的 HTML 表中提取 IP 地址,但遇到了一些麻烦。

The HTML is structured like so:

HTML 的结构如下:

<html>
<body>
    <table class="mainTable">
    <thead>
        <tr>
            <th>IP</th>
            <th>Country</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td><a href="hello.html">127.0.0.1<a></td>
            <td><img src="uk.gif" /><a href="uk.com">uk</a></td>
        </tr>
        <tr>
            <td><a href="hello.html">192.168.0.1<a></td>
            <td><img src="uk.gif" /><a href="us.com">us</a></td>
        </tr>
        <tr>
            <td><a href="hello.html">255.255.255.0<a></td>
            <td><img src="uk.gif" /><a href="br.com">br</a></td>
        </tr>
    </tbody>
</table>

The small code below extracts the text from the two td rows but i only need the IP data, not the IP and Country data:

下面的小代码从两个 td 行中提取文本,但我只需要 IP 数据,而不需要 IP 和国家/地区数据:

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("data.htm"))

table = soup.find('table', {'class': 'mainTable'})
for row in table.findAll("a"):
print(row.text)

this outputs:

这输出:

127.0.0.1
uk
192.168.0.1
us
255.255.255.0
br

What i need is the IP table.tbody.tr.td.aelements text and not the country table.tbody.tr.td.img.aelements.

我需要的是 IPtable.tbody.tr.td.a元素文本而不是国家table.tbody.tr.td.img.a元素。

Are there any experienced users of BeautifulSoup who would have any inkling on how to to this selection and extraction.

是否有任何有经验的 BeautifulSoup 用户对如何进行此选择和提取有任何了解。

Thanks.

谢谢。

回答by m.wasowski

Search just first <td>for each row in tbody:

首先搜索<td>中的每一行tbody

# html should contain page content:
[row.find('td').getText() for row in bs4.BeautifulSoup(html).find('tbody').find_all('tr')]

or maybe more readable:

或者可能更具可读性:

rows = [row in bs4.BeautifulSoup(html).find('tbody').find_all('tr')]
iplist = [row.find('td').getText() for row in rows]

回答by behzad.nouri

This gives you the right list:

这为您提供了正确的列表:

>>> pred = lambda tag: tag.parent.find('img') is None
>>> list(filter(pred, soup.find('tbody').find_all('a')))
[<a href="hello.html">127.0.0.1<a></a></a>, <a></a>, <a href="hello.html">192.168.0.1<a></a></a>, <a></a>, <a href="hello.html">255.255.255.0<a></a></a>, <a></a>]

just apply .texton the elements of this list.

只需应用于.text此列表的元素。

There are multiple empty <a></a>tags in above list because the <a>tags in the html are not closed properly. To get rid of them, you may use

<a></a>上面列表中有多个空标签,因为<a>html 中的标签没有正确关闭。要摆脱它们,您可以使用

pred = lambda tag: tag.parent.find('img') is None and tag.text

and ultimately:

并最终:

>>> [tag.text for tag in filter(pred, soup.find('tbody').find_all('a'))]
['127.0.0.1', '192.168.0.1', '255.255.255.0']

回答by salmanwahed

You can use a little regular expression for extracting the ip address. BeautifulSoup with regular expression is a nice combination for scraping.

您可以使用一些正则表达式来提取 ip 地址。带有正则表达式的 BeautifulSoup 是一个很好的抓取组合。

ip_pat = re.compile(r"^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$")
for row in table.findAll("a"):
    if ip_pat.match(row.text):
        print(row.text)    

回答by unutbu

First note that the HTML is not well-formed. It is not closing the atags. There are two<a>tags started here:

首先请注意,HTML 格式不正确。它不是关闭a标签。这里有两个<a>标签:

<a href="hello.html">127.0.0.1<a>

If you print tableyou'll see BeautifulSoup is parsing the HTML as:

如果您打印,table您会看到 BeautifulSoup 将 HTML 解析为:

<td>
<a href="hello.html">127.0.0.1</a><a></a>
</td>
...

Each ais followed by an empty a.

每个a后跟一个空的a



Given the presence of those extra <a>tags, if you want every third<a>tag, then

鉴于这些额外<a>标签的存在,如果你想要每三个<a>标签,那么

for row in table.findAll("a")[::3]:
    print(row.get_text())

suffices:

足够了:

127.0.0.1
192.168.0.1
255.255.255.0

On the other hand, if the occurrence of <a>tags is not so regular and you only want that <a>tags with no previous sibling (such as, but not limited to <img>), then

另一方面,如果<a>标签的出现不是那么有规律,而您只想要<a>没有以前同级的标签(例如但不限于<img>),那么

for row in table.findAll("a"):
    sibling = row.findPreviousSibling()
    if sibling is None:
        print(row.get_text())

would work.

会工作。



If you have lxml, the criteria can be expressed more succinctly using XPath:

如果您有lxml,则可以使用 XPath 更简洁地表达条件:

import lxml.html as LH
doc = LH.parse("data.htm")
ips = doc.xpath('//table[@class="mainTable"]//td/a[not(preceding-sibling::img)]/text()')
print(ips)

The XPath used above has the following meaning:

上面使用的XPath含义如下:

//table                            select all <table> tags
    [@class="mainTable"]           that have a class="mainTable" attribute
//                                 from these tags select descendants
  td/a                             which are td tags with a child <a> tag
    [not(preceding-sibling::img)]  such that it does not have a preceding sibling <img> tag
    /text()                        return the text of the <a> tag 

It does take a little time to learn XPath, but once you learn it you may never want to use BeautifulSoup again.

学习 XPath确实需要一点时间,但是一旦你学会了它,你可能再也不想使用 BeautifulSoup 了。