Python 使用 BeautifulSoup 提取特定的 TD 表格元素文本？

Question

提问by Pike Man

I trying to extract IP Addresses from a autogenerated HTML table using the BeautifulSoup library and im having a little trouble.

我尝试使用 BeautifulSoup 库从自动生成的 HTML 表中提取 IP 地址，但遇到了一些麻烦。

The HTML is structured like so:

HTML 的结构如下：

<html>
<body>
    <table class="mainTable">
    <thead>
        <tr>
            <th>IP</th>
            <th>Country</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td><a href="hello.html">127.0.0.1<a></td>
            <td><img src="uk.gif" /><a href="uk.com">uk</a></td>
        </tr>
        <tr>
            <td><a href="hello.html">192.168.0.1<a></td>
            <td><img src="uk.gif" /><a href="us.com">us</a></td>
        </tr>
        <tr>
            <td><a href="hello.html">255.255.255.0<a></td>
            <td><img src="uk.gif" /><a href="br.com">br</a></td>
        </tr>
    </tbody>
</table>

The small code below extracts the text from the two td rows but i only need the IP data, not the IP and Country data:

下面的小代码从两个 td 行中提取文本，但我只需要 IP 数据，而不需要 IP 和国家/地区数据：

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("data.htm"))

table = soup.find('table', {'class': 'mainTable'})
for row in table.findAll("a"):
print(row.text)

this outputs:

这输出：

127.0.0.1
uk
192.168.0.1
us
255.255.255.0
br

What i need is the IP table.tbody.tr.td.aelements text and not the country table.tbody.tr.td.img.aelements.

我需要的是 IPtable.tbody.tr.td.a元素文本而不是国家table.tbody.tr.td.img.a元素。

Are there any experienced users of BeautifulSoup who would have any inkling on how to to this selection and extraction.

是否有任何有经验的 BeautifulSoup 用户对如何进行此选择和提取有任何了解。

Thanks.

谢谢。

Answer 1

回答by m.wasowski

Search just first <td>for each row in tbody:

首先搜索<td>中的每一行tbody：

# html should contain page content:
[row.find('td').getText() for row in bs4.BeautifulSoup(html).find('tbody').find_all('tr')]

or maybe more readable:

或者可能更具可读性：

rows = [row in bs4.BeautifulSoup(html).find('tbody').find_all('tr')]
iplist = [row.find('td').getText() for row in rows]

Answer 2

回答by behzad.nouri

This gives you the right list:

这为您提供了正确的列表：

>>> pred = lambda tag: tag.parent.find('img') is None
>>> list(filter(pred, soup.find('tbody').find_all('a')))
[<a href="hello.html">127.0.0.1<a></a></a>, <a></a>, <a href="hello.html">192.168.0.1<a></a></a>, <a></a>, <a href="hello.html">255.255.255.0<a></a></a>, <a></a>]

just apply .texton the elements of this list.

只需应用于.text此列表的元素。

There are multiple empty <a></a>tags in above list because the <a>tags in the html are not closed properly. To get rid of them, you may use

<a></a>上面列表中有多个空标签，因为<a>html 中的标签没有正确关闭。要摆脱它们，您可以使用

pred = lambda tag: tag.parent.find('img') is None and tag.text

and ultimately:

并最终：

>>> [tag.text for tag in filter(pred, soup.find('tbody').find_all('a'))]
['127.0.0.1', '192.168.0.1', '255.255.255.0']

Answer 3

回答by salmanwahed

You can use a little regular expression for extracting the ip address. BeautifulSoup with regular expression is a nice combination for scraping.

您可以使用一些正则表达式来提取 ip 地址。带有正则表达式的 BeautifulSoup 是一个很好的抓取组合。

ip_pat = re.compile(r"^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$")
for row in table.findAll("a"):
    if ip_pat.match(row.text):
        print(row.text)

Answer 4

回答by unutbu

First note that the HTML is not well-formed. It is not closing the atags. There are two<a>tags started here:

首先请注意，HTML 格式不正确。它不是关闭a标签。这里有两个<a>标签：

<a href="hello.html">127.0.0.1<a>

If you print tableyou'll see BeautifulSoup is parsing the HTML as:

如果您打印，table您会看到 BeautifulSoup 将 HTML 解析为：

<td>
<a href="hello.html">127.0.0.1</a><a></a>
</td>
...

Each ais followed by an empty a.

每个a后跟一个空的a。

Given the presence of those extra <a>tags, if you want every third<a>tag, then

鉴于这些额外<a>标签的存在，如果你想要每三个<a>标签，那么

for row in table.findAll("a")[::3]:
    print(row.get_text())

suffices:

足够了：

127.0.0.1
192.168.0.1
255.255.255.0

On the other hand, if the occurrence of <a>tags is not so regular and you only want that <a>tags with no previous sibling (such as, but not limited to <img>), then

另一方面，如果<a>标签的出现不是那么有规律，而您只想要<a>没有以前同级的标签（例如但不限于<img>），那么

for row in table.findAll("a"):
    sibling = row.findPreviousSibling()
    if sibling is None:
        print(row.get_text())

would work.

会工作。

If you have lxml, the criteria can be expressed more succinctly using XPath:

如果您有lxml，则可以使用 XPath 更简洁地表达条件：

import lxml.html as LH
doc = LH.parse("data.htm")
ips = doc.xpath('//table[@class="mainTable"]//td/a[not(preceding-sibling::img)]/text()')
print(ips)

The XPath used above has the following meaning:

上面使用的XPath含义如下：

//table                            select all <table> tags
    [@class="mainTable"]           that have a class="mainTable" attribute
//                                 from these tags select descendants
  td/a                             which are td tags with a child <a> tag
    [not(preceding-sibling::img)]  such that it does not have a preceding sibling <img> tag
    /text()                        return the text of the <a> tag

It does take a little time to learn XPath, but once you learn it you may never want to use BeautifulSoup again.

学习 XPath确实需要一点时间，但是一旦你学会了它，你可能再也不想使用 BeautifulSoup 了。

Python 使用 BeautifulSoup 提取特定的 TD 表格元素文本？

提问by Pike Man

回答by m.wasowski

回答by behzad.nouri

回答by salmanwahed

回答by unutbu

相关推荐

最近更新

标签

Python 使用 BeautifulSoup 提取特定的 TD 表格元素文本？

提问by Pike Man

回答by m.wasowski

回答by behzad.nouri

回答by salmanwahed

回答by unutbu

相关推荐

Python PermissionError: [WinError 32] 该进程无法访问该文件，因为它正被另一个进程使用

Python backports/lzma/_lzmamodule.c:115:18: 致命错误: lzma.h: 没有那个文件或目录

Python django i18n：确保你有 GNU gettext 工具

python中一个合适的“什么都不做”的lambda表达式？

相关推荐

最近更新

标签