Python 使用 BeautifulSoup 提取特定的 TD 表格元素文本?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22746176/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using BeautifulSoup To Extract Specific TD Table Elements Text?
提问by Pike Man
I trying to extract IP Addresses from a autogenerated HTML table using the BeautifulSoup library and im having a little trouble.
我尝试使用 BeautifulSoup 库从自动生成的 HTML 表中提取 IP 地址,但遇到了一些麻烦。
The HTML is structured like so:
HTML 的结构如下:
<html>
<body>
<table class="mainTable">
<thead>
<tr>
<th>IP</th>
<th>Country</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="hello.html">127.0.0.1<a></td>
<td><img src="uk.gif" /><a href="uk.com">uk</a></td>
</tr>
<tr>
<td><a href="hello.html">192.168.0.1<a></td>
<td><img src="uk.gif" /><a href="us.com">us</a></td>
</tr>
<tr>
<td><a href="hello.html">255.255.255.0<a></td>
<td><img src="uk.gif" /><a href="br.com">br</a></td>
</tr>
</tbody>
</table>
The small code below extracts the text from the two td rows but i only need the IP data, not the IP and Country data:
下面的小代码从两个 td 行中提取文本,但我只需要 IP 数据,而不需要 IP 和国家/地区数据:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("data.htm"))
table = soup.find('table', {'class': 'mainTable'})
for row in table.findAll("a"):
print(row.text)
this outputs:
这输出:
127.0.0.1
uk
192.168.0.1
us
255.255.255.0
br
What i need is the IP table.tbody.tr.td.a
elements text and not the country table.tbody.tr.td.img.a
elements.
我需要的是 IPtable.tbody.tr.td.a
元素文本而不是国家table.tbody.tr.td.img.a
元素。
Are there any experienced users of BeautifulSoup who would have any inkling on how to to this selection and extraction.
是否有任何有经验的 BeautifulSoup 用户对如何进行此选择和提取有任何了解。
Thanks.
谢谢。
回答by m.wasowski
Search just first <td>
for each row in tbody
:
首先搜索<td>
中的每一行tbody
:
# html should contain page content:
[row.find('td').getText() for row in bs4.BeautifulSoup(html).find('tbody').find_all('tr')]
or maybe more readable:
或者可能更具可读性:
rows = [row in bs4.BeautifulSoup(html).find('tbody').find_all('tr')]
iplist = [row.find('td').getText() for row in rows]
回答by behzad.nouri
This gives you the right list:
这为您提供了正确的列表:
>>> pred = lambda tag: tag.parent.find('img') is None
>>> list(filter(pred, soup.find('tbody').find_all('a')))
[<a href="hello.html">127.0.0.1<a></a></a>, <a></a>, <a href="hello.html">192.168.0.1<a></a></a>, <a></a>, <a href="hello.html">255.255.255.0<a></a></a>, <a></a>]
just apply .text
on the elements of this list.
只需应用于.text
此列表的元素。
There are multiple empty <a></a>
tags in above list because the <a>
tags in the html are not closed properly. To get rid of them, you may use
<a></a>
上面列表中有多个空标签,因为<a>
html 中的标签没有正确关闭。要摆脱它们,您可以使用
pred = lambda tag: tag.parent.find('img') is None and tag.text
and ultimately:
并最终:
>>> [tag.text for tag in filter(pred, soup.find('tbody').find_all('a'))]
['127.0.0.1', '192.168.0.1', '255.255.255.0']
回答by salmanwahed
You can use a little regular expression for extracting the ip address. BeautifulSoup with regular expression is a nice combination for scraping.
您可以使用一些正则表达式来提取 ip 地址。带有正则表达式的 BeautifulSoup 是一个很好的抓取组合。
ip_pat = re.compile(r"^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$")
for row in table.findAll("a"):
if ip_pat.match(row.text):
print(row.text)
回答by unutbu
First note that the HTML is not well-formed. It is not closing the a
tags. There are two<a>
tags started here:
首先请注意,HTML 格式不正确。它不是关闭a
标签。这里有两个<a>
标签:
<a href="hello.html">127.0.0.1<a>
If you print table
you'll see BeautifulSoup is parsing the HTML as:
如果您打印,table
您会看到 BeautifulSoup 将 HTML 解析为:
<td>
<a href="hello.html">127.0.0.1</a><a></a>
</td>
...
Each a
is followed by an empty a
.
每个a
后跟一个空的a
。
Given the presence of those extra <a>
tags, if you want every third<a>
tag, then
鉴于这些额外<a>
标签的存在,如果你想要每三个<a>
标签,那么
for row in table.findAll("a")[::3]:
print(row.get_text())
suffices:
足够了:
127.0.0.1
192.168.0.1
255.255.255.0
On the other hand, if the occurrence of <a>
tags is not so regular and you only want that <a>
tags with no previous sibling (such as, but not limited to <img>
), then
另一方面,如果<a>
标签的出现不是那么有规律,而您只想要<a>
没有以前同级的标签(例如但不限于<img>
),那么
for row in table.findAll("a"):
sibling = row.findPreviousSibling()
if sibling is None:
print(row.get_text())
would work.
会工作。
If you have lxml, the criteria can be expressed more succinctly using XPath:
如果您有lxml,则可以使用 XPath 更简洁地表达条件:
import lxml.html as LH
doc = LH.parse("data.htm")
ips = doc.xpath('//table[@class="mainTable"]//td/a[not(preceding-sibling::img)]/text()')
print(ips)
The XPath used above has the following meaning:
上面使用的XPath含义如下:
//table select all <table> tags
[@class="mainTable"] that have a class="mainTable" attribute
// from these tags select descendants
td/a which are td tags with a child <a> tag
[not(preceding-sibling::img)] such that it does not have a preceding sibling <img> tag
/text() return the text of the <a> tag
It does take a little time to learn XPath, but once you learn it you may never want to use BeautifulSoup again.
学习 XPath确实需要一点时间,但是一旦你学会了它,你可能再也不想使用 BeautifulSoup 了。