Python BeautifulSoup:<div class <span class></span><span class>TEXT I WANT</span>
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17613606/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
BeautifulSoup: <div class <span class></span><span class>TEXT I WANT</span>
提问by
I am trying to extract the string enclosed by the span with id="titleDescription" using BeautifulSoup.
我正在尝试使用 BeautifulSoup 提取由 id="titleDescription" 跨度包围的字符串。
<div class="itemText">
<div class="wrapper">
<span class="itemPromo">Customer Choice Award Winner</span>
<a href="http://www.newegg.com/Product/Product.aspx?Item=N82E16819116501" title="View Details" >
<span class="itemDescription" id="titleDescriptionID" style="display:inline">Intel Core i7-3770K Ivy Bridge 3.5GHz (3.9GHz Turbo) LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K</span>
<span class="itemDescription" id="lineDescriptionID" style="display:none">Intel Core i7-3770K Ivy Bridge 3.5GHz (3.9GHz Turbo) LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K</span>
</a>
</div>
Code snippet
代码片段
f = open('egg.data', 'rb')
content = f.read()
content = content.decode('utf-8', 'replace')
content = ''.join([x for x in content if ord(x) < 128])
soup = bs(content)
for itemText in soup.find_all('div', attrs={'class':'itemText'}):
wrapper = itemText.div
wrapper_href = wrapper.a
for child in wrapper_href.descendants:
if child['id'] == 'titleDescriptionID':
print(child, "\n")
Traceback Error:
回溯错误:
Traceback (most recent call last):
File "egg.py", line 66, in <module>
if child['id'] == 'titleDescriptionID':
TypeError: string indices must be integers
采纳答案by zhangyangyu
spans = soup.find_all('span', attrs={'id':'titleDescriptionID'})
for span in spans:
print span.string
In your code, wrapper_href.descendants
contains at least 4 elements, 2 span tags and 2 string enclosed by the 2 span tags. It searches its children recursively.
在您的代码wrapper_href.descendants
中,至少包含 4 个元素、2 个 span 标签和由 2 个 span 标签包围的 2 个字符串。它递归地搜索它的孩子。
回答by Martijn Pieters
wrapper_href.descendants
includes any NavigableString
objects, which is what you are tripping over. NavigableString
are essentially string objects, and you are trying to index that with the child['id']
line:
wrapper_href.descendants
包括任何NavigableString
对象,这就是你绊倒的东西。NavigableString
本质上是字符串对象,并且您正在尝试使用以下child['id']
行对其进行索引:
>>> next(wrapper_href.descendants)
u'\n'
Why not just load the tag directly using itemText.find('span', id='titleDescriptionID')
?
为什么不直接使用加载标签itemText.find('span', id='titleDescriptionID')
?
Demo:
演示:
>>> for itemText in soup.find_all('div', attrs={'class':'itemText'}):
... print itemText.find('span', id='titleDescriptionID')
... print itemText.find('span', id='titleDescriptionID').text
...
<span class="itemDescription" id="titleDescriptionID" style="display:inline">Intel Core i7-3770K Ivy Bridge 3.5GHz (3.9GHz Turbo) LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K</span>
Intel Core i7-3770K Ivy Bridge 3.5GHz (3.9GHz Turbo) LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K
回答by Sudipta
from BeautifulSoup import BeautifulSoup
pool = BeautifulSoup(html) # where html contains the whole html as string
for item in pool.findAll('span', attrs={'id' : 'titleDescriptionID'}):
print item.string
When we search for a tag using BeautifulSoup, we get a BeautifulSoup.Tag object, which can directly be used to access its other attributes like inner content, style, href etc.
当我们使用 BeautifulSoup 搜索标签时,我们会得到一个 BeautifulSoup.Tag 对象,该对象可以直接用于访问其其他属性,如内部内容、样式、href 等。