Python BeautifulSoup：<div class <span class></span><span class>TEXT I WANT</span>

Question

提问by

I am trying to extract the string enclosed by the span with id="titleDescription" using BeautifulSoup.

我正在尝试使用 BeautifulSoup 提取由 id="titleDescription" 跨度包围的字符串。

<div class="itemText">
    <div class="wrapper">
        <span class="itemPromo">Customer Choice Award Winner</span>
        <a href="http://www.newegg.com/Product/Product.aspx?Item=N82E16819116501" title="View Details" >
            <span class="itemDescription" id="titleDescriptionID" style="display:inline">Intel Core i7-3770K Ivy Bridge 3.5GHz &#40;3.9GHz Turbo&#41; LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K</span>
            <span class="itemDescription" id="lineDescriptionID" style="display:none">Intel Core i7-3770K Ivy Bridge 3.5GHz &#40;3.9GHz Turbo&#41; LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K</span>
        </a>
    </div>

Code snippet

代码片段

f = open('egg.data', 'rb')
content = f.read()
content = content.decode('utf-8', 'replace')
content = ''.join([x for x in content if ord(x) < 128])

soup = bs(content)

for itemText in soup.find_all('div', attrs={'class':'itemText'}):
    wrapper = itemText.div
    wrapper_href = wrapper.a
    for child in wrapper_href.descendants:
        if child['id'] == 'titleDescriptionID':
           print(child, "\n")

Traceback Error:

回溯错误：

Traceback (most recent call last):
  File "egg.py", line 66, in <module>
    if child['id'] == 'titleDescriptionID':
TypeError: string indices must be integers

Answer 1

采纳答案by zhangyangyu

spans = soup.find_all('span', attrs={'id':'titleDescriptionID'})
for span in spans:
    print span.string

In your code, wrapper_href.descendantscontains at least 4 elements, 2 span tags and 2 string enclosed by the 2 span tags. It searches its children recursively.

在您的代码wrapper_href.descendants中，至少包含 4 个元素、2 个 span 标签和由 2 个 span 标签包围的 2 个字符串。它递归地搜索它的孩子。

Answer 2

回答by Martijn Pieters

wrapper_href.descendantsincludes any NavigableStringobjects, which is what you are tripping over. NavigableStringare essentially string objects, and you are trying to index that with the child['id']line:

wrapper_href.descendants包括任何NavigableString对象，这就是你绊倒的东西。NavigableString本质上是字符串对象，并且您正在尝试使用以下child['id']行对其进行索引：

>>> next(wrapper_href.descendants)
u'\n'

Why not just load the tag directly using itemText.find('span', id='titleDescriptionID')?

为什么不直接使用加载标签itemText.find('span', id='titleDescriptionID')？

Demo:

演示：

>>> for itemText in soup.find_all('div', attrs={'class':'itemText'}):
...     print itemText.find('span', id='titleDescriptionID')
...     print itemText.find('span', id='titleDescriptionID').text
... 
<span class="itemDescription" id="titleDescriptionID" style="display:inline">Intel Core i7-3770K Ivy Bridge 3.5GHz (3.9GHz Turbo) LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K</span>
Intel Core i7-3770K Ivy Bridge 3.5GHz (3.9GHz Turbo) LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K

Answer 3

回答by Sudipta

from BeautifulSoup import BeautifulSoup
pool = BeautifulSoup(html) # where html contains the whole html as string

for item in pool.findAll('span', attrs={'id' : 'titleDescriptionID'}):
    print item.string

When we search for a tag using BeautifulSoup, we get a BeautifulSoup.Tag object, which can directly be used to access its other attributes like inner content, style, href etc.

当我们使用 BeautifulSoup 搜索标签时，我们会得到一个 BeautifulSoup.Tag 对象，该对象可以直接用于访问其其他属性，如内部内容、样式、href 等。

Python BeautifulSoup：<div class <span class></span><span class>TEXT I WANT</span>

提问by

采纳答案by zhangyangyu

回答by Martijn Pieters

回答by Sudipta

相关推荐

最近更新

标签

Python BeautifulSoup：<div class <span class></span><span class>TEXT I WANT</span>

提问by

采纳答案by zhangyangyu

回答by Martijn Pieters

回答by Sudipta

相关推荐

bin/python bootstrap.py -d 期间如何解决pkg_resources.VersionConflict错误

Python 使用pyserial发送二进制数据

Python：从文件夹中读取多个json文件

Python 在 Pandas 中，我可以深度复制包含索引和列的 DataFrame 吗？

相关推荐

最近更新

标签