Python 使用 BeautifulSoup 从 `img` 标签中提取 `src` 属性

Question

提问by iDelusion

<div class="someClass">
    <a href="href">
        <img alt="some" src="some"/>
    </a>
</div>

I use bs4 and I cannot use a.attrs['src']to get the src, but I can get href. What should I do?

我使用 bs4 并且无法使用a.attrs['src']来src获取href. 我该怎么办？

Answer 1

回答by Abu Shoeb

You can use BeautifulSoupto extract srcattribute of an html imgtag. In my example, the htmlTextcontains the imgtag itself but this can be used for a URL too along with urllib2.

您可以BeautifulSoup用来提取标签的src属性html img。在我的示例中，htmlText包含img标签本身，但这也可以与urllib2.

For URLs

对于网址

from BeautifulSoup import BeautifulSoup as BSHTML
import urllib2
page = urllib2.urlopen('http://www.youtube.com/')
soup = BSHTML(page)
images = soup.findAll('img')
for image in images:
    #print image source
    print image['src']
    #print alternate text
    print image['alt']

For Texts with img tag

对于带有 img 标签的文本

from BeautifulSoup import BeautifulSoup as BSHTML
htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
soup = BSHTML(htmlText)
images = soup.findAll('img')
for image in images:
    print image['src']

Answer 2

回答by mx0

A link doesn't have attribute srcyou have to target actual imgtag.

链接没有src您必须针对实际img标签的属性。

import bs4

html = """<div class="someClass">
    <a href="href">
        <img alt="some" src="some"/>
    </a>
</div>"""

soup = bs4.BeautifulSoup(html, "html.parser")

# this will return src attrib from img tag that is inside 'a' tag
soup.a.img['src']

>>> 'some'

# if you have more then one 'a' tag
for a in soup.find_all('a'):
    if a.img:
        print(a.img['src'])

>>> 'some'

Answer 3

回答by Gray

You can use BeautifulSoup to extract src attribute of an html img tag. In my example, the htmlText contains the img tag itself but this can be used for a URL too along with urllib2.

您可以使用 BeautifulSoup 提取 html img 标签的 src 属性。在我的示例中，htmlText 包含 img 标签本身，但这也可以与 urllib2 一起用于 URL。

The solution provided by the most rated answer is not working any more with python3. This is the correct implementation:

最受好评的答案提供的解决方案不再适用于 python3。这是正确的实现：

For URLs

对于网址

from bs4 import BeautifulSoup as BSHTML
import urllib3

http = urllib3.PoolManager()
url = 'your_url'

response = http.request('GET', url)
soup = BSHTML(response.data, "html.parser")
images = soup.findAll('img')

for image in images:
    #print image source
    print(image['src'])
    #print alternate text
    print(image['alt'])

For Texts with img tag

对于带有 img 标签的文本

from bs4 import BeautifulSoup as BSHTML
htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
soup = BSHTML(htmlText)
images = soup.findAll('img')
for image in images:
    print(image['src'])

Python 使用 BeautifulSoup 从 `img` 标签中提取 `src` 属性

提问by iDelusion

回答by Abu Shoeb

回答by mx0

回答by Gray

相关推荐

最近更新

标签

Python 使用 BeautifulSoup 从 `img` 标签中提取 `src` 属性

提问by iDelusion

回答by Abu Shoeb

回答by mx0

回答by Gray

相关推荐

如何在python中测试变量是否为空

Python Pandas ValueError 数组的长度必须相同

Python 熊猫数据点的线图

python请求http响应500（可以在浏览器中访问站点）

相关推荐

最近更新

标签