Python 使用 BeautifulSoup 从 `img` 标签中提取 `src` 属性

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43982002/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 23:36:01  来源:igfitidea点击:

Extract `src` attribute from `img` tag using BeautifulSoup

pythonregexbs4

提问by iDelusion

<div class="someClass">
    <a href="href">
        <img alt="some" src="some"/>
    </a>
</div>

I use bs4 and I cannot use a.attrs['src']to get the src, but I can get href. What should I do?

我使用 bs4 并且无法使用a.attrs['src']src获取href. 我该怎么办?

回答by Abu Shoeb

You can use BeautifulSoupto extract srcattribute of an html imgtag. In my example, the htmlTextcontains the imgtag itself but this can be used for a URL too along with urllib2.

您可以BeautifulSoup用来提取标签的src属性html img。在我的示例中,htmlText包含img标签本身,但这也可以与urllib2.

For URLs

对于网址

from BeautifulSoup import BeautifulSoup as BSHTML
import urllib2
page = urllib2.urlopen('http://www.youtube.com/')
soup = BSHTML(page)
images = soup.findAll('img')
for image in images:
    #print image source
    print image['src']
    #print alternate text
    print image['alt']

For Texts with img tag

对于带有 img 标签的文本

from BeautifulSoup import BeautifulSoup as BSHTML
htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
soup = BSHTML(htmlText)
images = soup.findAll('img')
for image in images:
    print image['src']

回答by mx0

A link doesn't have attribute srcyou have to target actual imgtag.

链接没有src您必须针对实际img标签的属性。

import bs4

html = """<div class="someClass">
    <a href="href">
        <img alt="some" src="some"/>
    </a>
</div>"""

soup = bs4.BeautifulSoup(html, "html.parser")

# this will return src attrib from img tag that is inside 'a' tag
soup.a.img['src']

>>> 'some'

# if you have more then one 'a' tag
for a in soup.find_all('a'):
    if a.img:
        print(a.img['src'])

>>> 'some'

回答by Gray

You can use BeautifulSoup to extract src attribute of an html img tag. In my example, the htmlText contains the img tag itself but this can be used for a URL too along with urllib2.

您可以使用 BeautifulSoup 提取 html img 标签的 src 属性。在我的示例中,htmlText 包含 img 标签本身,但这也可以与 urllib2 一起用于 URL。

The solution provided by the most rated answer is not working any more with python3. This is the correct implementation:

最受好评的答案提供的解决方案不再适用于 python3。这是正确的实现:

For URLs

对于网址

from bs4 import BeautifulSoup as BSHTML
import urllib3

http = urllib3.PoolManager()
url = 'your_url'

response = http.request('GET', url)
soup = BSHTML(response.data, "html.parser")
images = soup.findAll('img')

for image in images:
    #print image source
    print(image['src'])
    #print alternate text
    print(image['alt'])

For Texts with img tag

对于带有 img 标签的文本

from bs4 import BeautifulSoup as BSHTML
htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
soup = BSHTML(htmlText)
images = soup.findAll('img')
for image in images:
    print(image['src'])