Python 使用 BeautifulSoup 从 `img` 标签中提取 `src` 属性
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43982002/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Extract `src` attribute from `img` tag using BeautifulSoup
提问by iDelusion
<div class="someClass">
<a href="href">
<img alt="some" src="some"/>
</a>
</div>
I use bs4 and I cannot use a.attrs['src']
to get the src
, but I can get href
. What should I do?
我使用 bs4 并且无法使用a.attrs['src']
来src
获取href
. 我该怎么办?
回答by Abu Shoeb
You can use BeautifulSoup
to extract src
attribute of an html img
tag. In my example, the htmlText
contains the img
tag itself but this can be used for a URL too along with urllib2
.
您可以BeautifulSoup
用来提取标签的src
属性html img
。在我的示例中,htmlText
包含img
标签本身,但这也可以与urllib2
.
For URLs
对于网址
from BeautifulSoup import BeautifulSoup as BSHTML
import urllib2
page = urllib2.urlopen('http://www.youtube.com/')
soup = BSHTML(page)
images = soup.findAll('img')
for image in images:
#print image source
print image['src']
#print alternate text
print image['alt']
For Texts with img tag
对于带有 img 标签的文本
from BeautifulSoup import BeautifulSoup as BSHTML
htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
soup = BSHTML(htmlText)
images = soup.findAll('img')
for image in images:
print image['src']
回答by mx0
A link doesn't have attribute src
you have to target actual img
tag.
链接没有src
您必须针对实际img
标签的属性。
import bs4
html = """<div class="someClass">
<a href="href">
<img alt="some" src="some"/>
</a>
</div>"""
soup = bs4.BeautifulSoup(html, "html.parser")
# this will return src attrib from img tag that is inside 'a' tag
soup.a.img['src']
>>> 'some'
# if you have more then one 'a' tag
for a in soup.find_all('a'):
if a.img:
print(a.img['src'])
>>> 'some'
回答by Gray
You can use BeautifulSoup to extract src attribute of an html img tag. In my example, the htmlText contains the img tag itself but this can be used for a URL too along with urllib2.
您可以使用 BeautifulSoup 提取 html img 标签的 src 属性。在我的示例中,htmlText 包含 img 标签本身,但这也可以与 urllib2 一起用于 URL。
The solution provided by the most rated answer is not working any more with python3. This is the correct implementation:
最受好评的答案提供的解决方案不再适用于 python3。这是正确的实现:
For URLs
对于网址
from bs4 import BeautifulSoup as BSHTML
import urllib3
http = urllib3.PoolManager()
url = 'your_url'
response = http.request('GET', url)
soup = BSHTML(response.data, "html.parser")
images = soup.findAll('img')
for image in images:
#print image source
print(image['src'])
#print alternate text
print(image['alt'])
For Texts with img tag
对于带有 img 标签的文本
from bs4 import BeautifulSoup as BSHTML
htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
soup = BSHTML(htmlText)
images = soup.findAll('img')
for image in images:
print(image['src'])