Python 使用 BeautifulSoup 根据属性提取图像 src

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18304532/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 10:26:25  来源:igfitidea点击:

Extracting image src based on attribute with BeautifulSoup

pythonhtml-parsingweb-scrapingbeautifulsoup

提问by

I'm using BeautifulSoup to get a HTML page from IMDb, and I would like to extract the poster image from the page. I've got the image based on one of the attributes, but I don't know how to extract the data inside it.

我正在使用 BeautifulSoup 从 IMDb 获取 HTML 页面,我想从页面中提取海报图像。我已经获得了基于其中一个属性的图像,但我不知道如何提取其中的数据。

Here's my code:

这是我的代码:

url = 'http://www.imdb.com/title/tt%s/' % (id)
soup = BeautifulSoup(urllib2.urlopen(url).read())
print("before FOR")
for src in soup.find(itemprop="image"): 
    print("inside FOR")
    print(link.get('src'))

采纳答案by Zero Piraeus

You're almost there - just a couple of mistakes. soup.find()gets the first element that matches, not a list, so you don't need to iterate over it. Once you have got the element, you can get its attributes (like src) using dictionary access. Here's a reworked version:

你快到了 - 只是几个错误。soup.find()获取匹配的第一个元素,而不是列表,因此您无需对其进行迭代。获得元素后,您可以src使用字典访问获取其属性(如)。这是一个重新设计的版本:

film_id = '0423409'
url = 'http://www.imdb.com/title/tt%s/' % (film_id)
soup = BeautifulSoup(urllib2.urlopen(url).read())
link = soup.find(itemprop="image")
print(link["src"])
# output:
http://ia.media-imdb.com/images/M/MV5BMTg2ODMwNTY3NV5BMl5BanBnXkFtZTcwMzczNjEzMQ@@._V1_SY317_CR0,0,214,317_.jpg

I've changed idto film_id, because id()is a built-in function, and it's bad practice to mask those.

我已更改idfilm_id,因为它id()是一个内置函数,屏蔽这些是不好的做法。

回答by David Maust

I believe your example is very close. You need to use findAll() instead of find() and when you iterate, you switch from src to link. In the below example I switched it to tag

我相信你的例子非常接近。您需要使用 findAll() 而不是 find() 并且当您迭代时,您从 src 切换到链接。在下面的示例中,我将其切换为tag

This code is working for me with BeautifulSoup4:

这段代码适用于 BeautifulSoup4:

url = 'http://www.imdb.com/title/tt%s/' % (id,)
soup = BeautifulSoup(urllib2.urlopen(url).read())
print "before FOR"
for tag in soup.findAll(itemprop="image"): 
    print "inside FOR"
    print(tag['src'])

回答by Pablo Ruiz Ruiz

If I understand correctly you are looking for the src of the image, for the extraction of it after that.

如果我理解正确,您正在寻找图像的 src,然后提取它。

In the first place you need to find (using the inspector) in which position in the HTML is the image. For example, in my particle case that I was scrapping soccer team shields, I needed:

首先,您需要(使用检查器)找到图像在 HTML 中的哪个位置。例如,在我报废足球队盾牌的粒子案例中,我需要:

m_url = 'http://www.marca.com/futbol/primera/equipos.html'
client = uOpen(m_url) 
page = client.read()
client.close()

page_soup = BS(page, 'html.parser')

teams = page_soup.findAll('li', {'id': 'nombreEquipo'})
for team in teams:
  name = team.h2.text
  shield_url = team.img['src']

Then, you need to process the image. You have to options.

然后,您需要处理图像。你必须选择。

1st: using numpy:

第一:使用numpy

def url_to_image(url):
    '''
    Función para extraer una imagen de una URL
    '''
    resp = uOpen(url)
    image = np.asarray(bytearray(resp.read()), dtype='uint8')
    image = cv2.imdecode(image, cv2.IMREAD_COLOR)
    return image

shield = url_to_image(shield_url)

屏蔽 = url_to_image(shield_url)

2nd Using scikit-imagelibrary (that you will probably need to install):

第二次使用scikit-image库(您可能需要安装):

shield = io.imread('http:' + shield_url)

Note:Just in this particular example I needed to add http: at the beggining.

注意:就在这个特定的例子中,我需要在开始时添加 http:。

Hope it helps!

希望能帮助到你!