Python 使用 BeautifulSoup 根据属性提取图像 src

Question

提问by

I'm using BeautifulSoup to get a HTML page from IMDb, and I would like to extract the poster image from the page. I've got the image based on one of the attributes, but I don't know how to extract the data inside it.

我正在使用 BeautifulSoup 从 IMDb 获取 HTML 页面，我想从页面中提取海报图像。我已经获得了基于其中一个属性的图像，但我不知道如何提取其中的数据。

Here's my code:

这是我的代码：

url = 'http://www.imdb.com/title/tt%s/' % (id)
soup = BeautifulSoup(urllib2.urlopen(url).read())
print("before FOR")
for src in soup.find(itemprop="image"): 
    print("inside FOR")
    print(link.get('src'))

Answer 1

采纳答案by Zero Piraeus

You're almost there - just a couple of mistakes. soup.find()gets the first element that matches, not a list, so you don't need to iterate over it. Once you have got the element, you can get its attributes (like src) using dictionary access. Here's a reworked version:

你快到了 - 只是几个错误。soup.find()获取匹配的第一个元素，而不是列表，因此您无需对其进行迭代。获得元素后，您可以src使用字典访问获取其属性（如）。这是一个重新设计的版本：

film_id = '0423409'
url = 'http://www.imdb.com/title/tt%s/' % (film_id)
soup = BeautifulSoup(urllib2.urlopen(url).read())
link = soup.find(itemprop="image")
print(link["src"])
# output:
http://ia.media-imdb.com/images/M/MV5BMTg2ODMwNTY3NV5BMl5BanBnXkFtZTcwMzczNjEzMQ@@._V1_SY317_CR0,0,214,317_.jpg

I've changed idto film_id, because id()is a built-in function, and it's bad practice to mask those.

我已更改id为film_id，因为它id()是一个内置函数，屏蔽这些是不好的做法。

Answer 2

回答by David Maust

I believe your example is very close. You need to use findAll() instead of find() and when you iterate, you switch from src to link. In the below example I switched it to tag

我相信你的例子非常接近。您需要使用 findAll() 而不是 find() 并且当您迭代时，您从 src 切换到链接。在下面的示例中，我将其切换为tag

This code is working for me with BeautifulSoup4:

这段代码适用于 BeautifulSoup4：

url = 'http://www.imdb.com/title/tt%s/' % (id,)
soup = BeautifulSoup(urllib2.urlopen(url).read())
print "before FOR"
for tag in soup.findAll(itemprop="image"): 
    print "inside FOR"
    print(tag['src'])

Answer 3

回答by Pablo Ruiz Ruiz

If I understand correctly you are looking for the src of the image, for the extraction of it after that.

如果我理解正确，您正在寻找图像的 src，然后提取它。

In the first place you need to find (using the inspector) in which position in the HTML is the image. For example, in my particle case that I was scrapping soccer team shields, I needed:

首先，您需要（使用检查器）找到图像在 HTML 中的哪个位置。例如，在我报废足球队盾牌的粒子案例中，我需要：

m_url = 'http://www.marca.com/futbol/primera/equipos.html'
client = uOpen(m_url) 
page = client.read()
client.close()

page_soup = BS(page, 'html.parser')

teams = page_soup.findAll('li', {'id': 'nombreEquipo'})
for team in teams:
  name = team.h2.text
  shield_url = team.img['src']

Then, you need to process the image. You have to options.

然后，您需要处理图像。你必须选择。

1st: using numpy:

第一：使用numpy：

def url_to_image(url):
    '''
    Función para extraer una imagen de una URL
    '''
    resp = uOpen(url)
    image = np.asarray(bytearray(resp.read()), dtype='uint8')
    image = cv2.imdecode(image, cv2.IMREAD_COLOR)
    return image

shield = url_to_image(shield_url)

屏蔽 = url_to_image(shield_url)

2nd Using scikit-imagelibrary (that you will probably need to install):

第二次使用scikit-image库（您可能需要安装）：

shield = io.imread('http:' + shield_url)

Note:Just in this particular example I needed to add http: at the beggining.

注意：就在这个特定的例子中，我需要在开始时添加 http:。

Hope it helps!

希望能帮助到你！

Python 使用 BeautifulSoup 根据属性提取图像 src

提问by

采纳答案by Zero Piraeus

回答by David Maust

回答by Pablo Ruiz Ruiz

相关推荐

最近更新

标签

Python 使用 BeautifulSoup 根据属性提取图像 src

提问by

采纳答案by Zero Piraeus

回答by David Maust

回答by Pablo Ruiz Ruiz

相关推荐

Python 如何在 redis 中正确使用连接池？

Python Flask 将 PDF 作为自己的页面处理

Python/Pandas：从列表中的字符串匹配的数据框中删除行

Python 为什么 .rstrip('\n') 不起作用？

相关推荐

最近更新

标签