Python 如何使用beautifulSoup从网站中提取和下载所有图像?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18408307/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 10:39:51  来源:igfitidea点击:

How to extract and download all images from a website using beautifulSoup?

pythonbeautifulsoup

提问by user2711817

I am trying to extract and download all images from a url. I wrote a script

我正在尝试从 url 中提取和下载所有图像。我写了一个脚本

import urllib2
import re
from os.path import basename
from urlparse import urlsplit

url = "http://filmygyan.in/katrina-kaifs-top-10-cutest-pics-gallery/"
urlContent = urllib2.urlopen(url).read()
# HTML image tag: <img src="url" alt="some_text"/>
imgUrls = re.findall('img .*?src="(.*?)"', urlContent)

# download all images
for imgUrl in imgUrls:
    try:
        imgData = urllib2.urlopen(imgUrl).read()
        fileName = basename(urlsplit(imgUrl)[2])
        output = open(fileName,'wb')
        output.write(imgData)
        output.close()
    except:
        pass

I don't want to extract image of this page see this image http://i.share.pho.to/1c9884b1_l.jpegI just want to get all the images without clicking on "Next" button I am not getting how can I get the all pics within "Next" class.?What changes I should do in findall?

我不想提取此页面的图像,请参阅此图像http://i.share.pho.to/1c9884b1_l.jpeg我只想获取所有图像而不单击“下一步”按钮我不知道怎么做我在“Next”类中获得了所有图片。我应该在 findall 中做哪些更改?

回答by 4d4c

If you want only pictures then you can just download them without even scrapping the webpage. The all have the same URL:

如果您只想要图片,那么您可以直接下载它们,甚至无需删除网页。都具有相同的 URL:

http://filmygyan.in/wp-content/gallery/katrina-kaifs-top-10-cutest-pics-gallery/cute1.jpg
http://filmygyan.in/wp-content/gallery/katrina-kaifs-top-10-cutest-pics-gallery/cute2.jpg
...
http://filmygyan.in/wp-content/gallery/katrina-kaifs-top-10-cutest-pics-gallery/cute10.jpg

So simple code as that will give you all images:

如此简单的代码将为您提供所有图像:

import os
import urllib
import urllib2


baseUrl = "http://filmygyan.in/wp-content/gallery/katrina-kaifs-top-10-"\
      "cutest-pics-gallery/cute%s.jpg"

for i in range(1,11):
    url = baseUrl % i
    urllib.urlretrieve(url, os.path.basename(url))

With Beautifulsoup you will have to click or go to the next page to scrap the images. If you want ot scrap each page individually try to scrathem using there class which is shutterset_katrina-kaifs-top-10-cutest-pics-gallery

使用 Beautifulsoup,您必须单击或转到下一页才能删除图像。如果您想单独废弃每个页面,请尝试使用那里的类对它们进行刮擦shutterset_katrina-kaifs-top-10-cutest-pics-gallery

回答by Jonathan

The following should extract all images from a given page and write it to the directory where the script is being run.

以下内容应从给定页面中提取所有图像并将其写入正在运行脚本的目录。

import re
import requests
from bs4 import BeautifulSoup

site = 'http://pixabay.com'

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]


for url in urls:
    filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
    if not filename:
         print("Regex didn't match with the url: {}".format(url))
         continue
    with open(filename.group(1), 'wb') as f:
        if 'http' not in url:
            # sometimes an image source can be relative 
            # if it is provide the base url which also happens 
            # to be the site variable atm. 
            url = '{}{}'.format(site, url)
        response = requests.get(url)
        f.write(response.content)