Python 如何使用beautifulSoup从网站中提取和下载所有图像？

Question

提问by user2711817

I am trying to extract and download all images from a url. I wrote a script

我正在尝试从 url 中提取和下载所有图像。我写了一个脚本

import urllib2
import re
from os.path import basename
from urlparse import urlsplit

url = "http://filmygyan.in/katrina-kaifs-top-10-cutest-pics-gallery/"
urlContent = urllib2.urlopen(url).read()
# HTML image tag: <img src="url" alt="some_text"/>
imgUrls = re.findall('img .*?src="(.*?)"', urlContent)

# download all images
for imgUrl in imgUrls:
    try:
        imgData = urllib2.urlopen(imgUrl).read()
        fileName = basename(urlsplit(imgUrl)[2])
        output = open(fileName,'wb')
        output.write(imgData)
        output.close()
    except:
        pass

I don't want to extract image of this page see this image http://i.share.pho.to/1c9884b1_l.jpegI just want to get all the images without clicking on "Next" button I am not getting how can I get the all pics within "Next" class.?What changes I should do in findall?

我不想提取此页面的图像，请参阅此图像http://i.share.pho.to/1c9884b1_l.jpeg我只想获取所有图像而不单击“下一步”按钮我不知道怎么做我在“Next”类中获得了所有图片。我应该在 findall 中做哪些更改？

Answer 1

回答by 4d4c

If you want only pictures then you can just download them without even scrapping the webpage. The all have the same URL:

如果您只想要图片，那么您可以直接下载它们，甚至无需删除网页。都具有相同的 URL：

http://filmygyan.in/wp-content/gallery/katrina-kaifs-top-10-cutest-pics-gallery/cute1.jpg
http://filmygyan.in/wp-content/gallery/katrina-kaifs-top-10-cutest-pics-gallery/cute2.jpg
...
http://filmygyan.in/wp-content/gallery/katrina-kaifs-top-10-cutest-pics-gallery/cute10.jpg

So simple code as that will give you all images:

如此简单的代码将为您提供所有图像：

import os
import urllib
import urllib2


baseUrl = "http://filmygyan.in/wp-content/gallery/katrina-kaifs-top-10-"\
      "cutest-pics-gallery/cute%s.jpg"

for i in range(1,11):
    url = baseUrl % i
    urllib.urlretrieve(url, os.path.basename(url))

With Beautifulsoup you will have to click or go to the next page to scrap the images. If you want ot scrap each page individually try to scrathem using there class which is shutterset_katrina-kaifs-top-10-cutest-pics-gallery

使用 Beautifulsoup，您必须单击或转到下一页才能删除图像。如果您想单独废弃每个页面，请尝试使用那里的类对它们进行刮擦shutterset_katrina-kaifs-top-10-cutest-pics-gallery

Answer 2

回答by Jonathan

The following should extract all images from a given page and write it to the directory where the script is being run.

以下内容应从给定页面中提取所有图像并将其写入正在运行脚本的目录。

import re
import requests
from bs4 import BeautifulSoup

site = 'http://pixabay.com'

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]


for url in urls:
    filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
    if not filename:
         print("Regex didn't match with the url: {}".format(url))
         continue
    with open(filename.group(1), 'wb') as f:
        if 'http' not in url:
            # sometimes an image source can be relative 
            # if it is provide the base url which also happens 
            # to be the site variable atm. 
            url = '{}{}'.format(site, url)
        response = requests.get(url)
        f.write(response.content)

Python 如何使用beautifulSoup从网站中提取和下载所有图像？

提问by user2711817

回答by 4d4c

回答by Jonathan

相关推荐

最近更新

标签

Python 如何使用beautifulSoup从网站中提取和下载所有图像？

提问by user2711817

回答by 4d4c

回答by Jonathan

相关推荐

Python 将 POST 请求卷曲到 pycurl 代码中

Python 归一化以引入 [0,1] 的范围

Python：导入 urllib.quote

Python 将文件复制到新目录并在文件名已存在时重命名

相关推荐

最近更新

标签