Python - 从谷歌图片搜索下载图片?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20716842/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 21:04:30  来源:igfitidea点击:

Python - Download Images from google Image search?

pythonweb-scraping

提问by user3116355

I want to download all Images of google image search using python . The code I am using seems to have some problem some times .My code is

我想使用 python 下载谷歌图片搜索的所有图片。我使用的代码有时似乎有一些问题。我的代码是

import os
import sys
import time
from urllib import FancyURLopener
import urllib2
import simplejson

# Define search term
searchTerm = "parrot"

# Replace spaces ' ' in search term for '%20' in order to comply with request
searchTerm = searchTerm.replace(' ','%20')


# Start FancyURLopener with defined version 
class MyOpener(FancyURLopener): 
    version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127     Firefox/2.0.0.11'
    myopener = MyOpener()

    # Set count to 0
    count= 0

    for i in range(0,10):
    # Notice that the start changes for each iteration in order to request a new set of   images for each loop
    url = ('https://ajax.googleapis.com/ajax/services/search/images?' + 'v=1.0& q='+searchTerm+'&start='+str(i*10)+'&userip=MyIP')
    print url
    request = urllib2.Request(url, None, {'Referer': 'testing'})
    response = urllib2.urlopen(request)

# Get results using JSON
    results = simplejson.load(response)
    data = results['responseData']
    dataInfo = data['results']

# Iterate for each result and get unescaped url
    for myUrl in dataInfo:
        count = count + 1
        my_url = myUrl['unescapedUrl']
        myopener.retrieve(myUrl['unescapedUrl'],str(count)+'.jpg')        

After downloading few pages I am getting an error as follows:

下载几页后,我收到如下错误:

Traceback (most recent call last):

回溯(最近一次调用最后一次):

  File "C:\Python27\img_google3.py", line 37, in <module>
    dataInfo = data['results']
TypeError: 'NoneType' object has no attribute '__getitem__'

What to do ??????

该怎么办 ??????

回答by jobin

The Google Image Search API is deprecated, you need to use the Google Custom Searchfor what you want to achieve. To fetch the images you need to do this:

谷歌图片搜索API已被弃用,你需要使用谷歌自定义搜索你想要达到的目标。要获取图像,您需要执行以下操作:

import urllib2
import simplejson
import cStringIO

fetcher = urllib2.build_opener()
searchTerm = 'parrot'
startIndex = 0
searchUrl = "http://ajax.googleapis.com/ajax/services/search/images?v=1.0&q=" + searchTerm + "&start=" + startIndex
f = fetcher.open(searchUrl)
deserialized_output = simplejson.load(f)

This will give you 4 results, as JSON, you need to iteratively get the results by incrementing the startIndexin the API request.

这将为您提供 4 个结果,作为 JSON,您需要通过增加startIndexAPI 请求中的来迭代获取结果。

To get the images you need to use a library like cStringIO.

要获取图像,您需要使用像cStringIO这样的库。

For example, to access the first image, you need to do this:

例如,要访问第一张图像,您需要执行以下操作:

imageUrl = deserialized_output['responseData']['results'][0]['unescapedUrl']
file = cStringIO.StringIO(urllib.urlopen(imageUrl).read())
img = Image.open(file)

回答by rishabhr0y

I have modified my code. Now the code can download 100 images for a given query, and images are full high resolution that is original images are being downloaded.

我已经修改了我的代码。现在代码可以为给定的查询下载 100 张图像,并且图像是全高分辨率的,即正在下载原始图像。

I am downloading the images using urllib2 & Beautiful soup

我正在使用 urllib2 和 Beautiful soup 下载图像

from bs4 import BeautifulSoup
import requests
import re
import urllib2
import os
import cookielib
import json

def get_soup(url,header):
    return BeautifulSoup(urllib2.urlopen(urllib2.Request(url,headers=header)),'html.parser')


query = raw_input("query image")# you can change the query for the image  here
image_type="ActiOn"
query= query.split()
query='+'.join(query)
url="https://www.google.co.in/search?q="+query+"&source=lnms&tbm=isch"
print url
#add the directory for your image here
DIR="Pictures"
header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"
}
soup = get_soup(url,header)


ActualImages=[]# contains the link for Large original images, type of  image
for a in soup.find_all("div",{"class":"rg_meta"}):
    link , Type =json.loads(a.text)["ou"]  ,json.loads(a.text)["ity"]
    ActualImages.append((link,Type))

print  "there are total" , len(ActualImages),"images"

if not os.path.exists(DIR):
            os.mkdir(DIR)
DIR = os.path.join(DIR, query.split()[0])

if not os.path.exists(DIR):
            os.mkdir(DIR)
###print images
for i , (img , Type) in enumerate( ActualImages):
    try:
        req = urllib2.Request(img, headers={'User-Agent' : header})
        raw_img = urllib2.urlopen(req).read()

        cntr = len([i for i in os.listdir(DIR) if image_type in i]) + 1
        print cntr
        if len(Type)==0:
            f = open(os.path.join(DIR , image_type + "_"+ str(cntr)+".jpg"), 'wb')
        else :
            f = open(os.path.join(DIR , image_type + "_"+ str(cntr)+"."+Type), 'wb')


        f.write(raw_img)
        f.close()
    except Exception as e:
        print "could not load : "+img
        print e

i hope this helps you

我希望这可以帮助你

回答by Mostafa

Google deprecated their API, scraping Google is complicated, so I would suggest using Bing API instead:

Google 弃用了他们的 API,抓取 Google 很复杂,所以我建议改用 Bing API:

https://datamarket.azure.com/dataset/5BA839F1-12CE-4CCE-BF57-A49D98D29A44

https://datamarket.azure.com/dataset/5BA839F1-12CE-4CCE-BF57-A49D98D29A44

Google is not so good, and Microsoft is not so evil

谷歌没那么好,微软也没那么邪恶

回答by Piees

Haven't looked into your code but this is an example solution made with selenium to try to get 400 pictures from the search term

尚未查看您的代码,但这是一个使用 selenium 制作的示例解决方案,用于尝试从搜索词中获取 400 张图片

# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import json
import os
import urllib2

searchterm = 'vannmelon' # will also be the name of the folder
url = "https://www.google.co.in/search?q="+searchterm+"&source=lnms&tbm=isch"
browser = webdriver.Firefox()
browser.get(url)
header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}
counter = 0
succounter = 0

if not os.path.exists(searchterm):
    os.mkdir(searchterm)

for _ in range(500):
    browser.execute_script("window.scrollBy(0,10000)")

for x in browser.find_elements_by_xpath("//div[@class='rg_meta']"):
    counter = counter + 1
    print "Total Count:", counter
    print "Succsessful Count:", succounter
    print "URL:",json.loads(x.get_attribute('innerHTML'))["ou"]

    img = json.loads(x.get_attribute('innerHTML'))["ou"]
    imgtype = json.loads(x.get_attribute('innerHTML'))["ity"]
    try:
        req = urllib2.Request(img, headers={'User-Agent': header})
        raw_img = urllib2.urlopen(req).read()
        File = open(os.path.join(searchterm , searchterm + "_" + str(counter) + "." + imgtype), "wb")
        File.write(raw_img)
        File.close()
        succounter = succounter + 1
    except:
            print "can't get img"

print succounter, "pictures succesfully downloaded"
browser.close()

回答by Suat Atan PhD

You can also use Selenium with Python. Here is how:

您还可以将 Selenium 与 Python 结合使用。方法如下:

from selenium import webdriver
import urllib
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome('C:/Python27/Scripts/chromedriver.exe')
word="apple"
url="http://images.google.com/search?q="+word+"&tbm=isch&sout=1"
driver.get(url)
imageXpathSelector='//*[@id="ires"]/table/tbody/tr[1]/td[1]/a/img'
img=driver.find_element_by_xpath(imageXpathSelector)
src=(img.get_attribute('src'))
urllib.urlretrieve(src, word+".jpg")
driver.close()

(This code works on Python 2.7) Please be informed that you should install Selenium package with 'pip install selenium' and you should download chromedriver.exe from here

(此代码适用于 Python 2.7)请注意,您应该使用 ' pip install selenium'安装 Selenium 包,并且您应该从这里下载 chromedriver.exe

On the contrary of the other web scraping techniques, Selenium opens the browser and download the items because Selenium's mission is testing rather than scraping.

与其他网络抓取技术相反,Selenium 会打开浏览器并下载项目,因为 Selenium 的任务是测试而不是抓取。

回答by atif93

Adding to Piees's answer, for downloading any number of images from the search results, we need to simulate a click on 'Show more results' button after first 400 results are loaded.

添加到Piees 的答案,要从搜索结果中下载任意数量的图像,我们需要在加载前 400 个结果后模拟单击“显示更多结果”按钮。

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import os
import json
import urllib2
import sys
import time

# adding path to geckodriver to the OS environment variable
# assuming that it is stored at the same path as this script
os.environ["PATH"] += os.pathsep + os.getcwd()
download_path = "dataset/"

def main():
    searchtext = sys.argv[1] # the search query
    num_requested = int(sys.argv[2]) # number of images to download
    number_of_scrolls = num_requested / 400 + 1 
    # number_of_scrolls * 400 images will be opened in the browser

    if not os.path.exists(download_path + searchtext.replace(" ", "_")):
        os.makedirs(download_path + searchtext.replace(" ", "_"))

    url = "https://www.google.co.in/search?q="+searchtext+"&source=lnms&tbm=isch"
    driver = webdriver.Firefox()
    driver.get(url)

    headers = {}
    headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
    extensions = {"jpg", "jpeg", "png", "gif"}
    img_count = 0
    downloaded_img_count = 0

    for _ in xrange(number_of_scrolls):
        for __ in xrange(10):
            # multiple scrolls needed to show all 400 images
            driver.execute_script("window.scrollBy(0, 1000000)")
            time.sleep(0.2)
        # to load next 400 images
        time.sleep(0.5)
        try:
            driver.find_element_by_xpath("//input[@value='Show more results']").click()
        except Exception as e:
            print "Less images found:", e
            break

    # imges = driver.find_elements_by_xpath('//div[@class="rg_meta"]') # not working anymore
    imges = driver.find_elements_by_xpath('//div[contains(@class,"rg_meta")]')
    print "Total images:", len(imges), "\n"
    for img in imges:
        img_count += 1
        img_url = json.loads(img.get_attribute('innerHTML'))["ou"]
        img_type = json.loads(img.get_attribute('innerHTML'))["ity"]
        print "Downloading image", img_count, ": ", img_url
        try:
            if img_type not in extensions:
                img_type = "jpg"
            req = urllib2.Request(img_url, headers=headers)
            raw_img = urllib2.urlopen(req).read()
            f = open(download_path+searchtext.replace(" ", "_")+"/"+str(downloaded_img_count)+"."+img_type, "wb")
            f.write(raw_img)
            f.close
            downloaded_img_count += 1
        except Exception as e:
            print "Download failed:", e
        finally:
            print
        if downloaded_img_count >= num_requested:
            break

    print "Total downloaded: ", downloaded_img_count, "/", img_count
    driver.quit()

if __name__ == "__main__":
    main()

Full code is here.

完整代码在这里

回答by CumminUp07

I know this question is old, but I ran across it recently and none of the previous answers work anymore. So I wrote this script to gather images from google. As of right now it can download as many images as are available.

我知道这个问题很旧,但我最近遇到了它,以前的答案都不再有效。所以我写了这个脚本来从谷歌收集图像。截至目前,它可以下载尽可能多的可用图像。

here is a github link to it as well https://github.com/CumminUp07/imengine/blob/master/get_google_images.py

这里还有一个 github 链接https://github.com/CumminUp07/imengine/blob/master/get_google_images.py

DISCLAIMER: DUE TO COPYRIGHT ISSUES, IMAGES GATHERED SHOULD ONLY BE USED FOR RESEARCH AND EDUCATION PURPOSES ONLY

免责声明:由于版权问题,收集的图像只能用于研究和教育目的

from bs4 import BeautifulSoup as Soup
import urllib2
import json
import urllib

#programtically go through google image ajax json return and save links to list#
#num_images is more of a suggestion                                            #  
#it will get the ceiling of the nearest 100 if available                       #
def get_links(query_string, num_images):
    #initialize place for links
    links = []
    #step by 100 because each return gives up to 100 links
    for i in range(0,num_images,100):
        url = 'https://www.google.com/search?ei=1m7NWePfFYaGmQG51q7IBg&hl=en&q='+query_string+'\
        &tbm=isch&ved=0ahUKEwjjovnD7sjWAhUGQyYKHTmrC2kQuT0I7gEoAQ&start='+str(i)+'\
        &yv=2&vet=10ahUKEwjjovnD7sjWAhUGQyYKHTmrC2kQuT0I7gEoAQ.1m7NWePfFYaGmQG51q7IBg.i&ijn=1&asearch=ichunk&async=_id:rg_s,_pms:s'

        #set user agent to avoid 403 error
        request = urllib2.Request(url, None, {'User-Agent': 'Mozilla/5.0'}) 

        #returns json formatted string of the html
        json_string = urllib2.urlopen(request).read() 

        #parse as json
        page = json.loads(json_string) 

        #html found here
        html = page[1][1] 

        #use BeautifulSoup to parse as html
        new_soup = Soup(html,'lxml')

        #all img tags, only returns results of search
        imgs = new_soup.find_all('img')

        #loop through images and put src in links list
        for j in range(len(imgs)):
            links.append(imgs[j]["src"])

    return links

#download images                              #
#takes list of links, directory to save to    # 
#and prefix for file names                    #
#saves images in directory as a one up number #
#with prefix added                            #
#all images will be .jpg                      #
def get_images(links,directory,pre):
    for i in range(len(links)):
        urllib.urlretrieve(links[i], "./"+directory+"/"+str(pre)+str(i)+".jpg")

#main function to search images                 #
#takes two lists, base term and secondary terms #
#also takes number of images to download per    #
#combination                                    #
#it runs every combination of search terms      #
#with base term first then secondary            #
def search_images(base,terms,num_images):
    for y in range(len(base)):
        for x in range(len(terms)):
            all_links = get_links(base[y]+'+'+terms[x],num_images)
            get_images(all_links,"images",x)

if __name__ == '__main__':
    terms = ["cars","numbers","scenery","people","dogs","cats","animals"]
    base = ["animated"]
    search_images(base,terms,1000)

回答by Sam Watkins

Here's my latest google image snarfer, written in Python, using Selenium and headless Chrome.

这是我最新的 google image snarfer,用 Python 编写,使用 Selenium 和无头 Chrome。

It requires python-selenium, the chromium-driver, and a module called retryfrom pip.

它需要python-seleniumchromium-driver和 一个retry从 pip调用的模块。

Link: http://sam.aiki.info/b/google-images.py

链接:http: //sam.aiki.info/b/google-images.py

Example Usage:

示例用法:

google-images.py tiger 10 --opts isz:lt,islt:svga,itp:photo > urls.txt
parallel=5
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"
(i=0; while read url; do wget -e robots=off -T10 --tries 10 -U"$user_agent" "$url" -O`printf %04d $i`.jpg & i=$(($i+1)) ; [ $(($i % $parallel)) = 0 ] && wait; done < urls.txt; wait)

Help Usage:

帮助用法:

$ google-images.py --help
usage: google-images.py [-h] [--safe SAFE] [--opts OPTS] query n

Fetch image URLs from Google Image Search.

positional arguments:
  query        image search query
  n            number of images (approx)

optional arguments:
  -h, --help   show this help message and exit
  --safe SAFE  safe search [off|active|images]
  --opts OPTS  search options, e.g.
               isz:lt,islt:svga,itp:photo,ic:color,ift:jpg

Code:

代码:

#!/usr/bin/env python3

# requires: selenium, chromium-driver, retry

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import selenium.common.exceptions as sel_ex
import sys
import time
import urllib.parse
from retry import retry
import argparse
import logging

logging.basicConfig(stream=sys.stderr, level=logging.INFO)
logger = logging.getLogger()
retry_logger = None

css_thumbnail = "img.Q4LuWd"
css_large = "img.n3VNCb"
css_load_more = ".mye4qd"
selenium_exceptions = (sel_ex.ElementClickInterceptedException, sel_ex.ElementNotInteractableException, sel_ex.StaleElementReferenceException)

def scroll_to_end(wd):
    wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")

@retry(exceptions=KeyError, tries=6, delay=0.1, backoff=2, logger=retry_logger)
def get_thumbnails(wd, want_more_than=0):
    wd.execute_script("document.querySelector('{}').click();".format(css_load_more))
    thumbnails = wd.find_elements_by_css_selector(css_thumbnail)
    n_results = len(thumbnails)
    if n_results <= want_more_than:
        raise KeyError("no new thumbnails")
    return thumbnails

@retry(exceptions=KeyError, tries=6, delay=0.1, backoff=2, logger=retry_logger)
def get_image_src(wd):
    actual_images = wd.find_elements_by_css_selector(css_large)
    sources = []
    for img in actual_images:
        src = img.get_attribute("src")
        if src.startswith("http") and not src.startswith("https://encrypted-tbn0.gstatic.com/"):
            sources.append(src)
    if not len(sources):
        raise KeyError("no large image")
    return sources

@retry(exceptions=selenium_exceptions, tries=6, delay=0.1, backoff=2, logger=retry_logger)
def retry_click(el):
    el.click()

def get_images(wd, start=0, n=20, out=None):
    thumbnails = []
    count = len(thumbnails)
    while count < n:
        scroll_to_end(wd)
        try:
            thumbnails = get_thumbnails(wd, want_more_than=count)
        except KeyError as e:
            logger.warning("cannot load enough thumbnails")
            break
        count = len(thumbnails)
    sources = []
    for tn in thumbnails:
        try:
            retry_click(tn)
        except selenium_exceptions as e:
            logger.warning("main image click failed")
            continue
        sources1 = []
        try:
            sources1 = get_image_src(wd)
        except KeyError as e:
            pass
            # logger.warning("main image not found")
        if not sources1:
            tn_src = tn.get_attribute("src")
            if not tn_src.startswith("data"):
                logger.warning("no src found for main image, using thumbnail")          
                sources1 = [tn_src]
            else:
                logger.warning("no src found for main image, thumbnail is a data URL")
        for src in sources1:
            if not src in sources:
                sources.append(src)
                if out:
                    print(src, file=out)
                    out.flush()
        if len(sources) >= n:
            break
    return sources

def google_image_search(wd, query, safe="off", n=20, opts='', out=None):
    search_url_t = "https://www.google.com/search?safe={safe}&site=&tbm=isch&source=hp&q={q}&oq={q}&gs_l=img&tbs={opts}"
    search_url = search_url_t.format(q=urllib.parse.quote(query), opts=urllib.parse.quote(opts), safe=safe)
    wd.get(search_url)
    sources = get_images(wd, n=n, out=out)
    return sources

def main():
    parser = argparse.ArgumentParser(description='Fetch image URLs from Google Image Search.')
    parser.add_argument('--safe', type=str, default="off", help='safe search [off|active|images]')
    parser.add_argument('--opts', type=str, default="", help='search options, e.g. isz:lt,islt:svga,itp:photo,ic:color,ift:jpg')
    parser.add_argument('query', type=str, help='image search query')
    parser.add_argument('n', type=int, default=20, help='number of images (approx)')
    args = parser.parse_args()

    opts = Options()
    opts.add_argument("--headless")
    # opts.add_argument("--blink-settings=imagesEnabled=false")
    with webdriver.Chrome(options=opts) as wd:
        sources = google_image_search(wd, args.query, safe=args.safe, n=args.n, opts=args.opts, out=sys.stdout)

main()