javascript 如何使用 puppeteer 在页面上下载图像?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/52542149/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 09:58:14  来源:igfitidea点击:

How can I download images on a page using puppeteer?

javascriptweb-scrapingpuppeteergoogle-chrome-headless

提问by supermario

I'm new to web scraping and want to download all images on a webpage using puppeteer:

我是网络抓取的新手,想使用 puppeteer 下载网页上的所有图像:

const puppeteer = require('puppeteer');

let scrape = async () => {
  // Actual Scraping goes Here...

  const browser = await puppeteer.launch({headless: false});
  const page = await browser.newPage();
  await page.goto('https://memeculture69.tumblr.com/');

  //   Right click and save images

};

scrape().then((value) => {
    console.log(value); // Success!
});

I have looked at the API? docsbut could not figure out how to acheive this. So appreciate your help.

我看过API?docs,但无法弄清楚如何实现这一点。所以感谢你的帮助。

回答by Braden Brown

Here is another example. It goes to a generic search in google and downloads the google image at the top left.

这是另一个例子。它会在 google 中进行通用搜索并下载左上角的 google 图片。

const puppeteer = require('puppeteer');
const fs = require('fs');

async function run() {
    const browser = await puppeteer.launch({
        headless: false
    });
    const page = await browser.newPage();
    await page.setViewport({ width: 1200, height: 1200 });
    await page.goto('https://www.google.com/search?q=.net+core&rlz=1C1GGRV_enUS785US785&oq=.net+core&aqs=chrome..69i57j69i60l3j69i65j69i60.999j0j7&sourceid=chrome&ie=UTF-8');

    const IMAGE_SELECTOR = '#tsf > div:nth-child(2) > div > div.logo > a > img';
    let imageHref = await page.evaluate((sel) => {
        return document.querySelector(sel).getAttribute('src').replace('/', '');
    }, IMAGE_SELECTOR);

    console.log("https://www.google.com/" + imageHref);
    var viewSource = await page.goto("https://www.google.com/" + imageHref);
    fs.writeFile(".googles-20th-birthday-us-5142672481189888-s.png", await viewSource.buffer(), function (err) {
    if (err) {
        return console.log(err);
    }

    console.log("The file was saved!");
});

    browser.close();
}

run();

If you have a list of images you want to download then you could change the selector to programatically change as needed and go down the list of images downloading them one at a time.

如果您有要下载的图像列表,则可以将选择器更改为根据需要以编程方式更改,然后在图像列表中一次下载一个。

回答by Grant Miller

You can use the following to scrape an array of all the srcattributes of all images on the page:

您可以使用以下命令抓取src页面上所有图像的所有属性的数组:

const images = await page.evaluate(() => Array.from(document.images, e => e.src));

Then you can use the Node File System Moduleand HTTPor HTTPS Moduleto download each image.

然后您可以使用节点文件系统模块HTTPHTTPS 模块来下载每个图像。

Complete Example:

完整示例:

'use strict';

const fs = require('fs');
const https = require('https');
const puppeteer = require('puppeteer');

/* ============================================================
  Promise-Based Download Function
============================================================ */

const download = (url, destination) => new Promise((resolve, reject) => {
  const file = fs.createWriteStream(destination);

  https.get(url, response => {
    response.pipe(file);

    file.on('finish', () => {
      file.close(resolve(true));
    });
  }).on('error', error => {
    fs.unlink(destination);

    reject(error.message);
  });
});

/* ============================================================
  Download All Images
============================================================ */

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  let result;

  await page.goto('https://www.example.com/');

  const images = await page.evaluate(() => Array.from(document.images, e => e.src));

  for (let i = 0; i < images.length; i++) {
    result = await download(images[i], `image-${i}.png`);

    if (result === true) {
      console.log('Success:', images[i], 'has been downloaded successfully.');
    } else {
      console.log('Error:', images[i], 'was not downloaded.');
      console.error(result);
    }
  }

  await browser.close();
})();

回答by Ben Adam

If you want to skip the manual dom traversal you can write the images to disk directly from the page response.

如果您想跳过手动 dom 遍历,您可以直接从页面响应将图像写入磁盘。

Example:

例子:

const puppeteer = require('puppeteer');
const fs = require('fs');
const path = require('path');

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    page.on('response', async response => {
        const url = response.url();
        if (response.request().resourceType() === 'image') {
            response.buffer().then(file => {
                const fileName = url.split('/').pop();
                const filePath = path.resolve(__dirname, fileName);
                const writeStream = fs.createWriteStream(filePath);
                writeStream.write(file);
            });
        }
    });
    await page.goto('https://memeculture69.tumblr.com/');
    await browser.close();
})();

回答by Naimur Rahman

The logic is simple i think. You just need to make a function which will take url of image and save it to your directory. The puppeteer will just scrape the image url and pass it to downloader function. Here is an example:

我认为逻辑很简单。您只需要创建一个函数,它将获取图像的 url 并将其保存到您的目录中。puppeteer 只会抓取图像 url 并将其传递给下载器函数。下面是一个例子:

const puppeteer = require("puppeteer");
const fs = require("fs");
const request = require("request");

//  This is main download function which takes the url of your image
function download(uri, filename, callback) {
  request.head(uri, function(err, res, body) {
    request(uri)
    .pipe(fs.createWriteStream(filename))
    .on("close", callback);
 });
}

let scrape = async () => {
 // Actual Scraping goes Here...
    const browser = await puppeteer.launch({ headless: false });
    const page = await browser.newPage();
    await page.goto("https://memeculture69.tumblr.com/");
    await page.waitFor(1000);
    const imageUrl = await page.evaluate(() =>
    document.querySelector("img.image") // image selector
    ); // here we got the image url.
    // Now just simply pass the image url to the downloader function to 
    download  the image.
    download(imageUrl, "image.png", function() {
     console.log("Image downloaded");
  });
 };

scrape()

回答by Lovesh Dongre

For image download by its selector I did the following:

对于由其选择器下载的图像,我执行了以下操作:

  1. Obtained urifor the image using selector
  2. Passed urito the download function

    const puppeteer = require('puppeteer');
    const fs = require('fs');
    var request = require('request');
    
    //download function
    var download = function (uri, filename, callback) {
        request.head(uri, function (err, res, body) {
            console.log('content-type:', res.headers['content-type']);
            console.log('content-length:', res.headers['content-length']);
            request(uri).pipe(fs.createWriteStream(filename)).on('close', callback);
        });
    };
    
    (async () => {
         const browser = await puppeteer.launch({
          headless: true,
          args: ['--no-sandbox', '--disable-setuid-sandbox'], //for no sandbox
        });
        const page = await browser.newPage();
        await page.goto('http://example.com');// your url here
    
        let imageLink = await page.evaluate(() => {
            const image = document.querySelector('#imageId');
            return image.src;
        })
    
        await download(imageLink, 'myImage.png', function () {
            console.log('done');
        });
    
        ...
    })();
    
  1. 得到URI的图像使用选择
  2. uri传递给下载函数

    const puppeteer = require('puppeteer');
    const fs = require('fs');
    var request = require('request');
    
    //download function
    var download = function (uri, filename, callback) {
        request.head(uri, function (err, res, body) {
            console.log('content-type:', res.headers['content-type']);
            console.log('content-length:', res.headers['content-length']);
            request(uri).pipe(fs.createWriteStream(filename)).on('close', callback);
        });
    };
    
    (async () => {
         const browser = await puppeteer.launch({
          headless: true,
          args: ['--no-sandbox', '--disable-setuid-sandbox'], //for no sandbox
        });
        const page = await browser.newPage();
        await page.goto('http://example.com');// your url here
    
        let imageLink = await page.evaluate(() => {
            const image = document.querySelector('#imageId');
            return image.src;
        })
    
        await download(imageLink, 'myImage.png', function () {
            console.log('done');
        });
    
        ...
    })();
    

Resource: Downloading images with node.js

资源:使用 node.js 下载图像

回答by Gabriel Furstenheim

It is possible to get all the images without visiting each url independently. You need to listen to all the requests to the server:

可以在不独立访问每个 url 的情况下获取所有图像。您需要监听对服务器的所有请求:

await page.setRequestInterception(true)
await page.on('request', function (request) {
   request.continue()
})
await page.on('response', async function (response) {
   // Filter those responses that are interesting
   const data = await response.buffer()
   // data contains the img information
})

回答by Sergey Gurin

This code saves all images found on the page into images folder

此代码将页面上找到的所有图像保存到图像文件夹中

page.on('response', async (response) => {
  const matches = /.*\.(jpg|png|svg|gif)$/.exec(response.url());
  if (matches && (matches.length === 2)) {
    const extension = matches[1];
    const buffer = await response.buffer();
    fs.writeFileSync(`images/${matches[0]}.${extension}`, buffer, 'base64');
  }
});