javascript 如何使用 puppeteer 在页面上下载图像？

Question

提问by supermario

I'm new to web scraping and want to download all images on a webpage using puppeteer:

我是网络抓取的新手，想使用 puppeteer 下载网页上的所有图像：

const puppeteer = require('puppeteer');

let scrape = async () => {
  // Actual Scraping goes Here...

  const browser = await puppeteer.launch({headless: false});
  const page = await browser.newPage();
  await page.goto('https://memeculture69.tumblr.com/');

  //   Right click and save images

};

scrape().then((value) => {
    console.log(value); // Success!
});

I have looked at the API? docsbut could not figure out how to acheive this. So appreciate your help.

我看过API？docs，但无法弄清楚如何实现这一点。所以感谢你的帮助。

Answer 1

回答by Braden Brown

Here is another example. It goes to a generic search in google and downloads the google image at the top left.

这是另一个例子。它会在 google 中进行通用搜索并下载左上角的 google 图片。

const puppeteer = require('puppeteer');
const fs = require('fs');

async function run() {
    const browser = await puppeteer.launch({
        headless: false
    });
    const page = await browser.newPage();
    await page.setViewport({ width: 1200, height: 1200 });
    await page.goto('https://www.google.com/search?q=.net+core&rlz=1C1GGRV_enUS785US785&oq=.net+core&aqs=chrome..69i57j69i60l3j69i65j69i60.999j0j7&sourceid=chrome&ie=UTF-8');

    const IMAGE_SELECTOR = '#tsf > div:nth-child(2) > div > div.logo > a > img';
    let imageHref = await page.evaluate((sel) => {
        return document.querySelector(sel).getAttribute('src').replace('/', '');
    }, IMAGE_SELECTOR);

    console.log("https://www.google.com/" + imageHref);
    var viewSource = await page.goto("https://www.google.com/" + imageHref);
    fs.writeFile(".googles-20th-birthday-us-5142672481189888-s.png", await viewSource.buffer(), function (err) {
    if (err) {
        return console.log(err);
    }

    console.log("The file was saved!");
});

    browser.close();
}

run();

If you have a list of images you want to download then you could change the selector to programatically change as needed and go down the list of images downloading them one at a time.

如果您有要下载的图像列表，则可以将选择器更改为根据需要以编程方式更改，然后在图像列表中一次下载一个。

Answer 2

回答by Grant Miller

You can use the following to scrape an array of all the srcattributes of all images on the page:

您可以使用以下命令抓取src页面上所有图像的所有属性的数组：

const images = await page.evaluate(() => Array.from(document.images, e => e.src));

Then you can use the Node File System Moduleand HTTPor HTTPS Moduleto download each image.

然后您可以使用节点文件系统模块和HTTP或HTTPS 模块来下载每个图像。

Complete Example:

完整示例：

'use strict';

const fs = require('fs');
const https = require('https');
const puppeteer = require('puppeteer');

/* ============================================================
  Promise-Based Download Function
============================================================ */

const download = (url, destination) => new Promise((resolve, reject) => {
  const file = fs.createWriteStream(destination);

  https.get(url, response => {
    response.pipe(file);

    file.on('finish', () => {
      file.close(resolve(true));
    });
  }).on('error', error => {
    fs.unlink(destination);

    reject(error.message);
  });
});

/* ============================================================
  Download All Images
============================================================ */

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  let result;

  await page.goto('https://www.example.com/');

  const images = await page.evaluate(() => Array.from(document.images, e => e.src));

  for (let i = 0; i < images.length; i++) {
    result = await download(images[i], `image-${i}.png`);

    if (result === true) {
      console.log('Success:', images[i], 'has been downloaded successfully.');
    } else {
      console.log('Error:', images[i], 'was not downloaded.');
      console.error(result);
    }
  }

  await browser.close();
})();

Answer 3

回答by Ben Adam

If you want to skip the manual dom traversal you can write the images to disk directly from the page response.

如果您想跳过手动 dom 遍历，您可以直接从页面响应将图像写入磁盘。

Example:

例子：

const puppeteer = require('puppeteer');
const fs = require('fs');
const path = require('path');

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    page.on('response', async response => {
        const url = response.url();
        if (response.request().resourceType() === 'image') {
            response.buffer().then(file => {
                const fileName = url.split('/').pop();
                const filePath = path.resolve(__dirname, fileName);
                const writeStream = fs.createWriteStream(filePath);
                writeStream.write(file);
            });
        }
    });
    await page.goto('https://memeculture69.tumblr.com/');
    await browser.close();
})();

Answer 4

回答by Naimur Rahman

The logic is simple i think. You just need to make a function which will take url of image and save it to your directory. The puppeteer will just scrape the image url and pass it to downloader function. Here is an example:

我认为逻辑很简单。您只需要创建一个函数，它将获取图像的 url 并将其保存到您的目录中。puppeteer 只会抓取图像 url 并将其传递给下载器函数。下面是一个例子：

const puppeteer = require("puppeteer");
const fs = require("fs");
const request = require("request");

//  This is main download function which takes the url of your image
function download(uri, filename, callback) {
  request.head(uri, function(err, res, body) {
    request(uri)
    .pipe(fs.createWriteStream(filename))
    .on("close", callback);
 });
}

let scrape = async () => {
 // Actual Scraping goes Here...
    const browser = await puppeteer.launch({ headless: false });
    const page = await browser.newPage();
    await page.goto("https://memeculture69.tumblr.com/");
    await page.waitFor(1000);
    const imageUrl = await page.evaluate(() =>
    document.querySelector("img.image") // image selector
    ); // here we got the image url.
    // Now just simply pass the image url to the downloader function to 
    download  the image.
    download(imageUrl, "image.png", function() {
     console.log("Image downloaded");
  });
 };

scrape()

Answer 5

回答by Lovesh Dongre

For image download by its selector I did the following:

对于由其选择器下载的图像，我执行了以下操作：

Obtained urifor the image using selector

Passed urito the download function

const puppeteer = require('puppeteer');
const fs = require('fs');
var request = require('request');

//download function
var download = function (uri, filename, callback) {
    request.head(uri, function (err, res, body) {
        console.log('content-type:', res.headers['content-type']);
        console.log('content-length:', res.headers['content-length']);
        request(uri).pipe(fs.createWriteStream(filename)).on('close', callback);
    });
};

(async () => {
     const browser = await puppeteer.launch({
      headless: true,
      args: ['--no-sandbox', '--disable-setuid-sandbox'], //for no sandbox
    });
    const page = await browser.newPage();
    await page.goto('http://example.com');// your url here

    let imageLink = await page.evaluate(() => {
        const image = document.querySelector('#imageId');
        return image.src;
    })

    await download(imageLink, 'myImage.png', function () {
        console.log('done');
    });

    ...
})();

得到URI的图像使用选择

将uri传递给下载函数

const puppeteer = require('puppeteer');
const fs = require('fs');
var request = require('request');

//download function
var download = function (uri, filename, callback) {
    request.head(uri, function (err, res, body) {
        console.log('content-type:', res.headers['content-type']);
        console.log('content-length:', res.headers['content-length']);
        request(uri).pipe(fs.createWriteStream(filename)).on('close', callback);
    });
};

(async () => {
     const browser = await puppeteer.launch({
      headless: true,
      args: ['--no-sandbox', '--disable-setuid-sandbox'], //for no sandbox
    });
    const page = await browser.newPage();
    await page.goto('http://example.com');// your url here

    let imageLink = await page.evaluate(() => {
        const image = document.querySelector('#imageId');
        return image.src;
    })

    await download(imageLink, 'myImage.png', function () {
        console.log('done');
    });

    ...
})();

Resource: Downloading images with node.js

资源：使用 node.js 下载图像

Answer 6

回答by Gabriel Furstenheim

It is possible to get all the images without visiting each url independently. You need to listen to all the requests to the server:

可以在不独立访问每个 url 的情况下获取所有图像。您需要监听对服务器的所有请求：

await page.setRequestInterception(true)
await page.on('request', function (request) {
   request.continue()
})
await page.on('response', async function (response) {
   // Filter those responses that are interesting
   const data = await response.buffer()
   // data contains the img information
})

Answer 7

回答by Sergey Gurin

This code saves all images found on the page into images folder

此代码将页面上找到的所有图像保存到图像文件夹中

page.on('response', async (response) => {
  const matches = /.*\.(jpg|png|svg|gif)$/.exec(response.url());
  if (matches && (matches.length === 2)) {
    const extension = matches[1];
    const buffer = await response.buffer();
    fs.writeFileSync(`images/${matches[0]}.${extension}`, buffer, 'base64');
  }
});

javascript 如何使用 puppeteer 在页面上下载图像？

提问by supermario

回答by Braden Brown

回答by Grant Miller

回答by Ben Adam

回答by Naimur Rahman

回答by Lovesh Dongre

回答by Gabriel Furstenheim

回答by Sergey Gurin

相关推荐

最近更新

标签

javascript 如何使用 puppeteer 在页面上下载图像？

提问by supermario

回答by Braden Brown

回答by Grant Miller

回答by Ben Adam

回答by Naimur Rahman

回答by Lovesh Dongre

回答by Gabriel Furstenheim

回答by Sergey Gurin

相关推荐

javascript 使用 vue.js 动态向表中添加行

javascript 去抖动功能的 Jest 单元测试

javascript 如何设置多个状态？

javascript 反应原生如何在点击 onPress 时调用多个函数

相关推荐

最近更新

标签