javascript 如何使用 puppeteer 在页面上下载图像?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/52542149/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I download images on a page using puppeteer?
提问by supermario
I'm new to web scraping and want to download all images on a webpage using puppeteer:
我是网络抓取的新手,想使用 puppeteer 下载网页上的所有图像:
const puppeteer = require('puppeteer');
let scrape = async () => {
// Actual Scraping goes Here...
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto('https://memeculture69.tumblr.com/');
// Right click and save images
};
scrape().then((value) => {
console.log(value); // Success!
});
I have looked at the API? docsbut could not figure out how to acheive this. So appreciate your help.
我看过API?docs,但无法弄清楚如何实现这一点。所以感谢你的帮助。
回答by Braden Brown
Here is another example. It goes to a generic search in google and downloads the google image at the top left.
这是另一个例子。它会在 google 中进行通用搜索并下载左上角的 google 图片。
const puppeteer = require('puppeteer');
const fs = require('fs');
async function run() {
const browser = await puppeteer.launch({
headless: false
});
const page = await browser.newPage();
await page.setViewport({ width: 1200, height: 1200 });
await page.goto('https://www.google.com/search?q=.net+core&rlz=1C1GGRV_enUS785US785&oq=.net+core&aqs=chrome..69i57j69i60l3j69i65j69i60.999j0j7&sourceid=chrome&ie=UTF-8');
const IMAGE_SELECTOR = '#tsf > div:nth-child(2) > div > div.logo > a > img';
let imageHref = await page.evaluate((sel) => {
return document.querySelector(sel).getAttribute('src').replace('/', '');
}, IMAGE_SELECTOR);
console.log("https://www.google.com/" + imageHref);
var viewSource = await page.goto("https://www.google.com/" + imageHref);
fs.writeFile(".googles-20th-birthday-us-5142672481189888-s.png", await viewSource.buffer(), function (err) {
if (err) {
return console.log(err);
}
console.log("The file was saved!");
});
browser.close();
}
run();
If you have a list of images you want to download then you could change the selector to programatically change as needed and go down the list of images downloading them one at a time.
如果您有要下载的图像列表,则可以将选择器更改为根据需要以编程方式更改,然后在图像列表中一次下载一个。
回答by Grant Miller
You can use the following to scrape an array of all the srcattributes of all images on the page:
您可以使用以下命令抓取src页面上所有图像的所有属性的数组:
const images = await page.evaluate(() => Array.from(document.images, e => e.src));
Then you can use the Node File System Moduleand HTTPor HTTPS Moduleto download each image.
然后您可以使用节点文件系统模块和HTTP或HTTPS 模块来下载每个图像。
Complete Example:
完整示例:
'use strict';
const fs = require('fs');
const https = require('https');
const puppeteer = require('puppeteer');
/* ============================================================
Promise-Based Download Function
============================================================ */
const download = (url, destination) => new Promise((resolve, reject) => {
const file = fs.createWriteStream(destination);
https.get(url, response => {
response.pipe(file);
file.on('finish', () => {
file.close(resolve(true));
});
}).on('error', error => {
fs.unlink(destination);
reject(error.message);
});
});
/* ============================================================
Download All Images
============================================================ */
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
let result;
await page.goto('https://www.example.com/');
const images = await page.evaluate(() => Array.from(document.images, e => e.src));
for (let i = 0; i < images.length; i++) {
result = await download(images[i], `image-${i}.png`);
if (result === true) {
console.log('Success:', images[i], 'has been downloaded successfully.');
} else {
console.log('Error:', images[i], 'was not downloaded.');
console.error(result);
}
}
await browser.close();
})();
回答by Ben Adam
If you want to skip the manual dom traversal you can write the images to disk directly from the page response.
如果您想跳过手动 dom 遍历,您可以直接从页面响应将图像写入磁盘。
Example:
例子:
const puppeteer = require('puppeteer');
const fs = require('fs');
const path = require('path');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.on('response', async response => {
const url = response.url();
if (response.request().resourceType() === 'image') {
response.buffer().then(file => {
const fileName = url.split('/').pop();
const filePath = path.resolve(__dirname, fileName);
const writeStream = fs.createWriteStream(filePath);
writeStream.write(file);
});
}
});
await page.goto('https://memeculture69.tumblr.com/');
await browser.close();
})();
回答by Naimur Rahman
The logic is simple i think. You just need to make a function which will take url of image and save it to your directory. The puppeteer will just scrape the image url and pass it to downloader function. Here is an example:
我认为逻辑很简单。您只需要创建一个函数,它将获取图像的 url 并将其保存到您的目录中。puppeteer 只会抓取图像 url 并将其传递给下载器函数。下面是一个例子:
const puppeteer = require("puppeteer");
const fs = require("fs");
const request = require("request");
// This is main download function which takes the url of your image
function download(uri, filename, callback) {
request.head(uri, function(err, res, body) {
request(uri)
.pipe(fs.createWriteStream(filename))
.on("close", callback);
});
}
let scrape = async () => {
// Actual Scraping goes Here...
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto("https://memeculture69.tumblr.com/");
await page.waitFor(1000);
const imageUrl = await page.evaluate(() =>
document.querySelector("img.image") // image selector
); // here we got the image url.
// Now just simply pass the image url to the downloader function to
download the image.
download(imageUrl, "image.png", function() {
console.log("Image downloaded");
});
};
scrape()
回答by Lovesh Dongre
For image download by its selector I did the following:
对于由其选择器下载的图像,我执行了以下操作:
- Obtained urifor the image using selector
Passed urito the download function
const puppeteer = require('puppeteer'); const fs = require('fs'); var request = require('request'); //download function var download = function (uri, filename, callback) { request.head(uri, function (err, res, body) { console.log('content-type:', res.headers['content-type']); console.log('content-length:', res.headers['content-length']); request(uri).pipe(fs.createWriteStream(filename)).on('close', callback); }); }; (async () => { const browser = await puppeteer.launch({ headless: true, args: ['--no-sandbox', '--disable-setuid-sandbox'], //for no sandbox }); const page = await browser.newPage(); await page.goto('http://example.com');// your url here let imageLink = await page.evaluate(() => { const image = document.querySelector('#imageId'); return image.src; }) await download(imageLink, 'myImage.png', function () { console.log('done'); }); ... })();
- 得到URI的图像使用选择
将uri传递给下载函数
const puppeteer = require('puppeteer'); const fs = require('fs'); var request = require('request'); //download function var download = function (uri, filename, callback) { request.head(uri, function (err, res, body) { console.log('content-type:', res.headers['content-type']); console.log('content-length:', res.headers['content-length']); request(uri).pipe(fs.createWriteStream(filename)).on('close', callback); }); }; (async () => { const browser = await puppeteer.launch({ headless: true, args: ['--no-sandbox', '--disable-setuid-sandbox'], //for no sandbox }); const page = await browser.newPage(); await page.goto('http://example.com');// your url here let imageLink = await page.evaluate(() => { const image = document.querySelector('#imageId'); return image.src; }) await download(imageLink, 'myImage.png', function () { console.log('done'); }); ... })();
Resource: Downloading images with node.js
回答by Gabriel Furstenheim
It is possible to get all the images without visiting each url independently. You need to listen to all the requests to the server:
可以在不独立访问每个 url 的情况下获取所有图像。您需要监听对服务器的所有请求:
await page.setRequestInterception(true)
await page.on('request', function (request) {
request.continue()
})
await page.on('response', async function (response) {
// Filter those responses that are interesting
const data = await response.buffer()
// data contains the img information
})
回答by Sergey Gurin
This code saves all images found on the page into images folder
此代码将页面上找到的所有图像保存到图像文件夹中
page.on('response', async (response) => {
const matches = /.*\.(jpg|png|svg|gif)$/.exec(response.url());
if (matches && (matches.length === 2)) {
const extension = matches[1];
const buffer = await response.buffer();
fs.writeFileSync(`images/${matches[0]}.${extension}`, buffer, 'base64');
}
});

