使用 JavaScript 从 PDF 文件中提取图像
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18680261/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Extract images from PDF file with JavaScript
提问by Mika H.
I want to write JavaScript code to extract all image files from a PDF file, perhaps getting them as JPG or some other image format. There is already some JavaScript code for reading a PDF file, for example in the PDF viewer pdf-js.
我想编写 JavaScript 代码来从 PDF 文件中提取所有图像文件,也许将它们获取为 JPG 或其他一些图像格式。已经有一些用于读取 PDF 文件的 JavaScript 代码,例如在 PDF 查看器pdf-js 中。
window.addEventListener('change', function webViewerChange(evt) {
var files = evt.target.files;
if (!files || files.length === 0)
return;
// Read the local file into a Uint8Array.
var fileReader = new FileReader();
fileReader.onload = function webViewerChangeFileReaderOnload(evt) {
var buffer = evt.target.result;
var uint8Array = new Uint8Array(buffer);
PDFView.open(uint8Array, 0);
};
var file = files[0];
fileReader.readAsArrayBuffer(file);
PDFView.setTitleUsingUrl(file.name);
........
Can I use this code to help read and extract the image files?
我可以使用此代码来帮助读取和提取图像文件吗?
回答by Jason Siefken
If you open a page with pdf.js
, for example
如果你打开一个页面pdf.js
,例如
PDFJS.getDocument({url: <pdf file>}).then(function (doc) {
doc.getPage(1).then(function (page) {
window.page = page;
})
})
you can then use getOperatorList
to search for paintJpegXObject
objects and grab the resources.
然后您可以使用它getOperatorList
来搜索paintJpegXObject
对象并获取资源。
window.objs = []
page.getOperatorList().then(function (ops) {
for (var i=0; i < ops.fnArray.length; i++) {
if (ops.fnArray[i] == PDFJS.OPS.paintJpegXObject) {
window.objs.push(ops.argsArray[i][0])
}
}
})
Now args
will have a list of the resources from that page that you need to fetch.
现在args
将拥有您需要从该页面获取的资源列表。
console.log(window.args.map(function (a) { page.objs.get(a) }))
should print to the console a bunch of <img />
objects with data-uri src=
attributes. These can be directly inserted into the page, or you can do more scripting to get at the raw data.
应该向控制台打印一堆<img />
具有 data-urisrc=
属性的对象。这些可以直接插入到页面中,或者您可以编写更多脚本来获取原始数据。
It only works for embedded JPEG objects, but it's a start!
它仅适用于嵌入的 JPEG 对象,但这是一个开始!