javascript 使用javascript从pdf文件中提取文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17424639/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Extract text from pdf file using javascript
提问by Coccinelle
I want to extract text from pdf file using only Javascript in the client side without using the server. I've already found a javascript code in the following link: extract text from pdf in Javascript
我想在不使用服务器的情况下在客户端仅使用 Javascript 从 pdf 文件中提取文本。我已经在以下链接中找到了一个 javascript 代码:在 Javascript 中从 pdf 中提取文本
and then in
然后在
http://hublog.hubmed.org/archives/001948.html
http://hublog.hubmed.org/archives/001948.html
and in:
并在:
https://github.com/hubgit/hubgit.github.com/tree/master/2011/11/pdftotext
https://github.com/hubgit/hubgit.github.com/tree/master/2011/11/pdftotext
1) I want please to know what are the files which are necessary for these extraction from the previous ones. 2) I don't know exactly how to adapt these codes in an application, not in the web.
1)我想知道从以前的文件中提取这些文件所需的文件是什么。2)我不知道如何在应用程序中调整这些代码,而不是在网络中。
Any answer is welcome. Thank you.
欢迎任何答案。谢谢你。
采纳答案by Allanon
here is a nice example of how to use pdf.js for extracting the text: http://git.macropus.org/2011/11/pdftotext/example/
这是如何使用 pdf.js 提取文本的一个很好的例子:http://git.macropus.org/2011/11/pdftotext/example/
of course you have to remove a lot of code for your purpose, but it should do it
当然你必须为你的目的删除很多代码,但它应该这样做
回答by Carlos Delgado
I've made an easier approach that doesn't need to post messages between iframes using the same library (using the latest version), using pdf.js.
我做了一个更简单的方法,不需要使用相同的库(使用最新版本)在 iframe 之间发布消息,使用 pdf.js。
The following example would extract all the text only from the first page of the PDF:
以下示例将仅从 PDF 的第一页中提取所有文本:
/**
* Retrieves the text of a specif page within a PDF Document obtained through pdf.js
*
* @param {Integer} pageNum Specifies the number of the page
* @param {PDFDocument} PDFDocumentInstance The PDF document obtained
**/
function getPageText(pageNum, PDFDocumentInstance) {
// Return a Promise that is solved once the text of the page is retrieven
return new Promise(function (resolve, reject) {
PDFDocumentInstance.getPage(pageNum).then(function (pdfPage) {
// The main trick to obtain the text of the PDF page, use the getTextContent method
pdfPage.getTextContent().then(function (textContent) {
var textItems = textContent.items;
var finalString = "";
// Concatenate the string of the item to the final string
for (var i = 0; i < textItems.length; i++) {
var item = textItems[i];
finalString += item.str + " ";
}
// Solve promise with the text retrieven from the page
resolve(finalString);
});
});
});
}
/**
* Extract the test from the PDF
*/
var PDF_URL = '/path/to/example.pdf';
PDFJS.getDocument(PDF_URL).then(function (PDFDocumentInstance) {
var totalPages = PDFDocumentInstance.pdfInfo.numPages;
var pageNumber = 1;
// Extract the text
getPageText(pageNumber , PDFDocumentInstance).then(function(textPage){
// Show the text of the page in the console
console.log(textPage);
});
}, function (reason) {
// PDF loading error
console.error(reason);
});
Read the article about this solution here. As @xarxziux mentioned, the library has changed since the first solution was posted (it shouldn't work with the latest version of pdf.js anymore). This should work for most of the cases.
在此处阅读有关此解决方案的文章。正如@xarxziux 所提到的,自从发布第一个解决方案以来,该库已经发生了变化(它不再适用于最新版本的 pdf.js)。这应该适用于大多数情况。