javascript 使用javascript从pdf文件中提取文本

Question

提问by Coccinelle

I want to extract text from pdf file using only Javascript in the client side without using the server. I've already found a javascript code in the following link: extract text from pdf in Javascript

我想在不使用服务器的情况下在客户端仅使用 Javascript 从 pdf 文件中提取文本。我已经在以下链接中找到了一个 javascript 代码：在 Javascript 中从 pdf 中提取文本

and then in

然后在

http://hublog.hubmed.org/archives/001948.html

and in:

并在：

https://github.com/hubgit/hubgit.github.com/tree/master/2011/11/pdftotext

1) I want please to know what are the files which are necessary for these extraction from the previous ones. 2) I don't know exactly how to adapt these codes in an application, not in the web.

1）我想知道从以前的文件中提取这些文件所需的文件是什么。2）我不知道如何在应用程序中调整这些代码，而不是在网络中。

Any answer is welcome. Thank you.

欢迎任何答案。谢谢你。

Answer 1

采纳答案by Allanon

here is a nice example of how to use pdf.js for extracting the text: http://git.macropus.org/2011/11/pdftotext/example/

这是如何使用 pdf.js 提取文本的一个很好的例子：http://git.macropus.org/2011/11/pdftotext/example/

of course you have to remove a lot of code for your purpose, but it should do it

当然你必须为你的目的删除很多代码，但它应该这样做

Answer 2

回答by Carlos Delgado

I've made an easier approach that doesn't need to post messages between iframes using the same library (using the latest version), using pdf.js.

我做了一个更简单的方法，不需要使用相同的库（使用最新版本）在 iframe 之间发布消息，使用 pdf.js。

The following example would extract all the text only from the first page of the PDF:

以下示例将仅从 PDF 的第一页中提取所有文本：

/**
 * Retrieves the text of a specif page within a PDF Document obtained through pdf.js 
 * 
 * @param {Integer} pageNum Specifies the number of the page 
 * @param {PDFDocument} PDFDocumentInstance The PDF document obtained 
 **/
function getPageText(pageNum, PDFDocumentInstance) {
    // Return a Promise that is solved once the text of the page is retrieven
    return new Promise(function (resolve, reject) {
        PDFDocumentInstance.getPage(pageNum).then(function (pdfPage) {
            // The main trick to obtain the text of the PDF page, use the getTextContent method
            pdfPage.getTextContent().then(function (textContent) {
                var textItems = textContent.items;
                var finalString = "";

                // Concatenate the string of the item to the final string
                for (var i = 0; i < textItems.length; i++) {
                    var item = textItems[i];

                    finalString += item.str + " ";
                }

                // Solve promise with the text retrieven from the page
                resolve(finalString);
            });
        });
    });
}

/**
 * Extract the test from the PDF
 */

var PDF_URL  = '/path/to/example.pdf';
PDFJS.getDocument(PDF_URL).then(function (PDFDocumentInstance) {

    var totalPages = PDFDocumentInstance.pdfInfo.numPages;
    var pageNumber = 1;

    // Extract the text
    getPageText(pageNumber , PDFDocumentInstance).then(function(textPage){
        // Show the text of the page in the console
        console.log(textPage);
    });

}, function (reason) {
    // PDF loading error
    console.error(reason);
});

Read the article about this solution here. As @xarxziux mentioned, the library has changed since the first solution was posted (it shouldn't work with the latest version of pdf.js anymore). This should work for most of the cases.

在此处阅读有关此解决方案的文章。正如@xarxziux 所提到的，自从发布第一个解决方案以来，该库已经发生了变化（它不再适用于最新版本的 pdf.js）。这应该适用于大多数情况。

javascript 使用javascript从pdf文件中提取文本

提问by Coccinelle

采纳答案by Allanon

回答by Carlos Delgado

相关推荐

最近更新

标签

javascript 使用javascript从pdf文件中提取文本

提问by Coccinelle

采纳答案by Allanon

回答by Carlos Delgado

相关推荐

简单的 Javascript 程序：未捕获的 ReferenceError：x 未定义

javascript 如何在 webgl 着色器中使用 console.log？

javascript 使用 onclick 函数隐藏父元素

javascript 获取使用输入类型文件（多个）选择的文件并将它们存储在一个数组中

相关推荐

最近更新

标签