我可以使用 Node.js 阅读 PDF 或 Word 文档吗?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/9038231/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Can I read PDF or Word Docs with Node.js?
提问by Shamoon
I can't find any packages to do this. I know PHP has a ton of libraries for PDFs (like http://www.fpdf.org/) but anything for Node?
我找不到任何包来做到这一点。我知道 PHP 有大量的 PDF 库(如http://www.fpdf.org/),但有什么适用于 Node 的?
采纳答案by Tim
You can easily convert one into another, or use for example a .doc template to generate a .pdf file, but you will probably want to use an existing web service for this task.
您可以轻松地将一个转换为另一个,或者使用例如 .doc 模板生成 .pdf 文件,但您可能希望使用现有的 Web 服务来完成此任务。
This can be done using the services of Livedocxfor example
这可以使用的服务来完成Livedocx例如
To use this service from node, see node-livedocx(Disclaimer: I am the author of this node module)
要从节点使用此服务,请参阅node-livedocx(免责声明:我是此节点模块的作者)
回答by James_1x0
回答by timoxley
Looks like there's a few for pdf, but I didn't find any for Word.
看起来有一些pdf的,但我没有找到word的。
CPU bound processing like that isn't really Node's strong point anyway (i.e. you get no additional benefits using node to do it over any other language). A pragmatic approach would be to find a good tool and utilise it from Node.
无论如何,像这样的 CPU 绑定处理并不是真正的 Node 的强项(即,与任何其他语言相比,使用 node 来完成它没有任何额外的好处)。一种务实的方法是找到一个好的工具并从 Node.js 中利用它。
I have heard good things around the office about docsplit http://documentcloud.github.com/docsplit/
我在办公室里听说过关于 docsplit http://documentcloud.github.com/docsplit/ 的好消息
While it's not Node, you could easily invoke it from Node with http://nodejs.org/docs/latest/api/all.html#child_process.exec
虽然它不是 Node,但您可以使用http://nodejs.org/docs/latest/api/all.html#child_process.exec从 Node 轻松调用它
回答by Tracker1
I would suggest looking into unoconvfor your initial conversion, this uses LibreOffice or OpenOffice for the actual conversion. Which adds some overhead.
我建议您查看unoconv进行初始转换,这将使用LibreOffice或 OpenOffice 进行实际转换。这增加了一些开销。
I'd setup a few workers with all the necessities setup, and use a request/response queue for handling the conversion... (may want to look into kueor zmq)
我会设置一些具有所有必需品设置的工作人员,并使用请求/响应队列来处理转换...(可能需要查看kue或zmq)
In general this is a CPU bound and heavy task that should be offloaded... Pandoc and others specifically mention .docx, not .docso they may or may not be options as well.
一般来说,这是一个 CPU 密集型和繁重的任务,应该卸载...... Pandoc 和其他人特别提到.docx,并非.doc如此,它们也可能是也可能不是选项。
Note: I know this question is old, just wanted to provide a current answer for others coming across this.
注意:我知道这个问题很老,只是想为遇到此问题的其他人提供当前答案。
回答by iwayankit
you can use pdf-text for pdf files. it will extract text from a pdf into an array of text 'chunks'. Useful for doing fuzzy parsing on structured pdf text.
您可以将 pdf-text 用于 pdf 文件。它将从 pdf 中提取文本到文本“块”数组中。用于对结构化 pdf 文本进行模糊解析。
var pdfText = require('pdf-text')
var pathToPdf = __dirname + "/info.pdf"
pdfText(pathToPdf, function(err, chunks) {
//chunks is an array of strings
//loosely corresponding to text objects within the pdf
//for a more concrete example, view the test file in this repo
})
var fs = require('fs')
var buffer = fs.readFileSync(pathToPdf)
pdfText(buffer, function(err, chunks) {
console.log(chunks)
})
for docx files you can use mammoth, it will extract text from .docx files.
对于 docx 文件,您可以使用 mammoth,它将从 .docx 文件中提取文本。
var mammoth = require("mammoth");
mammoth.extractRawText({path: "./doc.docx"})
.then(function(result){
var text = result.value; // The raw text
console.log(text);
var messages = result.messages;
})
.done();
I hope this will help.
我希望这将有所帮助。
回答by Vlad Bezden
回答by Philip Kirkbride
Another good option if you only need to convert from Word documents is Mammoth.js.
如果您只需要从 Word 文档转换,另一个不错的选择是Mammoth.js。
Mammoth is designed to convert .docx documents, such as those created by Microsoft Word, and convert them to HTML. Mammoth aims to produce simple and clean HTML by using semantic information in the document, and ignoring other details. For instance, Mammoth converts any paragraph with the style Heading 1 to h1 elements, rather than attempting to exactly copy the styling (font, text size, colour, etc.) of the heading.
There's a large mismatch between the structure used by .docx and the structure of HTML, meaning that the conversion is unlikely to be perfect for more complicated documents. Mammoth works best if you only use styles to semantically mark up your document.
Mammoth 旨在转换 .docx 文档,例如由 Microsoft Word 创建的文档,并将它们转换为 HTML。Mammoth 旨在通过使用文档中的语义信息并忽略其他细节来生成简单干净的 HTML。例如,猛犸象将具有样式标题 1 的任何段落转换为 h1 元素,而不是尝试完全复制标题的样式(字体、文本大小、颜色等)。
.docx 使用的结构与 HTML 的结构之间存在很大的不匹配,这意味着对于更复杂的文档,转换不太可能是完美的。如果您只使用样式从语义上标记您的文档,猛犸象效果最好。
回答by sdgfsdh
Here is an example showing how to download and extract text from a PDF using PDF.js:
这是一个示例,展示了如何使用PDF.js从 PDF 下载和提取文本:
import _ from 'lodash';
import superagent from 'superagent';
import pdf from 'pdfjs-dist';
const url = 'http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf';
const main = async () => {
const response = await superagent.get(url).buffer();
const data = response.body;
const doc = await pdf.getDocument({ data });
for (const i of _.range(doc.numPages)) {
const page = await doc.getPage(i + 1);
const content = await page.getTextContent();
for (const { str } of content.items) {
console.log(str);
}
}
};
main().catch(error => console.error(error));

