如何在 C# (.NET) 中加载 MS Word 文档的文本?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/215620/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-03 18:27:58  来源:igfitidea点击:

How to load text of MS Word document in C# (.NET)?

c#.netms-worddocxdoc

提问by Skuta

How do I load MS Word document (.doc and .docx) to memory (variable) without doing this?:

如何在不执行此操作的情况下将 MS Word 文档(.doc 和 .docx)加载到内存(变量)中?:

wordApp.Documents.Open

wordApp.Documents.Open

I don't want to open MS Word, I just want that text inside.

我不想打开 MS Word,我只想要里面的文本。

You gave me answer for DOCX, but what about DOC? I want free and high performance solution - not to open 12.000 instances of Word to process all of them. :( Aspose is commercial product, and 900$ is a way too much for what I do.

你给了我关于 DOCX 的答案,但是 DOC 呢?我想要免费和高性能的解决方案 - 不要打开 12.000 个 Word 实例来处理所有这些实例。:( Aspose 是商业产品,900 美元对于我的工作来说太过分了。

采纳答案by Cihan Ucar

You can use wordconv.exe which is part of the Office Compatibility Pack to convert from doc to docx.

您可以使用 Office 兼容包中的 wordconv.exe 将 doc 转换为 docx。

http://www.microsoft.com/downloads/details.aspx?familyid=941b3470-3ae9-4aee-8f43-c6bb74cd1466&displaylang=en

http://www.microsoft.com/downloads/details.aspx?familyid=941b3470-3ae9-4aee-8f43-c6bb74cd1466&displaylang=en

Just call the command like so: "C:\Program Files\Microsoft Office\Office12\wordconv.exe" -oice -nme InputFile OutputFile

只需像这样调用命令:“C:\Program Files\Microsoft Office\Office12\wordconv.exe” -oice -nme InputFile OutputFile

I'm not sure if you need word installed for it to run but it does work. I use it locally as a windows shell command to convert old office files to 2007 format whenever I want.

我不确定您是否需要安装 word 才能运行,但它确实有效。我在本地使用它作为 windows shell 命令,随时将旧的 office 文件转换为 2007 格式。

回答by Jobi Joy

If you are dealing with docx you can do this with out doing any interop with Word .docx file actually a ZIP contains an XML file , you can read the XML Please refer the below links

如果您正在处理 docx,您可以在不与 Word .docx 文件进行任何互操作的情况下执行此操作,实际上 ZIP 包含一个 XML 文件,您可以阅读 XML 请参考以下链接

http://conceptdev.blogspot.com/2007/03/open-docx-using-c-to-extract-text-for.html

http://conceptdev.blogspot.com/2007/03/open-docx-using-c-to-extract-text-for.html

Office (2007) Open XML File Formats

Office (2007) 打开 XML 文件格式

回答by Jason Whitehorn

For docx formatted Word Documents I found this interesting article on The CodeProject

对于 docx 格式的 Word 文档,我在 The CodeProject 上发现了这篇有趣的文章

Using DocxToText to Extract Text from DOCX Files

使用 DocxToText 从 DOCX 文件中提取文本

In the article the author discusses stripping out just the words themselves.

在文章中,作者讨论了剥离单词本身。

For your doc (non-docx) Word Documents other than using the Office APIs and (in the background) spawning an instance of Word you could try shelling out to one of the many different Doc2Docx converters on the market and then applying the above process for both.

对于您的文档(非 docx)Word 文档,而不是使用 Office API 和(在后台)生成 Word 实例,您可以尝试使用市场上许多不同的 Doc2Docx 转换器之一,然后应用上述过程两个都。

回答by bill_the_loser

I don't mean to be an antagonist, but why?

我并不是要成为对手,但为什么呢?

I've extracted data from Word Documents on Linux servers using Word2X or AbiWord and depending on the number and the variety of docments there will always be errors with the extraction. It's worse the more bullets, page breaks, document sections and other "special" features there are.

我已经使用 Word2X 或 AbiWord 从 Linux 服务器上的 Word 文档中提取数据,并且根据文档的数量和种类,提取时总会出现错误。项目符号、分页符、文档部分和其他“特殊”功能越多,情况就越糟。

I understand there are options now to automate OpenOffice to process documents, but my advice is, if you can, just use Word to process Word documents.

我知道现在有一些选项可以让 OpenOffice 自动化来处理文档,但我的建议是,如果可以的话,只使用 Word 来处理 Word 文档。

回答by Rick Minerich

I recently did some research on this topic. It turns out that to be able to manipulate word files programatically without opening word itself you need some very expensive tools.

我最近对这个主题做了一些研究。事实证明,要能够在不打开 word 本身的情况下以编程方式操作 word 文件,您需要一些非常昂贵的工具。

There's an article over at code project on manipulating Word, you might find it useful. The author build a C# COM wrapper for dealing with calls to Word. It looks like it actually pops open the word application though.

code project 上有一篇关于操纵 Word的文章,您可能会发现它很有用。作者构建了一个 C# COM 包装器来处理对 Word 的调用。看起来它实际上弹出了应用程序这个词。

This post over at the neowin forumslooks promising too. It includes quite a few PInvoked calls for the purpose of text extraction.

Neowin 论坛上的这篇文章看起来也很有希望。它包括相当多的用于文本提取的 PInvoked 调用。

Maybe if you could find a way to keep the window hidden it would be acceptable.

也许如果您能找到一种隐藏窗口的方法,那将是可以接受的。

回答by Cihan Ucar

Aspose has a component to read, modify and write Word documents. Here is the product link : Aspose.Words for .NET and Java

Aspose 有一个组件来读取、修改和编写 Word 文档。这是产品链接:Aspose.Words for .NET and Java

Aspose.Words enables .NET and Java applications to read, modify and write Word? documents without utilizing Microsoft Word?. Aspose.Words supports a wide array of features including document creation, content and formatting manipulation, powerful mail merge abilities, comprehensive support of DOC, OOXML, RTF, WordprocessingML, HTML, OpenDocument and PDF formats. Aspose.Words is truly the most affordable, fastest and feature rich Word component on the market.

Aspose.Words 使.NET 和Java 应用程序能够读取、修改和编写Word?不使用 Microsoft Word 的文档?Aspose.Words 支持多种功能,包括文档创建、内容和格式操作、强大的邮件合并功能、对 DOC、OOXML、RTF、WordprocessingML、HTML、OpenDocument 和 PDF 格式的全面支持。Aspose.Words 确实是市场上最实惠、速度最快且功能丰富的 Word 组件。

回答by edi9999

With docxtemplater, you can easily get the full text of a word (works with docx only).

使用docxtemplater,您可以轻松获取单词的全文(仅适用于 docx)。

Here's the code (Node.JS)

这是代码(Node.JS)

DocxTemplater=require('docxtemplater'); doc=new DocxTemplater().loadFromFile("input.docx"); result=doc.getFullText();

DocxTemplater=require('docxtemplater'); doc=new DocxTemplater().loadFromFile("input.docx"); result=doc.getFullText();

This is just three lines of code and doesn't depend on any word instance (all plain JS)

这只是三行代码,不依赖于任何单词实例(都是纯 JS)