如何从 C# 中的 MS Office 文档中提取文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1011234/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 05:29:44  来源:igfitidea点击:

How to extract text from MS office documents in C#

c#ms-officetext-extraction

提问by Elias Haileselassie

I was trying to extract a text(string) from MS Word (.doc, .docx), Excel and Powerpoint using C#. Where can i find a free and simple .Net library to read MS Office documents? I tried to use NPOI but i didn't get a sample about how to use NPOI.

我试图使用 C# 从 MS Word(.doc、.docx)、Excel 和 Powerpoint 中提取文本(字符串)。我在哪里可以找到一个免费且简单的 .Net 库来阅读 MS Office 文档?我尝试使用 NPOI,但我没有得到有关如何使用 NPOI 的示例。

采纳答案by adrianbanks

Using PInvokes you can use the IFilterinterface (on Windows). The IFilters for many common file types are installed with Windows (you can browse them using thistool. You can just ask the IFilter to return you the text from the file. There are several sets of example code (hereis one such example).

使用 PInvokes,您可以使用IFilter接口(在 Windows 上)。许多常见文件类型的 IFilter 随 Windows 一起安装(您可以使用工具浏览它们。您可以要求 IFilter 从文件中返回文本。有几组示例代码(这里是一个这样的示例)。

回答by Skurmedel

I did a docx text extractor once, and it was very simple. Basically docx, and the other (new) formats I presume, is a zip-file with a bunch of XML-files instead. The text can be extracted using a XmlReader and using only .NET-classes.

我做过一次docx文本提取器,非常简单。基本上 docx 和我认为的其他(新)格式是一个带有一堆 XML 文件的 zip 文件。可以使用 XmlReader 并仅使用 .NET 类来提取文本。

I don't have the code anymore, it seems :(, but I found a guy who have a similar solution.

我没有代码了,似乎:(,但我找到了一个有类似解决方案的人

Maybe this isn't viable for you if you need to read .doc and .xls files though, since they are binary formats and probably much harder to parse.

如果您需要阅读 .doc 和 .xls 文件,这可能对您不可行,因为它们是二进制格式并且可能更难解析。

There is also the OpenXML SDK, still in CTP though, released by Microsoft.

还有Microsoft 发布的OpenXML SDK,但仍处于 CTP 中。

回答by joshcomley

Simple!

简单的!

These two steps will get you there:

这两个步骤会让你到达那里:

1) Use the Office Interop libraryto convert DOC to DOCX
2) Use DOCX2TXTto extract the text from the new DOCX

1) 使用Office Interop 库将 DOC 转换为 DOCX
2) 使用DOCX2TXT从新的 DOCX 中提取文本

The link for 1) has a very good explanation of how to do the conversion and even a code sample.

1) 的链接很好地解释了如何进行转换甚至是代码示例。

An alternative to 2) is to just unzip the DOCX file in C# and scan for the files you need. You can read about the structure of the ZIP file here.

2) 的替代方法是在 C# 中解压缩 DOCX 文件并扫描您需要的文件。您可以在此处阅读 ZIP 文件的结构。

Edit:Ah yes, I forgot to point out as Skurmedel did below that you must have Office installed on the system on which you want to do the conversion.

编辑:是的,我忘了指出,正如 Skurmedel 在下面所做的那样,您必须在要进行转换的系统上安装 Office。

回答by KyleM

For Microsoft Word 2007 and Microsoft Word 2010 (.docx) files you can use the Open XML SDK. This snippet of code will open a document and return its contents as text. It is especially useful for anyone trying to use regular expressions to parse the contents of a Word document. To use this solution you would need reference DocumentFormat.OpenXml.dll, which is part of the OpenXML SDK.

对于 Microsoft Word 2007 和 Microsoft Word 2010 (.docx) 文件,您可以使用 Open XML SDK。这段代码将打开一个文档并将其内容作为文本返回。对于尝试使用正则表达式来解析 Word 文档内容的任何人来说,它尤其有用。要使用此解决方案,您需要参考 DocumentFormat.OpenXml.dll,它是 OpenXML SDK 的一部分。

See: http://msdn.microsoft.com/en-us/library/bb448854.aspx

请参阅:http: //msdn.microsoft.com/en-us/library/bb448854.aspx

 public static string TextFromWord(SPFile file)
    {
        const string wordmlNamespace = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";

        StringBuilder textBuilder = new StringBuilder();
        using (WordprocessingDocument wdDoc = WordprocessingDocument.Open(file.OpenBinaryStream(), false))
        {
            // Manage namespaces to perform XPath queries.  
            NameTable nt = new NameTable();
            XmlNamespaceManager nsManager = new XmlNamespaceManager(nt);
            nsManager.AddNamespace("w", wordmlNamespace);

            // Get the document part from the package.  
            // Load the XML in the document part into an XmlDocument instance.  
            XmlDocument xdoc = new XmlDocument(nt);
            xdoc.Load(wdDoc.MainDocumentPart.GetStream());

            XmlNodeList paragraphNodes = xdoc.SelectNodes("//w:p", nsManager);
            foreach (XmlNode paragraphNode in paragraphNodes)
            {
                XmlNodeList textNodes = paragraphNode.SelectNodes(".//w:t", nsManager);
                foreach (System.Xml.XmlNode textNode in textNodes)
                {
                    textBuilder.Append(textNode.InnerText);
                }
                textBuilder.Append(Environment.NewLine);
            }

        }
        return textBuilder.ToString();
    }

回答by Jordan

Let me just correct a little bit the answer given by KyleM. I just added processing of two extra nodes, which influence the result: one is responsible for the horizontal tabulation with "\t", other - for the vertical tabulation with "\v". Here is the code:

让我稍微纠正一下 KyleM 给出的答案。我刚刚添加了两个额外节点的处理,这会影响结果:一个负责使用“\t”进行水平制表,另一个负责使用“\v”进行垂直制表。这是代码:

    public static string ReadAllTextFromDocx(FileInfo fileInfo)
    {
        StringBuilder stringBuilder;
        using(WordprocessingDocument wordprocessingDocument = WordprocessingDocument.Open(dataSourceFileInfo.FullName, false))
        {
            NameTable nameTable = new NameTable();
            XmlNamespaceManager xmlNamespaceManager = new XmlNamespaceManager(nameTable);
            xmlNamespaceManager.AddNamespace("w", "http://schemas.openxmlformats.org/wordprocessingml/2006/main");

            string wordprocessingDocumentText;
            using(StreamReader streamReader = new StreamReader(wordprocessingDocument.MainDocumentPart.GetStream()))
            {
                wordprocessingDocumentText = streamReader.ReadToEnd();
            }

            stringBuilder = new StringBuilder(wordprocessingDocumentText.Length);

            XmlDocument xmlDocument = new XmlDocument(nameTable);
            xmlDocument.LoadXml(wordprocessingDocumentText);

            XmlNodeList paragraphNodes = xmlDocument.SelectNodes("//w:p", xmlNamespaceManager);
            foreach(XmlNode paragraphNode in paragraphNodes)
            {
                XmlNodeList textNodes = paragraphNode.SelectNodes(".//w:t | .//w:tab | .//w:br", xmlNamespaceManager);
                foreach(XmlNode textNode in textNodes)
                {
                    switch(textNode.Name)
                    {
                        case "w:t":
                            stringBuilder.Append(textNode.InnerText);
                            break;

                        case "w:tab":
                            stringBuilder.Append("\t");
                            break;

                        case "w:br":
                            stringBuilder.Append("\v");
                            break;
                    }
                }

                stringBuilder.Append(Environment.NewLine);
            }
        }

        return stringBuilder.ToString();
    }

回答by Sep

Tika is very helpful and easy to extract text from different kind of documents, including microsoft office files.

Tika 非常有用且易于从不同类型的文档(包括 Microsoft Office 文件)中提取文本。

You can use this project which is such a nice piece of art made by Kevin Miller http://kevm.github.io/tikaondotnet/

你可以使用这个项目,这是由 Kevin Miller http://kevm.github.io/tikaondotnet/制作的一件非常好的艺术品

Just simply add this NuGet package https://www.nuget.org/packages/TikaOnDotNet/

只需简单地添加这个 NuGet 包 https://www.nuget.org/packages/TikaOnDotNet/

and then, this one line of code will do the magic:

然后,这一行代码将发挥神奇作用:

var text = new TikaOnDotNet.TextExtractor().Extract("fileName.docx  / pdf  / .... ").Text;

回答by lxa

A bit late to the party, but nevertheless - nowadays you don't need to download anything - all is already installed with .NET: (just make sure to add references to System.IO.Compression and System.IO.Compression.FileSystem)

聚会有点晚了,但是 - 现在你不需要下载任何东西 - 所有东西都已经安装了 .NET :(只需确保添加对 System.IO.Compression 和 System.IO.Compression.FileSystem 的引用)

using System;
using System.Linq;
using System.Xml.Linq;
using System.Xml.XPath;
using System.Xml;
using System.Text;
using System.IO.Compression;

public static class DocxTextExtractor
{
    public static string Extract(string filename)
    {
        XmlNamespaceManager NsMgr = new XmlNamespaceManager(new NameTable());
        NsMgr.AddNamespace("w", "http://schemas.openxmlformats.org/wordprocessingml/2006/main");

        using (var archive = ZipFile.OpenRead(filename))
        {
            return XDocument
                .Load(archive.GetEntry(@"word/document.xml").Open())
                .XPathSelectElements("//w:p", NsMgr)
                .Aggregate(new StringBuilder(), (sb, p) => p
                    .XPathSelectElements(".//w:t|.//w:tab|.//w:br", NsMgr)
                    .Select(e => { switch (e.Name.LocalName) { case "br": return "\v"; case "tab": return "\t"; } return e.Value; })
                    .Aggregate(sb, (sb1, v) => sb1.Append(v)))
                .ToString();
        }
    }
}

回答by Chris

Use The Microsoft Office Interop. It's free and slick. Here how I pulled all the words from a doc.

使用 Microsoft Office 互操作。它是免费和光滑的。在这里,我如何从文档中提取所有单词。

    using Microsoft.Office.Interop.Word;

   //Create Doc
    string docPath = @"C:\docLocation.doc";
    Application app = new Application();
    Document doc = app.Documents.Open(docPath);

    //Get all words
    string allWords = doc.Content.Text;
    doc.Close();
    app.Quit();

Then do whatever you want with the words.

然后用文字做任何你想做的事。

回答by Erik Felde

If you're looking for asp.net options, the interop won't work unless you install office on the server. Even then, Microsoft says not to do it.

如果您正在寻找 asp.net 选项,除非您在服务器上安装 office,否则互操作将无法工作。即便如此,微软也表示不要这样做。

I used Spire.Doc, worked beautifully. Spire.Doc downloadIt even read documents that were really .txt but were saved .doc. They have free and pay versions. You can also get a trial license that removes some warning from documents that you create, but I didn't create any, just searched them so the free version worked like a charm.

我使用了 Spire.Doc,效果很好。 Spire.Doc 下载它甚至可以读取真正是 .txt 但保存为 .doc 的文档。他们有免费和付费版本。您还可以获得一个试用许可证,从您创建的文档中删除一些警告,但我没有创建任何警告,只是搜索了它们,所以免费版本就像一个魅力。

回答by Usman Aziz

One of the suitable options for extracting text from Office documents in C# is GroupDocs.Parser for .NETAPI. The following are the code samples for extracting simple as well as formatted text.

在 C# 中从 Office 文档中提取文本的合适选项之一是GroupDocs.Parser for .NETAPI。以下是用于提取简单文本和格式化文本的代码示例。

Extracting Text

提取文本

// Create an instance of Parser class
using(Parser parser = new Parser("sample.docx"))
{
    // Extract a text into the reader
    using(TextReader reader = parser.GetText())
    {
        // Print a text from the document
        // If text extraction isn't supported, a reader is null
        Console.WriteLine(reader == null ? "Text extraction isn't supported" : reader.ReadToEnd());
    }
}

Extracting Formatted Text

提取格式化文本

// Create an instance of Parser class
using (Parser parser = new Parser("sample.docx"))
{
    // Extract a formatted text into the reader
    using (TextReader reader = parser.GetFormattedText(new FormattedTextOptions(FormattedTextMode.Html)))
    {
        // Print a formatted text from the document
        // If formatted text extraction isn't supported, a reader is null
        Console.WriteLine(reader == null ? "Formatted text extraction isn't suppported" : reader.ReadToEnd());
    }
}

Disclosure: I work as Developer Evangelist at GroupDocs.

披露:我在 GroupDocs 担任开发人员布道师。