C# 从 Word 文档转换为 HTML

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2266097/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-07 00:53:47  来源:igfitidea点击:

Convert from Word document to HTML

c#htmlms-word

提问by Pankaj

I want to save the Word document in HTML using Word Viewer without having Word installed in my machine. Is there any way to accomplish this in C#?

我想使用 Word Viewer 将 Word 文档保存为 HTML 格式,而无需在我的机器上安装 Word。有什么办法可以在 C# 中实现这一点吗?

回答by Tim S. Van Haren

You will need to have MS Word installed to do this, I believe.

我相信您需要安装 MS Word 才能执行此操作。

Check out this articlefor details on the implementation.

查看本文以了解有关实现的详细信息。

回答by Bryan

According to this Stack Overflow question, it isn't possible with word viewer. You will need Word to use COM Interop to interact with Word.

根据这个堆栈溢出问题,单词查看器是不可能的。您将需要 Word 才能使用 COM Interop 与 Word 交互。

回答by ZombieSheep

I think this will depend on the version of the Word document. If you have them in docx format, I believethey are stored within the file as XML data (but it is so long since I looked at the specification I am perfectly happy to be corrected on that).

我认为这将取决于 Word 文档的版本。如果你有 docx 格式的它们,我相信它们作为 XML 数据存储在文件中(但自从我查看规范以来,我很高兴能对此进行更正)。

回答by dnagirl

If you're open to not using C#, you could do something like print to file using PrimoPDF(which would change the .doc into a .pdf) and then use a PDF to HTML converter to go the rest of the way. After that you can edit your html however you like.

如果您愿意不使用 C#,您可以使用PrimoPDF(这会将 .doc 更改为 .pdf)打印到文件,然后使用 PDF 到 HTML 转换器完成剩下的工作。之后,您可以随意编辑您的 html。

回答by ternaryOperator

Using the document conversion tools available in OpenOffice.org is probably the only possible option - the .doc format is only designed to be opened via Microsoft products so any libraries dealing with it will need to have reverse engineered the entire format.

使用 OpenOffice.org 中提供的文档转换工具可能是唯一可能的选择 - .doc 格式仅设计为通过 Microsoft 产品打开,因此任何处理它的库都需要对整个格式进行逆向工程。

回答by Krantisinh Patil

For converting .docx file to HTML format, you can use OpenXmlPowerTools. Make sure to add a reference to OpenXmlPowerTools.dll.

要将 .docx 文件转换为 HTML 格式,您可以使用OpenXmlPowerTools。确保添加对 OpenXmlPowerTools.dll 的引用。

using OpenXmlPowerTools;
using DocumentFormat.OpenXml.Wordprocessing;

byte[] byteArray = File.ReadAllBytes(DocxFilePath);
using (MemoryStream memoryStream = new MemoryStream())
{
     memoryStream.Write(byteArray, 0, byteArray.Length);
     using (WordprocessingDocument doc = WordprocessingDocument.Open(memoryStream, true))
     {
          HtmlConverterSettings settings = new HtmlConverterSettings()
          {
               PageTitle = "My Page Title"
          };
          XElement html = HtmlConverter.ConvertToHtml(doc, settings);

          File.WriteAllText(HTMLFilePath, html.ToStringNewLineOnAttributes());
     }
}

回答by Michael Williamson

I wrote Mammoth for .NET, which is a library that converts docx files to HTML, and is available on NuGet.

为 .NET编写了Mammoth,这是一个将 docx 文件转换为 HTML 的库,可在 NuGet 上使用

Mammoth tries to produce clean HTML by looking at semantic information -- for instance, mapping paragraph styles in Word (such as Heading 1) to appropriate tags and style in HTML/CSS (such as <h1>). If you want something that produces an exact visual copy, then Mammoth probably isn't for you. If you have something that's already well-structured and want to convert that to tidy HTML, Mammoth might do the trick.

Mammoth 试图通过查看语义信息来生成干净的 HTML —— 例如,将 Word 中的段落样式(例如Heading 1)映射到 HTML/CSS 中的适当标记和样式(例如<h1>)。如果你想要产生精确视觉副本的东西,那么猛犸象可能不适合你。如果你有一些已经结构良好的东西,并且想把它转换成整洁的 HTML,猛犸象可能会做到这一点。

回答by Ravi Gaurav Pandey

Another similar topic which I have got is Convert Word to HTML then render HTML on webpage. I think you might find this helpful if you are still on it. There's a freely distributed dll for this. I have given the link there.

我得到的另一个类似主题是将 Word 转换为 HTML,然后在网页上呈现 HTML。我认为如果您仍在使用它,您可能会发现这很有帮助。为此有一个免费分发的 dll。我已经在那里提供了链接。

回答by Bimzee

You can try with Microsoft.Office.Interop.Word;

你可以试试Microsoft.Office.Interop.Word;

   using Word = Microsoft.Office.Interop.Word;

    public static void ConvertDocToHtml(object Sourcepath, object TargetPath)
    {

        Word._Application newApp = new Word.Application();
        Word.Documents d = newApp.Documents;
        object Unknown = Type.Missing;
        Word.Document od = d.Open(ref Sourcepath, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown, ref Unknown);
        object format = Word.WdSaveFormat.wdFormatHTML;



        newApp.ActiveDocument.SaveAs(ref TargetPath, ref format,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown);

        newApp.Documents.Close(Word.WdSaveOptions.wdDoNotSaveChanges);


    }

回答by Mike W

Gemboxworks pretty well. It even converts images in the Word doc to base64 encoded strings in img tags.

Gembox工作得很好。它甚至将 Word 文档中的图像转换为 img 标签中的 base64 编码字符串。