使用 PHP 快速将 (.rtf|.doc) 文件转换为 Markdown 语法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1043768/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 00:45:18  来源:igfitidea点击:

Quickly Convert (.rtf|.doc) Files to Markdown Syntax with PHP

phpautomationmarkdownfile-conversion.doc

提问by Sampson

I've been manually converting articles into Markdown syntax for a few days now, and it's getting rather tedious. Some of these are 3 or 4 pages, italics and other emphasized text throughout. Is there a faster way to convert (.rtf|.doc) files to clean Markdown Syntax that I can take advantage of?

几天来,我一直在手动将文章转换为 Markdown 语法,这变得相当乏味。其中一些是 3 或 4 页、斜体和其他强调的文本。是否有更快的方法将 (.rtf|.doc) 文件转换为我可以利用的清理 Markdown 语法?

回答by David

If you happen to be on a mac, textutildoes a good job of converting doc, docx, and rtf to html, and pandoc does a good job of converting the resulting html to markdown:

如果你碰巧在 mac 上,textutil它可以很好地将 doc、docx 和 rtf 转换为 html,而 pandoc 可以很好地将生成的 html 转换为 Markdown:

$ textutil -convert html file.doc -stdout | pandoc -f html -t markdown -o file.md

I have a scriptthat I threw together a while back that tries to use textutil, pdf2html, and pandoc to convert whatever I throw at it to markdown.

我有一个脚本,我将它放在一起,尝试使用 textutil、pdf2html 和 pandoc 将我扔给它的任何内容转换为 Markdown。

回答by Taj Moore

ProgTipshas a possible solution with a Word macro (source download):

ProgTips有一个Word 宏的可能解决方案(源代码下载)

A simple macro (source download)for converting the most trivial things automatically. This macro does:

  • Replace bold and italics
  • Replace headings (marked heading 1-6)
  • Replace numbered and bulleted lists

It's very buggy, I believe it hangs on larger documents, however I'm NOT stating it's a stable release anyway! :-) Experimental use only, recode and reuse it as you like, post a comment if you've found a better solution.

一个简单的宏(源代码下载),用于自动转换最琐碎的事情。这个宏的作用是:

  • 替换粗体和斜体
  • 替换标题(标记标题1-6)
  • 替换编号和项目符号列表

它有很多问题,我相信它会挂在较大的文档上,但是我并不是说它是一个稳定的版本!:-) 仅供实验使用,根据需要重新编码和重用它,如果您找到更好的解决方案,请发表评论。

Source: ProgTips

资料来源:ProgTips

Macro source

宏源

Installation

安装

  • open WinWord,
  • press Alt+F11 to open the VBA editor,
  • right click the first project in the project browser
  • choose insert->module
  • paste the code from the file
  • close macro editor
  • go tools>macro>macros; run the macro named MarkDown
  • 打开WinWord,
  • 按 Alt+F11 打开 VBA 编辑器,
  • 右键单击项目浏览器中的第一个项目
  • 选择插入->模块
  • 粘贴文件中的代码
  • 关闭宏编辑器
  • 去工具>宏>宏;运行名为 MarkDown 的宏

Source: ProgTips

资料来源:ProgTips

Source

来源

Macro source for safe keeping if ProgTips deletes the post or the site gets wiped out:

如果 ProgTips 删除帖子或站点被清除,则用于安全保存的宏源:

'*** A simple MsWord->Markdown replacement macro by Kriss Rauhvargers, 2006.02.02.
'*** This tool does NOT implement all the markup specified in MarkDown definition by John Gruber, only
'*** the most simple things. These are:
'*** 1) Replaces all non-list paragraphs to ^p paragraph so MarkDown knows it is a stand-alone paragraph
'*** 2) Converts tables to text. In fact, tables get lost.
'*** 3) Adds a single indent to all indented paragraphs
'*** 4) Replaces all the text in italics to _text_
'*** 5) Replaces all the text in bold to **text**
'*** 6) Replaces Heading1-6 to #..#Heading (Heading numbering gets lost)
'*** 7) Replaces bulleted lists with ^p *  listitem ^p*  listitem2...
'*** 8) Replaces numbered lists with ^p 1. listitem ^p2.  listitem2...
'*** Feel free to use and redistribute this code
Sub MarkDown()
    Dim bReplace As Boolean
    Dim i As Integer
    Dim oPara As Paragraph


    'remove formatting from paragraph sign so that we dont get **blablabla^p** but rather **blablabla**^p
    Call RemoveBoldEnters


    For i = Selection.Document.Tables.Count To 1 Step -1
            Call Selection.Document.Tables(i).ConvertToText
    Next

    'simple text indent + extra paragraphs for non-numbered paragraphs
    For i = Selection.Document.Paragraphs.Count To 1 Step -1
        Set oPara = Selection.Document.Paragraphs(i)
        If oPara.Range.ListFormat.ListType = wdListNoNumbering Then
            If oPara.LeftIndent > 0 Then
                oPara.Range.InsertBefore (">")
            End If
            oPara.Range.InsertBefore (vbCrLf)
        End If


    Next

    'italic -> _italic_
    Selection.HomeKey Unit:=wdStory
    bReplace = ReplaceOneItalic  'first replacement
    While bReplace 'other replacements
        bReplace = ReplaceOneItalic
    Wend

    'bold-> **bold**
    Selection.HomeKey Unit:=wdStory
    bReplace = ReplaceOneBold 'first replacement
    While bReplace
        bReplace = ReplaceOneBold 'other replacements
    Wend



    'Heading -> ##heading
    For i = 1 To 6 'heading1 to heading6
        Selection.HomeKey Unit:=wdStory
        bReplace = ReplaceH(i) 'first replacement
        While bReplace
            bReplace = ReplaceH(i) 'other replacements
        Wend
    Next

    Call ReplaceLists


    Selection.HomeKey Unit:=wdStory
End Sub


'***************************************************************
' Function to replace bold with _bold_, only the first occurance
' Returns true if any occurance found, false otherwise
' Originally recorded by WinWord macro recorder, probably contains
' quite a lot of useless code
'***************************************************************
Function ReplaceOneBold() As Boolean
    Dim bReturn As Boolean

    Selection.Find.ClearFormatting
    With Selection.Find
        .Text = ""
        .Forward = True
        .Wrap = wdFindContinue
        .Font.Bold = True
        .Format = True
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With

    bReturn = False
    While Selection.Find.Execute = True
        bReturn = True
        Selection.Text = "**" & Selection.Text & "**"
        Selection.Font.Bold = False
        Selection.Find.Execute
    Wend

    ReplaceOneBold = bReturn
End Function

'*******************************************************************
' Function to replace italic with _italic_, only the first occurance
' Returns true if any occurance found, false otherwise
' Originally recorded by WinWord macro recorder, probably contains
' quite a lot of useless code
'********************************************************************
Function ReplaceOneItalic() As Boolean
    Dim bReturn As Boolean

        Selection.Find.ClearFormatting

    With Selection.Find
        .Text = ""
        .Forward = True
        .Wrap = wdFindContinue
        .Font.Italic = True
        .Format = True
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With

    bReturn = False
    While Selection.Find.Execute = True
        bReturn = True
        Selection.Text = "_" & Selection.Text & "_"
        Selection.Font.Italic = False
        Selection.Find.Execute
    Wend
    ReplaceOneItalic = bReturn
End Function

'*********************************************************************
' Function to replace headingX with #heading, only the first occurance
' Returns true if any occurance found, false otherwise
' Originally recorded by WinWord macro recorder, probably contains
' quite a lot of useless code
'*********************************************************************
Function ReplaceH(ByVal ipNumber As Integer) As Boolean
    Dim sReplacement As String

    Select Case ipNumber
    Case 1: sReplacement = "#"
    Case 2: sReplacement = "##"
    Case 3: sReplacement = "###"
    Case 4: sReplacement = "####"
    Case 5: sReplacement = "#####"
    Case 6: sReplacement = "######"
    End Select

    Selection.Find.ClearFormatting
    Selection.Find.Style = ActiveDocument.Styles("Heading " & ipNumber)
    With Selection.Find
        .Text = ""
        .Replacement.Text = ""
        .Forward = True
        .Wrap = wdFindContinue
        .Format = True
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With


     bReturn = False
    While Selection.Find.Execute = True
        bReturn = True
        Selection.Range.InsertBefore (vbCrLf & sReplacement & " ")
        Selection.Style = ActiveDocument.Styles("Normal")
        Selection.Find.Execute
    Wend

    ReplaceH = bReturn
End Function



'***************************************************************
' A fix-up for paragraph marks that ar are bold or italic
'***************************************************************
Sub RemoveBoldEnters()
    Selection.HomeKey Unit:=wdStory
    Selection.Find.ClearFormatting
    Selection.Find.Font.Italic = True
    Selection.Find.Replacement.ClearFormatting
    Selection.Find.Replacement.Font.Bold = False
    Selection.Find.Replacement.Font.Italic = False
    With Selection.Find
        .Text = "^p"
        .Replacement.Text = "^p"
        .Forward = True
        .Wrap = wdFindContinue
        .Format = True
    End With
    Selection.Find.Execute Replace:=wdReplaceAll

    Selection.HomeKey Unit:=wdStory
    Selection.Find.ClearFormatting
    Selection.Find.Font.Bold = True
    Selection.Find.Replacement.ClearFormatting
    Selection.Find.Replacement.Font.Bold = False
    Selection.Find.Replacement.Font.Italic = False
    With Selection.Find
        .Text = "^p"
        .Replacement.Text = "^p"
        .Forward = True
        .Wrap = wdFindContinue
        .Format = True
    End With
    Selection.Find.Execute Replace:=wdReplaceAll
End Sub

'***************************************************************
' Function to replace bold with _bold_, only the first occurance
' Returns true if any occurance found, false otherwise
' Originally recorded by WinWord macro recorder, probably contains
' quite a lot of useless code
'***************************************************************
Sub ReplaceLists()
    Dim i As Integer
    Dim j As Integer
    Dim Para As Paragraph

    Selection.HomeKey Unit:=wdStory

    'iterate through all the lists in the document
    For i = Selection.Document.Lists.Count To 1 Step -1
        'check each paragraph in the list
        For j = Selection.Document.Lists(i).ListParagraphs.Count To 1 Step -1
            Set Para = Selection.Document.Lists(i).ListParagraphs(j)
            'if it's a bulleted list
            If Para.Range.ListFormat.ListType = wdListBullet Then
                        Para.Range.InsertBefore (ListIndent(Para.Range.ListFormat.ListLevelNumber, "*"))
            'if it's a numbered list
            ElseIf Para.Range.ListFormat.ListType = wdListSimpleNumbering Or _
                                                    wdListMixedNumbering Or _
                                                    wdListListNumOnly Then
                Para.Range.InsertBefore (Para.Range.ListFormat.ListValue & ".  ")
            End If
        Next j
        'inserts paragraph marks before and after, removes the list itself
        Selection.Document.Lists(i).Range.InsertParagraphBefore
        Selection.Document.Lists(i).Range.InsertParagraphAfter
        Selection.Document.Lists(i).RemoveNumbers
    Next i
End Sub

'***********************************************************
' Returns the MarkDown indent text
'***********************************************************
Function ListIndent(ByVal ipNumber As Integer, ByVal spChar As String) As String
    Dim i  As Integer
    For i = 1 To ipNumber - 1
        ListIndent = ListIndent & "    "
    Next
    ListIndent = ListIndent & spChar & "    "
End Function

Source: ProgTips

资料来源:ProgTips

回答by matb33

If you're open to using the .docxformat, you could use this PHP script that I put together that will extract the XML, run some XSL transformations and output a pretty decent Markdown equivalent:

如果您愿意使用该.docx格式,您可以使用我放在一起的这个 PHP 脚本,它将提取 XML,运行一些 XSL 转换并输出一个相当不错的 Markdown 等效项:

https://github.com/matb33/docx2md

https://github.com/matb33/docx2md

Note that it is meant to work from the command-line, and is rather basic in its interface. However, it will get the job done!

请注意,它旨在从命令行工作,并且其界面相当基本。但是,它将完成工作!

If the script doesn't work well enough for you, I encourage you to send me your .docxfiles so I can reproduce your problem and fix it. Log an issue in GitHub or contact me directly if you prefer.

如果脚本对您来说不够好,我鼓励您将.docx文件发送给我,以便我可以重现您的问题并修复它。如果您愿意,请在 GitHub 中记录问题或直接与​​我联系。

回答by Mike Eng

Pandocis a good command-line conversion tool, but again, you will first need to get the input into a format that Pandoc can read, which is:

Pandoc是一个很好的命令行转换工具,但同样,您首先需要将输入转换为 Pandoc 可以读取的格式,即:

  • markdown
  • reStructuredText
  • textile
  • HTML
  • LaTeX
  • 降价
  • 重组文本
  • 纺织品
  • HTML
  • 乳胶

回答by janpaul123

We had the same problem of having to convert Word documents to markdown. Some were more complicated and (very) large documents, with math equations and images and such. So I made this script which converts using a number of different tools: https://github.com/Versal/word2markdown

我们遇到了同样的问题,必须将 Word 文档转换为 Markdown。有些更复杂,而且(非常)大的文档,带有数学方程式和图像等。所以我制作了这个脚本,它使用多种不同的工具进行转换:https: //github.com/Versal/word2markdown

Because it uses a chain of several tools it is a bit more error-prone, but it can be a good starting point if you have more complicated documents. Hope it can be helpful! :)

因为它使用一系列工具,所以更容易出错,但如果您有更复杂的文档,它可能是一个很好的起点。希望它可以有所帮助!:)

Update:It currently only works on Mac OS X, and you need to have some requirements installed (Word, Pandoc, HTML Tidy, git, node/npm). For it to work properly, you also need to open an empty Word document, and do: File->Save As Webpage->Compatibility->Encoding->UTF-8. Then this encoding is saved as default. See the README for more details on how to set up.

更新:目前仅适用于 Mac OS X,您需要安装一些要求(Word、Pandoc、HTML Tidy、git、node/npm)。为使其正常工作,您还需要打开一个空的 Word 文档,然后执行:文件->另存为网页->兼容性->编码->UTF-8。然后将此编码保存为默认值。有关如何设置的更多详细信息,请参阅自述文件。

Then run this in the console:

然后在控制台中运行:

$ git clone [email protected]:Versal/word2markdown.git
$ cd word2markdown
$ npm install
(copy over the Word files, for example, "document.docx")
$ ./doc-to-md.sh document.docx document_files > document.md

Then you can find the Markdown in document.mdand images in the directory document_files.

然后你可以document.md在目录中找到 Markdown和图像document_files

It's perhaps a bit complicated now, so I would welcome any contributions that make this easier or make this work on other operating systems! :)

现在可能有点复杂,所以我欢迎任何使这更容易或使其在其他操作系统上工作的贡献!:)

回答by user626528

Have you tried this one? Not sure about feature richness, but it works for simple texts. http://markitdown.medusis.com/

你试过这个吗?不确定功能是否丰富,但它适用于简单的文本。 http://markitdown.medusis.com/

回答by Valentin

As part of the university ruby course I developed a tool which can convert openoffice word files (.odt) to markdown. A lot of assumptions has to be made in order to turn it to correct formatting. For example it is hard to determine the size of a text which has to be considered as Heading. However the only think that you can loose with this conversion is the formatting any text that is met is always appends to the markdown document. The tool I've developed supports lists, bold and italic text, and it has syntax for tables.

作为大学 ruby​​ 课程的一部分,我开发了一个工具,可以将 openoffice word 文件 (.odt) 转换为 Markdown。必须做出很多假设才能将其转换为正确的格式。例如,很难确定必须被视为标题的文本的大小。但是,您可以通过这种转换来放松的唯一想法是,遇到的任何文本的格式总是附加到 Markdown 文档中。我开发的工具支持列表、粗体和斜体文本,并且具有表格语法。

http://github.com/bostko/doc2textGive it a try and please give me your feedback.

http://github.com/bostko/doc2text试一试,请给我您的反馈。