使用 PHP 快速将 (.rtf|.doc) 文件转换为 Markdown 语法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1043768/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Quickly Convert (.rtf|.doc) Files to Markdown Syntax with PHP
提问by Sampson
I've been manually converting articles into Markdown syntax for a few days now, and it's getting rather tedious. Some of these are 3 or 4 pages, italics and other emphasized text throughout. Is there a faster way to convert (.rtf|.doc) files to clean Markdown Syntax that I can take advantage of?
几天来,我一直在手动将文章转换为 Markdown 语法,这变得相当乏味。其中一些是 3 或 4 页、斜体和其他强调的文本。是否有更快的方法将 (.rtf|.doc) 文件转换为我可以利用的清理 Markdown 语法?
回答by David
If you happen to be on a mac, textutildoes a good job of converting doc, docx, and rtf to html, and pandoc does a good job of converting the resulting html to markdown:
如果你碰巧在 mac 上,textutil它可以很好地将 doc、docx 和 rtf 转换为 html,而 pandoc 可以很好地将生成的 html 转换为 Markdown:
$ textutil -convert html file.doc -stdout | pandoc -f html -t markdown -o file.md
I have a scriptthat I threw together a while back that tries to use textutil, pdf2html, and pandoc to convert whatever I throw at it to markdown.
我有一个脚本,我将它放在一起,尝试使用 textutil、pdf2html 和 pandoc 将我扔给它的任何内容转换为 Markdown。
回答by Taj Moore
ProgTipshas a possible solution with a Word macro (source download):
ProgTips有一个Word 宏的可能解决方案(源代码下载):
A simple macro (source download)for converting the most trivial things automatically. This macro does:
- Replace bold and italics
- Replace headings (marked heading 1-6)
- Replace numbered and bulleted lists
It's very buggy, I believe it hangs on larger documents, however I'm NOT stating it's a stable release anyway! :-) Experimental use only, recode and reuse it as you like, post a comment if you've found a better solution.
一个简单的宏(源代码下载),用于自动转换最琐碎的事情。这个宏的作用是:
- 替换粗体和斜体
- 替换标题(标记标题1-6)
- 替换编号和项目符号列表
它有很多问题,我相信它会挂在较大的文档上,但是我并不是说它是一个稳定的版本!:-) 仅供实验使用,根据需要重新编码和重用它,如果您找到更好的解决方案,请发表评论。
Source: ProgTips
资料来源:ProgTips
Macro source
宏源
Installation
安装
- open WinWord,
- press Alt+F11 to open the VBA editor,
- right click the first project in the project browser
- choose insert->module
- paste the code from the file
- close macro editor
- go tools>macro>macros; run the macro named MarkDown
- 打开WinWord,
- 按 Alt+F11 打开 VBA 编辑器,
- 右键单击项目浏览器中的第一个项目
- 选择插入->模块
- 粘贴文件中的代码
- 关闭宏编辑器
- 去工具>宏>宏;运行名为 MarkDown 的宏
Source: ProgTips
资料来源:ProgTips
Source
来源
Macro source for safe keeping if ProgTips deletes the post or the site gets wiped out:
如果 ProgTips 删除帖子或站点被清除,则用于安全保存的宏源:
'*** A simple MsWord->Markdown replacement macro by Kriss Rauhvargers, 2006.02.02.
'*** This tool does NOT implement all the markup specified in MarkDown definition by John Gruber, only
'*** the most simple things. These are:
'*** 1) Replaces all non-list paragraphs to ^p paragraph so MarkDown knows it is a stand-alone paragraph
'*** 2) Converts tables to text. In fact, tables get lost.
'*** 3) Adds a single indent to all indented paragraphs
'*** 4) Replaces all the text in italics to _text_
'*** 5) Replaces all the text in bold to **text**
'*** 6) Replaces Heading1-6 to #..#Heading (Heading numbering gets lost)
'*** 7) Replaces bulleted lists with ^p * listitem ^p* listitem2...
'*** 8) Replaces numbered lists with ^p 1. listitem ^p2. listitem2...
'*** Feel free to use and redistribute this code
Sub MarkDown()
Dim bReplace As Boolean
Dim i As Integer
Dim oPara As Paragraph
'remove formatting from paragraph sign so that we dont get **blablabla^p** but rather **blablabla**^p
Call RemoveBoldEnters
For i = Selection.Document.Tables.Count To 1 Step -1
Call Selection.Document.Tables(i).ConvertToText
Next
'simple text indent + extra paragraphs for non-numbered paragraphs
For i = Selection.Document.Paragraphs.Count To 1 Step -1
Set oPara = Selection.Document.Paragraphs(i)
If oPara.Range.ListFormat.ListType = wdListNoNumbering Then
If oPara.LeftIndent > 0 Then
oPara.Range.InsertBefore (">")
End If
oPara.Range.InsertBefore (vbCrLf)
End If
Next
'italic -> _italic_
Selection.HomeKey Unit:=wdStory
bReplace = ReplaceOneItalic 'first replacement
While bReplace 'other replacements
bReplace = ReplaceOneItalic
Wend
'bold-> **bold**
Selection.HomeKey Unit:=wdStory
bReplace = ReplaceOneBold 'first replacement
While bReplace
bReplace = ReplaceOneBold 'other replacements
Wend
'Heading -> ##heading
For i = 1 To 6 'heading1 to heading6
Selection.HomeKey Unit:=wdStory
bReplace = ReplaceH(i) 'first replacement
While bReplace
bReplace = ReplaceH(i) 'other replacements
Wend
Next
Call ReplaceLists
Selection.HomeKey Unit:=wdStory
End Sub
'***************************************************************
' Function to replace bold with _bold_, only the first occurance
' Returns true if any occurance found, false otherwise
' Originally recorded by WinWord macro recorder, probably contains
' quite a lot of useless code
'***************************************************************
Function ReplaceOneBold() As Boolean
Dim bReturn As Boolean
Selection.Find.ClearFormatting
With Selection.Find
.Text = ""
.Forward = True
.Wrap = wdFindContinue
.Font.Bold = True
.Format = True
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
bReturn = False
While Selection.Find.Execute = True
bReturn = True
Selection.Text = "**" & Selection.Text & "**"
Selection.Font.Bold = False
Selection.Find.Execute
Wend
ReplaceOneBold = bReturn
End Function
'*******************************************************************
' Function to replace italic with _italic_, only the first occurance
' Returns true if any occurance found, false otherwise
' Originally recorded by WinWord macro recorder, probably contains
' quite a lot of useless code
'********************************************************************
Function ReplaceOneItalic() As Boolean
Dim bReturn As Boolean
Selection.Find.ClearFormatting
With Selection.Find
.Text = ""
.Forward = True
.Wrap = wdFindContinue
.Font.Italic = True
.Format = True
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
bReturn = False
While Selection.Find.Execute = True
bReturn = True
Selection.Text = "_" & Selection.Text & "_"
Selection.Font.Italic = False
Selection.Find.Execute
Wend
ReplaceOneItalic = bReturn
End Function
'*********************************************************************
' Function to replace headingX with #heading, only the first occurance
' Returns true if any occurance found, false otherwise
' Originally recorded by WinWord macro recorder, probably contains
' quite a lot of useless code
'*********************************************************************
Function ReplaceH(ByVal ipNumber As Integer) As Boolean
Dim sReplacement As String
Select Case ipNumber
Case 1: sReplacement = "#"
Case 2: sReplacement = "##"
Case 3: sReplacement = "###"
Case 4: sReplacement = "####"
Case 5: sReplacement = "#####"
Case 6: sReplacement = "######"
End Select
Selection.Find.ClearFormatting
Selection.Find.Style = ActiveDocument.Styles("Heading " & ipNumber)
With Selection.Find
.Text = ""
.Replacement.Text = ""
.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
bReturn = False
While Selection.Find.Execute = True
bReturn = True
Selection.Range.InsertBefore (vbCrLf & sReplacement & " ")
Selection.Style = ActiveDocument.Styles("Normal")
Selection.Find.Execute
Wend
ReplaceH = bReturn
End Function
'***************************************************************
' A fix-up for paragraph marks that ar are bold or italic
'***************************************************************
Sub RemoveBoldEnters()
Selection.HomeKey Unit:=wdStory
Selection.Find.ClearFormatting
Selection.Find.Font.Italic = True
Selection.Find.Replacement.ClearFormatting
Selection.Find.Replacement.Font.Bold = False
Selection.Find.Replacement.Font.Italic = False
With Selection.Find
.Text = "^p"
.Replacement.Text = "^p"
.Forward = True
.Wrap = wdFindContinue
.Format = True
End With
Selection.Find.Execute Replace:=wdReplaceAll
Selection.HomeKey Unit:=wdStory
Selection.Find.ClearFormatting
Selection.Find.Font.Bold = True
Selection.Find.Replacement.ClearFormatting
Selection.Find.Replacement.Font.Bold = False
Selection.Find.Replacement.Font.Italic = False
With Selection.Find
.Text = "^p"
.Replacement.Text = "^p"
.Forward = True
.Wrap = wdFindContinue
.Format = True
End With
Selection.Find.Execute Replace:=wdReplaceAll
End Sub
'***************************************************************
' Function to replace bold with _bold_, only the first occurance
' Returns true if any occurance found, false otherwise
' Originally recorded by WinWord macro recorder, probably contains
' quite a lot of useless code
'***************************************************************
Sub ReplaceLists()
Dim i As Integer
Dim j As Integer
Dim Para As Paragraph
Selection.HomeKey Unit:=wdStory
'iterate through all the lists in the document
For i = Selection.Document.Lists.Count To 1 Step -1
'check each paragraph in the list
For j = Selection.Document.Lists(i).ListParagraphs.Count To 1 Step -1
Set Para = Selection.Document.Lists(i).ListParagraphs(j)
'if it's a bulleted list
If Para.Range.ListFormat.ListType = wdListBullet Then
Para.Range.InsertBefore (ListIndent(Para.Range.ListFormat.ListLevelNumber, "*"))
'if it's a numbered list
ElseIf Para.Range.ListFormat.ListType = wdListSimpleNumbering Or _
wdListMixedNumbering Or _
wdListListNumOnly Then
Para.Range.InsertBefore (Para.Range.ListFormat.ListValue & ". ")
End If
Next j
'inserts paragraph marks before and after, removes the list itself
Selection.Document.Lists(i).Range.InsertParagraphBefore
Selection.Document.Lists(i).Range.InsertParagraphAfter
Selection.Document.Lists(i).RemoveNumbers
Next i
End Sub
'***********************************************************
' Returns the MarkDown indent text
'***********************************************************
Function ListIndent(ByVal ipNumber As Integer, ByVal spChar As String) As String
Dim i As Integer
For i = 1 To ipNumber - 1
ListIndent = ListIndent & " "
Next
ListIndent = ListIndent & spChar & " "
End Function
Source: ProgTips
资料来源:ProgTips
回答by matb33
If you're open to using the .docxformat, you could use this PHP script that I put together that will extract the XML, run some XSL transformations and output a pretty decent Markdown equivalent:
如果您愿意使用该.docx格式,您可以使用我放在一起的这个 PHP 脚本,它将提取 XML,运行一些 XSL 转换并输出一个相当不错的 Markdown 等效项:
https://github.com/matb33/docx2md
https://github.com/matb33/docx2md
Note that it is meant to work from the command-line, and is rather basic in its interface. However, it will get the job done!
请注意,它旨在从命令行工作,并且其界面相当基本。但是,它将完成工作!
If the script doesn't work well enough for you, I encourage you to send me your .docxfiles so I can reproduce your problem and fix it. Log an issue in GitHub or contact me directly if you prefer.
如果脚本对您来说不够好,我鼓励您将.docx文件发送给我,以便我可以重现您的问题并修复它。如果您愿意,请在 GitHub 中记录问题或直接与我联系。
回答by Mike Eng
回答by janpaul123
We had the same problem of having to convert Word documents to markdown. Some were more complicated and (very) large documents, with math equations and images and such. So I made this script which converts using a number of different tools: https://github.com/Versal/word2markdown
我们遇到了同样的问题,必须将 Word 文档转换为 Markdown。有些更复杂,而且(非常)大的文档,带有数学方程式和图像等。所以我制作了这个脚本,它使用多种不同的工具进行转换:https: //github.com/Versal/word2markdown
Because it uses a chain of several tools it is a bit more error-prone, but it can be a good starting point if you have more complicated documents. Hope it can be helpful! :)
因为它使用一系列工具,所以更容易出错,但如果您有更复杂的文档,它可能是一个很好的起点。希望它可以有所帮助!:)
Update:It currently only works on Mac OS X, and you need to have some requirements installed (Word, Pandoc, HTML Tidy, git, node/npm). For it to work properly, you also need to open an empty Word document, and do: File->Save As Webpage->Compatibility->Encoding->UTF-8. Then this encoding is saved as default. See the README for more details on how to set up.
更新:目前仅适用于 Mac OS X,您需要安装一些要求(Word、Pandoc、HTML Tidy、git、node/npm)。为使其正常工作,您还需要打开一个空的 Word 文档,然后执行:文件->另存为网页->兼容性->编码->UTF-8。然后将此编码保存为默认值。有关如何设置的更多详细信息,请参阅自述文件。
Then run this in the console:
然后在控制台中运行:
$ git clone [email protected]:Versal/word2markdown.git
$ cd word2markdown
$ npm install
(copy over the Word files, for example, "document.docx")
$ ./doc-to-md.sh document.docx document_files > document.md
Then you can find the Markdown in document.mdand images in the directory document_files.
然后你可以document.md在目录中找到 Markdown和图像document_files。
It's perhaps a bit complicated now, so I would welcome any contributions that make this easier or make this work on other operating systems! :)
现在可能有点复杂,所以我欢迎任何使这更容易或使其在其他操作系统上工作的贡献!:)
回答by user626528
Have you tried this one? Not sure about feature richness, but it works for simple texts. http://markitdown.medusis.com/
你试过这个吗?不确定功能是否丰富,但它适用于简单的文本。 http://markitdown.medusis.com/
回答by Valentin
As part of the university ruby course I developed a tool which can convert openoffice word files (.odt) to markdown. A lot of assumptions has to be made in order to turn it to correct formatting. For example it is hard to determine the size of a text which has to be considered as Heading. However the only think that you can loose with this conversion is the formatting any text that is met is always appends to the markdown document. The tool I've developed supports lists, bold and italic text, and it has syntax for tables.
作为大学 ruby 课程的一部分,我开发了一个工具,可以将 openoffice word 文件 (.odt) 转换为 Markdown。必须做出很多假设才能将其转换为正确的格式。例如,很难确定必须被视为标题的文本的大小。但是,您可以通过这种转换来放松的唯一想法是,遇到的任何文本的格式总是附加到 Markdown 文档中。我开发的工具支持列表、粗体和斜体文本,并且具有表格语法。
http://github.com/bostko/doc2textGive it a try and please give me your feedback.
http://github.com/bostko/doc2text试一试,请给我您的反馈。

