Linux 通过命令行将 doc 转换为 txt

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6510272/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-05 04:48:24  来源:igfitidea点击:

Convert doc to txt via commandline

linuxms-worddoc

提问by user698601

We're searching a programm that allows us to convert a doc or docx document to a txt file. We're working with linux and we want to start a website that converts user uploaded doc files. We don't wanna use open office/libre office cause we have bad experience with that. Pandoc can't handle doc files :/

我们正在搜索允许我们将 doc 或 docx 文档转换为 txt 文件的程序。我们正在使用 linux,我们想启动一个网站来转换用户上传的 doc 文件。我们不想使用开放式办公室/自由办公室,因为我们对此有不好的经验。Pandoc 无法处理 doc 文件:/

Anyone have a idea?

有人有想法吗?

回答by Paul Sanwald

here is a perl projectwhich claims to do it. I have done a lot of this by hand also, using XSLT on the document.xml. the Docx file itself is just a zip file, you can unzip it and inspect the elements. I will say that this is not hard to do for specific files, but is very hard to do in the general case, because of the lack of documentation for how Word internally stores things, and the variance of internal representation.

这是一个声称这样做的perl 项目。我也手工完成了很多工作,在 document.xml 上使用 XSLT。Docx 文件本身只是一个 zip 文件,您可以解压缩它并检查元素。我会说这对于特定文件并不难,但在一般情况下很难做到,因为缺乏 Word 内部如何存储事物的文档,以及内部表示的差异。

回答by harlandski

You will have to use two different command-line tools, depending if you are working with .doc or .docx format.

您将不得不使用两种不同的命令行工具,具体取决于您使用的是 .doc 还是 .docx 格式。

For .doc use catdoc:

对于 .doc 使用 catdoc:

catdoc foo.doc > foo.txt

For .docx use docx2txt:

对于 .docx 使用 docx2txt:

docx2txt foo.docx

The latter will produce a file called foo.txt in the same directory as the original.

后者将在与原始目录相同的目录中生成一个名为 foo.txt 的文件。

I'm not sure which Linux distribution you are using, but both catdoc and docx2txt are available from the Ubuntu repositories, for example:

我不确定您使用的是哪个 Linux 发行版,但是 catdoc 和 docx2txt 都可以从 Ubuntu 存储库中获得,例如:

apt-get install docx2txt

Or with Homebrew on Mac:

或者在 Mac 上使用 Homebrew:

brew install docx2txt

回答by Mishari

For doc files you may use antiword, it's available on Homebrew and Ubuntu.

对于 doc 文件,您可以使用 antiword,它在 Homebrew 和 Ubuntu 上可用。