Linux 通过命令行将 doc 转换为 txt
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/6510272/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Convert doc to txt via commandline
提问by user698601
We're searching a programm that allows us to convert a doc or docx document to a txt file. We're working with linux and we want to start a website that converts user uploaded doc files. We don't wanna use open office/libre office cause we have bad experience with that. Pandoc can't handle doc files :/
我们正在搜索允许我们将 doc 或 docx 文档转换为 txt 文件的程序。我们正在使用 linux,我们想启动一个网站来转换用户上传的 doc 文件。我们不想使用开放式办公室/自由办公室,因为我们对此有不好的经验。Pandoc 无法处理 doc 文件:/
Anyone have a idea?
有人有想法吗?
回答by Paul Sanwald
here is a perl projectwhich claims to do it. I have done a lot of this by hand also, using XSLT on the document.xml. the Docx file itself is just a zip file, you can unzip it and inspect the elements. I will say that this is not hard to do for specific files, but is very hard to do in the general case, because of the lack of documentation for how Word internally stores things, and the variance of internal representation.
这是一个声称这样做的perl 项目。我也手工完成了很多工作,在 document.xml 上使用 XSLT。Docx 文件本身只是一个 zip 文件,您可以解压缩它并检查元素。我会说这对于特定文件并不难,但在一般情况下很难做到,因为缺乏 Word 内部如何存储事物的文档,以及内部表示的差异。
回答by harlandski
You will have to use two different command-line tools, depending if you are working with .doc or .docx format.
您将不得不使用两种不同的命令行工具,具体取决于您使用的是 .doc 还是 .docx 格式。
For .doc use catdoc:
对于 .doc 使用 catdoc:
catdoc foo.doc > foo.txt
For .docx use docx2txt:
对于 .docx 使用 docx2txt:
docx2txt foo.docx
The latter will produce a file called foo.txt in the same directory as the original.
后者将在与原始目录相同的目录中生成一个名为 foo.txt 的文件。
I'm not sure which Linux distribution you are using, but both catdoc and docx2txt are available from the Ubuntu repositories, for example:
我不确定您使用的是哪个 Linux 发行版,但是 catdoc 和 docx2txt 都可以从 Ubuntu 存储库中获得,例如:
apt-get install docx2txt
Or with Homebrew on Mac:
或者在 Mac 上使用 Homebrew:
brew install docx2txt
回答by Mishari
For doc files you may use antiword, it's available on Homebrew and Ubuntu.
对于 doc 文件,您可以使用 antiword,它在 Homebrew 和 Ubuntu 上可用。