vba 将 Word doc 或 docx 文件转换为文本文件?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1110409/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Convert Word doc or docx files into text files?
提问by CheeseConQueso
I need a way to convert .docor .docxextensions to .txtwithout installing anything. I also don't want to have to manually open Word to do this obviously. As long as it's running on auto.
我需要一种无需安装任何东西即可转换.doc或.docx扩展的方法.txt。我也不想手动打开 Word 来执行此操作。只要它在自动上运行。
I was thinking that either Perl or VBA could do the trick, but I can't find anything online for either.
我在想 Perl 或 VBA 都可以做到这一点,但我在网上找不到任何东西。
Any suggestions?
有什么建议?
采纳答案by Sinan ünür
Note that an excellent source of information for Microsoft Office applications is the Object Browser. You can access it via Tools→ Macro→ Visual Basic Editor. Once you are in the editor, hit F2to browse the interfaces, methods, and properties provided by Microsoft Office applications.
请注意,Microsoft Office 应用程序的一个极好的信息来源是对象浏览器。您可以通过Tools→ Macro→访问它Visual Basic Editor。进入编辑器后,点击F2浏览 Microsoft Office 应用程序提供的接口、方法和属性。
Here is an example using Win32::OLE:
这是一个使用Win32::OLE的示例:
#!/usr/bin/perl
use strict;
use warnings;
use File::Spec::Functions qw( catfile );
use Win32::OLE;
use Win32::OLE::Const 'Microsoft Word';
$Win32::OLE::Warn = 3;
my $word = get_word();
$word->{Visible} = 0;
my $doc = $word->{Documents}->Open(catfile $ENV{TEMP}, 'test.docx');
$doc->SaveAs(
catfile($ENV{TEMP}, 'test.txt'),
wdFormatTextLineBreaks
);
$doc->Close(0);
sub get_word {
my $word;
eval {
$word = Win32::OLE->GetActiveObject('Word.Application');
};
die "$@\n" if $@;
unless(defined $word) {
$word = Win32::OLE->new('Word.Application', sub { $_[0]->Quit })
or die "Oops, cannot start Word: ",
Win32::OLE->LastError, "\n";
}
return $word;
}
__END__
回答by jeje
A simple Perl only solution for docx:
docx 的简单 Perl 解决方案:
Use Archive::Zipto get the
word/document.xmlfile from yourdocxfile. (A docx is just a zipped archive.)Use XML::LibXMLto parse it.
Then use XML::LibXSLTto transform it into text or html format. Seach the web to find a nice docx2txt.xslfile :)
使用Archive::Zip
word/document.xml从您的docx文件中获取文件。(一个 docx 只是一个压缩档案。)使用XML::LibXML来解析它。
然后使用XML::LibXSLT将其转换为文本或 html 格式。搜索网络以找到一个不错的docx2txt.xsl文件 :)
Cheers !
干杯!
J.
J。
回答by Nick A Miller
For .doc, I've had some success with the linux command line tool antiword. It extracts the text from .doc very quickly, giving a good rendering of indentation. Then you can pipe that to a text file in bash.
对于 .doc,我使用 linux 命令行工具antiword取得了一些成功。它非常快速地从 .doc 中提取文本,提供良好的缩进呈现。然后,您可以将其通过管道传输到 bash 中的文本文件。
For .docx, I've used the OOXML SDK as some other users mentioned. It is just a .NET library to make it easier to work with the OOXML that is zipped up in an OOXML file. There is a lot of metadata that you will want to discard if you are only interested in the text. Some other people have already written the code I see: DocXToText.
对于 .docx,我使用了其他一些用户提到的 OOXML SDK。它只是一个 .NET 库,可以更轻松地使用压缩在 OOXML 文件中的 OOXML。如果您只对文本感兴趣,则有很多元数据需要舍弃。其他一些人已经编写了我看到的代码:DocXToText。
Aspose.Words has a very simple API with great support too I have found.
Aspose.Words 有一个非常简单的 API,我也找到了很好的支持。
There is also this bash command from commandlinefu.com which works by unzipping the .docx:
还有来自 commandlinefu.com 的这个 bash 命令,它通过解压缩 .docx 来工作:
unzip -p some.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'
回答by Ether
If you have some flavour of unix installed, you can use the 'strings' utility to find and extract all readable strings from the document. There will be some mess before and after the text you are looking for, but the results will be readable.
如果您安装了某种类型的 unix,您可以使用“字符串”实用程序从文档中查找和提取所有可读字符串。您正在查找的文本前后会有一些混乱,但结果将是可读的。
回答by Jim
I strongly recommend AsposeWordsif you can do Java or .NET. It can convert, without Word installed, between all major text file types.
如果您可以使用 Java 或 .NET,我强烈推荐AsposeWords。它可以在没有安装 Word 的情况下在所有主要文本文件类型之间进行转换。
回答by vladr
Note that you can also use OpenOfficeto perform miscellaneous document, drawing, spreadhseet etc. conversions on both Windows and *nix platforms.
请注意,您还可以使用OpenOffice在 Windows 和 *nix 平台上执行各种文档、绘图、电子表格等转换。
You can access OpenOffice programmatically (in a way analogous to COM on Windows) via UNOfrom a variety of languages for which a UNO binding exists, including from Perl via the OpenOffice::UNOmodule.
您可以通过UNO从存在 UNO 绑定的各种语言(包括通过OpenOffice::UNO模块从 Perl )以编程方式(以类似于 Windows 上的 COM 的方式)访问 OpenOffice 。
On the OpenOffice::UNO pageyou will also find a sample Perl scriptlet which opens a document, all you then need to do is export it to txtby using the document.storeToURL()method -- see a Python examplewhich can be easily adapted to your Perl needs.
在OpenOffice::UNO 页面上,您还将找到一个示例 Perl 脚本,它可以打开一个文档,然后您需要做的就是txt使用该document.storeToURL()方法将其导出——请参阅一个 Python 示例,该示例可以轻松适应您的 Perl 需求。
回答by AlbertoPL
.doc's that use the WordprocessingMLand .docx's XML formatcan have their XML parsed to retrieve the actual text of the document. You'll have to read their specifications to figure out which tags contain readable text.
使用WordprocessingML和.docx 的 XML 格式的.doc可以解析其 XML 以检索文档的实际文本。您必须阅读它们的规范才能确定哪些标签包含可读文本。
回答by Jean-Francois T.
The method of Sinan ünür works well.
However, I got some crash with the files I was transforming.
Sinan ünür 的方法效果很好。
但是,我在转换文件时遇到了一些崩溃。
Another method is to use Win32::OLE and Win32::Clipboard as such:
另一种方法是使用 Win32::OLE 和 Win32::Clipboard :
- Open the Word document
- Select all the text
- Copy in the Clipboard
- Print the content of Clipboard in a txt file
- Empty the Clipboard and close the Word document
- 打开 Word 文档
- 选择所有文本
- 在剪贴板中复制
- 在txt文件中打印剪贴板的内容
- 清空剪贴板并关闭 Word 文档
Based on the script given by Sigvald Refsu in http://computer-programming-forum.com/53-perl/c44063de8613483b.htm, I came up with the following script.
根据 Sigvald Refsu 在http://computer-programming-forum.com/53-perl/c44063de8613483b.htm 中给出的脚本,我想出了以下脚本。
Note: I chose to save the txt file with the same basename as the .docx file and in the same folder but this can easily be changed
注意:我选择使用与 .docx 文件相同的基本名称将 txt 文件保存在同一文件夹中,但这可以轻松更改
###########################################
use strict;
use File::Spec::Functions qw( catfile );
use FindBin '$Bin';
use Win32::OLE qw(in with);
use Win32::OLE::Const 'Microsoft Word';
use Win32::Clipboard;
my $monitor_word=0; #set 1 to watch MS Word being opened and closed
sub docx2txt {
##Note: the path shall be in the form "C:\dir\ with\ space\file.docx";
my $docx_file=shift;
#MS Word object
my $Word = Win32::OLE->new('Word.Application', 'Quit') or die "Couldn't run Word";
#Monitor what happens in MS Word
$Word->{Visible} = 1 if $monitor_word;
#Open file
my $Doc = $Word->Documents->Open($docx_file);
with ($Doc, ShowRevisions => 0); #Turn of revision marks
#Select the complete document
$Doc->Select();
my $Range = $Word->Selection();
with ($Range, ExtendMode => 1);
$Range->SelectAll();
#Copy selection to clipboard
$Range->Copy();
#Create txt file
my $txt_file=$docx_file;
$txt_file =~ s/\.docx$/.txt/;
open(TextFile,">$txt_file") or die "Error while trying to write in $txt_file (!$)";
printf TextFile ("%s\n", Win32::Clipboard::Get());
close TextFile;
#Empty the Clipboard (to prevent warning about "huge amount of data in clipboard")
Win32::Clipboard::Set("");
#Close Word file without saving
$Doc->Close({SaveChanges => wdDoNotSaveChanges});
# Disconnect OLE
undef $Word;
}
Hope it can helps you.
希望它可以帮助你。
回答by edi9999
With docxtemplater, you can easily get the full text of a word (works with docx only).
使用docxtemplater,您可以轻松获取单词的全文(仅适用于 docx)。
Here's the code (Node.JS)
这是代码(Node.JS)
DocxTemplater=require('docxtemplater');
doc=new DocxTemplater().loadFromFile("input.docx");
result=doc.getFullText();
This is just three lines of code and doesn't depend on any word instance (all plain JS)
这只是三行代码,不依赖于任何单词实例(都是纯 JS)
回答by fortran
I need a way to convert .doc or .docx extensions to .txt without installing anything
我需要一种无需安装任何东西即可将 .doc 或 .docx 扩展名转换为 .txt 的方法
for I in *.doc?; do mv $I `echo $ | sed 's/\.docx?/\.txt'`; done
Just joking.
开玩笑而已。
You could use antiwordfor the older versions of Word documents, and try to parse the xml of the new ones.
您可以对旧版本的 Word 文档使用antiword,并尝试解析新版本的 xml。

