Java：Apache POI：我可以从 MS Word (.doc) 文件中获取干净的文本吗？

Question

提问by XenoRo

The strings I'm (programmatically) getting from MS Word files when using Apache POI are not the same text I can look at when I open the files with MS Word.

我在使用 Apache POI 时（以编程方式）从 MS Word 文件中获取的字符串与我在使用 MS Word 打开文件时可以看到的文本不同。

When using the following code:

使用以下代码时：

File someFile = new File("some\path\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
HWPFDocument wordDoc = new HWPFDocument(inputStrm);
System.out.println(wordDoc.getText());

the output is a single line with many 'invalid' characters (yes, the 'boxes'), and many unwanted strings, like "FORMTEXT", "HYPERLINK \l "_Toc##########"" ('#' being numeric digits), "PAGEREF _Toc########## \h 4", etc.

输出是一行，其中包含许多“无效”字符（是的，“框”）和许多不需要的字符串，例如“ FORMTEXT”、“ HYPERLINK \l "_Toc##########"”（“#”是数字）、“ PAGEREF _Toc########## \h 4”等。

The following code "fixes" the single-line problem, but maintains all the invalid characters and unwanted text:

以下代码“修复”了单行问题，但保留了所有无效字符和不需要的文本：

File someFile = new File("some\path\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
WordExtractor wordExtractor = new WordExtractor(inputStrm);
for(String paragraph:wordExtractor.getParagraphText()){
  System.out.println(paragraph);
}

I don't know if I'm using the wrong method for extracting the text, but that's what I've come up with when looking at POI's quick-guide. If I am, what is the correct approach?

我不知道我是否使用了错误的提取文本的方法，但这就是我在查看POI 的快速指南时想到的。如果我是，正确的方法是什么？

If that output is correct, is there a standard way for getting rid of the unwanted text, or will I have to write a filter of my own?

如果该输出是正确的，是否有标准方法可以去除不需要的文本，还是我必须编写自己的过滤器？

Answer 1

采纳答案by Gagravarr

There are two options, one provided directly in Apache POI, the other via Apache Tika (which uses Apache POI internally).

有两种选择，一种直接在 Apache POI 中提供，另一种通过 Apache Tika（内部使用 Apache POI）提供。

The first option is to use WordExtractor, but wrap it in a call to stripFields(String)when calling it. That will remove the text based fields included in the text, things like HYPERLINK that you've seen. Your code would become:

第一个选项是使用WordExtractor，但stripFields(String)在调用它时将其包装在调用中。这将删除文本中包含的基于文本的字段，例如您看到的 HYPERLINK。你的代码会变成：

NPOIFSFileSystem fs = new NPOIFSFileSytem(file);
WordExtractor extractor = new WordExtractor(fs.getRoot());

for(String rawText : extractor.getParagraphText()) {
String text = extractor.stripFields(rawText);
System.out.println(text);
}

The other option is to use Apache Tika. Tika provides text extraction, and metadata, for a wide variety of files, so the same code will work for .doc, .docx, .pdf and many others too. To get clean, plain text of your word document (you can also get XHTML if you'd rather), you'd do something like:

另一种选择是使用Apache Tika。Tika 为各种文件提供文本提取和元数据，因此相同的代码也适用于 .doc、.docx、.pdf 和许多其他文件。要获得干净、纯文本的 Word 文档（如果您愿意，也可以获取 XHTML），您可以执行以下操作：

TikaConfig tika = TikaConfig.getDefaultConfig();
TikaInputStream stream = TikaInputStream.get(file);
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
tika.getParser().parse(input, handler, metadata, new ParseContext());
String text = handler.toString();

Answer 2

回答by Vyas

This class can read both .doc and .docx files in Java. For this I'm using tika-app-1.2.jar:

此类可以读取 Java 中的 .doc 和 .docx 文件。为此，我使用 tika-app-1.2.jar：

/*
 * This class is used to read .doc and .docx files
 * 
 * @author Developer
 *
 */

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.net.URL; 
import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.detect.Detector;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;

class TextExtractor { 
    private OutputStream outputstream;
    private ParseContext context;
    private Detector detector;
    private Parser parser;
    private Metadata metadata;
    private String extractedText;

    public TextExtractor() {
        context = new ParseContext();
        detector = new DefaultDetector();
        parser = new AutoDetectParser(detector);
        context.set(Parser.class, parser);
        outputstream = new ByteArrayOutputStream();
        metadata = new Metadata();
    }

    public void process(String filename) throws Exception {
        URL url;
        File file = new File(filename);
        if (file.isFile()) {
            url = file.toURI().toURL();
        } else {
            url = new URL(filename);
        }
        InputStream input = TikaInputStream.get(url, metadata);
        ContentHandler handler = new BodyContentHandler(outputstream);
        parser.parse(input, handler, metadata, context); 
        input.close();
    }

    public void getString() {
        //Get the text into a String object
        extractedText = outputstream.toString();
        //Do whatever you want with this String object.
        System.out.println(extractedText);
    }

    public static void main(String args[]) throws Exception {
        if (args.length == 1) {
            TextExtractor textExtractor = new TextExtractor();
            textExtractor.process(args[0]);
            textExtractor.getString();
        } else { 
            throw new Exception();
        }
    }
}

To compile:

编译：

javac -cp ".:tika-app-1.2.jar" TextExtractor.java

To run:

跑步：

java -cp ".:tika-app-1.2.jar" TextExtractor SomeWordDocument.doc

Answer 3

回答by Steven Bellens

Try this, works for me and is purely a POI solution. You will have to look for the HWPFDocument counterpart though. Make sure the document you are reading predates Word 97, else use XWPFDocument like I do.

试试这个，对我有用，纯粹是一个 POI 解决方案。不过，您将不得不寻找 HWPFDocument 副本。确保您正在阅读的文档早于 Word 97，否则像我一样使用 XWPFDocument。

InputStream inputstream = new FileInputStream(m_filepath); 
//read the file 
XWPFDocument adoc= new XWPFDocument(inputstream);
//and place it in a xwpf format

aString = new XWPFWordExtractor(adoc).getText();           
//gets the full text

Now if you want certain parts you can use the getparagraphtext but dont use the text extractor, use it directly on the paragraph like this

现在，如果您想要某些部分，您可以使用 getparagraphtext 但不要使用文本提取器，直接在段落上使用它，如下所示

for (XWPFParagraph p : adoc.getParagraphs()) 
{ 
    System.out.println(p.getParagraphText());
}

Java：Apache POI：我可以从 MS Word (.doc) 文件中获取干净的文本吗？

提问by XenoRo

采纳答案by Gagravarr

回答by Vyas

回答by Steven Bellens

相关推荐

最近更新

标签

Java：Apache POI：我可以从 MS Word (.doc) 文件中获取干净的文本吗？

提问by XenoRo

采纳答案by Gagravarr

回答by Vyas

回答by Steven Bellens

相关推荐

java 在构建 WAR 之前在 Maven 中重命名生成的文件

java Thread.currentThread().getName() 和 getName() 有什么区别？

java java中的RMI聊天程序-如何从客户端向客户端发送消息（不通过服务器）？

java 如何制作简单的java按钮？

相关推荐

最近更新

标签