Java Apache POI 读取 Word (.doc) 文件并获取使用的命名样式

Question

提问by Ashaelon

I am trying to read a Microsoft Word 2003 Document (.doc) using poi-scratchpad-3.8 (HWPF). I need to either read the file word by word, or character by character. Either way is fine for what I need. Once I have read either a character or word, I need to get the style name that is applied to the word/character. So, the question is, how do I get the style name used for a word or character when reading the .doc file?

我正在尝试使用 poi-scratchpad-3.8 (HWPF) 读取 Microsoft Word 2003 文档 (.doc)。我需要逐字或逐字读取文件。无论哪种方式都适合我的需要。一旦我阅读了一个字符或单词，我需要获取应用于该单词/字符的样式名称。所以，问题是，如何在阅读 .doc 文件时获取用于单词或字符的样式名称？

EDIT

编辑

I am adding the code that I used to attempt this. If anyone wants to attempt this, good luck.

我正在添加我用来尝试此操作的代码。如果有人想尝试这个，祝你好运。

private void processDoc(String path) throws Exception {
    System.out.println(path);
    POIFSFileSystem fis = new POIFSFileSystem(new FileInputStream(path));
    HWPFDocument wdDoc = new HWPFDocument(fis);

    // list all style names and indexes in stylesheet
    for (int j = 0; j < wdDoc.getStyleSheet().numStyles(); j++) {
        if (wdDoc.getStyleSheet().getStyleDescription(j) != null) {
            System.out.println(j + ": " + wdDoc.getStyleSheet().getStyleDescription(j).getName());
        } else {
            // getStyleDescription returned null
            System.out.println(j + ": " + null);
        }
    }

    // set range for entire document
    Range range = wdDoc.getRange();

    // loop through all paragraphs in range
    for (int i = 0; i < range.numParagraphs(); i++) {
        Paragraph p = range.getParagraph(i);

        // check if style index is greater than total number of styles
        if (wdDoc.getStyleSheet().numStyles() > p.getStyleIndex()) {
            System.out.println(wdDoc.getStyleSheet().numStyles() + " -> " + p.getStyleIndex());
            StyleDescription style = wdDoc.getStyleSheet().getStyleDescription(p.getStyleIndex());
            String styleName = style.getName();
            // write style name and associated text
            System.out.println(styleName + " -> " + p.text());
        } else {
            System.out.println("\n" + wdDoc.getStyleSheet().numStyles() + " ----> " + p.getStyleIndex());
        }
    }

Answer 1

回答by Gagravarr

I would suggest that you take a look at the sourcecode to WordExtractor from Apache Tika, as it's a great example of getting text and styling from a Word document using Apache POI

我建议您查看Apache Tika 中 WordExtractor的源代码，因为它是使用 Apache POI 从 Word 文档中获取文本和样式的一个很好的例子

Based on what you did and didn't say in your question, I suspect you're looking for something a little like this:

根据您在问题中所做的和没有说的，我怀疑您正在寻找类似这样的东西：

    Range r = document.getRange();
    for(int i=0; i<r.numParagraphs(); i++) {
       Paragraph p = r.getParagraph(i);
       String text = p.getText();
       if( ! text.contains("What I'm Looking For")) {
          // Try the next paragraph
          continue;
       }

       if (document.getStyleSheet().numStyles()>p.getStyleIndex()) {
          StyleDescription style =
               document.getStyleSheet().getStyleDescription(p.getStyleIndex());
          String styleName = style.getName();
          System.out.println(styleName + " -> " + text);
       }
       else {
          // Text has an unknown or invalid style
       }
    }

For anything more advanced, take a look at the WordExtractor sourcecode and see what else you can do with this sort of thing!

对于更高级的东西，请查看 WordExtractor 源代码，看看您还可以用这种东西做什么！

Java Apache POI 读取 Word (.doc) 文件并获取使用的命名样式

提问by Ashaelon

回答by Gagravarr

相关推荐

最近更新

标签

Java Apache POI 读取 Word (.doc) 文件并获取使用的命名样式

提问by Ashaelon

回答by Gagravarr

相关推荐

java 如何更改 JOptionPane 的字体大小

java 位运算符和二进制字符串评估

java 在 Oracle 中选择当月的记录

java 字符串到json的转换问题

相关推荐

最近更新

标签