Java Apache POI 读取 Word (.doc) 文件并获取使用的命名样式

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/12753233/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-31 10:04:19  来源:igfitidea点击:

Java Apache POI read Word (.doc) file and get named styles used

javams-wordapache-poi

提问by Ashaelon

I am trying to read a Microsoft Word 2003 Document (.doc) using poi-scratchpad-3.8 (HWPF). I need to either read the file word by word, or character by character. Either way is fine for what I need. Once I have read either a character or word, I need to get the style name that is applied to the word/character. So, the question is, how do I get the style name used for a word or character when reading the .doc file?

我正在尝试使用 poi-scratchpad-3.8 (HWPF) 读取 Microsoft Word 2003 文档 (.doc)。我需要逐字或逐字读取文件。无论哪种方式都适合我的需要。一旦我阅读了一个字符或单词,我需要获取应用于该单词/字符的样式名称。所以,问题是,如何在阅读 .doc 文件时获取用于单词或字符的样式名称?

EDIT

编辑

I am adding the code that I used to attempt this. If anyone wants to attempt this, good luck.

我正在添加我用来尝试此操作的代码。如果有人想尝试这个,祝你好运。

private void processDoc(String path) throws Exception {
    System.out.println(path);
    POIFSFileSystem fis = new POIFSFileSystem(new FileInputStream(path));
    HWPFDocument wdDoc = new HWPFDocument(fis);

    // list all style names and indexes in stylesheet
    for (int j = 0; j < wdDoc.getStyleSheet().numStyles(); j++) {
        if (wdDoc.getStyleSheet().getStyleDescription(j) != null) {
            System.out.println(j + ": " + wdDoc.getStyleSheet().getStyleDescription(j).getName());
        } else {
            // getStyleDescription returned null
            System.out.println(j + ": " + null);
        }
    }

    // set range for entire document
    Range range = wdDoc.getRange();

    // loop through all paragraphs in range
    for (int i = 0; i < range.numParagraphs(); i++) {
        Paragraph p = range.getParagraph(i);

        // check if style index is greater than total number of styles
        if (wdDoc.getStyleSheet().numStyles() > p.getStyleIndex()) {
            System.out.println(wdDoc.getStyleSheet().numStyles() + " -> " + p.getStyleIndex());
            StyleDescription style = wdDoc.getStyleSheet().getStyleDescription(p.getStyleIndex());
            String styleName = style.getName();
            // write style name and associated text
            System.out.println(styleName + " -> " + p.text());
        } else {
            System.out.println("\n" + wdDoc.getStyleSheet().numStyles() + " ----> " + p.getStyleIndex());
        }
    }

回答by Gagravarr

I would suggest that you take a look at the sourcecode to WordExtractor from Apache Tika, as it's a great example of getting text and styling from a Word document using Apache POI

我建议您查看Apache Tika 中 WordExtractor的源代码,因为它是使用 Apache POI 从 Word 文档中获取文本和样式的一个很好的例子

Based on what you did and didn't say in your question, I suspect you're looking for something a little like this:

根据您在问题中所做的和没有说的,我怀疑您正在寻找类似这样的东西:

    Range r = document.getRange();
    for(int i=0; i<r.numParagraphs(); i++) {
       Paragraph p = r.getParagraph(i);
       String text = p.getText();
       if( ! text.contains("What I'm Looking For")) {
          // Try the next paragraph
          continue;
       }

       if (document.getStyleSheet().numStyles()>p.getStyleIndex()) {
          StyleDescription style =
               document.getStyleSheet().getStyleDescription(p.getStyleIndex());
          String styleName = style.getName();
          System.out.println(styleName + " -> " + text);
       }
       else {
          // Text has an unknown or invalid style
       }
    }

For anything more advanced, take a look at the WordExtractor sourcecode and see what else you can do with this sort of thing!

对于更高级的东西,请查看 WordExtractor 源代码,看看您还可以用这种东西做什么!