使用 java poi 从 Office 2007+ 文档中读取属性集

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18635107/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 09:42:44  来源:igfitidea点击:

Reading property sets from Office 2007+ documents with java poi

javaexcelapache-poi

提问by Ionescu Alexandru

I have tried to read property sets from Office 2007+ documents (docx, xlsx). Found the amazing solution on http://poi.apache.org/hpsf/how-to.html. There is an example for Office 2003 and early format (doc, xls, without "x").

我曾尝试从 Office 2007+ 文档(docx、xlsx)中读取属性集。在http://poi.apache.org/hpsf/how-to.html上找到了惊人的解决方案。有一个 Office 2003 和早期格式的示例(doc、xls,没有“x”)。

public class ReadSummaryInformation {
    public static void main(final String[] args) throws IOException {
        final String filename = "C://file.docx";
        POIFSReader r = new POIFSReader();
        r.registerListener(new MyPOIFSReaderListener(),
                           "
Exception in thread "main" org.apache.poi.poifs.filesystem.OfficeXmlFileException: 
The supplied data appears to be in the Office 2007+ XML. [b]You are calling the part
of POI that deals with OLE2 Office Documents.[/b] You need to call a different part of 
POI to process this data (eg XSSF instead of HSSF)
5SummaryInformation"); r.read(new FileInputStream(filename)); } static class MyPOIFSReaderListener implements POIFSReaderListener { public void processPOIFSReaderEvent(final POIFSReaderEvent event) { SummaryInformation si = null; try { si = (SummaryInformation) PropertySetFactory.create(event.getStream()); } catch (Exception ex){ throw new RuntimeException ("Property set stream \"" + event.getPath() + event.getName() + "\": " + ex); } final String title = si.getTitle(); if (title != null) System.out.println("Title: \"" + title + "\""); else System.out.println("Document has no title."); } } }

I tried to open docx and xlsx (meaning that I tried to read the "\005SummaryInformation" from the documents) with this code, and guess what? I got the exception:

我试图用这段代码打开 docx 和 xlsx(意思是我试图从文档中读取“\005SummaryInformation”),你猜怎么着?我得到了例外:

try {
   OPCPackage pkg = OPCPackage.open(new FileInputStream(new File("D:\file.docx")));
   POIXMLProperties props;
   props = new POIXMLProperties(pkg);
   System.out.println("The title is " + props.getCoreProperties().getTitle());
} catch (Exception e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
}

Exception in thread "main" java.lang.NoClassDefFoundError: org/dom4j/DocumentException
       at org.apache.poi.openxml4j.opc.OPCPackage.init(OPCPackage.java:154)
       at org.apache.poi.openxml4j.opc.OPCPackage.<init>(OPCPackage.java:141)
       at org.apache.poi.openxml4j.opc.Package.<init>(Package.java:54)
       at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:82)
       at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:267)
       at ReadSummaryInformation.main(ReadSummaryInformation.java:38)
Caused by: java.lang.ClassNotFoundException: org.dom4j.DocumentException
       at java.net.URLClassLoader.run(Unknown Source)
       at java.security.AccessController.doPrivileged(Native Method)
       at java.net.URLClassLoader.findClass(Unknown Source)
       at java.lang.ClassLoader.loadClass(Unknown Source)
       at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
       at java.lang.ClassLoader.loadClass(Unknown Source)
       ... 6 more

Mister http://poi.apache.org/states loud and clear that:

http://poi.apache.org/先生大声而明确地说:

Office OpenXML Format is the new standards based XML file format found in Microsoft Office 2007 and 2008. This includes XLSX, DOCX and PPTX. The project provides a low level API to support the Open Packaging Conventions using openxml4j.

Office OpenXML 格式是 Microsoft Office 2007 和 2008 中基于新标准的 XML 文件格式。这包括 XLSX、DOCX 和 PPTX。该项目提供了一个低级 API 来支持使用 openxml4j 的开放打包约定。

Then I got to poi's api and I found out that HPSF has PropertySet which is the actual class that access the metadata I want, but XSSF doesn't. It's just one of the explanation that I found for the exception.

然后我找到了 poi 的 api,我发现 HPSF 有 PropertySet,它是访问我想要的元数据的实际类,但 XSSF 没有。这只是我为异常找到的解释之一。

My question is: can I read this marvelous "\005SummaryInformation" from Office 2007+ files with POI? I have a string feeling that the authors of the source code left the api structure in the air and started a new one when the Office 2007 format came out.

我的问题是:我可以从带有 POI 的 Office 2007+ 文件中读取这个奇妙的“\005SummaryInformation”吗?我有一个字符串感觉,源代码的作者在Office 2007格式出现时将api结构留在了空气中并开始了一个新的。

Thank you in advance!

先感谢您!



I tried to do that but I got an exception:

我试图这样做,但我有一个例外:

  .;C:\Program Files (x86)\Java\jre6\lib\ext\QTJava.zip;D:\kituri\Java\JDBC
   driver\mysql-connector-java-5.1.22\mysql-connector-java-5.1.22-bin.jar;%JAVA_HOME%
   \lib;%XMLBEANS_HOME%\lib\xbean.jar;D:\work\Workspace\document_archive01-2212
   \src\RunClass.java;D:\work\Workspace\document_archive01-2212\poi-3.9\ooxml-
   lib\dom4j-1.6.1.jar

My classpath looks like this:

我的类路径如下所示:

 C:\oraclexe\app\oracle\product.2.0\server\bin;;C:\Oracle11g\product.2.0\dbhome_1
 \bin;%SystemRoot%\system32;%SystemRoot%;%SystemRoot%\System32\Wbem;%SYSTEMROOT%
 \System32\WindowsPowerShell\v1.0\;C:\Program Files (x86)\ATI Technologies\ATI.ACE
 \Core-Static;C:\Program Files\WIDCOMM\Bluetooth Software\;C:\Program Files\WIDCOMM
 \Bluetooth Software\syswow64;C:\Program Files (x86)\QuickTime\QTSystem\;C:\Program 
 Files (x86)\Java\apache-maven-3.0.4\bin;C:\Program Files (x86)\Java\jdk1.7.0_07\bin;D:
 \ChromeDriver;%XMLBEANS_HOME%\bin

And my path looks like this:

我的路径如下所示:

OPCPackage pkg = OPCPackage.open(new File("file.xlsx"));
POIXMLProperties props = new POIXMLProperties(pkg);
System.out.println("The title is " + props.getCorePart().getTitle());
  • poi-3.9-20121203.jar
  • xbean.jar
  • poi-ooxml-3.9-20121203.jar are imported in the project and set as buildpath.
  • poi-3.9-20121203.jar
  • xbean.jar
  • 项目中导入poi-ooxml-3.9-20121203.jar并设置为buildpath。

I tried to find the problem for 4 days (a.k.a. reimporting the libraries and setting the path variable) but I got dizzy and I don't really have time to deal with this problem that doesn't seems to be clear at all. I checked even the integrity of the libraries imported (I assured that the .class files are present in jars).

我试图找到问题 4 天(也就是重新导入库并设置路径变量),但我头晕目眩,我真的没有时间处理这个似乎根本不清楚的问题。我什至检查了导入的库的完整性(我保证 .class 文件存在于 jars 中)。

采纳答案by Gagravarr

The properties in an OOXML file are similar, but not quite identical to their OLE2 cousins. So, you can't use the HPSF SummaryInformation code directly, but there's something similar

OOXML 文件中的属性与它们的 OLE2 表兄弟相似,但并不完全相同。所以,你不能直接使用 HPSF SummaryInformation 代码,但有类似的东西

The class you'll want is POIXMLProperties, something like:

您需要的类是POIXMLProperties,例如:

##代码##

From POIXMLPropertiesyou can get access to all the built-in properties, and the custom ones too!

POIXMLProperties您可以访问所有内置属性,也可以访问自定义属性!

(Note that to work with OOXML files, you need some additional Jars on your classpath. The Apache POI Components pagehas all the details)

(请注意,要使用 OOXML 文件,您的类路径中需要一些额外的 Jars。Apache POI 组件页面包含所有详细信息)