Java 使用 POI 读取 .doc 文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19358643/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Java reading .doc file using POI
提问by Rijo Joseph
Hi i am trying to read text from doc and docx file, for doc files i am doing this
嗨,我正在尝试从 doc 和 docx 文件中读取文本,对于 doc 文件,我正在这样做
package test;
import java.io.File;
import java.io.FileInputStream;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class ReadFile {
public static void main(String[] args) {
File file = null;
WordExtractor extractor = null;
try {
file = new File("C:\Users\rijo\Downloads\r.doc");
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
HWPFDocument document = new HWPFDocument(fis);
extractor = new WordExtractor(document);
String fileData = extractor.getText();
System.out.println(fileData);
} catch (Exception exep) {
}
}
}
But this gives me an org/apache/poi/OldFileFormatException
exception.
但这给了我一个org/apache/poi/OldFileFormatException
例外。
Any idea how to fix this?
知道如何解决这个问题吗?
Also I need to read Docx and PDF files ? any good way to read all type of files?
我还需要阅读 Docx 和 PDF 文件吗?有什么好方法可以读取所有类型的文件?
回答by SudoRahul
If you look at the javadocs of OldFileFormatException , you can see the reason for that
如果您查看OldFileFormatException的 javadocs ,您可以看到原因
Base class of all the exceptions that POI throws in the event that it's given a file that's older than currently supported.
如果 POI 的文件比当前支持的文件旧,则 POI 抛出的所有异常的基类。
This means that the r.doc
you're using is not supported by the HWPFDocument. May be it supports the latest format(docx
has also been there for quite a long time now. Not sure if ApachePOI supports doc
format in the HWPFDocument
).
这意味着r.doc
您正在使用的HWPFDocument不支持。可能是它支持最新的格式(docx
现在也已经有很长时间了。不确定ApachePOI是否支持 中的doc
格式HWPFDocument
)。
回答by Levenal
Using the following jars (In case version numbers are playing a role here):
使用以下 jars(以防版本号在这里起作用):
dom4j-1.7-20060614
poi-3.9-20121203
poi-ooxml-3.9-20121203
poi-ooxml-schemas-3.9-20121203
poi-scratchpad-3.9-20121203
xmlbeans-2.4.0
I typed this up:
我输入了这个:
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class SO {
public static void main(String[] args){
//Alternate between the two to check what works.
//String FilePath = "D:\Users\username\Desktop\Doc1.docx";
String FilePath = "D:\Users\username\Desktop\Bob.doc";
FileInputStream fis;
if(FilePath.substring(FilePath.length() -1).equals("x")){ //is a docx
try {
fis = new FileInputStream(new File(FilePath));
XWPFDocument doc = new XWPFDocument(fis);
XWPFWordExtractor extract = new XWPFWordExtractor(doc);
System.out.println(extract.getText());
} catch (IOException e) {
e.printStackTrace();
}
} else { //is not a docx
try {
fis = new FileInputStream(new File(FilePath));
HWPFDocument doc = new HWPFDocument(fis);
WordExtractor extractor = new WordExtractor(doc);
System.out.println(extractor.getText());
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
this allowed me to read text from both a .docx and .doc respectively. If this doesn't work on your PC you may well have either an issue with the external jars you are using.
这使我可以分别从 .docx 和 .doc 中读取文本。如果这在您的 PC 上不起作用,则您使用的外部 jar 很可能存在问题。
Give it a go though :) Good luck!
试一试:) 祝你好运!
回答by Darius Miliauskas
I do not know why you are using WordExtractor just to get text from .doc. For me it was enough to use one method:
我不知道您为什么使用 WordExtractor 只是为了从 .doc 获取文本。对我来说,使用一种方法就足够了:
import org.apache.poi.hwpf.HWPFDocument;
...
File fin = new File(yourFilePath);
FileInputStream fis = new FileInputStream(fin);
HWPFDocument doc = new HWPFDocument(fis);
String text = doc.getDocumentText();
System.out.println(text);
...
To work with .pdf use another Apache: pdfbox.
要使用 .pdf 使用另一个 Apache:pdfbox。