Java 如何使用lucene索引pdf文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23762015/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to index pdf file with lucene
提问by JV_MI
i have to create a fulltext search with lucene in my project,so i have to index a blob column in mysql database(contains file pdf,doc,xsl,xml and image),with doc,xsl,and xml i dont have any problems but with the pdf file i cant get result
我必须在我的项目中使用 lucene 创建全文搜索,所以我必须在 mysql 数据库中索引一个 blob 列(包含文件 pdf、doc、xsl、xml 和图像),使用 doc、xsl 和 xml 我没有任何问题但是使用pdf文件我无法得到结果
public class Indexfile {
public static void main(String[] args) throws Exception {
RemoteControlServiceConnection a = new RemoteControlServiceConnection(
"jdbc:mysql://localhost:3306/Test","root", "root" );
Connection conn = a.getConnexionMySQL();
final File INDEX_DIR = new File("index");
IndexWriter writer = new IndexWriter(INDEX_DIR,
new StandardAnalyzer(),
true);
String query = "SELECT id, name ,document FROM Table_document";
Statement statement = conn.createStatement();
ResultSet result = statement.executeQuery(query);
while (result.next()) {
Document document = new Document();
document.add(new Field("id", result.getString("id"), Field.Store.YES, Field.Index.NO));
document.add(new Field("name", result.getString("name"), Field.Store.YES, Field.Index.TOKENIZED));
document.add(new Field("document", result.getString("document"), Field.Store.YES, Field.Index.TOKENIZED));
writer.addDocument(text);
}
}
writer.close();
}
}
for search i use
我使用的搜索
public class searchlucene {
public static void main(String[] args) throws Exception {
StandardAnalyzer analyzer = new StandardAnalyzer();
String qu = "montbel*"; // put your keyword here
// String IndexStoreDir = "index-directory";
try {
Query q = new QueryParser("document", analyzer).parse(qu);
int hitspp = 100; //hits per page
IndexSearcher searcher = new IndexSearcher(IndexReader.open("index"));
TopDocCollector collector = new TopDocCollector(hitspp);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
System.out.println("Found " + hits.length + " hits.");
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("name"));
}
searcher.close();
} catch (Exception ex1) {
}
}}
采纳答案by Salah
First You need to convert the PDF
file content to text, then add that text to the index.
首先您需要将PDF
文件内容转换为文本,然后将该文本添加到索引中。
For Example:
例如:
You can use PDFBox
to convert the pdf
content to text:
您可以使用PDFBox
将pdf
内容转换为文本:
String contents = "";
PDDocument doc = null;
try {
doc = PDDocument.load(file);
PDFTextStripper stripper = new PDFTextStripper();
stripper.setLineSeparator("\n");
stripper.setStartPage(1);
stripper.setEndPage(5);// this mean that it will index the first 5 pages only
contents = stripper.getText(doc);
} catch(Exception e){
e.printStackTrace();
}
Then add the content to LuceneDocument
, example:
然后将内容添加到LuceneDocument
,例如:
luceneDoc.add(new Field(CONTENT_FIELD, allContents.toString(), Field.Store.NO, Field.Index.TOKENIZED));
回答by rohit
First you can read your pdf through itext just like
try{
PdfReader readerObj = new PdfReader("file path");
int n = readerObj.getNumberOfPages();
String content=PdfTextExtractor.getTextFromPage(reader, 2); //Extracting the content from a particular page.
document.close();
}catch(Exception e){
e.printStackTrace();
}
add your pdf content to lucene document
doc.add(new Field("pdfContent", content, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
回答by user3186357
to Parse any kind of file use Tika project, then index it with Lucene. Tika already contain too many APIs (pdfBox....)
使用Tika 项目解析任何类型的文件,然后使用 Lucene 对其进行索引。Tika 已经包含太多 API (pdfBox....)