Java 如何使用lucene索引pdf文件

Question

提问by JV_MI

i have to create a fulltext search with lucene in my project,so i have to index a blob column in mysql database(contains file pdf,doc,xsl,xml and image),with doc,xsl,and xml i dont have any problems but with the pdf file i cant get result

我必须在我的项目中使用 lucene 创建全文搜索，所以我必须在 mysql 数据库中索引一个 blob 列（包含文件 pdf、doc、xsl、xml 和图像），使用 doc、xsl 和 xml 我没有任何问题但是使用pdf文件我无法得到结果

    public class Indexfile {
  public static void main(String[] args) throws Exception {

        RemoteControlServiceConnection a = new RemoteControlServiceConnection(
                "jdbc:mysql://localhost:3306/Test","root", "root" );
        Connection conn = a.getConnexionMySQL();
        final File INDEX_DIR = new File("index");
        IndexWriter writer = new IndexWriter(INDEX_DIR,
                new StandardAnalyzer(),
                true);

        String query = "SELECT id, name ,document FROM Table_document";
        Statement statement = conn.createStatement();
        ResultSet result = statement.executeQuery(query);

        while (result.next()) {
            Document document = new Document();
            document.add(new Field("id", result.getString("id"), Field.Store.YES, Field.Index.NO));
            document.add(new Field("name", result.getString("name"), Field.Store.YES, Field.Index.TOKENIZED));
            document.add(new Field("document", result.getString("document"), Field.Store.YES, Field.Index.TOKENIZED));
             writer.addDocument(text);
            }
        }

        writer.close();


    }
}

for search i use

我使用的搜索

    public class searchlucene {
    public static void main(String[] args) throws Exception {
    StandardAnalyzer analyzer = new StandardAnalyzer();
    String qu = "montbel*"; // put your keyword here
   // String IndexStoreDir = "index-directory";
    try {
        Query q = new QueryParser("document", analyzer).parse(qu);
        int hitspp = 100; //hits per page
        IndexSearcher searcher = new IndexSearcher(IndexReader.open("index"));
        TopDocCollector collector = new TopDocCollector(hitspp);
        searcher.search(q, collector);
        ScoreDoc[] hits = collector.topDocs().scoreDocs;
        System.out.println("Found " + hits.length + " hits.");
        for (int i = 0; i < hits.length; ++i) {
              int docId = hits[i].doc;
              Document d = searcher.doc(docId);
              System.out.println((i + 1) + ". " + d.get("name"));
          }
          searcher.close();
      } catch (Exception ex1) {
      }
}}

Answer 1

采纳答案by Salah

First You need to convert the PDFfile content to text, then add that text to the index.

首先您需要将PDF文件内容转换为文本，然后将该文本添加到索引中。

For Example:

例如：

You can use PDFBoxto convert the pdfcontent to text:

您可以使用PDFBox将pdf内容转换为文本：

String contents = "";
PDDocument doc = null;
try {
    doc = PDDocument.load(file);
    PDFTextStripper stripper = new PDFTextStripper();

    stripper.setLineSeparator("\n");
    stripper.setStartPage(1);
    stripper.setEndPage(5);// this mean that it will index the first 5 pages only
    contents = stripper.getText(doc);

} catch(Exception e){
    e.printStackTrace();
}

Then add the content to LuceneDocument, example:

然后将内容添加到LuceneDocument，例如：

luceneDoc.add(new Field(CONTENT_FIELD, allContents.toString(), Field.Store.NO, Field.Index.TOKENIZED));

Answer 2

回答by rohit

    First you can read your pdf through itext just like
try{
        PdfReader readerObj = new PdfReader("file path");
            int n = readerObj.getNumberOfPages();
            String content=PdfTextExtractor.getTextFromPage(reader, 2); //Extracting the content from a particular page.
            document.close();
}catch(Exception e){
    e.printStackTrace();
}

    add your pdf content to lucene document
    doc.add(new Field("pdfContent", content, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));

Answer 3

回答by user3186357

to Parse any kind of file use Tika project, then index it with Lucene. Tika already contain too many APIs (pdfBox....)

使用Tika 项目解析任何类型的文件，然后使用 Lucene 对其进行索引。Tika 已经包含太多 API (pdfBox....)

Java 如何使用lucene索引pdf文件

提问by JV_MI

采纳答案by Salah

回答by rohit

回答by user3186357

相关推荐

最近更新

标签

Java 如何使用lucene索引pdf文件

提问by JV_MI

采纳答案by Salah

回答by rohit

回答by user3186357

相关推荐

Java 排除子目录的过滤器映射 url-pattern

Java 如何使用表单操作从 JSP 页面映射 servlet 调用？

Java 在 JTable 列中设置右对齐

Java 如何从 C# 代码中运行 jar 文件

相关推荐

最近更新

标签