java 如何在 Lucene 3.0.2 中索引和搜索文本文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4091441/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 04:44:00  来源:igfitidea点击:

How do I index and search text files in Lucene 3.0.2?

javaindexinglucenetext-files

提问by celsowm

I am newbie in Lucene, and I'm having some problems creating simple code to query a text file collection.

我是 Lucene 的新手,在创建简单的代码来查询文本文件集合时遇到了一些问题。

I tried this example, but is incompatible with the new version of Lucene.

我试过这个例子,但与新版本的 Lucene 不兼容。

UDPATE:This is my new code, but it still doesn't work yet.

UDPATE:这是我的新代码,但它仍然不起作用。

回答by ffriend

Lucene is a quite big topic with a lot of classes and methods to cover, and you normally cannot use it without understanding at least some basic concepts. If you need a quickly available service, use Solrinstead. If you need full control of Lucene, go on reading. I will cover some core Lucene concepts and classes, that represent them. (For information on how to read text files in memory read, for example, thisarticle).

Lucene 是一个相当大的话题,需要涵盖很多类和方法,并且您通常无法在不了解至少一些基本概念的情况下使用它。如果您需要快速可用的服务,请改用Solr。如果您需要完全控制 Lucene,请继续阅读。我将介绍一些代表它们的核心 Lucene 概念和类。(有关如何读取内存中的文本文件的信息阅读,例如这篇文章)。

Whatever you are going to do in Lucene - indexing or searching - you need an analyzer. The goal of analyzer is to tokenize (break into words) and stem (get base of a word) your input text. It also throws out the most frequent words like "a", "the", etc. You can find analyzers for more then 20 languages, or you can use SnowballAnalyzerand pass language as a parameter.
To create instance of SnowballAnalyzer for English you this:

无论您要在 Lucene 中做什么——索引或搜索——您都需要一个分析器。分析器的目标是对输入文本进行标记(分解为单词)和词干(获取单词的基础)。它还抛出最常用的词,如“a”、“the”等。您可以找到超过 20 种语言的分析器,或者您可以使用SnowballAnalyzer并将语言作为参数传递。
要为英语创建 SnowballAnalyzer 的实例,您可以:

Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English");

If you are going to index texts in different languages, and want to select analyzer automatically, you can use tika's LanguageIdentifier.

如果您要索引不同语言的文本,并希望自动选择分析器,您可以使用tika 的 LanguageIdentifier

You need to store your index somewhere. There's 2 major possibilities for this: in-memory index, which is easy-to-try, and disk index, which is the most widespread one.
Use any of the next 2 lines:

您需要将索引存储在某处。这有两种主要的可能性:易于尝试的内存索引和最普遍的磁盘索引。
使用以下两行中的任何一行:

Directory directory = new RAMDirectory();   // RAM index storage
Directory directory = FSDirectory.open(new File("/path/to/index"));  // disk index storage

When you want to add, update or delete document, you need IndexWriter:

当你想要添加、更新或删除文档时,你需要使用 IndexWriter:

IndexWriter writer = new IndexWriter(directory, analyzer, true, new IndexWriter.MaxFieldLength(25000));

Any document (text file in your case) is a set of fields. To create document, which will hold information about your file, use this:

任何文档(在您的情况下为文本文件)都是一组字段。要创建包含有关文件信息的文档,请使用以下命令:

Document doc = new Document();
String title = nameOfYourFile;
doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED));  // adding title field
String content = contentsOfYourFile;
doc.add(new Field("content", content, Field.Store.YES, Field.Index.ANALYZED)); // adding content field
writer.addDocument(doc);  // writing new document to the index

Fieldconstructor takes field's name, it's text and at least2 more parameters. First is a flag, that show whether Lucene must store this field. If it equals Field.Store.YESyou will have possibility to get all your text back from the index, otherwise only index information about it will be stored.
Second parameter shows whether Lucene must index this field or not. Use Field.Index.ANALYZEDfor any field you are going to search on.
Normally, you use both parameters as shown above.

Field构造函数接受字段的名称、文本和至少2 个以上的参数。首先是一个标志,表明 Lucene 是否必须存储该字段。如果它等于Field.Store.YES您将有可能从索引中取回所有文本,否则只会存储有关它的索引信息。
第二个参数显示 Lucene 是否必须索引该字段。使用Field.Index.ANALYZED你要搜索的任何领域。
通常,您使用两个参数,如上所示。

Don't forget to close your IndexWriterafter the job is done:

IndexWriter完成工作后不要忘记关闭您的:

writer.close();

Searching is a bit tricky. You will need several classes: Queryand QueryParserto make Lucene query from the string, IndexSearcherfor actual searching, TopScoreDocCollectorto store results (it is passed to IndexSearcheras a parameter) and ScoreDocto iterate through results. Next snippet shows how this all is composed:

搜索有点棘手。您将需要几个类:QueryQueryParser从字符串中进行 Lucene 查询、IndexSearcher实际搜索、TopScoreDocCollector存储结果(它IndexSearcher作为参数传递给)和ScoreDoc迭代结果。下一个片段显示了这一切是如何组成的:

IndexSearcher searcher = new IndexSearcher(directory);
QueryParser parser = new QueryParser(Version.LUCENE_30, "content", analyzer);
Query query = parser.parse("terms to search");
TopScoreDocCollector collector = TopScoreDocCollector.create(HOW_MANY_RESULTS_TO_COLLECT, true);
searcher.search(query, collector);

ScoreDoc[] hits = collector.topDocs().scoreDocs;
// `i` is just a number of document in Lucene. Note, that this number may change after document deletion 
for (int i = 0; i < hits.length; i++) {
    Document hitDoc = searcher.doc(hits[i].doc);  // getting actual document
    System.out.println("Title: " + hitDoc.get("title"));
    System.out.println("Content: " + hitDoc.get("content"));
    System.out.println();
}

Note second argument to the QueryParserconstructor - it is default field, i.e. field that will be searched if no qualifier was given. For example, if your query is "title:term", Lucene will search for a word "term" in field "title" of all docs, but if your query is just "term" if will search in default field, in this case - "contents". For more info see Lucene Query Syntax.
QueryParseralso takes analyzer as a last argument. This must be same analyzer as you used to index your text.

注意QueryParser构造函数的第二个参数- 它是默认字段,即如果没有给出限定符将被搜索的字段。例如,如果您的查询是“title:term”,Lucene 将在所有文档的“title”字段中搜索单词“term”,但如果您的查询只是“term”,则在默认字段中搜索,在这种情况下- “内容”。有关更多信息,请参阅Lucene 查询语法
QueryParser也将分析器作为最后一个参数。这必须与您用来索引文本的分析器相同。

The last thing you must know is a TopScoreDocCollector.createfirst parameter. It is just a number that represents how many results you want to collect. For example, if it is equal 100, Lucene will collect only first (by score) 100 results and drop the rest. This is just an act of optimization - you collect best results, and if you're not satisfied with it, you repeat search with a larger number.

您必须知道的最后一件事是TopScoreDocCollector.create第一个参数。它只是一个数字,代表您要收集多少结果。例如,如果它等于 100,Lucene 将只收集第一个(按分数)100 个结果并丢弃其余的。这只是一种优化行为 - 您收集了最佳结果,如果您不满意,您可以使用更大的数字重复搜索。

Finally, don't forget to close searcher and directory to not loose system resources:

最后,不要忘记关闭搜索器和目录以免丢失系统资源:

searcher.close();
directory.close();

EDIT:Also see IndexFiles demo classfrom Lucene 3.0 sources.

编辑:另请参阅来自Lucene 3.0 源的IndexFiles 演示类

回答by smartnut007

package org.test;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;


import org.apache.lucene.queryParser.*;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

public class LuceneSimple {

 private static void addDoc(IndexWriter w, String value) throws IOException {
  Document doc = new Document();
  doc.add(new Field("title", value, Field.Store.YES, Field.Index.ANALYZED));
  w.addDocument(doc);
 }



 public static void main(String[] args) throws CorruptIndexException, LockObtainFailedException, IOException, ParseException {

     File dir = new File("F:/tmp/dir");

  StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);

  Directory index = new RAMDirectory();
  //Directory index = FSDirectory.open(new File("lucDirHello") );


  IndexWriter w = new IndexWriter(index, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);

  w.setRAMBufferSizeMB(200);

  System.out.println(index.getClass() + " RamBuff:" + w.getRAMBufferSizeMB() );

  addDoc(w, "Lucene in Action");
     addDoc(w, "Lucene for Dummies");
     addDoc(w, "Managing Gigabytes");
     addDoc(w, "The Art of Computer Science");
     addDoc(w, "Computer Science ! what is that ?");


     Long N = 0l;

     for( File f : dir.listFiles() ){
      BufferedReader br = new BufferedReader( new FileReader(f) );
      String line = null;
      while( ( line = br.readLine() ) != null ){
       if( line.length() < 140 ) continue;      
       addDoc(w, line);
       ++N;
      }
      br.close();
     }

     w.close();

     // 2. query
     String querystr = "Computer";

     Query q = new QueryParser( Version.LUCENE_30, "title", analyzer ).parse(querystr);


     //search
     int hitsPerPage = 10;

     IndexSearcher searcher = new IndexSearcher(index, true);

     TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);

     searcher.search(q, collector);

     ScoreDoc[] hits = collector.topDocs().scoreDocs;

     System.out.println("Found " + hits.length + " hits.");
     for(int i=0;i<hits.length;++i) {
       int docId = hits[i].doc;
       Document d = searcher.doc(docId);
       System.out.println((i + 1) + ". " + d.get("title"));
     }


     searcher.close();

 }

}

回答by Aravind Yarram

I suggest you look into Solr @ http://lucene.apache.org/solr/rather than working with lucene api

我建议您查看 Solr @ http://lucene.apache.org/solr/而不是使用 lucene api