使用Lucene计算类别中的结果-IGI

时间：2020-03-06 14:54:43 　来源:igfitidea点击:

我正在尝试使用Lucene Java 2.3.2对产品目录进行搜索。除了产品的常规字段外，还有一个名为"类别"的字段。一个产品可以分为多个类别。当前，我使用FilteredQuery在每个类别中搜索相同的搜索词，以获取每个类别的结果数。

这样，每个查询会产生20-30次内部搜索调用，以显示结果。这大大减慢了搜索速度。有没有使用Lucene达到相同结果的更快方法？

解决方案

我们可能需要考虑使用TermDocs迭代器浏览所有与类别匹配的文档。

此示例代码遍历每个"类别"术语，然后计算与该术语匹配的文档数。

public static void countDocumentsInCategories(IndexReader reader) throws IOException {
    TermEnum terms = null;
    TermDocs td = null;

    try {
        terms = reader.terms(new Term("Category", ""));
        td = reader.termDocs();
        do {
            Term currentTerm = terms.term();

            if (!currentTerm.field().equals("Category")) {
                break;
            }

            int numDocs = 0;
            td.seek(terms);
            while (td.next()) {
                numDocs++;
            }

            System.out.println(currentTerm.field() + " : " + currentTerm.text() + " --> " + numDocs);
        } while (terms.next());
    } finally {
        if (td != null) td.close();
        if (terms != null) terms.close();
    }
}

即使对于大型索引，此代码也应运行得相当快。

这是一些测试该方法的代码：

public static void main(String[] args) throws Exception {
    RAMDirectory store = new RAMDirectory();

    IndexWriter w = new IndexWriter(store, new StandardAnalyzer());
    addDocument(w, 1, "Apple", "fruit", "computer");
    addDocument(w, 2, "Orange", "fruit", "colour");
    addDocument(w, 3, "Dell", "computer");
    addDocument(w, 4, "Cumquat", "fruit");
    w.close();

    IndexReader r = IndexReader.open(store);
    countDocumentsInCategories(r);
    r.close();
}

private static void addDocument(IndexWriter w, int id, String name, String... categories) throws IOException {
    Document d = new Document();
    d.add(new Field("ID", String.valueOf(id), Field.Store.YES, Field.Index.UN_TOKENIZED));
    d.add(new Field("Name", name, Field.Store.NO, Field.Index.UN_TOKENIZED));

    for (String category : categories) {
        d.add(new Field("Category", category, Field.Store.NO, Field.Index.UN_TOKENIZED));
    }

    w.addDocument(d);
}

我没有足够的声誉来发表评论(！)，但是在Matt Quail的回答中，我非常确定我们可以替换此：

int numDocs = 0;
td.seek(terms);
while (td.next()) {
    numDocs++;
}

有了这个：

int numDocs = terms.docFreq()

然后完全摆脱td变量。这应该使其更快。

因此，让我看一下我是否正确理解了这个问题：给定用户的查询，我们想显示每个类别中查询的匹配项数。正确的？

这样想：查询实际上是" originalQuery AND((类别1或者类别2或者...)"，除了要为每个类别获取数字的总分。不幸的是，在Lucene中收集命中的界面非常狭窄，仅给我们总体查询分数。但是我们可以实现自定义的记分器/收集器。

查看org.apache.lucene.search.DisjunctionSumScorer的源代码。我们可以复制其中的一些内容来编写自定义评分器，以便在进行主要搜索时遍历类别匹配。我们可以保留一个Map <String，Long>`来跟踪每个类别中的匹配项。

这是我所做的，尽管它占用了很多内存：

我们需要预先创建一堆BitSet，每个类别一个，其中包含该类别中所有文档的文档ID。现在，在搜索时，我们可以使用HitCollector并对照BitSet检查文档ID。

这是创建位集的代码：

public BitSet[] getBitSets(IndexSearcher indexSearcher, 
                           Category[] categories) {
    BitSet[] bitSets = new BitSet[categories.length];
    for(int i=0; i<categories.length; i++)
    {
        Query query = categories[i].getQuery();
        final BitSet bitset = new BitSet()
        indexSearcher.search(query, new HitCollector() {
            public void collect(int doc, float score) {
                bitSet.set(doc);
            }
        });
        bitSets[i] = bitSet;
    }
    return bitSets;
}

这只是做到这一点的一种方法。如果类别足够简单，则可以使用TermDocs而不是运行完整搜索，但是无论如何只要加载索引，它就只能运行一次。

现在，当需要计算搜索结果的类别时，我们可以执行以下操作：

public int[] getCategroryCount(IndexSearcher indexSearcher, 
                               Query query, 
                               final BitSet[] bitSets) {
    final int[] count = new int[bitSets.length];
    indexSearcher.search(query, new HitCollector() {
        public void collect(int doc, float score) {
            for(int i=0; i<bitSets.length; i++) {
                if(bitSets[i].get(doc)) count[i]++;
            }
        }
    });
    return count;
}

最终得到的是一个数组，其中包含搜索结果中每个类别的计数。如果我们还需要搜索结果，则应将TopDocCollector添加到命中收集器中(yo dawg ...)。或者，我们可以再次运行搜索。 2次搜寻优于30次。

萨钦(Sachin)，我相信我们想进行多方面的搜索。它与Lucene并不是开箱即用的。我建议我们尝试使用SOLR，它具有多面功能，是一项主要且方便的功能。

使用Lucene计算类别中的结果

解决方案

相关推荐

最近更新

标签

使用Lucene计算类别中的结果

解决方案

相关推荐

何时在$ this上使用self？

我们经常关注的C ++博客？

WPF中的GroupBox标头是否会吞下鼠标单击？

服务器端病毒扫描

相关推荐

最近更新

标签