Java 如何使用 Lucene 的新 AnalyzingInfixSuggester API 实现自动建议?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24968697/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-14 16:03:46  来源:igfitidea点击:

How to implement auto suggest using Lucene's new AnalyzingInfixSuggester API?

javaautocompletelucenesearch-suggestion

提问by user1586977

I am a greenhand on Lucene, and I want to implement auto suggest, just like google, when I input a character like 'G', it would give me a list, you can try your self.

我是Lucene的新手,我想实现自动建议,就像google一样,当我输入'G'这样的字符时,它会给我一个列表,你可以自己尝试。

I have searched on the whole net. Nobody has done this , and it gives us some new tools in package suggest

我在整个网络上搜索过。没有人这样做过,它为我们提供了一些新的工具包建议

But i need an example to tell me how to do that

但我需要一个例子来告诉我怎么做

Is there anyone can help ?

有没有人可以帮忙?

回答by John Wiseman

I'll give you a pretty complete example that shows you how to use AnalyzingInfixSuggester. In this example we'll pretend that we're Amazon, and we want to autocomplete a product search field. We'll take advantage of features of the Lucene suggestion system to implement the following:

我会给你一个非常完整的例子,向你展示如何使用AnalyzingInfixSuggester. 在这个例子中,我们假设我们是亚马逊,我们想要自动完成一个产品搜索字段。我们将利用 Lucene 建议系统的特性来实现以下内容:

  1. Ranked results: We will suggest the most popular matching products first.
  2. Region-restricted results: We will only suggest products that we sell in the customer's country.
  3. Product photos: We will store product photo URLs in the suggestion index so we can display them in the search results, without having to do an additional database lookup.
  1. 排名结果:我们会首先推荐最受欢迎的匹配产品。
  2. 受区域限制的结果:我们只会推荐我们在客户所在国家/地区销售的产品。
  3. 产品照片:我们将产品照片 URL 存储在建议索引中,以便我们可以在搜索结果中显示它们,而无需进行额外的数据库查找。

First I'll define a simple class to hold information about a product in Product.java:

首先,我将在 Product.java 中定义一个简单的类来保存有关产品的信息:

import java.util.Set;

class Product implements java.io.Serializable
{
    String name;
    String image;
    String[] regions;
    int numberSold;

    public Product(String name, String image, String[] regions,
                   int numberSold) {
        this.name = name;
        this.image = image;
        this.regions = regions;
        this.numberSold = numberSold;
    }
}

To index records in with the AnalyzingInfixSuggester's buildmethod you need to pass it an object that implements the org.apache.lucene.search.suggest.InputIteratorinterface. An InputIteratorgives access to the key, contexts, payloadand weightfor each record.

要使用AnalyzingInfixSuggesterbuild方法索引记录,您需要向它传递一个实现该org.apache.lucene.search.suggest.InputIterator接口的对象。AnInputIterator可以访问每个记录的keycontextspayloadweight

The keyis the text you actually want to search on and autocomplete against. In our example, it will be the name of the product.

关键是你真正想要搜索并自动完成对文本。在我们的示例中,它将是产品的名称。

The contextsare a set of additional, arbitrary data that you can use to filter records against. In our example, the contexts are the set of ISO codes for the countries we will ship a particular product to.

上下文是一组的,你可以用它来筛选记录对额外的,任意的数据。在我们的示例中,上下文是我们将特定产品运送到的国家/地区的一组 ISO 代码。

The payloadis additional arbitrary data you want to store in the index for the record. In this example, we will actually serialize each Productinstance and store the resulting bytes as the payload. Then when we later do lookups, we can deserialize the payload and access information in the product instance like the image URL.

有效载荷是要在备案索引存储更多任意数据。在这个例子中,我们将实际序列化每个Product实例并将结果字节存储为有效负载。然后当我们稍后进行查找时,我们可以反序列化有效负载并访问产品实例中的信息,例如图像 URL。

The weightis used to order suggestion results; results with a higher weight are returned first. We'll use the number of sales for a given product as its weight.

重量被用于顺序建议的结果; 首先返回权重较高的结果。我们将使用给定产品的销售数量作为其权重。

Here's the contents of ProductIterator.java:

以下是 ProductIterator.java 的内容:

import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.ObjectOutputStream;
import java.io.UnsupportedEncodingException;
import java.util.Comparator;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Set;
import org.apache.lucene.search.suggest.InputIterator;
import org.apache.lucene.util.BytesRef;


class ProductIterator implements InputIterator
{
    private Iterator<Product> productIterator;
    private Product currentProduct;

    ProductIterator(Iterator<Product> productIterator) {
        this.productIterator = productIterator;
    }

    public boolean hasContexts() {
        return true;
    }

    public boolean hasPayloads() {
        return true;
    }

    public Comparator<BytesRef> getComparator() {
        return null;
    }

    // This method needs to return the key for the record; this is the
    // text we'll be autocompleting against.
    public BytesRef next() {
        if (productIterator.hasNext()) {
            currentProduct = productIterator.next();
            try {
                return new BytesRef(currentProduct.name.getBytes("UTF8"));
            } catch (UnsupportedEncodingException e) {
                throw new Error("Couldn't convert to UTF-8");
            }
        } else {
            return null;
        }
    }

    // This method returns the payload for the record, which is
    // additional data that can be associated with a record and
    // returned when we do suggestion lookups.  In this example the
    // payload is a serialized Java object representing our product.
    public BytesRef payload() {
        try {
            ByteArrayOutputStream bos = new ByteArrayOutputStream();
            ObjectOutputStream out = new ObjectOutputStream(bos);
            out.writeObject(currentProduct);
            out.close();
            return new BytesRef(bos.toByteArray());
        } catch (IOException e) {
            throw new Error("Well that's unfortunate.");
        }
    }

    // This method returns the contexts for the record, which we can
    // use to restrict suggestions.  In this example we use the
    // regions in which a product is sold.
    public Set<BytesRef> contexts() {
        try {
            Set<BytesRef> regions = new HashSet();
            for (String region : currentProduct.regions) {
                regions.add(new BytesRef(region.getBytes("UTF8")));
            }
            return regions;
        } catch (UnsupportedEncodingException e) {
            throw new Error("Couldn't convert to UTF-8");
        }
    }

    // This method helps us order our suggestions.  In this example we
    // use the number of products of this type that we've sold.
    public long weight() {
        return currentProduct.numberSold;
    }
}

In our driver program, we will do the following things:

在我们的驱动程序中,我们将做以下事情:

  1. Create an index directory in RAM.
  2. Create a StandardTokenizer.
  3. Create an AnalyzingInfixSuggesterusing the RAM directory and tokenizer.
  4. Index a number of products using ProductIterator.
  5. Print the results of some sample lookups.
  1. 在 RAM 中创建索引目录。
  2. 创建一个StandardTokenizer.
  3. 创建一个AnalyzingInfixSuggester使用 RAM 目录和标记器。
  4. 使用 索引许多产品ProductIterator
  5. 打印一些示例查找的结果。

Here's the driver program, SuggestProducts.java:

这是驱动程序 SuggestProducts.java:

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.ObjectInputStream;
import java.io.UnsupportedEncodingException;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester;
import org.apache.lucene.search.suggest.Lookup;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.Version;

public class SuggestProducts
{
    // Get suggestions given a prefix and a region.
    private static void lookup(AnalyzingInfixSuggester suggester, String name,
                               String region) {
        try {
            List<Lookup.LookupResult> results;
            HashSet<BytesRef> contexts = new HashSet<BytesRef>();
            contexts.add(new BytesRef(region.getBytes("UTF8")));
            // Do the actual lookup.  We ask for the top 2 results.
            results = suggester.lookup(name, contexts, 2, true, false);
            System.out.println("-- \"" + name + "\" (" + region + "):");
            for (Lookup.LookupResult result : results) {
                System.out.println(result.key);
                Product p = getProduct(result);
                if (p != null) {
                    System.out.println("  image: " + p.image);
                    System.out.println("  # sold: " + p.numberSold);
                }
            }
        } catch (IOException e) {
            System.err.println("Error");
        }
    }

    // Deserialize a Product from a LookupResult payload.
    private static Product getProduct(Lookup.LookupResult result)
    {
        try {
            BytesRef payload = result.payload;
            if (payload != null) {
                ByteArrayInputStream bis = new ByteArrayInputStream(payload.bytes);
                ObjectInputStream in = new ObjectInputStream(bis);
                Product p = (Product) in.readObject();
                return p;
            } else {
                return null;
            }
        } catch (IOException|ClassNotFoundException e) {
            throw new Error("Could not decode payload :(");
        }
    }

    public static void main(String[] args) {
        try {
            RAMDirectory index_dir = new RAMDirectory();
            StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_48);
            AnalyzingInfixSuggester suggester = new AnalyzingInfixSuggester(
                Version.LUCENE_48, index_dir, analyzer);

            // Create our list of products.
            ArrayList<Product> products = new ArrayList<Product>();
            products.add(
                new Product(
                    "Electric Guitar",
                    "http://images.example/electric-guitar.jpg",
                    new String[]{"US", "CA"},
                    100));
            products.add(
                new Product(
                    "Electric Train",
                    "http://images.example/train.jpg",
                    new String[]{"US", "CA"},
                    100));
            products.add(
                new Product(
                    "Acoustic Guitar",
                    "http://images.example/acoustic-guitar.jpg",
                    new String[]{"US", "ZA"},
                    80));
            products.add(
                new Product(
                    "Guarana Soda",
                    "http://images.example/soda.jpg",
                    new String[]{"ZA", "IE"},
                    130));

            // Index the products with the suggester.
            suggester.build(new ProductIterator(products.iterator()));

            // Do some example lookups.
            lookup(suggester, "Gu", "US");
            lookup(suggester, "Gu", "ZA");
            lookup(suggester, "Gui", "CA");
            lookup(suggester, "Electric guit", "US");
        } catch (IOException e) {
            System.err.println("Error!");
        }
    }
}

And here is the output from the driver program:

这是驱动程序的输出:

-- "Gu" (US):
Electric Guitar
  image: http://images.example/electric-guitar.jpg
  # sold: 100
Acoustic Guitar
  image: http://images.example/acoustic-guitar.jpg
  # sold: 80
-- "Gu" (ZA):
Guarana Soda
  image: http://images.example/soda.jpg
  # sold: 130
Acoustic Guitar
  image: http://images.example/acoustic-guitar.jpg
  # sold: 80
-- "Gui" (CA):
Electric Guitar
  image: http://images.example/electric-guitar.jpg
  # sold: 100
-- "Electric guit" (US):
Electric Guitar
  image: http://images.example/electric-guitar.jpg
  # sold: 100

Appendix

附录

There's a way to avoid writing a full InputIteratorthat you might find easier. You can write a stub InputIteratorthat returns nullfrom its next, payloadand contextsmethods. Pass an instance of it to AnalyzingInfixSuggester's buildmethod:

有一种方法可以避免编写InputIterator您可能会发现更容易的完整内容。您可以编写一个从其,和方法InputIterator返回的存根。将它的一个实例传递给的方法:nullnextpayloadcontextsAnalyzingInfixSuggesterbuild

suggester.build(new ProductIterator(new ArrayList<Product>().iterator()));

Then for each item you want to index, call the AnalyzingInfixSuggesteraddmethod:

然后对于要索引的每个项目,调用该AnalyzingInfixSuggesteradd方法:

suggester.add(text, contexts, weight, payload)

After you've indexed everything, call refresh:

索引所有内容后,请调用refresh

suggester.refresh();

If you're indexing large amounts of data, it's possible to significantly speedup indexing using this method with multiple threads: Call build, then use multiple threads to additems, then finally call refresh.

如果您要索引大量数据,则可以使用此方法在多个线程中显着加快索引速度:调用build,然后对add项目使用多个线程,最后调用refresh.

[Edited 2015-04-23 to demonstrate deserializing info from the LookupResultpayload.]

[编辑 2015-04-23 以演示来自LookupResult有效负载的反序列化信息。]