Java 如何使用“like”运算符查询lucene?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3307890/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 21:57:57  来源:igfitidea点击:

How to query lucene with "like" operator?

javalucenesql-like

提问by Freewind

The wildcard * can only be used at the end of a word, like user*.

通配符 * 只能用在词尾,例如user*.

I want to query with a like %user%, how to do that?

我想用 like 查询%user%,怎么做?

采纳答案by Pascal Dimassimo

Lucene provides the ReverseStringFilterthat allows to do leading wildcard search like *user. It works by indexing all terms in reverse order.

Lucene 提供了ReverseStringFilter,它允许像 *user 那样进行前导通配符搜索。它通过以相反的顺序索引所有术语来工作。

But I think there is no way to do something similar to 'LIKE %user%'.

但我认为没有办法做类似于“LIKE %user%”的事情。

回答by Andreas Dolk

Since Lucene 2.1 you can use

从 Lucene 2.1 开始,您可以使用

QueryParser.setAllowLeadingWildcard(true);

but this can kill performance. The LuceneFAQhas some more info for this.

但这会降低性能。该LuceneFAQ有一些这方面的更多信息。

回答by Stephen C

When you think about it, it is not entirely unsurprising that lucene's support for wildcarding is (normally) restricted to a wildcard at the end of a word pattern.

仔细想想,lucene 对通配符的支持(通常)仅限于单词模式末尾的通配符并不完全不足为奇。

Keyword search engines works by creating a reverse index of all words in the corpus, which is sorted in word order. When you do a normal non-wildcard search, the engine makes use of the fact that index entries are sorted to locate the entry or entries for your word in O(logN)steps where Nis the number of words or entries. For a word pattern with a suffix wildcard, the same thing happens to find the first matching word, and other matches are found by scanning the entries until the fixed part of the pattern no longer matches.

关键字搜索引擎的工作原理是为语料库中的所有单词创建一个反向索引,该索引按单词顺序排序。当您进行普通的非通配符搜索时,引擎会利用索引条目已排序的事实来逐步定位您的单词的条目或条目,O(logN)其中N是单词或条目的数量。对于带有后缀通配符的单词模式,同样的事情发生在找到第一个匹配的单词,通过扫描条目找到其他匹配,直到模式的固定部分不再匹配。

However, for a word pattern with a wildcard prefix anda wildcard suffix, the engine would have to look at allentries in the index. This would be O(N)... unless the engine built a whole stack of secondary indexes for matching literal substrings of words. (And that would make indexing a whole lot more expensive). And for more complex patterns (e.g. regexes) the problem would be even worse for the search engine.

但是,对于带有通配符前缀通配符后缀的单词模式,引擎必须查看索引中的所有条目。这将是O(N)......除非引擎构建了一整套二级索引来匹配单词的字面子串。(这将使索引变得更加昂贵)。对于更复杂的模式(例如正则表达式),搜索引擎的问题会更糟。

回答by Jon

The trouble with LIKE queries is that they are expensivein terms of time taken to execute. You can set up QueryParser to allow leading wildcards with the following:

LIKE 查询的问题在于它们在执行所需的时间方面很昂贵。您可以设置 QueryParser 以允许使用以下前导通配符:

QueryParser.setAllowLeadingWildcard(true)

QueryParser.setAllowLeadingWildcard(true)

And this will allow you to do searches like:

这将允许您进行以下搜索:

*user*

*user*

But this will take a long time to execute. Sometimes when people say they want a LIKE query, what they actually want is a fuzzyquery. This would allow you to do the following search:

但这需要很长时间才能执行。有时,当人们说他们想要一个 LIKE 查询时,他们实际上想要的是一个模糊查询。这将允许您执行以下搜索:

user~

user~

Which would match the terms usersand fuser. You can specify an edit distance between the term in your query and the terms you want matched using a float value between 0 and 1. For example user~0.8would match more terms than user~0.5.

这将匹配条款usersfuser。您可以使用介于 0 和 1 之间的浮点值指定查询中的术语与要匹配的术语之间的编辑距离。例如user~0.8,匹配的术语多于user~0.5

I suggest you also take a look at regex query, which supports regular expression syntax for Lucene searches. It may be closer to what you really need. Perhaps something like:

我建议你也看看regex query,它支持 Lucene 搜索的正则表达式语法。它可能更接近你真正需要的东西。也许是这样的:

.*user.*

.*user.*